3 Essential Statistics for Exploratory Data Analysis on Simple Data

Alexmetelli
4 min readMar 4, 2021

With implementations in python.

Photo by https://unsplash.com/@glenncarstenspeters

In this article I will explain three basic statics for exploratory data analysis on simple data. This is aimed to people approaching data analysis for the first time and is by no mean an exhaustive guide.

I will use python and jupyter notebook, I avoided the use of specific libraries such as numpy on purpose as I believe that to understand this basic concepts and implementations in “raw” python is much more valuable.

Let’s start with some definitions:

Data can be defined as what we collect to make decisions, a distinction can be done between simple, composite and complex data.

Slide from Dr. Felix Reidl

For the purpose of this article we will consider only simple data.

A statistic is a summary that quantifies an aspect of the data. Some statistics work better on specific type of data, for this reason an initial distinction is necessary.

We will implement the following statistics:

Slide from Dr. Felix Reidl

We will use a data set of 18500 entries each containing price, size in number of rooms and location expressed in travel fare zone of homes for sale in London. The data has been divided in 3 lists: “price”, “size” and “zone”.

Central Tendency

The central tendency describes the middle of the data.

The arithmetic mean is the sum of the values divided by the number of values:

Here we apply the formula on the “price” list:

The result is £551,677 for our data. The mean is one of the most common statistics but it can be distorted by few extreme values and it is undefined on ordinal data.

The median is the middle value of a sorted data set. If n defines the number of data point in our data set we take the (n+1)/2 value in case n is odd. In case of an even value of n the median takes the mean of the (n/2) + (n/2 + 1) values. In either cases there is an equal number of values above and below the median.

Applying the function on our dataset we find that the properties on sale have a median of 2 rooms.

Mean and median are useful statistics but they can’t be applied on categorical data, so here comes the mode to help. It simply measures the most frequent value in our dataset:

On our data set the zone is a perfect example of categorical data, and the function shows us that most of the property are located in zone 3.

Measure of dispersion

The dispersion tells us whether the data is tightly concentrated or widely spread around the central tendency.

The range is the most basic and tells us the difference between the maximum and the minimum value in the data set.

The code will show us a range of £1,346,000 for the price and 9 rooms for the size.

The variance is the average square deviation from the mean:

In practice the standard deviation is much more useful and is simply calculated as the square root of the variance:

To compute it in python we calculate the variance first and then we square it:

The standard deviation tells us how much the data are clustered around the mean, it has many other properties but they are out of the scope of this article.

The Nth percentile (for 0 < N <100) is any value such that N% of the data lies below it and 100% — N% lies above it. We will follow the same convention to divide even or odd values on N used for the median.

Commonly used percentiles are the quartiles:

  • Q1 is the 25th percentile,
  • Q2 is the 50th percentile,
  • Q3 is the 75 percentile.

The interquartile range is defined as IQR = Q3 — Q1.

The quartiles provide us with a good high level description of a dataset. Applying the function on the price of our data we find that

Q1 = 375000.0, Q2 = 499950.0, Q3 = 685000.

To end this article we need to mention the box plot, a visual representation of the quartiles: a rectangle is drawn from Q1 to Q3 and Q2 is marked with an inside line.

The box plot is often drawn with whiskers which extend from Q1 down to the last data point larger than Q1 — IQR x 1.5 and from Q3 up to the last data point smaller than Q3 + IQR x 1.5. Values outside the whiskers are marked as outliers, meaning that they lie far from the rest of the distribution.

I know this is all very basic but i hope to make someone else life easier with this article.

Many thanks for reading.

--

--