How to find Outliers in Data?

Soham Shinde
Aug 30, 2022
5 min read

Updated: Sep 1, 2022

Hello Readers, Hope you are doing good. Today’s topic is about outliers. We will be touching on:

1. Introduction to outliers

2. What is outlier?

3. How they are created?

4. Why they are created?

5. Why it is important to remove them from data?

6. How to remove them using mathematical concept? ( Using IQR method)

7. Some myths we have about outliers

1. Introduction to outliers:

When I was in grad school, I was working on the data to find the sales of one of the big supermarkets in the world. I had a lot of data to handle, more than 4000 rows and 6 columns. When we deal with this kind of data, every data has some outliers. Outliers are the data points, and attributes, which are lying out of the regular data points. One thing to keep in mind is that we find outliers only in integer or continuous variables or ordinal variables.

The categorical variable will not have outliers. Suppose we have product categories such as Dairy, Furniture, and Automobile. We cannot have an outlier in this type of data. This tells us that when we have an integer value, we can remove outliers from that attribute.

2. What is an outlier?

Outlier is any data point or data points which are falling outside the regular range of data. We can define it more simply as the data points which are not most common in the data. If we have timings for the shipment delivery in hours and minutes, in the shipment delivery timing, we can have outliers. For example, an outlier can be a time when the shipment got delivered 4 days after shipping. Average shipping is 3 days, that is 72 hours. But if we see 96 hours delivery timing, we can know that shipment was delivered late and that point can be an outlier.

Definition of an outlier: Outlier is the point that differs significantly from the observations taken. This can be due to various reasons for causing outliers in the data set. In the next section, we will be seeing how are the outliers formed in the data.

3. Why they are created?

It is also important to know how outliers are formed while collecting the data. Let us see some important points and reasons for the creation of outliers.

Data collection/ Sampling error: While collecting the data we sometimes collect from regions, sensors, countries, or various temperature situations. This depends on what kind of data we are collecting. Data collection is the most frequent reason for the outliers. This sample that we have collected can have a person which does not belong to the target population.
NA Values: If we collect the data for some experiments, we can get Null values in the data set collected. This is because of some mechanical issue the sensors are faulted to pick up the readings. So we might get Not Available (NA) values. When we visualize the data count of data points, we can see a lot of NA values. If they are, we have to remove them.
Data Entry error/Typo: While entering the data we sometimes have to be careful about data entry. Data entry can cause a major discrepancy in the values entered. Suppose we have the ages of adult women, 26, 39, 23, 42, 36, 23, 38, 42, 29, 390.

We can that the last age can be “390”. But the zero was added by mistake in the dataset.

4. Natural occurrence: This type of outliers is caused when we have default outliers in the data set. For example, if we have data on the heights of all men and women in the country. We can have outliers in the data when we plot a distribution plot such as a Histogram. Some people can have less height and some can be very tall. These people can be outliers in the dataset.

4. Why it is important to remove them from data?

I have included this question because we have to understand the reason for removing it. Not every time it is important to remove the outliers. We will touch base on that as well. Let us see some reasons to remove the outliers:

Inaccurate Mean: This means that the mean can differ a lot if we do not remove the outliers. As mean is the average of all the values in the data set. Mean captures the sum of all the values and is then divided by the number of values in the column. It will also consider the outliers in the data.
Reduce storage space: Removing the outliers can reduce the space we use in the computers and hard disks. Sometimes there are so many outliers that can take a lot of space and slow down the speed of grabbing the data for calculation.
Improving the performance of data: Have you had a situation when you had a slow speed of output when you are trying to perform Exploratory data analysis and plotting the graphs? This is because the data contains excessive data in the form of outliers. This is the most common reason for slowing down the performance of the model, data, or reports.
When not to remove outliers: This is one of the most important questions which we have to keep in mind. It is critical to decide if the specific attribute is an outlier or not. Sometimes domain knowledge is important to decide if we have to remove the outliers from the attribute or column. For example, if we have the temperature of the day as an attribute in the data. The temperature is very cold at night, which means we have less temperature at night. If we need the temperature every 2 hours of the day, then we cannot remove the cold temperature data points, knowing that the temperature at night can be very cold. These cannot be outliers as these are essential data points that we need.

5. How to remove them using mathematical concepts? ( Using IQR method )

We will be using the IQR technique for removing the outliers in the data. IQR stands for the Interquartile range of the data. IQR proximity rules state that any data points that are outliers that are below Q1–1.5*IQR and above Q3 + 1.5 * IQR are considered outliers and eligible for deleting. Below I have mentioned a simple formula to calculate IQR and the limits which decide the outliers.

IQR = Q3 — Q1

Outlier < Q1 — (1.5 * IQR )

Outlier > Q3 + (1.5 * IQR)

Let’s try to understand the meaning of quartile. If any data is divided into 4 equal parts starting when we sort it in ascending order, each part is quartile. The first quartile (Q1) means it is the 25th percentile of the data. The second quartile (Q2) means the 50th percentile of data and Q3 is the 75th percentile of the data. The below image shows the code I have written in Python to find the limits for outliers.

Picture: Python code to find the outlier in the speed variable

After running the code we get 2 limits as we can see in the output. These 2 outputs show that below 7.18 knots the speed is an outlier. Above 28.31 knots, all other data points are outliers. The next step is to make sure that we have removed the outliers from the dataset using the lower and upper limits. After removing the outliers, we can visualize the data to have oversight if outliers are removed from the data. below box plot shows that below the minimum and maximum value there are no outliers.

Picture: Box plot of speed attribute after removing outliers

6. Some facts about outliers.

Outliers can cause due to timing issues as in the data collection timing also matters.
They can be naturally occurring in nature. We have to look for these instances.
Sometimes outliers are important as they can give vital information about the data collection process.
Do not be afraid of outliers, they are just data points. We have to decide as per the analysis if we need them or to remove them.

How to find Outliers in Data?

Recent Posts

Comments