Introduction to Exploratory Data Analysis

What, why, and how to do Exploratory Data Analysis?

3 min readSep 13, 2020

As a newcomer in data science & analytics, I have no clue about what is done in a data science/analysis project. One thing that confuses me is Exploratory Data Analysis (EDA). Therefore, after reading some explanations about EDA and examples of EDA projects on Kaggle, I am writing this article to review my understanding of EDA. So, if you find something needed to be corrected in my article, please tell me that! Thanks!

What is EDA?

EDA is one of the steps in a data science/data analysis project that is carried out to get us familiarized with our data. In EDA, we will be playing with our dataset. We can validate the assumptions we have about the data through this step.

However, it should be noted that we don’t need to make assumptions when we are analyzing the dataset. Simply by playing with the data, we have as is, we will find insightful information!

Why do we do EDA?

The dataset that we will analyze later may be used to build a machine learning (ML) model or perform certain statistical testing. Some ML algorithms or statistical tests have some assumptions that must be met by the dataset. EDA is used to ensure that our dataset has fulfilled the necessary assumptions. By the way, hypothesis testing can also be done at this stage!

Furthermore, the main purpose of this stage is to deeply understand our dataset — how about the descriptive statistics, the data types, data distribution, and possible errors in the dataset such as anomalies, outliers, missing values, etc. By truly be familiarized with our dataset, we hope to be able to determine and run the appropriate type of analysis with our dataset. Or vice versa, we can prepare our dataset so that it is ready to use for any further analysis.

How to do EDA?

Broadly speaking, four things are done in EDA:

Data Understanding — understanding the descriptive statistics, data types (categorical/numerical), data distributions, and the correlation between predictor and target variables or between predictor variables.
Data Cleansing — errors checking & resolving (duplicated entries, outliers, missing values, inappropriate data values/types, etc.).
Data Wrangling — transforming the raw data into another suitable format for further analysis.
Data Analysis — as the name implies, we perform analysis as needed to answer the business questions or gain insightful information through a series of why-questions on our dataset.

In EDA, we don’t just do the analysis with some simple queries. We also do data visualization to get more insightful information from our dataset. Three types of data visualization can be done in EDA:

Univariate Visualization — data visualization on a single variable to make us familiarized with the characteristic of that variable.
Bivariate Visualization — an analysis that is used to observe the relationships between two variables.
Multivariate Visualization — to trace the relationships between different fields.

Conclusions

EDA is one of the stages in a data science/data analysis project that is carried out with the main purpose is to deeply understand our dataset. By becoming familiar with the dataset, we can determine correctly what further analysis to do next. Apart from data familiarization, we can also prepare our data to be ready to be used for further analysis or building an ML model. Finally, the combination of simple query and data visualization can help us perform this stage optimally.

References

For a deeper understanding of EDA and its steps in detail, I recommend you to read these resources.

Thanks for reading my article! Feel free to correct me if there is any correction needed in this article!