Exploratory Data Analysis (EDA) examines and visualizes data to understand its main characteristics, identify patterns, spot anomalies, and test hypotheses. It helps summarize the data and uncover insights before applying more advanced data analysis techniques.
Become a Data Scientist through hands-on learning with hackathons, masterclasses, webinars, and Ask-Me-Anything! Start learning now!
What Is Exploratory Data Analysis?
Exploratory Data Analysis is a data analytics process that aims to understand the data in depth and learn its different characteristics, often using visual means. This allows one to get a better feel for the data and find useful patterns.
Figure 1: Exploratory Data Analysis
It is crucial to understand it in depth before you perform data analysis and run your data through an algorithm. You need to know the patterns in your data and determine which variables are important and do not play a significant role in the output. Further, some variables may have correlations with other variables. You also need to recognize errors in your data.
Exploratory data analysis can do all of this. It helps you gather insights, better sense the data, and remove irregularities and unnecessary values.
- Helps you prepare your dataset for analysis.
- Allows a machine learning model to predict our dataset better.
- Gives you more accurate results.
- It also helps us to choose a better machine learning model.
Figure 2: Exploratory Data Analysis uses
Steps Involved in Exploratory Data Analysis
1. Understand the Data
Familiarize yourself with the data set, understand the domain, and identify the objectives of the analysis.
2. Data Collection
Collect the required data from various sources such as databases, web scraping, or APIs.
3. Data Cleaning
- Handle missing values: Impute or remove missing data.
- Remove duplicates: Ensure there are no duplicate records.
- Correct data types: Convert data types to appropriate formats.
- Fix errors: Address any inconsistencies or errors in the data.
4. Data Transformation
- Normalize or standardize the data if necessary.
- Create new features through feature engineering.
- Aggregate or disaggregate data based on analysis needs.
5. Data Integration
Integrate data from various sources to create a complete data set.
6. Data Exploration
- Univariate Analysis: Analyze individual variables using summary statistics and visualizations (e.g., histograms, box plots).
- Bivariate Analysis: Analyze the relationship between two variables with scatter plots, correlation coefficients, and cross-tabulations.
- Multivariate Analysis: Investigate interactions between multiple variables using pair plots and correlation matrices.
7. Data Visualization
Visualize data distributions and relationships using visual tools such as bar charts, line charts, scatter plots, heatmaps, and box plots.
8. Descriptive Statistics
Calculate central tendency measures (mean, median, mode) and dispersion measures (range, variance, standard deviation).
9. Identify Patterns and Outliers
Detect patterns, trends, and outliers in the data using visualizations and statistical methods.
10. Hypothesis Testing
Formulate and test hypotheses using statistical tests (e.g., t-tests, chi-square tests) to validate assumptions or relationships in the data.
11. Data Summarization
Summarize findings with descriptive statistics, visualizations, and key insights.
12. Documentation and Reporting
- Document the EDA process, findings, and insights clearly and structured.
- Create reports and presentations to convey results to stakeholders.
13. Iterate and Refine
Continuously refine the analysis based on feedback and additional questions during the process.
Importance of Exploratory Data Analysis in Data Science
Exploratory Data Analysis is a critical step in the data science process. It is the foundation for understanding and interpreting complex data sets. EDA helps data scientists identify patterns, spot anomalies, test hypotheses, and check assumptions through various statistical and graphical techniques. Practitioners can uncover underlying structures, detect outliers, and determine the relationships between variables, which is essential for developing accurate predictive models by thoroughly exploring the data.
Furthermore, Exploratory Data Analysis allows the identification of data quality issues, such as missing values or errors, which can be addressed before proceeding to more advanced analysis. This preliminary analysis enhances the reliability and accuracy of the subsequent modeling and ensures that the insights derived are valid and actionable. EDA allows data scientists to make informed decisions and derive meaningful insights that drive business strategies and solutions.
Types of Exploratory Data Analysis (EDA)
1. Univariate Analysis
- Definition: Focuses on analyzing a single variable at a time.
- Purpose: To understand the variable's distribution, central tendency, and spread.
- Techniques:
- Descriptive statistics (mean, median, mode, variance, standard deviation).
- Visualizations (histograms, box plots, bar charts, pie charts).
2. Bivariate Analysis
- Definition: Examines the relationship between two variables.
- Purpose: To understand how one variable affects or is associated with another.
- Techniques:
- Scatter plots.
- Correlation coefficients (Pearson, Spearman).
- Cross-tabulations and contingency tables.
- Visualizations (line plots, scatter plots, pair plots).
3. Multivariate Analysis
- Definition: Investigates interactions between three or more variables.
- Purpose: To understand the complex relationships and interactions in the data.
- Techniques:
- Multivariate plots (pair plots, parallel coordinates plots).
- Dimensionality reduction techniques (PCA, t-SNE).
- Cluster analysis.
- Heatmaps and correlation matrices.
4. Descriptive Statistics
- Definition: Summarizes the main features of a data set.
- Purpose: To provide a quick overview of the data.
- Techniques:
- Measures of central tendency (mean, median, mode).
- Measures of dispersion (range, variance, standard deviation).
- Frequency distributions.
5. Graphical Analysis
- Definition: Uses visual tools to explore data.
- Purpose: To identify patterns, trends, and data anomalies through visualization.
- Techniques:
- Charts (bar charts, histograms, pie charts).
- Plots (scatter plots, line plots, box plots).
- Advanced visualizations (heatmaps, violin plots, pair plots).
6. Dimensionality Reduction
- Definition: Reduces the number of variables under consideration.
- Purpose: To simplify models, reduce computation time, and mitigate the curse of dimensionality.
- Techniques:
- Principal Component Analysis (PCA).
- t-Distributed Stochastic Neighbor Embedding (t-SNE).
- Linear Discriminant Analysis (LDA).
Exploratory Data Analysis Tools
Using the following tools for exploratory data analysis, data scientists can effectively gain deeper insights and prepare data for advanced analytics and modeling.
1. Python Libraries
- Pandas: Provides data structures and functions needed to manipulate structured data seamlessly.
- Use: Data cleaning, manipulation, and summary statistics.
- Supports large, multi-dimensional arrays and matrices and a collection of mathematical functions.
- Use: Numerical computations and data manipulation.
- Matplotlib: A plotting library that produces static, animated, and interactive visualizations.
- Use: Basic plots like line charts, scatter plots, and bar charts.
- Seaborn: Built on Matplotlib, it provides a high-level interface for drawing attractive statistical graphics.
- Use: Advanced visualizations like heatmaps, violin plots, and pair plots.
- SciPy: Builds on NumPy and provides many higher-level scientific algorithms.
- Use: Statistical analysis and additional mathematical functions.
- Plotly: A graphing library that makes interactive, publication-quality graphs online.
- Use: Interactive and dynamic visualizations.
2. R Libraries
- ggplot2: A framework for creating graphics using the principles of the Grammar of Graphics.
- Use: Complex and multi-layered visualizations.
- dplyr: A set of tools for data manipulation, offering consistent verbs to address common data manipulation tasks.
- Use: Data wrangling and manipulation.
- tidyr: Provides functions to help you organize your data in a tidy way.
- Use: Data cleaning and tidying.
- shiny: An R package that makes building interactive web apps straight from R easy.
- Use: Interactive data analysis applications.
- plotly: Also available in R for creating interactive visualizations.
- Use: Interactive visualizations.
3. Integrated Development Environments (IDEs)
- Jupyter Notebook: An open-source web application that allows you to create and share documents that contain live code, equations, visualizations, and narrative text.
- Use: Combining code execution, rich text, and visualizations.
- RStudio: An integrated development environment for R that offers tools for writing and debugging code, building software, and analyzing data.
- Use: R development and analysis.
4. Data Visualization Tools
- Tableau: A top data visualization tool that facilitates the creation of diverse charts and dashboards.
- Use: Interactive and shareable dashboards.
- Power BI: A Microsoft business analytics service offering interactive visualizations and business intelligence features.
- Use: Interactive reports and dashboards.
5. Statistical Analysis Tools
- SPSS: A comprehensive statistics package from IBM.
- Use: Complex statistical data analysis.
- SAS: A software suite developed by SAS Institute for advanced analytics, business intelligence, data management, and predictive analytics.
- Use: Statistical analysis and data management.
6. Data Cleaning Tools
- OpenRefine: A powerful tool for cleaning messy data, transforming formats, and enhancing it with web services and external data.
- Use: Data cleaning and transformation.
- SQL Databases: Tools like MySQL, PostgreSQL, and SQLite are used to manage and query relational databases.
- Use: Data extraction, transformation, and basic analysis.
Our Data Scientist Master's Program covers core topics such as R, Python, Machine Learning, Tableau, Hadoop, and Spark. Get started on your journey today!
Market Analysis With Exploratory Data Analysis
Now, perform Exploratory Data Analysis on market analysis data. You start by importing all necessary modules.
Figure 3: Importing necessary modules
Then, you read in the data as a pandas data frame.
Figure 4: Market Analysis Data
The dataset is not formatted correctly. The first two rows contain the actual column names, just arbitrary values.
Importing Data
When importing your data, skip the first two rows to overcome the skewed rows. This will ensure that your column names are populated correctly.
Figure 5: Importing Market Analysis Data
The dataset is imported correctly now. The column names are in the correct row, and you’ve dropped the arbitrary data.
The above data was collected while taking a survey. Information about the survey takers, like their occupation, salary, whether they have taken a loan, age, etc., is given. You will use exploratory data analysis to find patterns in this data and correlations between columns. You will also perform basic data-cleaning steps.
Data Cleaning
The next step is data cleaning. Let us drop the customer ID column, as it is just the row numbers indexed at 1. Also, split the ‘jobedu’ column into two: one for the job and one for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it is useless anymore.
Figure 6: Cleaning Market Analysis Data
This is what the dataset looks like now.
Figure 7: Market Analysis Data
Missing Values
The data has some missing values in its columns. There are three major categories of missing values:
- MCAR (Missing completely at random): These values are randomly missing and do not depend on any other values.
- MAR (Missing at random): These values depend on additional features.
- MNAR (Missing not at random): There is a reason why these values are missing.
Let’s check the columns which have missing values.
Figure 8: Missing values
You cannot do anything about the missing age values. So, drop all rows without age values.
Figure 9: Missing age values
Now, in the month column, you can fill in the missing values by finding the most commonly occurring month and filling it in place of the missing values. You see the mode of the month column to get the most commonly occurring values and fill in the missing values using the fill function.
Figure 10: Filling in missing month values
Check to see the number of missing values left in your data.
Figure 11: Missing values
Finally, only the response column has missing values. You cannot change these values. If the user hasn't filled in the response, you cannot auto-generate it, so you drop these values.
Figure 12: Dropping Missing response values
Finally, the data is clean. You can now start finding the outliers.
Handling Outliers
There are two types of outliers in data:
- Univariate outliers: Univariate outliers are the data points whose values lie outside the expected range. Here, only a single variable is being considered.
- Multivariate outliers: These outliers depend on the correlation between two variables. While plotting data, one variable may not lie beyond the expected range. Still, when you plot the same variable with another variable, these values may lie far from the expected value.
Univariate Analysis
Now, consider the different jobs on which you have data. Plotting the job column as a bar graph in ascending order of the number of people who work in that job tells us the most popular jobs in the market. Normalize the data to ensure that they lie in the same range and are comparable.
Figure 13: Plotting the number of people performing a certain job
Moving on, plot a pie chart to compare the education qualifications of the people in the survey. Almost half of the people have only secondary school education, and one-fourth have a tertiary education.
Figure 14: Plotting the education qualification of people
Bivariate Analysis
Bivariate analysis is of three main types:
1. Numeric-Numerical Analysis
When both variables are compared, they have numeric data, and the analysis is said to be a Numeric-Numerical Analysis. You can use scatter plots, pair plots, and correlation matrices to compare two numeric columns.
Scatter Plot
A scatter plot represents every data point in the graph. It shows how the data in one column fluctuates according to the corresponding data points in another column. For example, plot a scatterplot between different individuals' salaries and bank balances and the balance and age of individuals.
Figure 15: Plotting a scatter plot of Salary vs. Balance
By looking at the above plot, it can be said that regardless of the individual salary, the average bank balance ranges from 0 - 25,0000. The majority of the people have a bank balance below 40k.
Figure 16: Plotting a scatter plot of Balance vs Age
From the above graph, you can conclude that the average balance of people, regardless of age, is around 25,000. This is the average balance, irrespective of age and salary.
Pair Plot
Pair plots are used to compare multiple variables simultaneously. They plot a scatter plot of all input variables against each other, which helps save space and allows us to compare various variables simultaneously. Let's plot the pair plot for salary, balance, and age.
Figure 17: Plotting a pairplot
The figures below show the pair plots for salary, balance, and age. Each variable is plotted against the others on both the x- and y-axes.
Figure 18: Pairplots of salary, balance, and age
Correlation Matrix
A correlation matrix is used to see the correlation between different variables. The correlation coefficient determines how two variables are correlated. The below table shows the correlation between salary, age, and balance. Correlation tells you how one variable affects the other. This helps us determine how changes in one variable will also cause a change in the other.
Figure 19: Correlation matrix between salary, balance, and age
The above matrix tells us that balance, age, and salary have a high correlation coefficient and affect each other. Age and salary have a lower correlation coefficient.
2. Numeric - Categorical Analysis
When one variable is of numeric type, and another is a categorical variable, you perform numeric-categorical analysis.
You can use the group by function to arrange the data into similar groups. Rows that have the same value in a particular column will be grouped. This way, you can see the numerical occurrences of a certain category across a column. You can also group values and find their mean.
Figure 20: Groupby of response with respect to salary
The above values tell you the average salary of the people who have responded yes or no in the response column.
You can also find the middle value of salary or the median value of the people who have responded with yes and no in our survey.
Figure 21: Median of groupby of response with respect to salary
You can also plot the box plot of response vs salary. A boxplot will show you the range of values that fall under a certain category.
Figure 22: Boxplot of response with respect to salary
The above plot tells you that the salary range of people who said no on the survey is between 20k - 70k with a median salary of 60k, while the salary range of people who replied with yes on the survey was between 50k - 100k with a median salary of 60K.
3. Categorical — Categorical Analysis
When both the variables contain categorical data, you perform categorical-categorical analysis. First, convert the categorical response column into a numerical column with 1 corresponding to a positive response and 0 corresponding to a negative response.
Figure 23: Changing categorical to numerical values
Now, plot the marital status of people with the response rate. The figure below tells you the mean number of people who responded yes to the survey and their marital status.
Figure 24: Changing categorical to numerical values
Also, plot the mean loan with the response rate.
Figure 25: Changing categorical to numerical values
You can conclude that people who have taken a loan are likelier to respond with a no on the survey.
Conclusion
Exploratory Data Analysis provides valuable insights through data exploration, cleaning, and visualization. By understanding the fundamental steps of EDA and applying them to market analysis, professionals can make data-driven decisions and uncover hidden trends. Mastering EDA techniques is essential for anyone looking to excel in data science.
Develop your skills further and become an expert in Exploratory Data Analysis with Simplilearn's Data Scientist program. This course covers all foundational concepts and advanced data science techniques, empowering you to transform data into actionable insights. Start your journey today and unlock new career opportunities.
Upskill yourself with our trending Data Science Courses and Certifications
FAQs
1. What Are the Benefits of EDA?
Exploratory Data Analysis helps identify patterns, detect outliers, understand relationships between variables, and improve data quality, leading to more accurate and reliable models.
2. How Does EDA Differ From Data Cleaning?
Exploratory Data Analysis involves analyzing and visualizing data to understand its characteristics, while data cleaning focuses on correcting errors, handling missing values, and ensuring data consistency.
3. Can EDA Be Performed on Any Type of Data?
Yes, Exploratory Data Analysis can be performed on any type of data, including structured, unstructured, and semi-structured data, though the techniques and tools may vary.
4. What Are Some Common Visualizations Used in EDA?
Common visualizations in Exploratory Data Analysis include histograms, scatter plots, box plots, bar charts, line charts, heatmaps, and pair plots.
5. What Should Be Done if Outliers Are Found During EDA?
Investigate the cause of outliers to determine if they are errors, natural variations, or significant insights. Based on the context of the analysis, decide whether to retain, transform, or remove them.