4 Tools to Speed Up Exploratory Data Analysis (EDA) in Python

 

4 Tools to Speed Up Exploratory Data Analysis (EDA) in Python

An easy to use one-line of code to automate EDA

Data preparation and Exploratory data analysis take a lot of time and effort from data professionals. Wouldn’t it be nice to have a package(s) that enable you to explore your data quickly? In one line of code?

This article will show the best four packages in Python that can automate your data exploration and analysis. I will go through each one of them, what it does and how you can use it.

  1. DataPrep
  2. Pandas Profiling
  3. SweetViz
  4. AutoViz

DataPrep

The DataPrep ecosystem currently consists of three components: connector, EDA, and Clean API.

The Connector enables a simple data collection from Web APIs by providing a standard set of operations. The EDA component handles the exploration data analysis part, and finally, Clean API provides functions for quickly and efficiently cleaning and validating data.

Forexample, using the Philly Parking Violations dataset, we can call plot() to get an overview of EDA on the data frame or plot correlations with a single line of code, using plot_correlations().

It is also possible to generate a detailed report with one line of code using DataPrep. Here is a create_report() method called on a DataFrame.

import pandas as pd
from dataprep.eda import create_report
df = pd.read_csv("parking_violations.csv")create_report(df)

And you get back an extensive and interactive report for not only variables but also correlations as well as interactions and missing value.

DataPrep EDS report.

DataPrep eases the amount and effort you need as a data scientist to explore the dataset. With Just one line of code, you can get an overview of your dataset, missing values, correlations, and statistical description of the dataset, as we have seen above.

To install DataPrep, run:

pip install dataprep

Visit also the DataPrep Documentation here for more information:

Pandas Profiling

Pandas profiling also enables you to perform similar EDA as all packages in this article. It has an extensive use case and more tutorials than all of the packages.

With just one line of code, you can generate an EDA report using Pandas Profiling with descriptive statistics, Correlations, Missing value, text analysis, and more.

Let us call ProfileReport() on the Philly DataFrame to generate an EDA report.

from pandas_profiling import ProfileReport
profile = ProfileReport(df, title=”Report”)
profile

Pandas Profiling generates a similar report with a sleek User Interface (UI).

You can install using the pip package manager by running

pip install pandas-profiling[notebook]

Visit the Github repository for more tutorials and documentation:

SweetViz

SweetViz also provides an interactive EDA with just two lines of code. In addition, you can compare two datasets easily, like training and test dataset for your machine learning projects.

To get a report from SweetViz, you can run the following command on any data frame, and it will generate an HTML report.

AutoViz

AutoViz provides similar functionality. You can generate much more detailed plots for your dataset with AutoViz using only one line of code. Here is a report generated with AutoViz using the Philly Parking Dataset.

from autoviz.AutoViz_Class import AutoViz_ClassAV = AutoViz_Class()df_av = AV.AutoViz(‘parking.csv’)

Note that you do not even need Pandas to read the data. AutoViz will load it when you provide the path to the dataset. Here is the report we generate with Autoviz.

In AutoViz, you have much more plots (i.e., violin, boxplots, and more) as well as statistical and probability values. However, the UI is not neat as others Report, and you do not have interactive plots.

To install AutoViz, run the following command:

pip install autoviz

Final Thoughts

The four packages offer almost a similar functionality. You can automate your EDA with simple, intuitive, and one line of code.

Of all the four packages in this article, DataPrep provides much more functionality than simple EDA. It can help you ingest more data sources and offer a speedup for large datasets.


Comments

Popular posts from this blog

What is Gini Impurity?

Artificial Neural Network Interview Questions

20+ Questions to Test your Skills on Logistic Regression