2023-02-12
Libraries for Automated Exploratory Data Analysis (EDA)
EDA Made Easy - Discover Top-10 Python Libraries That Will Take Your Data Analysis to the Next Level! Learn the Secrets of Automated EDA!
Exploratory Data Analysis (EDA) is an important step in the data analysis process. It allows us to explore and understand the dataset, identify patterns, and make informed decisions about data cleaning, feature engineering, and modeling. In recent years, several Python libraries have been developed to automate and streamline the EDA process. Here are 10 popular Python libraries for automated EDA:
Top-10 Tools for Automated EDA
Pandas Profiling
Pandas Profiling generates a report with descriptive statistics and visualizations for each variable in a Pandas DataFrame. The report includes correlations, missing values, and data types.
2. DataPrep
DataPrep provides a set of functions for data cleaning and preprocessing, including automatic column type detection, outlier detection, and missing value imputation.
pip install dataprep
The following code demonstrates how to use DataPrep.EDA
to create a profile report for the titanic dataset.
from dataprep.datasets import load_dataset
from dataprep.eda import create_report
df = load_dataset("titanic")
create_report(df).show_browser()
3. Sweetviz
Sweetviz generates a report with detailed visualizations and statistical analysis for each variable in a Pandas DataFrame. The report includes comparisons between different subgroups and correlation matrices.
4. Lux
Lux is a library for interactive data visualization that provides a powerful and intuitive interface for exploring and visualizing data. It includes a recommendation system that suggests relevant visualizations based on the current selection.
5. dabl
dabl is a library that provides a set of functions for automated data analysis and machine learning. It includes tools for data cleaning, feature engineering, and modeling, and provides an easy-to-use interface for non-experts.
6. Autoviz
Autoviz is a library that automatically generates visualizations for each variable in a Pandas DataFrame. It includes different types of charts such as scatterplots, histograms, and bar charts, and it can be used for both regression and classification tasks.
7. Klib
Klib is a library that provides a set of functions for data cleaning and preprocessing, including feature selection, missing value imputation, and correlation analysis. It includes useful visualizations and statistical analysis for each variable.
8. ExplainerDashboard
ExplainerDashboard is a library that provides a dashboard for exploring and visualizing the results of machine learning models. It includes visualizations for feature importance, confusion matrices, and partial dependence plots.
9. PyCaret
PyCaret is a library for automated machine learning that includes tools for data preprocessing, feature selection, and model training. It includes a user-friendly interface that allows non-experts to build and deploy machine learning models.
Missingno
Missingno is a library that provides a set of tools for visualizing and understanding missing data in a dataset. It includes tools for matrix visualization, bar charts, and heatmaps.
Honorable mentions
There are three other tools that might be useful during data exploration.
Featuretools
Featuretools is a library for automated feature engineering that allows you to automatically generate features from multiple tables. It includes tools for handling time-based data and can generate a set of feature definitions in just a few lines of code.
PyExplainer
PyExplainer is a library that allows you to easily explain and interpret the results of machine learning models. It includes tools for feature importance, partial dependence plots, and permutation feature importance.
Any comments or suggestions? Let me know.
References
There is interesting article that features EDA tools:
It covers:
- pandas-profiling (python)
- summarytools (R)
- explore (R)
- dataMaid (R)
To cite this article:
@article{Saf2023Libraries, author = {Krystian Safjan}, title = {Libraries for Automated Exploratory Data Analysis (EDA)}, journal = {Krystian's Safjan Blog}, year = {2023}, }