Data Science Team & Tech Lead

Comparing Exploratory Data Analysis Libraries

I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.

The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.

In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).

Comparison

Packagesweetvizpandas-profilingautovizlux-apidtaledataprep
Version2.1.33.0.00.0.830.3.21.56.00.3.0
Recommended for explorationYesNoYesNoNoYes
Recommended for productionNoNoNoNoNoNo
Ease of useYesYesYesYesYesYes
Computation speedFastMediumFastFastFastFast
Installation complexityLowLowLowLowMediumLow
Target variable-centricYesNoYesYesNoYes
Missing data checkNoYesNoNoYesYes
Per variable summary statisticsYesYesNoNoYesYes
AutoEDA focusNoYesYesNoNoYes
Score432211
Table comparing autoEDA libraries for exploratory data analysis.

Unique Points of Each Library

sweetviz

  • Beautiful and simple visualisation that is good for business explanation. 
  • Limited features beyond simple visualisation.

pandas-profiling

  • Good for generic data quality check.
  • No option to specify target variable for tailored EDA.

autoviz

  • Holistic visualisations that are good for deep dive analysis. 
  • Claims to do smart selection of plots/analyses but have yet to see the impact.

lux-api

  • Generate very simple single or pair variables distribution plots.
  • Plot of interest selection is manual.
  • Charts are not integrated into notebook (as widgets only).

dtale

  • Very good as a Python-based data manipulation tool with GUI.
  • Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
  • Too much manual effort required to setup all analyses/plots required.
  • No dashboard deployment support.
  • No option to specify target variable for tailored EDA.
  • Complex dependecies on many libraries.

dataprep

  • EDA module is like an extension to pandas-profiling. Very comprehensive.
  • Contains 3 modules to : collect data, explore data and clean data.
  • Potentially faster on large data due to use of Dask.
  • Possible to deep dive investigate selected columns/variables.

Comments

One response to “Comparing Exploratory Data Analysis Libraries”

  1. Kushwant Singh Avatar
    Kushwant Singh

    Hi Yee Lim,

    Thanks for your writeup on the various EDA libraries that you assessed. Do you plan to include the dataprep EDA library in the future?
    https://dataprep.ai/

Leave a Reply

Your email address will not be published. Required fields are marked *