Blog

AutoML Regression – pycaret

Code template for running autoML regression in pycaret.

Link to website : https://pycaret.org/

Link to repository : https://github.com/pycaret/pycaret

Download notebook

2022-01-30
AutoML Classification – pycaret

Code template for running autoML classification in pycaret.

Link to website : https://pycaret.org/

Link to repository : https://github.com/pycaret/pycaret

Download notebook

2022-01-30
AutoML Regression – auto-sklearn

Code template for running autoML regression in auto-sklearn.

Link to website : https://automl.github.io/auto-sklearn/master/

Link to repository : https://github.com/automl/auto-sklearn

Download notebook

2022-01-30
AutoML Classification – auto-sklearn

Code template for running autoML classification in auto-sklearn.

Link to website : https://automl.github.io/auto-sklearn/master/

Link to repository : https://github.com/automl/auto-sklearn

Download notebook

2022-01-30
AutoML Regression – autokeras

Code template for running autoML regression in autokeras.

Link to website : https://autokeras.com/

Link to repository : https://github.com/keras-team/autokeras

Download notebook

2022-01-30
AutoML Classification – autokeras

Code template for running autoML classification in autokeras.

Link to website : https://autokeras.com/

Link to repository : https://github.com/keras-team/autokeras

Download notebook

2022-01-30

Comparing Exploratory Data Analysis Libraries

I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.

The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.

In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).

Comparison

Package	sweetviz	pandas-profiling	autoviz	lux-api	dtale	dataprep
Version	2.1.3	3.0.0	0.0.83	0.3.2	1.56.0	0.3.0
Recommended for exploration	Yes	No	Yes	No	No	Yes
Recommended for production	No	No	No	No	No	No
Ease of use	Yes	Yes	Yes	Yes	Yes	Yes
Computation speed	Fast	Medium	Fast	Fast	Fast	Fast
Installation complexity	Low	Low	Low	Low	Medium	Low
Target variable-centric	Yes	No	Yes	Yes	No	Yes
Missing data check	No	Yes	No	No	Yes	Yes
Per variable summary statistics	Yes	Yes	No	No	Yes	Yes
AutoEDA focus	No	Yes	Yes	No	No	Yes
Score	4	3	2	2	1	1

Table comparing autoEDA libraries for exploratory data analysis.

Unique Points of Each Library

sweetviz

Beautiful and simple visualisation that is good for business explanation.
Limited features beyond simple visualisation.

View notebook

pandas-profiling

Good for generic data quality check.
No option to specify target variable for tailored EDA.

View notebook

autoviz

Holistic visualisations that are good for deep dive analysis.
Claims to do smart selection of plots/analyses but have yet to see the impact.

View notebook

lux-api

Generate very simple single or pair variables distribution plots.
Plot of interest selection is manual.
Charts are not integrated into notebook (as widgets only).

View notebook

dtale

Very good as a Python-based data manipulation tool with GUI.
Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
Too much manual effort required to setup all analyses/plots required.
No dashboard deployment support.
No option to specify target variable for tailored EDA.
Complex dependecies on many libraries.

View notebook

dataprep

EDA module is like an extension to pandas-profiling. Very comprehensive.
Contains 3 modules to : collect data, explore data and clean data.
Potentially faster on large data due to use of Dask.
Possible to deep dive investigate selected columns/variables.

View notebook

2022-01-29

Comparing Time Series Machine Learning Libraries

I have tested and reviewed a few Python packages for time-series data analysis, mostly on forecasting. Most of these packages are one-stop shop machine learning packages, with some of them also containing autoML function.

The main objective here is to review and explore Python packages that will shorten the time needed for time-series data analysis.

In summary, kats is the most promising one-stop shop machine learning package for time-series analysis. pycaret-ts-alpha is likely to be a strong contender once it matures out of the alpha status and gets integrated officially into pycaret.

These libraries tend to be a bit rough around the edges in terms of documentations and API implementations, especially for the newer packages. The support for multivariate time series forecasting is also on the weaker side, as most of them focus on univariate time series forecasting.

pytorch-forecasting deserves a special mention as it is the only library with a deep learning focus. While I agree that deep learning is very sexy to play with, I am still quite reserved in terms of applying deep learning to time series problems. Compare to traditional statistical models that have tens of parameters, deep learning models often have millions or billions of parameters to be trained. Fitting an N-BEATS model that has 1.6 million parameters on the air passenger data with hundreds of data points feels wrong.

Or as John von Neumann famously said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

Comparison

Package	kats	pmdarima	sktime	pytorch-forecasting	pycaret-ts-alpha	autots
Version	0.1.0	1.8.2	0.5.3	0.9.0	3.0.0.dev1624743408	0.3.2
Recommended for exploration	No	No	No	No	No	No
Recommended for production	Yes	Yes	No	No	No	No
Ease of use	Yes	Yes	No	Yes	No	Yes
Computation speed	Fast	Fast	Fast	Slow	Medium	Slow
Installation complexity	Low	Low	Low	Medium	Medium	Low
One-stop shop	Yes	No	Yes	No	Yes	Yes
AutoML focus	No	No	No	No	Yes	Yes
Deep learning focus	No	No	No	Yes	No	No
Score	4	3	2	2	1	1

Table comparing autoML libraries for time-series analysis.

Unique Points of Each Library

kats

Has a promising list of time-series models implemented, including a good selection of algorithms for change detection and time-series feature extraction.
Seems good for stable/production deployment due to stable implementation and good documentation.

View notebook on forecasting

View notebook on detection

pmdarima

Standard baseline model for time-series forecasting.
Modelled after equivalent in R.
Easier to be used as part of larger one-stop shop library.

View notebook

sktime

Has a promising list of time-series models and time-series algorithms implemented.
Less complete documentation and non-standardised APIs make exploring them slightly trickier.

View notebook

pytorch-forecasting

Has a focus on using deep learning models for time-series forecasting.
Very interesting selection of DL models for time-series forecasting.
The non-conventional models implemented in PyTorch Forecasting are often developed very recently.
Each model has millions of parameters (requires a lot of data to train) and is slowed to train.

View notebook

pycaret-ts-alpha

Has a strong potential as it is based on pycaret framework, but currently in alpha.
Makes experimentation easy and standardised.

View notebook

autots

Uses genetic algorithm to find an ensemble of best models, with options to setup weighted metrics to evaluate model performance.
Very slow in building the model ensemble.
Documentations and tutorials are lacking which makes using AutoTS slightly more time consuming to use than other similar packages.

View notebook

2022-01-29