Data Science Team & Tech Lead

Blog

  • AutoML Regression – pycaret

    AutoML Regression – pycaret

    Code template for running autoML regression in pycaret.

    Link to website : https://pycaret.org/

    Link to repository : https://github.com/pycaret/pycaret

  • AutoML Classification – pycaret

    AutoML Classification – pycaret

    Code template for running autoML classification in pycaret.

    Link to website : https://pycaret.org/

    Link to repository : https://github.com/pycaret/pycaret

  • AutoML Regression – auto-sklearn

    AutoML Regression – auto-sklearn

    Code template for running autoML regression in auto-sklearn.

    Link to website : https://automl.github.io/auto-sklearn/master/

    Link to repository : https://github.com/automl/auto-sklearn

  • AutoML Classification – auto-sklearn

    AutoML Classification – auto-sklearn

    Code template for running autoML classification in auto-sklearn.

    Link to website : https://automl.github.io/auto-sklearn/master/

    Link to repository : https://github.com/automl/auto-sklearn

  • AutoML Regression – autokeras

    AutoML Regression – autokeras

    Code template for running autoML regression in autokeras.

    Link to website : https://autokeras.com/

    Link to repository : https://github.com/keras-team/autokeras

  • AutoML Classification – autokeras

    AutoML Classification – autokeras

    Code template for running autoML classification in autokeras.

    Link to website : https://autokeras.com/

    Link to repository : https://github.com/keras-team/autokeras

  • Comparing Exploratory Data Analysis Libraries

    Comparing Exploratory Data Analysis Libraries

    I have tested and reviewed a few Python packages for data processing and/or exploratory data analysis (EDA). Most of these packages attempt to automate parts of the data processing and/or EDA process, or provide a suite of functions to manipulate and visualize data.

    The main objective here is to review and explore Python packages that will shorten the time needed for data processing and/or exploratory data analysis.

    In summary, sweetviz may be the best option under a business setting (with focus on business understanding) while autoviz/pandas-profiling are solid choices under an R&D setting (with focus on deep dive analysis).

    Comparison

    Packagesweetvizpandas-profilingautovizlux-apidtaledataprep
    Version2.1.33.0.00.0.830.3.21.56.00.3.0
    Recommended for explorationYesNoYesNoNoYes
    Recommended for productionNoNoNoNoNoNo
    Ease of useYesYesYesYesYesYes
    Computation speedFastMediumFastFastFastFast
    Installation complexityLowLowLowLowMediumLow
    Target variable-centricYesNoYesYesNoYes
    Missing data checkNoYesNoNoYesYes
    Per variable summary statisticsYesYesNoNoYesYes
    AutoEDA focusNoYesYesNoNoYes
    Score432211
    Table comparing autoEDA libraries for exploratory data analysis.

    Unique Points of Each Library

    sweetviz

    • Beautiful and simple visualisation that is good for business explanation. 
    • Limited features beyond simple visualisation.

    pandas-profiling

    • Good for generic data quality check.
    • No option to specify target variable for tailored EDA.

    autoviz

    • Holistic visualisations that are good for deep dive analysis. 
    • Claims to do smart selection of plots/analyses but have yet to see the impact.

    lux-api

    • Generate very simple single or pair variables distribution plots.
    • Plot of interest selection is manual.
    • Charts are not integrated into notebook (as widgets only).

    dtale

    • Very good as a Python-based data manipulation tool with GUI.
    • Contains a full suite of tools for data manipulation/analysis/visualisation, almost matching similar commercial data analytics tools.
    • Too much manual effort required to setup all analyses/plots required.
    • No dashboard deployment support.
    • No option to specify target variable for tailored EDA.
    • Complex dependecies on many libraries.

    dataprep

    • EDA module is like an extension to pandas-profiling. Very comprehensive.
    • Contains 3 modules to : collect data, explore data and clean data.
    • Potentially faster on large data due to use of Dask.
    • Possible to deep dive investigate selected columns/variables.
  • Comparing Time Series Machine Learning Libraries

    Comparing Time Series Machine Learning Libraries

    I have tested and reviewed a few Python packages for time-series data analysis, mostly on forecasting. Most of these packages are one-stop shop machine learning packages, with some of them also containing autoML function.

    The main objective here is to review and explore Python packages that will shorten the time needed for time-series data analysis.

    In summary, kats is the most promising one-stop shop machine learning package for time-series analysis. pycaret-ts-alpha is likely to be a strong contender once it matures out of the alpha status and gets integrated officially into pycaret.

    These libraries tend to be a bit rough around the edges in terms of documentations and API implementations, especially for the newer packages. The support for multivariate time series forecasting is also on the weaker side, as most of them focus on univariate time series forecasting.

    pytorch-forecasting deserves a special mention as it is the only library with a deep learning focus. While I agree that deep learning is very sexy to play with, I am still quite reserved in terms of applying deep learning to time series problems. Compare to traditional statistical models that have tens of parameters, deep learning models often have millions or billions of parameters to be trained. Fitting an N-BEATS model that has 1.6 million parameters on the air passenger data with hundreds of data points feels wrong.

    Or as John von Neumann famously said, “With four parameters I can fit an elephant, and with five I can make him wiggle his trunk.”

    Comparison

    Packagekatspmdarimasktimepytorch-forecastingpycaret-ts-alphaautots
    Version0.1.01.8.20.5.30.9.03.0.0.dev16247434080.3.2
    Recommended for explorationNoNoNoNoNoNo
    Recommended for productionYesYesNoNoNoNo
    Ease of useYesYesNoYesNoYes
    Computation speedFastFastFastSlowMediumSlow
    Installation complexityLowLowLowMediumMediumLow
    One-stop shopYesNoYesNoYesYes
    AutoML focusNoNoNoNoYesYes
    Deep learning focusNoNoNoYesNoNo
    Score432211
    Table comparing autoML libraries for time-series analysis.

    Unique Points of Each Library

    kats

    • Has a promising list of time-series models implemented, including a good selection of algorithms for change detection and time-series feature extraction.   
    • Seems good for stable/production deployment due to stable implementation and good documentation.

    pmdarima

    • Standard baseline model for time-series forecasting.
    • Modelled after equivalent in R. 
    • Easier to be used as part of larger one-stop shop library.

    sktime

    • Has a promising list of time-series models and time-series algorithms implemented.
    • Less complete documentation and non-standardised APIs make exploring them slightly trickier.

    pytorch-forecasting

    • Has a focus on using deep learning models for time-series forecasting.
    • Very interesting selection of DL models for time-series forecasting.
    • The non-conventional models implemented in PyTorch Forecasting are often developed very recently.
    • Each model has millions of parameters (requires a lot of data to train) and is slowed to train.

    pycaret-ts-alpha

    • Has a strong potential as it is based on pycaret framework, but currently in alpha.
    • Makes experimentation easy and standardised.

    autots

    • Uses genetic algorithm to find an ensemble of best models, with options to setup weighted metrics to evaluate model performance.
    • Very slow in building the model ensemble.
    • Documentations and tutorials are lacking which makes using AutoTS slightly more time consuming to use than other similar packages.