Data Science Team & Tech Lead

Blog

  • Midpoint review of M6 competition – results

    As a quick follow-up to my last post on the midpoint review of M6 competition, I have looked into the actual performance statistics of my entries in the first half of the competition.

    The results are suprising in a few aspects :

    • The results exhibit huge fluctuations from month to month. (Perhaps partially reflecting the current uncertain and volatile market conditions.)
    • My forecasting results are better than my own initial expectation, beating the benchmark (i.e. 0.16) in most months. (Something happened in Mar & Apr 2022 that I will explain below.)
    • My investment decision results are bad as expected, due to the conservative strategy that I have used.
    Submission NumberDateOverall RankPerformance (Forecasts)Rank (Forecasts)Performance (Decisions)Rank (Decisions)
    1stFeb 202236.50.15984464.2849227
    2ndMar 2022111.50.1661992-6.08808131
    3rdApr 2022115.50.161321090.04113122
    4thMay 2022640.1594960-1.9618168
    5thJune 2022650.152946-0.7612124
    6thJuly 2022180.1436613.7955635
    M6 results for Dull AI thus far, broken down by individual month. Note that the results for July 2022 is not finalized yet.

    Huge fluctuations in results

    The fluctuations in results for both the forecasting and investment decision categories are huge, changing quite wildly from month to month despite (mostly) the same approaches being used throughout the period.

    This can be explained by the widely known low signal-to-noise ratio phenomenon observed in stock markets, which makes forecasting and investment decision difficult problems to solve.

    However, the stock markets in the first half of 2022 are also more volatile than usual, due to a combination of unknown factors at work (including pandemic, war, recession and inflation). One example of this can be seen in the elevated levels of the CBOE Volatility Index (VIX), which roughly tracks the “fear” in the S&P500 index.

    CBOE Volatility Index (VIX) taken from Google

    Forecasting results are better than my own expectations

    Out of the previous 5 months, my forecasting submission was able to beat the benchmark in 3 out of 5 months (i.e. to be lower than 0.16). The benchmark is at 0.16, which means one assumes no knowledge of the future ranked returns of each stock.

    Of the 2 months that I failed to beat the benchmark, actually something happened behind the scene. It mostly happened because I lack the time to properly vet and test the solution. For the submissions in Mar and Apr 2022, my algorithm spat out results that I uploaded as usual. But I noticed my performance degraded severely in those 2 months, so I did a few changes to the algorithm:

    • Setup unit tests around a few of the critical functions
    • Checked data sources in detail to ensure quality and remove any low-quality data sources as inputs
    • Re-written my own forecasting module, casting away a pre-built pipeline that I used from a library
    • Added diagnostic checks around the models that I trained each time to ensure that there was nothing off during the training process

    A funny side story. It was after I managed to add the diagnostic checks that I noticed the models I trained in Mar and Apr 2022 cannot distinguish between the worst and best-performing stocks effectively (i.e. almost interchangeably).

    Even now, I still do not know what is the exact cause (if any), or if my algorithm is mostly bug-free now, or even if my current result is just a fluke.

    Investment decision results are as bad as expected

    My investment decision results are as bad as expected.

    Firstly I spent the least time on this part, without setting up a proper framework (i.e. backtesting integration) even now. Secondly and partially because of the first reason, I am running my investment using a very traditional and conservative approach which I will not name now as the competition is still ongoing.

    Compared to the forecasting component, my investment decision component is mostly so simple that anyone can replicate it in an Excel file. The only more complex part is the portfolio allocation algorithm. I have a simple portfolio allocation algorithm written to manage risks, but it is very trivial that doesn’t worth any additional special mention.

    As a fun fact, I did a drastic change to my investment strategy as of the 6th submission. It would be interesting to see how it turns out.

    Ending

    While it feels nervous posting this performance summary when I realised the results for my latest (best so far) 6th submission is still pending, I still decided to share it for now just to stick to my blog post schedule.

    Hopefully you get something out of this post, and I hope that this will not be laughed at as an example of pre-mature celebration. Let’s see how things go for the remaining half of this competition!

  • Midpoint review of M6 competition

    With the ending of June, it is now the halfway point of M6 competition. It may be a good time to do a quick review of my progress and learnings from the M6 competition so far.

    (And also to get me into the habits of regularly writing blogs!)

    Progress in M6 competition

    For a brief period at the beginning of M6 competition, I was among the top 20 on the leaderboard (overall rank). But ever since then, I have been languishing between rank 80-110.

    I tried a few ways to improve the results (e.g. adding unit testing, expanding security universe), however the results either (1) did not improve, or (2) I lack the time/energy to fully implement them.

    As of now, I still have 24 items/ideas on my to-do list to be tested or implemented to improve my solution for M6 competition!

    That being said, I would still say that I have achieved my original goal, which is to use M6 competition as a motivator to build an investment pipeline (including automated data retrieval, forecasting and portfolio optimisation).

    If you are interested in my exact methodology, perhaps as a counter-example of what not to do, I will share it once the competition is over.

    Learnings from M6 competition

    1. Getting access to required investment data is hard

    Before you jump in and mention that everyone can easily get free price data from Yahoo Finance or other similar sources, I just want to say that I agree with you.

    But getting access to price data is the first step. Typically you will also want to be able to screen for securities to create your investable universe. And this screening requires non-direct price data, e.g. market cap and valuation metrics. The problem becomes even tougher if you intend to create a cross-country/cross-exchange universe.

    Assuming that you got access to a screening capability (the easiest way is to buy it from a provider), the next step is to build a stable (ideally automated) connection to a chosen data provider. This step can be tricky, as it depends on how much you are willing to pay for a data service, and this roughly correlates with how stable the provided data API will be.

    Don’t even start thinking about getting data from multiple providers to be merged together. Just trying to get the data index (in this case, tickers or other security identifiers) to align will make you crazy if this is not part of your full-time job).

    Lastly, once you have all these in place, there remains the question of data quality. I briefly tested adjusted OHLC EOD price data from a few retail investor-friendly data sources (i.e. annual subscription price of less than 4 digits).

    My rough conclusions are:

    • Yahoo Finance
      • Pretty good pricing data that agrees with direct data from exchanges.
      • But often has random outlier spikes (e.g. 100x price on a single day).
      • Possible to get some fundamental data as well, but the API is very unstable.
      • Free (but in grey area).
    • Interactive Brokers
      • Very good pricing data for recent dates.
      • But historical adjusted prices are systematically off. Perhaps due to a different adjustment calculation method.
      • Requires an IB account and a running instance of TWS to get data.
      • Very cheap subscription price. Single digit per month to get all US pricing data.
    • EOD Historical
      • Just started testing it out, as it was also used by M6 competition to calculate rankings.
      • No comment on data quality yet, as I have yet to test them.
      • Very user-friendly API and reasonable pricing.

    2. Forecasting prices is hard (really hard)

    As with doing any kind of predictive modelling, forecasting stock prices are hard. Any type of price is hard to forecast, because there is often no ground truth to a price and the relationships between factors affecting a price change often.

    A price is a reflection of not the intrinsic value of an item most of the time, but a reflection of how much someone is willing to pay for it at that moment in time.

    With this in mind, I have a feeling (just a hypothesis that I have not yet checked out) that an approach that tries to model prices (or other price proxies) as point estimates are very unlikely to work out. A probabilistic approach seems to be the best bet, but it makes the computation and interpretation of the results (e.g. how to trade on the estimates) more difficult. Plus this approach also falls slightly out of my knowledge domain.

    The difficulty of this problem can be seen by the huge fluctuation in the ranking on the leaderboard as well. A +/- 20 position move from month to month is not a rare occurrence. Although this may also be due to the (1) current ultra-volatile stock/economic environment and (2) changes in users’ forecasting methods across submissions.

    This is a rather long competition that lasts for one year. But I have the feeling that for a stock market forecasting competition, it may need to run for 2-5 years to filter out the methods that are winning just due to chance. Then only we can see who is truly swimming naked. (Disclaimer: I am not implying that my method can stand the test of time, because I don’t think it can.)

    3. Setting up trading strategy and portfolio allocation is also hard

    This is another huge topic by itself, and often is rather distinct from the stock price forecasting problem. As mentioned by Prof. Makridakis (the M6 competition organiser) in one of his LinkedIn posts, there is not a strong correlation between accurate forecasting and good investment return.

    As I have a rather basic understanding of how to build a profitable trading strategy and portfolio allocation, I am not able to comment much here. But I would say that in the absence of strong convictions, buying the market is not a bad idea in general.

    4. Clean code/solution structuring helps

    For most M6 participants, focusing on code/solution structuring is perhaps among the last thing they would do (or so I guess, please correct me if I am wrong). What I mean by code/solution structuring is ensuring that various parts of the codes (e.g. data processing, forecasting, portfolio optimisation) are written and structured according to software engineering best practices.

    For me, this is the part that I spent the most time on. I know that some of you will be laughing at me because you think I deserve to rank near the bottom due to this (again I agree with you). But I truly enjoyed the time that I took to (1) structure my codes to follow the Python package cookiecutter template, (2) incorporate CI/CD practices (e.g. using Git, pre-commit), and (3) write clean codes with proper linting and docstrings.

    As I work on the codebase on a part-time basis, having a clean code structure has enabled me to easily dive back into parts of the codebase. It reduces the time I need to figure out how my codes all link together, and hopefully can ensure that my codes are reusable if I do decide to repurpose them for something else.

    Ending

    That’s all for my midpoint review. I will continue to participate in M6 competition by making submissions, but I doubt I will have the time/energy to reverse the tide. Either way, I got a lot out of this competition already.

    If you have read through my lengthy post, hope you gained some useful insights (or at least had a fun read)!

  • A beginner guide to the folder structure generated by cookiecutter-pypackage

    cookiecutter-pypackage offers a very well equipped standard project template to create a Python package. However for many first time users, the automatically generated folder structure can be quite intimidating.

    The typical folder structure generated by cookiecutter-pypackage (https://github.com/cheeyeelim/cookiecutter-pypackage) (v1.1.2) looks like this :

    A typical cookiecutter-pypackage folder structure

    Note that all mentions of {{cookiecutter.project_slug}} and {{cookiecutter.pkg_name}} will be replaced by the user supplied strings when you create the project using cookiecutter https://github.com/cheeyeelim/cookiecutter-pypackage.git (which is derived from https://github.com/waynerv/cookiecutter-pypackage).

    (For steps on how to use cookiecutter-pypackage, please refer to https://cheeyeelim.github.io/cookiecutter-pypackage/latest/.)

    We will breakdown the key files and folders created below :

    .github folder

    .github
    |- workflows
    |-- dev.yml
    |-- preview.yml
    |-- release.yml
    |- ISSUE_TEMPLATE.md

    .github contains configurations used by GitHub repos. The YAML files in workflows folder specify the CI/CD steps to run using GitHub Actions (https://docs.github.com/en/actions). Each YAML file represents a workflow that can be triggered differently and can contain different steps. For example, release.yml workflow is only triggered when a push event occurs for a tag and it will then process the repo for publication of Python library to PyPI.

    ISSUE_TEMPLATE.md is used as the default template when a user creates an issue on GitHub for this specific project.

    {{cookiecutter.pkg_name}} folder

    {{cookiecutter.pkg_name}}
    |- __init__.py
    |- {{cookiecutter.pkg_name}}.py
    |- cli.py

    {{cookiecutter.pkg_name}} should be quite self-explanatory. This folder should hold all the core Python codes needed by the Python package. __init__.py contains the meta information regarding the Python package, such as author and version.

    {{cookiecutter.pkg_name}}.py represents the main entry point of a Python package. cli.py specifies the available command line entry points (i.e. allowing this Python package to be run directly as a shell command, rather than through Python script). Command line interface will be provided by click (https://click.palletsprojects.com/).

    docs folder

    docs
    |- api.md
    |- changelog.md
    |- contributing.md
    |- index.md
    |- installation.md
    |- usage.md

    docs folder holds Markdown documents that will be built by mike (https://github.com/jimporter/mike) (based on mkdocs) into documentation for this Python package. It contains default text that should apply for most projects, but do manually check and verify them.

    index.md and changelog.md automatically loads from README.md and CHANGELOG.md files at the folder root. This is to ensure that processes that have dependencies on README.md and CHANGELOG.md are able to find them easily (e.g. README.md at folder root will be used as a repo README).

    api.md is another special Markdown file that will contain documentations on APIs that are automatically created from docstrings. This allows the creation of API documentation with relatively little efforts.

    tests folder

    tests
    |- __init__.py
    |- test_{{cookiecutter.pkg_name}}.py

    tests folder is the standard folder that holds testing scripts for a Python package. A very simple test script template is provided in test_{{cookiecutter.pkg_name}}.py. These test scripts will later be run using tox (https://tox.wiki/en/latest/) and pytest (https://docs.pytest.org/), which allow testing under multiple system configurations easily in one go.

    {{cookiecutter.project_slug}} folder – project root

    {{cookiecutter.project_slug}} (project root)
    |- .bumpversion.cfg
    |- .editorconfig
    |- .gitignore
    |- .pre-commit-config.yaml
    |- CHANGELOG.md
    |- LICENSE
    |- makefile
    |- mkdocs.yml
    |- poetry.toml
    |- pyproject.toml
    |- README.md
    |- setup.cfg

    There are many files at project root level.

    .bumpversion.cfg is the configuration file for bump2version (https://github.com/c4urself/bump2version), which helps updating all version strings in the source code with a single command.

    .editorconfig (https://editorconfig.org/) is a file that defines coding styles and text editor configurations for multiple IDEs.

    .gitignore should be a well known file for anyone who has worked with Git before. It contains a list of patterns matching files and folders that should be excluded from Git tracking.

    .pre-commit-config.yaml is the configuration file for pre-commit (https://pre-commit.com/). pre-commit introduces commands (usually linters and auto-formatters) that run automatically right before git commit, and will stop the commit if any check is failed.

    CHANGELOG.md should be used by package author to record features added and bug fixes associated with each version. It should follow the changelog format as defined at https://keepachangelog.com/en/1.0.0/.

    LICENSE is another common file that specifies the copyright license associated with the Python package. cookiecutter-pypackage helps generate this file based on user-specified license type during project setup.

    makefile contains many commonly used commands with recommended default parameters to be used with the make command. For example, make clean will clean up temporary files and caches generated during the development process.

    mkdocs.yml is the configuration file for mkdocs (https://www.mkdocs.org/), that helps generate documentations using the templates specified in docs folder.

    poetry (https://python-poetry.org/) is an amazing Python packaging and dependencies manager that I highly recommend. poetry.toml is the configuration file for general poetry behaviour. Currently poetry.toml contains instructions to make poetry stores virtual environments under project folder, rather than centrally.

    pyproject.toml contains general Python project-specific build configurations as specified in PEP 518 (https://www.python.org/dev/peps/pep-0518/). pyproject.toml contains meta information on the Python package that will be used by poetry for PyPI release. It also contains all dependencies information with associated version constraints that poetry uses to install and manage virtual environments with. However do note that pyproject.toml can also contain configurations for other tools such as isort.

    README.md, as indicated in its name, is the file that contains overview information that introduces users to the Python package. It is usually the first document that a new user sees, so it is helpful to keep important information to help the new user gets started in using the Python package (e.g. how to install, how to run the tool).

    setup.cfg is another configuration file that contains general Python project-specific build configurations. setup.cfg complements pyproject.toml by helping to specify configurations for other tools such as flake8 and tox.

    Hopefully this long post can help ease your transition into start working with a Python-based cookiecutter template!

  • Trying to use cookiecutter-pypackage and ended up contributing to it

    cookiecutter-pypackage provides an amazing framework to develop a Python package with. The original version (https://github.com/audreyfeldroy/cookiecutter-pypackage) has been further extended by many contributors to include useful tools such as poetry and pre-commit.

    I decided to start using one of the newer versions of cookiecutter-pypackage (https://github.com/waynerv/cookiecutter-pypackage) (v1.1.1) before running into an error in one of the GitHub Actions workflows. mindsers/changelog-reader-action@v2 failed to recognise the correct version number due to incorrect CHANGELOG format.

    While fixing the bug and working through the structure of cookiecutter-pypackage, I decided to overhaul the entire project which leads to the forking and the creation of my own version of cookiecutter-pypackage (https://github.com/cheeyeelim/cookiecutter-pypackage) (v1.1.2).

    This version of cookiecutter-pypackage has the following key updates :

    • Added mike to provide versioning support for documentation created using mkdocs
    • Added tox-conda to provide conda support when using tox
    • Improved testing by checking full cookiecutter project creation

    Once all these updates were implemented, then only I remember my original intention to use cookiecutter-pypackage was to create a low-code Natural Language Processing (NLP) library to perform exploratory data analytics. But at least now I got it working and tailored to my needs, so I finally get to create JustNLP (https://github.com/cheeyeelim/justnlp) using cookiecutter-pypackage!

  • Loading child CSS style in WordPress

    If you are a WordPress newbie and wanted to add your own CSS styles onto an existing WordPress theme using child template like me, you may also be following the instruction on WordPress.

    As I was working with the twenty-twenty-one theme, which uses get_template functions to load CSS styles, I added the following given code template into a functions.php file within the child template.

    add_action( 'wp_enqueue_scripts', 'my_theme_enqueue_styles' );
    function my_theme_enqueue_styles() {
        wp_enqueue_style( 'child-style', get_stylesheet_uri(),
            array( 'parenthandle' ), 
            wp_get_theme()->get('Version') // this only works if you have Version in the style header
        );
    }

    However my activated child template only loaded the parent theme and CSS style on my website.

    Only parent theme is loaded.

    After researching for a few hours and finally coming across this forum post, only I realised that among all the string arguments here, 'parenthandle' should be changed to point towards the name of the parent CSS style as defined in the parent functions.php file, i.e. 'twenty-twenty-one-style' (if you use twenty-twenty-one theme like me).

    Changing the code to this will solve the problem.

    add_action( 'wp_enqueue_scripts', 'my_theme_enqueue_styles' );
    function my_theme_enqueue_styles() {
        wp_enqueue_style( 'child-style', get_stylesheet_uri(),
            array( 'twenty-twenty-one-style' ), 
            wp_get_theme()->get('Version') // this only works if you have Version in the style header
        );
    }

    And your child template should now be loaded.

    Now both parent and child themes are loaded.
  • Time Series Forecasting – sktime

    Time Series Forecasting – sktime

    Code template for running time series forecasting in sktime.

    Link to website : https://sktime.org/

    Link to repository : https://github.com/alan-turing-institute/sktime

  • Time Series Forecasting – pytorch-forecasting

    Time Series Forecasting – pytorch-forecasting

    Code template for running time series forecasting in pytorch-forecasting.

    Link to website : https://pytorch-forecasting.readthedocs.io/en/stable/

    Link to repository : https://github.com/jdb78/pytorch-forecasting

  • Time Series Forecasting – pmdarima

    Time Series Forecasting – pmdarima

    Code template for running time series forecasting in pmdarima.

    Link to website : http://alkaline-ml.com/pmdarima/

    Link to repository : https://github.com/alkaline-ml/pmdarima

  • Time Series Forecasting – kats

    Time Series Forecasting – kats

    Code template for running time series forecasting in kats.

    Link to website : https://facebookresearch.github.io/Kats/

    Link to repository : https://github.com/facebookresearch/Kats

  • Time Series Detection – kats

    Time Series Detection – kats

    Code template for running time series detection in kats.

    Link to website : https://facebookresearch.github.io/Kats/

    Link to repository : https://github.com/facebookresearch/Kats