Data Science Team & Tech Lead

Tag: Software Engineering

  • Brief Review on GitHub Copilot

    Brief Review on GitHub Copilot

    Time really does fly. It is now almost the end of 2024.

    To close off 2024, I will be writing a post on a different topic each every week until the new year arrives.

    My first post is about GitHub Copilot.

    I’m rather late to the game in terms of adopting GitHub Copilot for my personal projects.

    But it has really blown me out of the water so far.

    Copilot helped me navigated the complex territory of Kubernetes/Helm YAML manifests, but was less helpful when I’m working with polars.

    Some quick pros and cons are listed below.

    Pros:

    ➕ Amazing context search ability based on currently opened files.

    When asked a question, it will automatically search for relevant parts in opened files in VS Code to help produce a more relevant answer. This means it can suggest functions/methods from libraries that you are using and variable/column names that follow your convention.

    ➕ Great at explaining hard-to-search technical terms (e.g. special characters in Bash, regex).

    In the olden days without LLM, it is really hard to search for special characters on Google especially if you do not know what they are called in English. But Copilot has no problem breaking down a string of special characters and explaining them one by one. In fact, Copilot taught me about heredoc in Bash.

    Cons:

    ➖ Not useful on newer or rapidly changing libraries (e.g. polars).

    Copilot does suggest wrong syntax from time to time, but it suffers the most when asked to work with newer or rapidly changing libraries. With polars, it kept on suggesting older APIs, e.g. with_column and groupby, instead of with_columns and group_by.

    ➖ Can suggest convoluted solutions when simpler ones exist.

    To illustrate this using a recent example that I remembered. When asked on how can I access a value in a polars DataFrame, it suggested selecting a column, converting it into a series before accessing the value via index. Although in reality, the value can be accessed directly with square brackets or item().

  • Technical Debt vs. Mortgage: A Data Science Homeowner’s Guide

    (I used chatGPT to help me make the written content more “engaging” and “LinkedIn-like”, so keeping the 2 versions below for comparison purpose.)


    [ChatGPT rewritten version]

    Building a minimal viable product (MVP) in data science is like buying your first home with the maximum mortgage.

    It’s often necessary to move quickly and show business value (aka “get a place to live in”), but in doing so, we often accumulate a mountain of technical debt—just like a hefty mortgage.

    But here’s the thing: While you’re using the data science product (or living comfortably in your home), don’t forget to pay down that technical debt—just like you wouldn’t skip your mortgage payments!

    Sure, you might get by without addressing it for a while, but trust me, no one wants to be hit with a foreclosure notice or an unmanageable pile of tech debt later on.

    The key takeaway? Keep building, but always have a plan to pay it down. Your future self will thank you!


    [Original version]

    Building a minimal viable product in data science is like buying your first home using maximum mortgage.

    It is often a necessity to do this to show business values (get a place to live in) fast, which means accumulating a huge amount of technical debt (mortgage) along the way.

    However we should not forget that while using the data science product (or living in your home), it is important to pay down the technical debt (mortgage) periodically.

    While it may be possible to get away from paying down the technical debt for quite some time, but I would definitely not recommend anyone skipping on their mortgage payment!

  • Onto Kubernetes

    Onto Kubernetes

    I have always been told that using Kubernetes is too complex and overkill for most purposes.

    That has put me off for years, before I finally decided to take the plunge into the Kubernetes world 2 months back, embarking on a mission to migrate my entire personal stack onto Kubernetes.

    The tip-over point for me arrives when it becomes increasingly hard to manage the 4 virtual machines, 7 applications, and 10+ containers. The manual management of infrastructure and resources took up all my free time, without leaving much time for doing actual developments.

    Heeding the warnings of others, I approached Kubernetes cautiously, spending the first month reading a book on the basics (Kubernetes in Action by Marko Luksa).

    By the end of the first month, I thought I was ready, as I had experience with container technology and all my applications were dockerised. So I spun up my first-ever Kubernetes cluster (managed service obviously) to begin my migration.

    I ended up spending another 2 weeks fighting with helm and helmfile (as I swore to work off manifests only, without relying on command lining everything).

    And another 2 weeks to get my web services accessible from the outside (e.g. load balancer, TLS – why are some Kubernetes settings done via annotations?).

    May be I was initially too optimistic, but at least now I finally managed to get my key services to run smoothly on Kubernetes.

    So what is my take on Kubernetes for now?

    The complexity seems to be manageable, as long as you have some knowledge of system admin and container technology. Without that knowledge though, I can see how hard it will be to debug any deployment that goes wrong, trying to dig through layers upon layers of abstraction provided by Kubernetes.

    In terms of cloud computing costs, it was almost exactly the same pre and post-Kubernetes migration despite using a managed Kubernetes service.

    Hopefully, this will not become my famous last words down the road.

  • New additions to family – Traefik and Airflow

    New additions to family – Traefik and Airflow

    Added Traefik and Airflow to the family of services behind my personal websites.

    Traefik – an amazing modern reverse proxy that integrates extremely well with docker containers, saving me a lot of troubles in manual configurations (looking at you nginx).

    Traefik makes it trivially simple to redirect internet traffic to multiple Dash docker containers, by just adding tags to docker compose services.

    Despite the flexibility offered by Traefik, I feel more comfortable using nginx as the first layer reverse proxy as it has worked very well for me for a long time.

    Airflow – an industry standard tool to schedule workflows. I finally have a proper tool to run and schedule long-running tasks, without having to resort to manual executions.

    Setting up DAGs to run on Airflow is relatively easy. But what I did not expect is the complexity in setting up Airflow infrastructure. In essence Airflow consists of not just a single service, but multiple services that cross talk with one another.

    Configuring them took some time, but official Airflow docker image has greatly simplify this process. That being said, standing up Airflow almost doubled my shoestring cloud budget.

    Now, time to write and get some DAGs running!

  • FIRE planner

    FIRE planner

    Built a new dashboard to help you plan financially for your potential FIRE (financial independence retire early).

    Link to dashboard : https://cheeyeelim.com/apps/fireplanner

    It takes a few inputs to help you visualize your (+ your partner’s) personal cash flows throughout your lifetime.

    Besides simple income and expense adjustments, it also simulates housing/mortgage and child-related expenses.

    Unfortunately, the dashboard is specific to Singapore-based residents for now, as it uses Singapore average values (e.g. costs to raise children) and incorporates only the Singapore retirement scheme (i.e. CPF).

    All parameters are based on point estimates for now (e.g. inflation, investment return), so complex scenario simulations are not supported.

    p/s : This dashboard took me longer than expected to build, not due to the complexity of the simulation, but the high number of user inputs supported.

  • Infrastructure and framework behind my personal websites

    Infrastructure and framework behind my personal websites

    I decided to set up my own website at the end of 2020.

    3 years later, I run 2 websites backed by multiple supporting services (see image below), all set up and operated by myself.

    My goals are (1) to set up a robust infrastructure that can ensure my websites/services are always up, and (2) to set up a development framework that minimises maintenance efforts.

    For the infrastructure, each service is dockerised with custom images and deployed on my favourite cloud service provider (DigitalOcean).

    Uptime monitor (UptimeRobot) and web analytics service (Google Analytics) have been set up to constantly check the status of the services.

    As for the development framework, I develop locally on VS Code with Windows Subsystem for Linux (WSL), with enforced linting and formatting via pre-commit hooks.

    Codes are pushed to repos on GitHub, while images are pushed to the container registry on Docker Hub.

    I paid special attention to code quality, especially on Python codes, to make maintenance easier. But overall code quality is not as high as I would like it to be, because I need to work with multiple languages (i.e. Python, Bash, Javascript, PHP, HTML/CSS, SQL) on this stack and I am less familiar with some of these languages.

    So far I am quite on track with my goals, with (1) these services achieving 99.5% SLA yearly over the past 3 years and (2) each service taking about 3-4 hours of maintenance time per year. Granted, I am not operating high-volume or complex websites, but still achieving these requires some discipline.

    I realise there are some parts that are still missing from this stack/setup, for example, full CI/CD integration, Kubernetes for service deployment, and MLOps services.

    But perhaps I should stop tinkering with the infrastructure, and start working on more content creations?

  • Makefile

    A small titbit to share today, the Makefile.

    A Makefile can be used to define sets of commonly used commands to save time and to ensure the commands run in the correct order with needed pre-requisites.

    For example, you can define a list of build-related commands under a target called “build”.

    build: 
        docker-compose build image-1 
        docker-compose push image-1 

    Then next time you can execute the build by calling “make build”, instead of manually typing out all the commands in sequence.

    Recently I have started to use it more often, as it really simplifies the development and deployment steps.

    (p/s: In case you are wondering about the GPG_TTY environment variable, that is needed for GPG to properly prompt for the password when docker is authenticating with its private container registry.)

  • Things they didn’t teach you in software engineering

    Whenever you feel disillusioned about the mismatch in promises between what you were taught in university/bootcamp versus what you actually worked on in a job.

    I recommend reading this article, https://vadimkravcenko.com/shorts/things-they-didnt-teach-you/.

    It is written for software engineering, but many points mentioned apply to data science/analytics as well.