Data Science Team & Tech Lead

Tag: Cloud Deployment

Onto Kubernetes – Part 4

With the Prometheus-Loki-Grafana stack being deployed, the observability stack has been fully deployed on the Kubernetes cluster. Operational metrics gathered by Prometheus and logs aggregated by Loki are finally fed into the Grafana for easier visualisation.

I have spoken about Prometheus (https://prometheus.io/) in my last post so I will not repeat it.

In terms of Loki (https://grafana.com/oss/loki/), it is easier to set up than I originally thought, given the number of components that make up Loki (12 components in total). Roughly about half of these components are core components that are needed for Loki to function (e.g. distributor, ingester), but the other half are optional supporting components that can be turned off safely (e.g. query scheduler, table manager).

Most of the heavy lifting is done by Promtail, as it automatically discovers target logs to be scraped and pushes them to Loki. In contrast, a separate exporter needs to be setup per pod/service to make metrics visible to Prometheus, which involves a lot more efforts. That being said, Promtail has now been deprecated (https://grafana.com/docs/loki/latest/send-data/promtail/).

As for Grafana (https://grafana.com/), it seems like a simple dashboard solution that is built mostly for observability purposes. But it impresses me in a few ways.

Grafana is very efficient in terms of resource usage relative to its speeds of performance. The entire Grafana could run on less than 200Mb memory, while doing live refresh of data-heavy dashboard. Other dashboard solutions that I have worked with will not be able to cope with the same amount of data with same refresh speed without significant configuration effort (setup caching etc.).

As I am working with Grafana operator (https://github.com/grafana/grafana-operator), it is very easy to configure data sources and dashboards in Grafana. I just have to define GrafanaDashboard and GrafanaDataSource CRDs, and Grafana will be able to pick them up automatically.

The definition of Grafana dashboards in json format is interesting as well. It is easy to version control and could be modified in its raw text format. The only complain I have is that the public gallery of Grafana dashboards (https://grafana.com/grafana/dashboards/) is relatively limited in terms of selections. Based on my anecdotal experience, most dashboard submissions on there are outdated so not deployable out of the box.

Besides observability for metrics and logs, another common observability implementation is to monitor internal network traffic using a service mesh like LinkerD (https://linkerd.io/). I will leave this for another day, purely due to a lack of time.

The last piece of core supporting Kubernetes service that I will set up next will be the Kubernetes resources and volumes backup using Velero (https://velero.io/). While most of my Kubernetes deployments are stateless, I do run a few databases as well. Losing the data on them will be a disaster if there is no backup.

2025-05-05
Onto Kubernetes – Part 3

I am still working on the Kubernetes stack behind my personal website whenever I have some free time.

The goal is still the same – to build my own personal Kubernetes-powered data science/machine learning production deployment stack (And yes, I know about Kubeflow/AWS Sagemaker/Databricks/etc).

However my key objective now lies not with finding out whether using Kubernetes will save maintenance efforts (short answer – it does not save much maintenance efforts at a small scale), but with seeing how a best practice end state Kubernetes stack will look like and the effort needed to get there.

So what have I been up to? Some of my time in this period has been spent on fixing minor issues that were not noticed during the initial deployments.

Example 1, my WordPress pod is losing my custom theme every time the pod is restarted. Why is that? It is because the persistent volume seems to get overwritten each time by the Bitnami WordPress helm chart that I am using. The solution? I implemented a custom init container that repopulates the Wordress root directory by pulling a backup from S3.

Example 2, a subset of my pods have been crashing regularly due to a node becoming unhealthy. Why is that? It is because my custom Airflow and Dash containers seem to have unknown memory leaks, leading to resource starvation on the node causing pods to be evicted. The solution? I manually set up custom resource requests and limits for all Kubernetes containers after monitoring their typical utilisations. (I have been putting off doing this for a while thinking I could get by fine, but this incident has proven that I am wrong.)

The majority of my time has been spent on setting up proper (1) secret management (using Hashicorp Vault + External Secrets Operator) and (2) monitoring (using Prometheus + Grafana) stacks.

On secret management. Hashicorp Vault + External Secrets Operator have been relatively easy to use, with well-constructed and documented helm charts.

The concept behind Hashicorp Vault is relatively easy to understand (i.e. think of it like a password manager). The key trap for any beginner will be the sealing/unsealing part. The vault needs to be unsealed with a root token and a number of unseal keys for it to be functional. But if the vault instance ever gets restarted, the vault will become sealed and no one will be able to read the secrets (i.e. passwords) stored in it.

A sealed vault needs to be manually unsealed unless you have auto-unseal implemented. However implementing auto-unseal needs another secured key/secret management platform, which turns this problem into an iterative loop. This is one area that I feel is better solved with a managed solution (which unfortunately DigitalOcean does not have at the moment).

External Secrets Operator (ESO) works great, but it does take some time to understand the underlying concept. In short, Vault <- SecretStore <- ExternalSecret <- Secret. To get a secret automatically created, one needs to specify ExternalSecret (which tells ESO which secret to retrieve and create) and SecretStore (which tells ESO where and how to access the vault). The key beginner trap here will be the creation and deletion policy. If not set properly, secrets may be automatically deleted due to garbage collection, leading to services in Kubernetes going down (since most services rely on secrets in one form or another).

On monitoring. Prometheus is a very well-established and documented tool, so setting it up with a helm chart is a breeze (in fact there are so many Prometheus helm chart implementations that you can definitely pick one that suits your needs). One of the ways Prometheus works in short is, Prometheus -> Prometheus operator -> Service Monitor -> Service -> Exporter -> Pod/Container to be monitored. The key beginner trap here is to think of Prometheus as just another service, but it is in fact a stack of services. The first time the Prometheus pods were spun up after installation, my nodes were fully filled with two Prometheus pods unable to be scheduled.

The complexity of Prometheus comes from the sheer number of services, i.e. main Prometheus, Prometheus operator, alert manager and many many different types of exporters. While most services/helm charts have great support for Prometheus (i.e. already exposing metrics in Prometheus format), the challenge lies in getting these metrics to Prometheus as more often than not you need an exporter. The exporter can run centrally (e.g. kube-state-metrics exporter), run on each node (e.g. node exporter), or most often run as a sidecar as part of the pods (e.g. apache exporter for WordPress, flask exporter for Flask, postgres exporter for Postgres). Configuring all these exporters to make metrics visible in Prometheus is not hard, but definitely laborious.

For now, I have managed to get all metrics fed into Prometheus, except for Dash that I could not find a pre-built exporter for. The next steps will be to spin up Grafana so that I can better visualise the metrics and set up some key alerting rules. With this, hopefully I can avoid having my Dash instances stuck in a crash loop due to missing secrets for one month without me knowing about it.

After getting Prometheus + Grafana up and running, Loki the log aggregation system will be next. However the number of services that come with Loki does scare me as well.

2025-01-30
Onto Kubernetes – Part 2

About 2 months ago, I started migrating my entire personal stack onto Kubernetes from regular virtual servers.

So what has happened in the meantime? Have I freed up more operation maintenance time to do more interesting data science development work yet?

Unfortunately the answer is no, at least for now.

It turns out that migrating Airflow and MLflow onto Kubernetes is harder than I thought. This is because both of these tools require multiple backend services to run smoothly, including a relational database (where PostgreSQL is used) and an in-memory database (where Redis is used).

Previously to speed up my development progress, I had been using managed instances of PostgreSQL and Redis offered by DigitalOcean. They are extremely easy to set up and I was able to start using them within minutes.

However I eventually ran into weird runtime issues in Airflow and MLflow that ultimately boiled down to specific configuration issues within PostgreSQL and Redis. While managed services are easier to get started, debugging and customising them are typically harder due to restricted access to certain logs and backend configurations.

So I told myself, if I can work with managed PostgreSQL and Redis, how hard would it be to self-host them directly in Kubernetes, which would give me the freedom to customise them to work with Airflow and MLflow as needed?

Or so I thought.

I spent the next few days on properly exposing PostgreSQL and Redis ports via ingress-nginx, then another few days on setting up pgbouncer connection pooling for PostgreSQL, then another few days on setting up Airflow environment to work with custom DAG package, then another few days on making sure all services are interacting correctly with the new self-hosted PostgreSQL and Redis instances.

After many “few more days” than I expected, my entire personal stack is finally fully migrated onto Kubernetes (components as shown in the attached diagram).

So what’s next you asked? Is the platform all set and ready to go?

Not yet, unfortunately. To make sure the Kubernetes-based platform can survive for longer with minimal maintenance, I will be setting up proper secret management, monitoring solution and CI/CD integration next.

Another “few more days” to go eh?

2024-10-06
Onto Kubernetes

I have always been told that using Kubernetes is too complex and overkill for most purposes.

That has put me off for years, before I finally decided to take the plunge into the Kubernetes world 2 months back, embarking on a mission to migrate my entire personal stack onto Kubernetes.

The tip-over point for me arrives when it becomes increasingly hard to manage the 4 virtual machines, 7 applications, and 10+ containers. The manual management of infrastructure and resources took up all my free time, without leaving much time for doing actual developments.

Heeding the warnings of others, I approached Kubernetes cautiously, spending the first month reading a book on the basics (Kubernetes in Action by Marko Luksa).

By the end of the first month, I thought I was ready, as I had experience with container technology and all my applications were dockerised. So I spun up my first-ever Kubernetes cluster (managed service obviously) to begin my migration.

I ended up spending another 2 weeks fighting with helm and helmfile (as I swore to work off manifests only, without relying on command lining everything).

And another 2 weeks to get my web services accessible from the outside (e.g. load balancer, TLS – why are some Kubernetes settings done via annotations?).

May be I was initially too optimistic, but at least now I finally managed to get my key services to run smoothly on Kubernetes.

So what is my take on Kubernetes for now?

The complexity seems to be manageable, as long as you have some knowledge of system admin and container technology. Without that knowledge though, I can see how hard it will be to debug any deployment that goes wrong, trying to dig through layers upon layers of abstraction provided by Kubernetes.

In terms of cloud computing costs, it was almost exactly the same pre and post-Kubernetes migration despite using a managed Kubernetes service.

Hopefully, this will not become my famous last words down the road.

2024-07-31
New additions to family – Traefik and Airflow

Added Traefik and Airflow to the family of services behind my personal websites.

Traefik – an amazing modern reverse proxy that integrates extremely well with docker containers, saving me a lot of troubles in manual configurations (looking at you nginx).

Traefik makes it trivially simple to redirect internet traffic to multiple Dash docker containers, by just adding tags to docker compose services.

Despite the flexibility offered by Traefik, I feel more comfortable using nginx as the first layer reverse proxy as it has worked very well for me for a long time.

Airflow – an industry standard tool to schedule workflows. I finally have a proper tool to run and schedule long-running tasks, without having to resort to manual executions.

Setting up DAGs to run on Airflow is relatively easy. But what I did not expect is the complexity in setting up Airflow infrastructure. In essence Airflow consists of not just a single service, but multiple services that cross talk with one another.

Configuring them took some time, but official Airflow docker image has greatly simplify this process. That being said, standing up Airflow almost doubled my shoestring cloud budget.

Now, time to write and get some DAGs running!

2024-02-10