Data Science Team & Tech Lead

Blog

Delegating to chatGPT & Stable Difussion

This week, I am jumping on the bandwagon by delegating the task of creating my LinkedIn post to chatGPT (for text) and Stable Diffusion (for image) with the following prompts.

⌨ chatGPT : Can you write an opinion article, discussing your opinion on this topic, “In the future, do you see a shift in companies from hiring more data scientists, to hiring more specialized roles such as data analysts, data engineers or machine learning engineers?”

? Stable Diffusion : A realistic photo of a tech geek opening a door to the future with bright light coming out of the door

———————————————————————

In my opinion, the future will likely see a shift in companies from hiring more generalized data scientists to hiring more specialized roles such as data analysts, data engineers, and machine learning engineers.

As data continues to play an increasingly important role in driving business decisions, companies will need to have experts who can not only analyze and interpret data, but also be able to build and maintain the infrastructure necessary to store, process, and extract insights from large and complex data sets.

Data analysts, for example, are experts at using data to help organizations make informed decisions. They are skilled at using a variety of tools and techniques to collect, process, and analyze data, and they are able to present their findings in a way that is easily understandable to non-technical stakeholders.

Data engineers, on the other hand, are focused on building and maintaining the infrastructure that is needed to store, process, and analyze data at scale. They are experts at designing and implementing scalable and reliable data pipelines, and they are able to work with a variety of technologies and tools to ensure that data is available and accessible to the people who need it.

Machine learning engineers, meanwhile, are focused on building and deploying machine learning models that can be used to make predictions and automate complex tasks. They are skilled at using a variety of machine learning algorithms and techniques, and they are able to work with large and complex data sets to build models that can accurately predict outcomes and automate tasks.

Overall, I believe that the future will see a shift towards hiring more specialized roles such as data analysts, data engineers, and machine learning engineers, as these roles are better equipped to handle the increasing complexity and scale of data that organizations are dealing with. By hiring experts in these specialized roles, companies will be able to more effectively harness the power of data to drive business decisions and stay competitive in an increasingly data-driven world.

2023-01-30
State of DS Survey by Anaconda – Skill Gaps

According to the State of Data Science survey done by Anaconda, the top 5 most important skill gaps in data science are:

⭐️ Engineering skills

⭐️ Probability and statistics

⭐️ Business knowledge

⭐️ Big data management

⭐️ Communication skills

These skill gaps cut across multiple knowledge domains, from technical skills to soft skills, reflecting the needs for data scientists or a data science team to possess these skills to ensure project success.

Also note that these are mostly fundamental skills, rather than specialised skills such as NLP. Perhaps reflecting the nascency of data science in most organisations.

2022-11-29
Illustrated stable diffusion from Jay Alammar

If you are interested in knowing how Stable Diffusion generates amazing AI arts, but are put off by the steep technical details,

This illustrated guide from Jay Alammar should help you, https://jalammar.github.io/illustrated-stable-diffusion/.

The guide helpfully breaks down the model into components and substitutes complex equations with simple flowcharts.

P/S: I also highly recommend his illustrated guide on Transformers (https://jalammar.github.io/illustrated-transformer/), which explains Transformers really well.

2022-11-22
Short review of Designing Machine Learning System

I have finally finished “Designing Machine Learning Systems” after a few weekends of focus reading. It is one of the rare technical books that I finished in its entirety, and I thoroughly enjoyed it.

Just to offer a quick book review below.

The book is amazing in the following aspects:

✅Provides a high-level overview of practical aspects in ML system design

✅Shares best practices on bringing ML models to production at scale

✅Offers insights on the latest industry-adopted trends and tools

However the book is not meant for nor offers the following:

ℹ️A step-by-step guide to setup an ML system

ℹ️Theoretical explanations on ML concepts

ℹ️Use case/project recipes to use as starting templates

That being said, I have gained a lot of practical tips from the book, and looking forward to putting some of the knowledge I learned into practice.

Thanks Chip Huyen for the great book and looking forward to the next edition!

2022-11-15
Automatic speech recognition – Whisper OpenAI

Whisper is a recently released transformer-based automatic speech recognition (ASR) model from OpenAI.

It can be used for:

?Language identification

?Voice activity detection

?Multi-lingual speech recognition

?Multi-lingual speech translation

When evaluated on the ESB datasets (including LibriSpeech, Common Voice), Whisper outperformed Conformer RNN-T from NVidia and Wav2Vec2 from Meta.

Link to blog: https://openai.com/blog/whisper/
Link to repo: https://github.com/openai/whisper
Link to benchmarking study: https://arxiv.org/abs/2210.13352

2022-11-08
Data versioning

“Data versioning is like flossing. Everyone agrees it’s a good thing to do, but few do it.” ~ Chip Huyen, Designing Machine Learning Systems

Unlike code versioning, it is a lot more difficult to implement data versioning in data science / machine learning projects.

It is because of the following reasons:

➡️ Data is often larger than codes.

➡️ Varying definitions of what constitutes a difference between two data versions and how to resolve merge conflicts.

➡️ Regulations on data protection and privacy make keeping historical data difficult.

Do you floss… erm version your data often?

2022-11-01
Concept drift vs data drift vs covariate drift

Do you always get confused among concept drift vs data drift vs covariate drift like me?

The diagram (from a research paper, https://arxiv.org/abs/1511.03816) provides a clear illustration of the different terms.

In summary, concept drift in data refers to changes in environmental conditions that differ from the original environmental conditions under which a model is trained.

Concept drifts can involve both changes in distributions or in relationships between the variables.

When concept drift occurs, it can lead to a degradation of a model’s predictive performance over time, which then requires the model to be retrained to learn from the new environment.

2022-10-25
MLU-Explain : Visual explanation of ML concepts

A very cool website from Amazon that explains various machine learning concepts using interactive and visual essays.

https://mlu-explain.github.io/

Using simple and interesting examples, the website really brings to life many core concepts in machine learning and makes them accessible to more people.

This reminds me of how I learned physics during my high school era. There was a website that explains the concepts in physics better than my teachers at school. Sadly, I no longer remember its website address.

2022-10-18
ML system design – problem definition & consulting

I recently started reading the excellent book called Designing Machine Learning Systems by Chip Huyen.

In the first few chapters, the book illustrated very clearly the differing stakeholder expectations of an ML system, by using a restaurant recommendation app as an example.

Data scientists / ML engineers ➡️ Want a model that recommends the best restaurant for each user

Sales team ➡️ Want a model that brings in the most revenue

Product team ➡️ Want a model that has low latency in returning results

IT / Platform Ops team ➡️ Want a model that is easy to maintain (e.g. bug-free, scalable)

The book proposes a technical approach to satisfy these different objectives by framing them formally as objective functions and solving them with multiple models. I fully agree with this approach from a technical point of view (i.e. data science).

However as I straddle the two worlds of data science & business consulting in my day job, I started reflecting on how could I complement the approach by applying some techniques I learned from consulting.

In a business setting, we often need to also manage actual stakeholders themselves, that view the world (and the project) through their own lens shaped by their unique backgrounds and motivations.

To really ensure the success of the project (i.e. the ML system), often we (or someone on the project team like the project manager or product owner) need to manage the stakeholders carefully.

These usually involve:

➡️ Talking to the stakeholders to understand their needs (e.g. the real needs and the needs that they mentioned)

➡️ Aligning the interests of multiple stakeholders (e.g. finding common grounds)

➡️ Communicating the proposed approach/result to the stakeholders (e.g. ensuring everyone understands the pros & cons of the approach).

If we can tackle the seemingly technical question of designing an ML system by considering it from the perspectives of both computers and people, perhaps we can increase the success of any ML endeavour in a business setting by many folds.

2022-10-10
The happiness equation

(Image source: https://twitter.com/aurelien_gohier/status/1062248485154705408)

We always tell ourselves that happiness will follow after we have worked hard and achieved great success.

”I will be very happy if I achieve the next milestone, be it a job promotion or buying a house or earning my first bucket of gold.”

However often after we achieved those successes, happiness fades away quickly as we see that there is the next thing that we need to work towards.

”I need to become the manager now that I have become the team lead.”

”I need to upgrade my house to a bigger one now that the first property I have is just an apartment.”

”I need to earn a million now that I have my first 100k.”

What if we try to achieve happiness within us, rather than conditioning our happiness on external factors?

A very good point of view proposed by the book, “The happiness equation” by Neil Pasricha.

2022-10-02