Data Science Team & Tech Lead

Tag: Machine Learning

  • Rocket League reinforcement learning-trained bot

    Rocket League reinforcement learning-trained bot

    As a passionate gamer, I have been reading about the Rocket League Nexto cheat situation with keen interest (https://kotaku.com/rocket-league-machine-learning-cheating-nexto-bot-1849980593).

    For those unfamiliar with games, Rocket League is a competitive online game where players control cars to play football.

    Someone has built a bot trained via reinforcement learning, and offered it as a cheating solution to help people win in competitive ranked games against other unsuspecting players.

    Unsurprisingly, the bot is extremely good at the game, and helped the cheaters win against many human players.

    To be fair, most online competitive games have plenty of “hackers” or people who use cheating solutions to win, so there is nothing new about this.

    However most of these cheating solutions are simple hacks (e.g. wallhack, auto-aim) to the existing games, so that the cheating player can gain certain advantages.

    Using a full-blown reinforcement learning-trained bot to cheat in online games is rare, which makes this news interesting.

    One of the reasons this is made possible for Rocket League is because someone has built a third-party Python package, RLGym, that solves the challenges of (1) creating a representative gym and (2) interacting with the game (read game state & relay user input).

    While I do not condone cheating with reinforcement training-trained bots, I have a small dream that one day game developers can setup dedicated channels/servers, where only user-designed bots can join and fight each other out to claim the throne of best AI.

    Such a geeky thing I know, but imagine how fun would that be!

    (p/s: The consultant in me said that game developers will never set this up, as this feature costs a lot to implement & maintain, but is unlikely to bring in additional revenue streams.)

  • Timeline of LLM

    Timeline of LLM

    “Classic quant signals might work, but you can’t explain them; ChatGPT might not work, but it can explain itself.

    In a sense this is the opposite of a classic “black box” machine-learning investment algorithm.”

    This is an interesting take by Matt Levine, on the ability of ChatGPT to be an asset manager (https://www.bloomberg.com/opinion/articles/2023-01-26/chatgpt-is-not-much-of-a-pitch-robot).

    One of the reasons ChatGPT is so popular is because of its ability to elaborate topics in a structured way confidently.

    Maybe this highlights an important skill set to be acquired by aspiring/practising data scientists?

    Anyway if you are lost in the recent deluge of large language model (LLM) models, I recommend this nicely curated timeline by Dr Alan Thompson, https://lifearchitect.ai/timeline/.

  • Delegating to chatGPT & Stable Difussion

    Delegating to chatGPT & Stable Difussion

    This week, I am jumping on the bandwagon by delegating the task of creating my LinkedIn post to chatGPT (for text) and Stable Diffusion (for image) with the following prompts.

    ⌨ chatGPT : Can you write an opinion article, discussing your opinion on this topic, “In the future, do you see a shift in companies from hiring more data scientists, to hiring more specialized roles such as data analysts, data engineers or machine learning engineers?”

    ? Stable Diffusion : A realistic photo of a tech geek opening a door to the future with bright light coming out of the door

    ———————————————————————

    In my opinion, the future will likely see a shift in companies from hiring more generalized data scientists to hiring more specialized roles such as data analysts, data engineers, and machine learning engineers.

    As data continues to play an increasingly important role in driving business decisions, companies will need to have experts who can not only analyze and interpret data, but also be able to build and maintain the infrastructure necessary to store, process, and extract insights from large and complex data sets.

    Data analysts, for example, are experts at using data to help organizations make informed decisions. They are skilled at using a variety of tools and techniques to collect, process, and analyze data, and they are able to present their findings in a way that is easily understandable to non-technical stakeholders.

    Data engineers, on the other hand, are focused on building and maintaining the infrastructure that is needed to store, process, and analyze data at scale. They are experts at designing and implementing scalable and reliable data pipelines, and they are able to work with a variety of technologies and tools to ensure that data is available and accessible to the people who need it.

    Machine learning engineers, meanwhile, are focused on building and deploying machine learning models that can be used to make predictions and automate complex tasks. They are skilled at using a variety of machine learning algorithms and techniques, and they are able to work with large and complex data sets to build models that can accurately predict outcomes and automate tasks.

    Overall, I believe that the future will see a shift towards hiring more specialized roles such as data analysts, data engineers, and machine learning engineers, as these roles are better equipped to handle the increasing complexity and scale of data that organizations are dealing with. By hiring experts in these specialized roles, companies will be able to more effectively harness the power of data to drive business decisions and stay competitive in an increasingly data-driven world.

  • Illustrated stable diffusion from Jay Alammar

    Illustrated stable diffusion from Jay Alammar

    If you are interested in knowing how Stable Diffusion generates amazing AI arts, but are put off by the steep technical details,

    This illustrated guide from Jay Alammar should help you, https://jalammar.github.io/illustrated-stable-diffusion/.

    The guide helpfully breaks down the model into components and substitutes complex equations with simple flowcharts.

    P/S: I also highly recommend his illustrated guide on Transformers (https://jalammar.github.io/illustrated-transformer/), which explains Transformers really well.

  • Short review of Designing Machine Learning System

    Short review of Designing Machine Learning System

    I have finally finished “Designing Machine Learning Systems” after a few weekends of focus reading. It is one of the rare technical books that I finished in its entirety, and I thoroughly enjoyed it.

    Just to offer a quick book review below.

    The book is amazing in the following aspects:

    ✅Provides a high-level overview of practical aspects in ML system design

    ✅Shares best practices on bringing ML models to production at scale

    ✅Offers insights on the latest industry-adopted trends and tools

    However the book is not meant for nor offers the following:

    ℹ️A step-by-step guide to setup an ML system

    ℹ️Theoretical explanations on ML concepts

    ℹ️Use case/project recipes to use as starting templates

    That being said, I have gained a lot of practical tips from the book, and looking forward to putting some of the knowledge I learned into practice.

    Thanks Chip Huyen for the great book and looking forward to the next edition!

  • Automatic speech recognition – Whisper OpenAI

    Automatic speech recognition – Whisper OpenAI

    Whisper is a recently released transformer-based automatic speech recognition (ASR) model from OpenAI.

    It can be used for:

    ?Language identification

    ?Voice activity detection

    ?Multi-lingual speech recognition

    ?Multi-lingual speech translation

    When evaluated on the ESB datasets (including LibriSpeech, Common Voice), Whisper outperformed Conformer RNN-T from NVidia and Wav2Vec2 from Meta.

    Link to blog: https://openai.com/blog/whisper/
    Link to repo: https://github.com/openai/whisper
    Link to benchmarking study: https://arxiv.org/abs/2210.13352

  • Data versioning

    Data versioning

    “Data versioning is like flossing. Everyone agrees it’s a good thing to do, but few do it.” ~ Chip Huyen, Designing Machine Learning Systems

    Unlike code versioning, it is a lot more difficult to implement data versioning in data science / machine learning projects.

    It is because of the following reasons:

    ➡️ Data is often larger than codes.

    ➡️ Varying definitions of what constitutes a difference between two data versions and how to resolve merge conflicts.

    ➡️ Regulations on data protection and privacy make keeping historical data difficult.

    Do you floss… erm version your data often?

  • Concept drift vs data drift vs covariate drift

    Concept drift vs data drift vs covariate drift

    Do you always get confused among concept drift vs data drift vs covariate drift like me?

    The diagram (from a research paper, https://arxiv.org/abs/1511.03816) provides a clear illustration of the different terms.

    In summary, concept drift in data refers to changes in environmental conditions that differ from the original environmental conditions under which a model is trained.

    Concept drifts can involve both changes in distributions or in relationships between the variables.

    When concept drift occurs, it can lead to a degradation of a model’s predictive performance over time, which then requires the model to be retrained to learn from the new environment.

  • ML system design – problem definition & consulting

    ML system design – problem definition & consulting

    I recently started reading the excellent book called Designing Machine Learning Systems by Chip Huyen.

    In the first few chapters, the book illustrated very clearly the differing stakeholder expectations of an ML system, by using a restaurant recommendation app as an example.

    Data scientists / ML engineers ➡️ Want a model that recommends the best restaurant for each user

    Sales team ➡️ Want a model that brings in the most revenue

    Product team ➡️ Want a model that has low latency in returning results

    IT / Platform Ops team ➡️ Want a model that is easy to maintain (e.g. bug-free, scalable)

    The book proposes a technical approach to satisfy these different objectives by framing them formally as objective functions and solving them with multiple models. I fully agree with this approach from a technical point of view (i.e. data science).

    However as I straddle the two worlds of data science & business consulting in my day job, I started reflecting on how could I complement the approach by applying some techniques I learned from consulting.

    In a business setting, we often need to also manage actual stakeholders themselves, that view the world (and the project) through their own lens shaped by their unique backgrounds and motivations.

    To really ensure the success of the project (i.e. the ML system), often we (or someone on the project team like the project manager or product owner) need to manage the stakeholders carefully.

    These usually involve:

    ➡️ Talking to the stakeholders to understand their needs (e.g. the real needs and the needs that they mentioned)

    ➡️ Aligning the interests of multiple stakeholders (e.g. finding common grounds)

    ➡️ Communicating the proposed approach/result to the stakeholders (e.g. ensuring everyone understands the pros & cons of the approach).

    If we can tackle the seemingly technical question of designing an ML system by considering it from the perspectives of both computers and people, perhaps we can increase the success of any ML endeavour in a business setting by many folds.

  • Time Series Forecasting – sktime

    Time Series Forecasting – sktime

    Code template for running time series forecasting in sktime.

    Link to website : https://sktime.org/

    Link to repository : https://github.com/alan-turing-institute/sktime