Béatrice Moissinac, PhD
Hello, world! 👋 Welcome to BeaBytes.
My goal is to help you understand AI and equip you with enough conceptual (but not technical) fluency to fight off the snake oil merchants.
I reserve the right to change my mind at any time.

contact -at- beabytes -dot- com

© 2013-2024 Béatrice Moissinac, all rights reserved.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of my employer or sponsors.
Lies, Damned Lies, and Data Science

Lies, Damned Lies, and Data Science

In 2012, a viral TED talk by Amy Cuddy became the spark that started a blaze through psychological research, and pushed the field of Psychology into an existential crisis. As it turn out, many psychology research results were unreproducible. Reproducibility is a keystone of the scientific method, because it reinforces the validity of the discovery. “Science doesn’t care what you believe”. If you can’t reproduce the same result given the same experimental setting, then is it even real?

Data Science is due for a similar reckoning. In this article, I will talk about all the ingredients that make up this explosive cocktail, and how you can keep your data scientists accountable.

Ritualistic Statistics

In Psychology research, the causes of this crisis have been largely attributed to (1) a lack of incentive to reproduce research experiments that have already been done, and (2) ritualistic statistics.

In essence, psychological experiments are designed to discover if there is a notable and “real” difference between two groups: one group receives a treatment, and one does not (the control group). There exists multiple types of statistical tests, depending on the particular properties of what you are observing. Most of those tests rely on a mechanism called the P-value . You do your test, you check your P-value, and that tells you if you have a statistically significant (“real”/“noticeable”/“valid”) difference between the two groups.

What causes the reproducibility crisis in Psychology, is that researchers started doing statistics in auto-pilot. This is called ritualistic statistics.

Group treatment -> P-value -> Publish results. Rinse and repeat.

Most often, statistical tests assume something about the nature of the data and the nature of the Truth being tested. For instance, the Student’s Test or T-Test assumes that the distribution of the parameter you are observing is Normal (i.e. it has the famous “bell curve”).

It is a reasonable assumption to have, because we know that if we have a large enough sample, the mean (i.e, average) of that sample will be the same as the population (i.e., everyone, thus the “Truth”) and that we are likely to get a Normal distribution. Unfortunately, it is often the case that this assumption is wrong in the context being observed. They should have heeded the words of Mark Twain.

“Lies, Damned Lies, and Statistics”

You know who else uses statistics a lot? Data Scientists… 😬

What is Data Science?

In the last decade, Data Science has emerged and evolved into one of the most on demand software engineering specialization, while simultaneously being often an ill-defined term. Unlike Machine Learning, Data Science is not an academic discipline, with its own set of algorithms and methods.

In my opinion, the term Data Science was taken up by the industry to generally capture applied Machine Learning and quantitative analysis methods. A quick search for “data scientist” roles on LinkedIn Jobs will return a laundry list of skill requirements. From Machine Learning and data analysis, to software engineering, MLOps (DevOps for ML models), and data engineering, this very wide range of expected skills makes it even harder to define a standard of competencies for Data Scientists.

Not all Data Scientists are Created Equal

Go on LinkedIn, and search for profiles of “data scientists”. What do you see? You will find that data scientists have an unusually wide spectrum of backgrounds compared to other engineering careers. It is common for data scientists to come from another (tech or non-tech) career via a bootcamp, or have some experience as a data analyst, or have one or multiple degrees in a field with a strong statistical background (i.e., Physics or Cognitive Science). And it goes all the way to PhD in Computer Science with a thesis on an AI topic. This vast spectrum of expertise offers a unique diversity of thoughts in the industry, and it is highly desirable and healthy for any project.

There is an immense diversity, but also disparities in skill, expertise, and knowledge among Data Scientists. The difference in skills between a data scientist with - say - a Bachelor in Physics and a data scientist with a PhD in Computer Science is comparable to the difference in competence between a certified nursing assistant and a surgeon. One is a practitioner of routine procedures while the other may be tasked with advanced, less paint-by-numbers tasks. That being said, experience may alleviate some of those disparities.

In practice, depending on their backgrounds, data scientists may have large knowledge gaps in computer science, software engineering, theory of computation, and even statistics in the context of machine learning, despite those topics being fundamental to any ML project.

But it’s ok, because you can just call the API, and Python is easy to learn. Right? 😬

The Blessing and the Curse of Ease

Few technological revolutions came with such a low barrier of entry as Machine Learning. The advent of Machine Learning is due in part to the Open Source community effort to build libraries in Python. Practitioners have access to Python open source libraries such as scikit-learn, nltk, PyTorch, PySpark, or recently ray.

Learning Python does not make you a good software engineer

From a programming language point of view, Python is easy to learn and produces code quickly, because of its simple syntax.

However, it is a double-edged sword. Python abstracts a lot of the complexity (e.g., types, memory management, run-time compilation), which may lead to building poor software engineering habits, especially if it is the only object-oriented language you know.

Jupyter Notebooks makes you a worse software engineer

Notebooks such as Jupyter Notebooks or Databricks Notebooks are a useful tool in the arsenal of a data scientist. A notebook allows you to run parts of code, through an interactive insulation of the code, called a ‘cell’. It is particularly well suited for data analysis, and quick iteration over a statistical process. However, potentially weak software engineering practices will be compounded by a notebook. For instance, I have seen notebook practitioners develop the undesirable habit of importing packages and functions in the cell where they invoke the function, rather than at the top of the file. They may practice poor modularity, and may not understand the concept of scope or namespace.

Black Box API

The amazing accessibility of ML algorithms is mostly achieved through open source APIs. An API is a collection (or library) of functions, and open source APIs are freely available libraries. An ML API in Python is therefore a library of ML algorithms implemented in Python. It makes implementing ML projects extremely fast and practical.

However, convenience is a slippery slope for ritualistic Machine Learning. A practitioner with knowledge gaps on the assumption of the statistical models used in ML is even more likely to use APIs as black boxes. How many times have you seen a data scientist just throwing data at a model, and take the output at face value? And then if the performance is not adequate, just throw more data at it! That will do it! The problem is that algorithms will always give an answer even if the data violates a statistical assumption. The algorithm doesn’t “know” about the assumption. It is the responsibility of the practitioner to know and uphold the assumptions of a model.

To a hammer, everything is a nail

Another dangerous side effect of the ease of use and access of even the latest AI models is that to a hammer, everything is a nail. Have you noticed how many people are trying to throw an LLM at every problem? A few years ago, it was Deep Learning, a few years before that, it was Bayesian Networks. All of those can be fantastic tools, but not for every task. Once again, each model comes with its assumptions and requirements. Hype, ease of access, knowledge gaps from the practitioners, all of those combined into weak science at best, or intellectual dishonesty at worst.

Accountability for Data Scientists

The Gartner prediction is often quoted on the topic of AI project failure.

Through 2022, 85% of AI projects will deliver erroneous outcomes due to bias in data, algorithms or the teams responsible for managing them.

Building products using Machine Learning and data is still difficult. The tooling infrastructure is still very immature and the non-standard combination of data and software creates unforeseen challenges for engineering teams. But in my views, a lot of the failures come from this explosive cocktail of ritualistic Machine Learning:

  • Weak software engineering knowledge and practices compounded by the tools themselves;
  • Knowledge gap in mathematical, statistical, and computational methods, encouraged black boxing API;
  • Ill-defined range of competence for the role of data scientist, reinforced by a pool of candidates with an unusually wide range of backgrounds;
  • A tendency to follow the hype rather than the science.

What can you do?

Hold your data scientists accountable using Science.

  • At a minimum, any AI/ML project should include an Exploratory Data Analysis, whose results directly support the design choices for feature engineering and model selection.
  • Data scientists should be encouraged to think outside-of-the box of ML, which is a very small box indeed, and train themselves in other areas of AI, such as data mining, optimization, and graph analytics.
  • Data scientists should be trained to use eXplainable AI methods to provide context about the algorithm’s performance beyond the traditional performance metrics like accuracy, FPR, or FNR..
  • Data scientists should be held at similar standards than other software engineering specialties, with code review, code documentation, and architectural designs.

Until such practices are established as the norm, I’ll remain skeptical of Data Science.