Nearly two years after the launch of ChatGPT, the hype and - unfortunately - the h y s t e r i a surrounding AI and LLMs has not ebbed.

The bold nature of some of those statements is aggravated by the exoneration of scientists and technologists from the burden of proof. The lack of reasoned, measured, and evidence-based discourse about genAI, especially from very eminent researchers, should make you feel uneasy. There is so much money involved, and the technology is likely to directly impact your job, your safety, your mental health, and your planet. It also allows companies to discount risk and responsibility, and pass that hot potato to the next downstream user.

It fosters an ecosystem of middlemen start-ups with dubious added-value. Have your data scientists promised you magical systems via “fine-tuning”, but produced only tepid results? How many AI practitioners have given you a hand-wave as an answer to questions about cost and safety of those systems? Many are already profiteering from this hysteria to sell you yet another flavor of snake oil, excepts this time, the oil “understands” or “will take over”. 🙄

Let’s pull the curtain back on LLMs and see what the science actually says about those systems, their safety, and their “intelligence”. 🧙

An important feature of a learning machine is that its teacher will often be very largely ignorant of quite what is going on inside. Alan Turing, Computing Machinery and Intelligence, 1950

Why Are We so Easily Fooled by LLMs?

In his novel “Blindsight” published in 2006, Peter Watts offers a thought experiment: Is consciousness required for intelligence? In search of answers, Watt’s team of scientists use language (written, spoken, or symbolic) to evaluate intelligence and consciousness of a newly discovered alien species. The novel reflects back to the reader our biased view about intelligence and the perceived relationship between language and intelligence. This is a long held view in western societies, at least as far back as the ancient Greeks.

“Man cannot be a subject of instruction otherwise than through the ear” - Aristotle

An archaic definition of dumb is the inability to speak. Today, the continued prejudices faced by Deaf and non-verbal autistic people are further examples of how the mastery of “through the ear” language is equated to intelligence. So why are we fooled so easily by LLMs? Because LLMs parrot language with sufficient mastery to fool a superficial assessment.

While philosophers and computer scientists have long debated the possibility of mind and understanding within machines (e.g., Chinese room experiment, Computationalism, and my absolute favorite: Chomsky vs Norvig), there is currently no evidence supporting it. On the contrary, very recent research show the shallowness of LLMs reasoning. So let’s dive in that research, so that we won’t be fooled twice! 🏊‍♀️

What Do Deep Neural Networks Learn?

The exploration of the interaction between input data and algorithmic learning is a vast area for research, but let’s focus on some recent findings. Researchers have known for at least a decade that deep neural networks don’t learn the way we originally thought they did. Take the now famous example of the panda + noise = gibbon described by Goodfellow I. et al. in their paper: “Explaining and Harnessing Adversarial examples”. In their work, an image of a panda is originally identified by a deep learning algorithm with 57.7% confidence (It’s not an amazing confidence by the way. A flip of a coin has a 50% confidence rate.) When some random noise is applied to the panda image, the algorithm now identifies the image as a gibbon with 99.3% confidence.

This type of findings implies that the parameters that the algorithm is focusing on to identify a panda vs a gibbon are not what is perceptible to the human eye, AND that those parameters are not that good at “actual learning”. A more recent example of this limitation is shortcut learning. Researchers have been exploring how deep neural networks take “shortcuts” in their learning by focusing on surface parameters and uninformative but persistent correlations. An amusing example of shortcut learning is investigated by Beery et al., where they show that a picture of a cow is not correctly identified because the training set contained only cows in green pastures. When presented with a cow on the beach, the algorithm failed.

Data Scientists and AI practitioners should pay close attention to this type of research because it informs methodologies to measure robustness and generalization (how well your model perform when the environment has a different distribution of phenomena as you expected). In view of the rapidly tightening regulatory landscape in the US and Europe, auditing robustness should already be in your 2025 roadmap… 😬

LLMs as “Learners”

LLMs are Deep Neural Networks on steroids, which mathematically improvise. If you want a deeper explanation, there exists many resources to learn about how LLMs work, but not many are as good as Stephen Wolfram’s. It is a long introduction, but extremely worthy of your time if you are inclined to dissect the Wizard of Oz. 🔪 🧙

Are LLM Intelligent?

Intelligence is an ill-defined concept, because it is a collection of various skills. Some of those skills can be reliably measured (repeated measurement lead to the same results) but they do not form the panacea of intelligence. The definition of intelligence is also deeply cultural, and will vary based on societies’ values.

“Everyone is a genius. But if you judge a fish by its ability to climb a tree, it will live its whole life believing that it is stupid.” - Probably not Albert Einstein

So what do we do? We can measure an LLM’s ability to do very specific things. It makes for great headlines when ChatGPT passed the bar exam or the CPA exam. Search for an exam and they probably tested it already.

The performance evaluation set against a standard is called benchmarking. Many benchmarking datasets have been created in the past two years, like Big Bench or Microsoft Eureka. There are leaderboards for benchmarks, and different versions of LLMs (e.g., GPT-3, GPT-4) are regularly compared using those benchmarks.

Thus, there is solid evidence that LLMs are able to perform some tasks seemingly very well. But is it “thinking”? My TI-89 is an amazing calculator, but no one is entertaining the idea that it is “thinking” or even sentient. But then again, it doesn’t talk (yet).

Do LLMs reason?

I have already written about my reasoning behind my belief that LLMs don’t think. But even within the Natural Language Processing (NLP) community (NLP is the family of algorithms that processes human languages as text/sound. LLMs are part of the field of NLP, and before their explosions in fame, ‘NLP’ was the general term used in the industry to talk about systems that can process and produce human languages.), opinions are divided. In the Spring of 2022 ( a few month before the launch of ChatGPT), a survey was administered among 327 NLP researchers to understand what researchers thought about NLP systems. To the question “Some generative model trained only on text, given enough data and computational resources, could understand natural language in some non-trivial sense”, respondents were split 49/51% toward agreement. However, NLP scientists and other AI experts like myself are not cognition experts. This is perfectly highlighted by the fact that this study deliberately did not provide participants with a definition of what to understand means, nor did they ask the participants to define their own definition of it, because it is “a topic of discussion”. In short, AI experts are divided down the middle about whether LLMs “understand”, whatever that means.

Fortunately, empirically evidence is slowly trickling in toward a more robust answer. In October 2024, Apple published a detailed study on mathematical reasoning. They used a benchmark of grade-school-level mathematical questions. I strongly recommend everyone to read this paper. Don’t be intimidated by the “academic” label, this paper is beautifully and accessibly written:

LLMs exhibit noticeable variance when responding to different instantiations of the same question. Specifically, the performance of all models declines when only the numerical values in the question are altered in the GSM-Symbolic benchmark. Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.

This is consistent with what we know and understand concretely about LLMs: they are a super-guesser of the next best token.

What I Think You Should Think About LLMs

I think the hysteria around LLMs has created a great veiling cloud of mysticism, which allows snake oil merchants to make outlandish, unsubstantiated, and unchecked claims to profit from you, at great expense from our societies.

I think you should be skeptical of any claims that LLMs “understand” or “reason” unless those words are precisely defined and the claims scientifically proven. Without a definition, it is relying on confusion and our natural tendencies to anthropomorphize machine computation into human reasoning. Without scientific proof, it’s pure fantasy! And when it is coming from Data Scientists, it is nearing intellectual dishonesty and deontological failure. Your skepticism should be proportional to how much someone may profit from your belief of those claims.

I think you should balance curiosity and caution. Yes, LLM-based tools can phenomenally improve your productivity. You should make yourself familiar with what they can and cannot do. You should also make yourself very familiar with the substantial risks they carry for your organization. Disclaimer: Those risks keep me happily employed 😛.

I think you should be wary of using LLMs as a one-solution fit all, especially given their cost and risks. The recent explosion in middleware start-ups whose value proposal is an API call to OpenAI is the greatest example of someone trying to use a hammer on every nail-looking thing. Yes, an LLM could do it, but is it the best (most efficient, cost-effective, performant) solution, or is it the trendy solution?

I think you should expect the next energy crisis to impact the LLMs market. LLMs are extremely greedy in energy, water, human, and cloud capital. Training a GPT model requires an estimated 10 Gigawatt-hours (10 millions kilowatt-hours), and that ChatGPT’s daily usage is also estimated at 1 Gigawatt-hours. The average US household consumes 0.010 gigawatt-hours per year. Yet, data centers are being built in locations where the electricity grid might not be up to par. Time will tell… 🤠

I think you should expect the performance of foundation models to start plateau-ing. The era of pre-training is estimated to come to an end very soon. It means that foundation models have absorbed all the data available to them (legally or not), and their performance were somewhat (but not always) correlated with the amount of data available to those models. Perversely, their performance may be affected by AI-generated content on the same Internet they have been feeding from, creating a ouroboros-shaped nightmare for LLM pre-training. In Machine Learning development, the next step would be to either refine the data and/or refine the algorithm. Both those steps will take a lot time and innovation to happen.

There is a lot to think about regarding LLMs… Until either the market or grounded scientific and evidence-based discourse tame this LLM hysteria, I will remain skeptical of unsupported claims, snake-oil merchants API-call shops. And you should too.

Enough with the LLM hysteria