Will LLMs make classic data science obsolete?

Published on Nov 7, 2025 in Benchmarks  

With the AI trend, many “AI engineers” have appeared. Most of them think they are the kings of the world because they have interacted with a LLM through an API. But not knowing much about of what they are doing, they talk nonsense.

I met one of them explaining that classic data science will disappear because you can do anything with LLMs, just by asking them. Especially since, according to this person, it is much easier to use a LLM, for example, to make classification.

LLMs being unreliable for classification, I told him that he was talking bullshit.

Months passed, and I thought, even if it’s basically stupid, why not do a comparative test?

I had already done a classification test on SMS spam to compare the performance between Python and Golang.

So I created an equivalent microservice that does a query with a LLM using the OpenAI API.

python
#!/usr/bin/python3

import os
import json
import time
from flask import Flask, request, jsonify
from openai import OpenAI
from dotenv import load_dotenv

load_dotenv()
api_key=os.environ.get("API_KEY")
base_url=os.environ.get("API_URL")

if api_key is None or base_url is None:
    print("Please set API_KEY and API_URL in .env")
    exit(1)

debug=True

client = OpenAI(base_url=base_url, api_key=api_key)

app = Flask(__name__)

def classify_with_llm(message):
    prompt = f"""
You are a spam detection model.
Parse the following text and respond STRICTLY with valid JSON with no other text:
{{"label": "spam"}} if it's spam, or {{"label": "ham"}} otherwise.

Text to analyze:

"""{message}"""
"""
    try:
        resp = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "system", "content": "You detect spam."},
                        {"role": "user", "content": prompt}],
            temperature=0.0,
            max_tokens=16,
        )
        text = resp.choices[0].message.content.strip()
        data = json.loads(text)
        label = data.get("label")
        if label in ("spam", "ham"):
            return label
        else:
            raise ValueError(f"Label inattendu : {label}")
    except (json.JSONDecodeError, KeyError, ValueError) as e:
        raise ValueError(f"Invalid response from LLM: {e}")

@app.route('/predict', methods=['PUT'])
def predict():
    data = request.get_json()
    if not data:
        return jsonify({'error': 'No data provided'}), 400
    message = data.get('message')
    if not message:
        return jsonify({'error': 'No message provided'}), 400
    try:
        prediction = classify_with_llm([message])
    except ValueError as e:
        return jsonify({'error': "Internal Server Error"}), 500
    return jsonify({'data': prediction})

if __name__ == '__main__':
    app.run(port=8000, debug=True)

First of all, the python version which uses a classic data science algorithm is simpler (31 lines of code vs 50 for the LLM version). So the argument that using a LLM rather than a data science algorithm is easier is false.

Now, compare reliability and performance. I used a local LLM on an A5000 GPU, so a much more powerful hardware than for my test with a classic classification algorithm. The model that gave me the best results is Mistral 24B (quantized in 5.5bits to run on an A5000).

Here’s what it looks like:

plaintext
Total messages: 1113
True Positive: 165
False Positive: 242
True Negative: 703
False Negative: 3
Accuracy: 77.99%
Average time: 394.14 ms

So the LLM version is wrong 15 times more often than the classic version (77.99% accuracy vs 98.56%), and it is 128 times slower, while using much more powerful hardware.

The goal of LLMs is to generate content, under human supervision, given their tendency to hallucinate. They are bad, for example, for classification or decision-making. And unsurprisingly, my review confirms it.

Because LLMs speak our language and have some reasoning ability, some people use them as omniscient oracles. But they are not. Really not. And since they have no understanding of the world, they have no ability to improve themselves.

Despite all the marketing promises of an AGI that is coming tomorrow (even if I think that after 3 years no serious person still believes this), LLMs remain limited by their training data.

People who only know LLMs want to use them to solve all problems. But there are old school tools that are much more efficient, both in terms of compute and accuracy.

Of course, to use a classic algorithm, you need a dataset with labeled data. The use of a LLM is based on the idea that you don’t have to do any pre-training. But this idea comes from the fact that people mostly know generalist LLMs. For specific tasks, it’s definitely worth training your own model.

In the end, AI is just a rebranding of applied math.

What is the secret?

Labelled data!

And a two-and-a-half-century-old algorithm like Multinomial NB isn’t going to be obsolete anytime soon.

Don’t miss my upcoming posts — hit the follow button on my LinkedIn profile