Are LLMs stochastic parrots?

Published on Aug 30, 2025 in Thoughts on AI

A few days ago, I read that LLMs are stochastic parrots. But is this the reality?

Actually, not really.

An LLM is a neural network that takes a sequence of tokens as input. The words are cut by a tokenizer that replaces the pieces of words with numerical values. The number of network inputs determines the maximum size of the context.

The network has as many outputs as there are possible tokens, for example 50000 if its vocabulary includes 50000 tokens.

The calculation between the inputs and outputs of the network is perfectly deterministic. So LLMs are non-stochatic parrots 🤣.

The stochastic dimension in text generation occurs after the LLM, during sampling.

But, what is sampling?

In fact, once the calculation is done for a given input, the output of the network gives us logits, values between -infinity and +infinity that indicates the probability that each of the tokens is next.

Of course, we want real probabilities, i.e. values between 0 and 1 with a sum of 1. To do this, we use the softmax function.

Then comes the time for decoding. You can simply do greedy decoding. That is to say, take the most likely token. In this case, a given entry will always give the same result. But in practice, we don’t do this because it gives boring results.

That’s where sampling comes in. If you’ve tinkered a bit with LLMs, you’ve probably heard of top k, top p, temperature…

With the top k parameter, we decided to randomly draw among the most likely tokens. If, for example, top k is at 3, we take the three most likely tokens and draw randomly according to their respective probability.

With top p, we set a probability, for example 0.9 and we add the tokens among the possible tokens, starting from the most probable, until the sum of their probability exceeds top p. And then, all that remains is to draw randomly according to their respective probabilities.

Temperature is used to change the result of the softmax function. We divide the logits by the temperature before applying the function.

If the temperature is low, it will amplify the differences and give a more deterministic result, close to greedy decoding. Conversely, a higher temperature will cause tokens to come out with a lower probability and give more “creativity” to the result.

Of course, all these parameters can be combined.

In conclusion, the stochastic dimension in text generation exists, but it does not come from LLM but from sampling. Because LLMs are 100% deterministic.

Don’t miss my upcoming posts — hit the follow button on my LinkedIn profile