What AI text generators can do (and can’t do)

Hello Neural Enthusiasts!

The release of ChatGPT at the end of 2023 has generated a lot of enthusiasm but also a lot of confusion. On my side, I began to experiment in my free time as much with the creative aspect of LLMs (generative literature) as with the possibility of creating accurate assistants. I also had the opportunity to develop a domain-specific assistant in a corporate context with retrieval augmented generation and fine tuning. This experience allowed me to understand the possibilities and limits of LLMs. I started approaching prospects in the context of freelance missions and I realize that most people don’t know the limits of the “magic” of LLMs. In this article, I will try to provide an overview of what is possible or not possible.

The problem of humanization

If LLMs generate so much confusion for the uninitiated, it is because today we have machines that speak our language and we tend to humanize them. And even for insiders, it’s sometimes complicated. You’ve all heard the story of the engineer at Google who fell in love with the assistant he created and decided it was sentient. The marketing associated with the current AI trend also adds a lot of confusion.

AI is not human. Even if in certain areas, it seems to perform miracles, what may seem simple to a human is often impossible for the generative AI currently available to us (or very expensive for such a trivial task).

How do LLMs work?

What most people call a LLM is actually a transformer decoder neural network which is used for text generation. An artificial neural network is a computational system with inputs and outputs that simulates the functioning of biological neurons.

As input, the generative LLM receives tokens. Tokens are groups of characters (four on average). Tokenization is a way of encoding text optimized for LLM. The generative LLM has as many outputs as tokens used by the model. With current tokenizers, we often use 32,000 different tokens. The only ability of a generative LLM is to predict the probability of the next token for a given text.

To use a generative LLM, we encode the prompt into tokens. We place tokens as input to the LLM. Each model has a number of inputs which limits the size of the context that can be given to the network. Then, the LLM calculates the probability of each existing token being the next one and places this result as its output. We can choose to take the most likely token. This is called greedy decoding. But that gives boring answers. In general we use sampling, a strategy that I will not detail here.

Then simply place the new token following the input and start again…

With this explanation alone, you can already see that a generative LLM understands absolutely nothing. It has no long-term strategy. It can’t see any further than the end of the next token.

The pre-training

For a neural network to work, it must have the right parameters. This is where we use machine learning. The first step is pre-training. To do this, you must give a large corpus of text to the model. It is this part that is long and expensive. But whether with OpenAI or open source, it is not necessary to take care of this step. After this step, the model will be able to reproduce the structure of natural language. But it risks generating a lot of meaningless content, what we call hallucination.

The fine-tuning

To prevent hallucinations, the model must be specialized, i.e. trained for a specific domain. This is called fine-tuning. There are several strategies that I will not detail here (unsupervised, supervised, reinforcement with human feedback, etc.)

Fine-tuning is a complex process that requires expertise. You must create the right dataset. There are many parameters to test before finding the right recipe and this requires a certain amount of know-how. When moving down this path, it should be expected that it will take time to achieve a satisfactory result.

At this stage, it is important to note that if a LLM seems capable of reasoning, it is only an emergent feature resulting from the knowledge of language that it has stored. But the generative model is fundamentally not made for that.

It should also be noted that an LLM does not know how to do math. If we ask it, “Is 7 greater than 5?”, if it has the information in his training data, it will be able to answer correctly. But with this type of question in a more general context, there is a good chance that it will hallucinate an answer.

Prompt engineering

Before developing this point, it’s necessary to clarify what I am going to talk about. Some people use the concept of prompt engineering to refer to the art of speaking to chatbots in the right manner. And some people explain that prompt engineering is a joke and that it is not a real area of ​​expertise. But here I will talk about chatbot context creation. Perhaps talking about context engineering could remove this ambiguity. But it’s not a commonly used name.

As a LLM only knows how to predict the next token, it must be bootstrapped correctly. Earlier, I said that it was enough to place the user’s prompt at the input of the model. In reality, it’s a little more complex. It is also necessary to add a pre-prompt to give more context to the LLM and allow it to generate a coherent response. For example, “You are an assistant who answers the user’s questions in the most useful way…”

In fact, the LLM is a kind of machine that can be programmed with natural language. We can give it a personality simply by adding to the context, “Personality: description of personality”. We can give it a scenario in the same way. We can give it examples of how it is supposed to “speak”. We can also give it a specific memory…

Retrieval Augmented Generation

Fine-tuning is not the only way to avoid hallucinations. Simply adding information about a topic in the context of the LLM makes it capable of generating more accurate answers. As fine-tuning is not so easy to implement, another method, RAG, has become very popular.

To implement this method, simply cut the specific information you want to give to the LLM into chunks. With models like SBERT we can associate a vector for each chunk which represents the topic of the chunk. SBERT is an encoder transformer. It takes tokenized text as input and gives a vector as output. This type of model can be used for classification. In our case, SBERT was trained specifically to measure thematic proximity between two texts. To do this, we calculate the cosine of the angle formed by the two vectors representing each text. The closer the result is to 1, the more our two texts have a similar topic.

With this technique, it is easy to add specific knowledge in the context of the LLM based on a user’s questions. But the capability of this technique is also very limited.

For example, if you use an LLM without training on fruits and vegetables (stupid example), you can easily create a RAG with data on this topic. If the user asks specific questions with the name of a fruit or vegetable, the RAG system will know how to place the right data in the context of the LLM. This way, it will be able to generate a correct answer. But if the user asks a general question, which requires a summary, for example, “What fruits and vegetables are harvested in August?”, the LLM will be unable to answer correctly.

The RAG is only capable of adding knowledge to the LLM by topic similarity. If you want a model capable of synthesis on a topic, you have to go through fine-tuning and it is not the same cost.

For your information, when you create your GPT with OpenAI, under the hood it is a RAG, with all its limitations.

The RAG system is ultimately a semantic search engine that allows information to be added to the context of the LLM. The problem is that this information is static. If you want live information, it is also possible to interface your LLM with a dynamic search engine that crawls up-to-date data.

API

Another way to have up-to-date data is to use an API. But this method is also very limited, even more than a static RAG. With a RAG, you can at least check the similarity between the user prompt and your data.

Let’s take the (still stupid) example of fruits and vegetables. Imagine this time that you have an API to retrieve data on fruits and vegetables. If an endpoint allows you to list all fruits and vegetables, you can retrieve this data and store it in your app. But you will only have the names of the fruits and vegetables and using topic similarity is probably not the right way. The best we can do is to use a simple lexical comparison and if the right keyword is present (the name of a fruit or vegetable) to make queries on all possible endpoints for the specific fruit or vegetable. Then, it’s possible to add all the information in the context.

We also find the same limitation as a static RAG, for all general questions, which require a synthesis. The most frustrating thing in this story is that even if the API offers, for example, a specific endpoint to answer the question “What fruits and vegetables are harvested in August?” it will be very difficult to understand the user’s question and make the right query to the right endpoint. Once again, despite the illusion, generative AI understands nothing.

Another method would be to use a classification model to route user requests to the correct endpoint. But unlike SBERT which was specifically trained for topic similarity, this time, we must consider the training of a specific encoder model with all that it costs (dataset creation, experimentation, failure, improvement, a lot of time…) without being sure that it will work correctly.

100% accurate

A final word about hallucinations. It is usual to want a 100% accurate assistant. But that doesn’t exist. The best method to limit hallucinations is to combine RAG and fine-tuning. And even so, getting an 98% accurate assistant is already a good score. With the best app in the world, you will still have errors. And creating such an app will not be done in a few days.

Conclusion

In this article I have tried to do my best to share my knowledge on the subject with people who are not familiar with how LLMs work. I hope this article has been useful to you.

If you work in the world of AI, you’ve probably seen these kinds of challenges understanding what AI can and can’t do. If you find my explanations useful, do not hesitate to share this article.