Hello World LLM!

Published on Jan 14, 2024 in Former LLM Course  

In our two previous lessons, you learned how to install the necessary operating system requirements and Docker environment to code with LLM. Now it’s time to create your “Hello World” LLM application.

For this tutorial, we will be using the Hugging Face’s Transformers library and the GPTQ version of the fine-tuned Mistral model from OpenHermes.

As Hugging Face uses Git LFS, you will have to install it on your local machine.

terminal
sudo apt install git-lfs

Inside your model directory (shared with your Docker environment), you can clone the repository.

terminal
git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ

Please notice that you have several options to choose from. In this case i’ve selected the 4 bits branch with group size 32. For information, smaller groups sizes yield higher quality results. However, 1g or 1gs means there’s no configured group size and the GPTQ default is 1024 (lower quality).

Edit the config.json file in the model directory. Search the line that says "use_exllama": false and replace false with true. With 4 bits models, the Transformers library can now take advantage of exllama, which will improve its performance significantly.

python
import random
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
from auto_gptq import exllama_set_max_input_length

model_name = "OpenHermes-2.5-Mistral-7B-GPTQ-Q4"

# Set the transformers seed to have a different generated text
# each time

seed = random.randint(0, 2**32-1)
set_seed(seed)

# Load the tokenizer

tokenizer = AutoTokenizer.from_pretrained(
    model_name,
    clean_up_tokenization_spaces=True,
    legacy=True
)

# Load the model

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=False
)

# Set the exllama context size to 8k tokens
# It's not necessarily required for this example
# But now you know how to do it

model = exllama_set_max_input_length(model, max_input_length=8192)

system = "You are a helpful AI assistant"
question = "Who discovered America?"

# This model is trained using the ChatML prompt format

prompt = f"<|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"

print(f"\nUser question: {question}\n")
print(f"Formated prompt:\n {prompt}\n")

# Run the text generator

pipe = pipeline(
        "text-generation",
        model=model,
        tokenizer=tokenizer,
        do_sample=True,
        max_new_tokens=500,
        )
result = pipe(prompt)[0]["generated_text"]

# Remove the prompt from the result

length = len(prompt)
answer = result[length:]

# Calculate the number of generated tokens

tokens = len(tokenizer.encode(answer))

print(f"\nBot answer: {answer}")
print(f"\nGenerated tokens: {tokens}")

And that’s it! You now have everything you need to get started with LLM. The next lesson will cover some writing prompts.