Lesson 3: Hello World LLM!

Hello Neural Explorers!

In our two previous lessons, you learned how to install the necessary operating system requirements and Docker environment to code with LLM. Now it’s time to create your “Hello World” LLM application.

For this tutorial, we will be using the Hugging Face’s Transformers library and the GPTQ version of the fine-tuned Mistral model from OpenHermes.

As Hugging Face uses Git LFS, you will have to install it on your local machine.

sudo apt install git-lfs

Inside your model directory (shared with your Docker environment), you can clone the repository.

git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ

Please notice that you have several options to choose from. In this case i’ve selected the 4 bits branch with group size 32. For information, smaller groups sizes yield higher quality results. However, 1g or 1gs means there’s no configured group size and the GPTQ default is 1024 (lower quality).

Edit the config.json file in the model directory. Search the line that says "use_exllama": false and replace false with true. With 4 bits models, the Transformers library can now take advantage of exllama, which will improve its performance significantly.


import random
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed
from auto_gptq import exllama_set_max_input_length
 
model_name = "OpenHermes-2.5-Mistral-7B-GPTQ-Q4"
 
# Set the transformers seed to have a different generated text
# each time
 
seed = random.randint(0, 2**32-1)
set_seed(seed)
 
# Load the tokenizer
 
tokenizer = AutoTokenizer.from_pretrained(
model_name,
clean_up_tokenization_spaces=True,
legacy=True
)
 
# Load the model
 
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=False
)
 
# Set the exllama context size to 8k tokens
# It's not necessarily required for this example
# But now you know how to do it
 
model = exllama_set_max_input_length(model, max_input_length=8192)
 
system = "You are a helpful AI assistant"
question = "Who discovered America?"
 
# This model is trained using the ChatML prompt format
 
prompt = f"<|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n"
 
print(f"\nUser question: {question}\n")
print(f"Formated prompt:\n {prompt}\n")
 
# Run the text generator
 
pipe = pipeline(
"text-generation",
model=model,
tokenizer=tokenizer,
do_sample=True,
max_new_tokens=500,
)
result = pipe(prompt)[0]["generated_text"]
 
# Remove the prompt from the result
 
length = len(prompt)
answer = result[length:]
 
# Calculate the number of generated tokens
 
tokens = len(tokenizer.encode(answer))
 
print(f"\nBot answer: {answer}")
print(f"\nGenerated tokens: {tokens}")

And that’s it! You now have everything you need to get started with LLM. The next lesson will cover some writing prompts.