Lesson 3: Hello World LLM!

Hello Neural Explorers!

In our two previous lessons, you learned how to install the necessary operating system requirements and Docker environment to code with LLM. Now it’s time to create your “Hello World” LLM application.

For this tutorial, we will be using the Hugging Face’s Transformers library and the GPTQ version of the fine-tuned Mistral model from OpenHermes.

As Hugging Face uses Git LFS, you will have to install it on your local machine.

sudo apt install git-lfs

Inside your model directory (shared with your Docker environment), you can clone the repository.

git clone --branch gptq-4bit-32g-actorder_True https://huggingface.co/TheBloke/OpenHermes-2.5-Mistral-7B-GPTQ

Please notice that you have several options to choose from. In this case i’ve selected the 4 bits branch with group size 32. For information, smaller groups sizes yield higher quality results. However, 1g or 1gs means there’s no configured group size and the GPTQ default is 1024 (lower quality).

Edit the config.json file in the model directory. Search the line that says "use_exllama": false and replace false with true. With 4 bits models, the Transformers library can now take advantage of exllama, which will improve its performance significantly.

import random import torch from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline, set_seed from auto_gptq import exllama_set_max_input_length model_name = "OpenHermes-2.5-Mistral-7B-GPTQ-Q4" # Set the transformers seed to have a different generated text # each time seed = random.randint(0, 2**32-1) set_seed(seed) # Load the tokenizer tokenizer = AutoTokenizer.from_pretrained( model_name, clean_up_tokenization_spaces=True, legacy=True ) # Load the model model = AutoModelForCausalLM.from_pretrained( model_name, device_map="auto", torch_dtype=torch.float16, trust_remote_code=False ) # Set the exllama context size to 8k tokens # It's not necessarily required for this example # But now you know how to do it model = exllama_set_max_input_length(model, max_input_length=8192) system = "You are a helpful AI assistant" question = "Who discovered America?" # This model is trained using the ChatML prompt format prompt = f"<|im_start|>system\n{system}<|im_end|>\n<|im_start|>user\n{question}<|im_end|>\n<|im_start|>assistant\n" print(f"\nUser question: {question}\n") print(f"Formated prompt:\n {prompt}\n") # Run the text generator pipe = pipeline( "text-generation", model=model, tokenizer=tokenizer, do_sample=True, max_new_tokens=500, ) result = pipe(prompt)[0]["generated_text"] # Remove the prompt from the result length = len(prompt) answer = result[length:] # Calculate the number of generated tokens tokens = len(tokenizer.encode(answer)) print(f"\nBot answer: {answer}") print(f"\nGenerated tokens: {tokens}")

And that’s it! You now have everything you need to get started with LLM. The next lesson will cover some writing prompts.

Keep up, get in touch.

Contact

Follow