Discovering PyTorch
Published on Apr 6, 2025 in LLM from scratch PyTorch
I’ve already shared an article about creating a neural network with PyTorch. As you can see, it’s pretty straightforward once you know how a neural network works.
This is true for classical/simple neural networks. But in this new series of articles, I want to guide you step by step in creating a generative transformer.
PyTorch offers solutions for creating transformers in simple ways. It also supports the attention mechanism in an autonomous function. But there is so much variation in how to implement a transformer that most of the time you’ll want to do it from scratch.
Creating a transformer isn’t that complicated. But it’s not enough to rely on torch.nn, as with most neural networks. You also need to know the basics of tensor manipulation.
The best way to discover this is to open a Python shell and tinker with the functions provided by PyTorch to understand their behavior.
To follow this series of articles, you will need some basic knowledge of linear algebra. If you know what a vector or a matrix is, and you know operations that can be done, that should be enough.
If you don’t know what a tensor is, it’s just a generalization of vectors and matrices. A vector is a 1-dimensional tensor, a matrix is a 2-dimensional tensor… The attention mechanism implemented in transformers mainly uses 4-dimensional tensors (batch size x sequence length x number of heads x size of a head).
To begin, launch the Python shell and import torch.
$ python3
Python 3.11.2 (main, Nov 30 2024, 21:22:50) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>>
Creation and initialization of tensors
Apart from arange and linspace which create one-dimensional tensors, all other functions create n-dimensional tensors, where n is the number of parameters given. The value of each parameter indicates the size of the corresponding dimension.
Some of these functions also accept a number of named parameters, for example dtype to define the data type, or device to specify whether the tensor should be created on the host or on a GPU. But this article is just an introduction. I advise you to consult the documentation to find out more.
The empty function creates an uninitialized tensor. It therefore contains arbitrary values.
>>> torch.empty(5)
tensor([1.6289e-22, 4.5769e-41, 1.6289e-22, 4.5769e-41, 4.4842e-44])
>>> torch.empty(5, 3)
tensor([[1.6745e-35, 0.0000e+00, 1.1005e-35],
[0.0000e+00, 1.1210e-43, 0.0000e+00],
[8.9683e-44, 0.0000e+00, 1.6694e-35],
[0.0000e+00, 0.0000e+00, 0.0000e+00],
[0.0000e+00, 0.0000e+00, 0.0000e+00]])
The zeros function creates a tensor initialized to zero.
>>> torch.zeros(3, 3)
tensor([[0., 0., 0.],
[0., 0., 0.],
[0., 0., 0.]])
The ones function creates a tensor initialized to one.
>>> torch.ones(3, 3)
tensor([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
The rand function creates a tensor initialized by random values with an uniform distribution between zero and one.
>>> torch.rand(3, 3)
tensor([[0.1536, 0.9394, 0.6958],
[0.5869, 0.5589, 0.6007],
[0.1248, 0.3264, 0.5541]])
The randn function creates a tensor initialized by random values with a normal distribution of mean zero and standard deviation one.
>>> torch.randn(3, 3)
tensor([[-0.2956, 1.0668, 1.3782],
[ 1.8092, 0.2827, -0.7969],
[ 1.3981, 1.2128, 0.9647]])
The eye function creates an identity tensor.
>>> torch.eye(3, 3)
tensor([[1., 0., 0.],
[0., 1., 0.],
[0., 0., 1.]])
The arange and linspace functions create sequences.
Just specifying the size.
>>> torch.arange(5)
tensor([0, 1, 2, 3, 4])
You can also specify the start and end.
>>> torch.arange(1, 5)
tensor([1, 2, 3, 4])
We can change the step.
>>> torch.arange(1, 5, 0.5)
tensor([1.0000, 1.5000, 2.0000, 2.5000, 3.0000, 3.5000, 4.0000, 4.5000])
With linspace, you specify the start value, end value, and size of the tensor.
>>> torch.linspace(1, 5, 8)
tensor([1.0000, 1.5714, 2.1429, 2.7143, 3.2857, 3.8571, 4.4286, 5.0000])
Manipulating Dimensions
The view method applies to a torch.Tensor and allows you to preserve its contents while separating or merging its dimensions. For this to work, the shapes must of course be compatible (in this example, 3x3=9).
>>> x = torch.randn(3, 3)
>>> x
tensor([[-1.9500, 0.8733, 1.2472],
[ 0.7804, -0.0726, -1.1400],
[-0.5425, -1.0257, 0.9130]])
>>> x.size()
torch.Size([3, 3])
>>> y = x.view(9)
>>> y
tensor([-1.9500, 0.8733, 1.2472, 0.7804, -0.0726, -1.1400, -0.5425, -1.0257,
0.9130])
>>> y.size()
torch.Size([9])
>>> z = y.view(3, 3)
>>> z
tensor([[-1.9500, 0.8733, 1.2472],
[ 0.7804, -0.0726, -1.1400],
[-0.5425, -1.0257, 0.9130]])
>>> z.size()
torch.Size([3, 3])
The unsqueeze method applies to a torch.Tensor and inserts a dimension. You must pass the dimension number as a parameter. In PyTorch, the dimensions of an n-dimensional tensor are numbered from 0 to n - 1. You can also use negative values to start from the end (-1 for the last dimension, -2 for the second to last, etc.)
>>> x = torch.randn(4)
>>> x
tensor([ 0.0913, 0.4442, -1.0917, -0.6743])
>>> x.unsqueeze(0)
tensor([[ 0.0913, 0.4442, -1.0917, -0.6743]])
>>> x.unsqueeze(1)
tensor([[ 0.0913],
[ 0.4442],
[-1.0917],
[-0.6743]])
The squeeze method applies to a torch.Tensor and removes all dimensions of size one. The dimension(s) of size one to remove can also be specified as a parameter, with an integer or a tuple of integers. If a specified dimension is not of size one, it will not be modified.
>>> x = torch.randn(1, 2, 1, 2)
>>> x
tensor([[[[ 1.2558, -0.3881]],
[[-1.0312, -0.5278]]]])
>>> x.squeeze()
tensor([[ 1.2558, -0.3881],
[-1.0312, -0.5278]])
>>> x.squeeze().size()
torch.Size([2, 2])
>>> x.squeeze(0).size()
torch.Size([2, 1, 2])
>>> x.squeeze(1).size()
torch.Size([1, 2, 1, 2])
The t method applies to a torch.Tensor of dimension one or two and performs a transpose.
>>> x = torch.randn(3, 3)
>>> x
tensor([[-0.2568, -0.9684, 1.4707],
[ 0.9398, 0.5700, -1.3556],
[ 0.2480, -0.4194, 0.2585]])
>>> x.t()
tensor([[-0.2568, 0.9398, 0.2480],
[-0.9684, 0.5700, -0.4194],
[ 1.4707, -1.3556, 0.2585]])
The transpose method applies to a torch.Tensor and allows you to invert the two dimensions passed as a parameter. See the unsqueeze method to see how the dimensions are numbered.
>>> x = torch.randn(3, 3, 2)
>>> x
tensor([[[ 2.6529, -0.7599],
[ 1.0370, -0.3682],
[-0.8821, 1.3665]],
[[-1.1326, -0.1308],
[ 0.0860, 0.8604],
[-0.4915, 0.7214]],
[[ 1.0046, -0.8917],
[ 1.8634, -0.1128],
[-0.5388, 0.2329]]])
>>> x.transpose(0,2)
tensor([[[ 2.6529, -1.1326, 1.0046],
[ 1.0370, 0.0860, 1.8634],
[-0.8821, -0.4915, -0.5388]],
[[-0.7599, -0.1308, -0.8917],
[-0.3682, 0.8604, -0.1128],
[ 1.3665, 0.7214, 0.2329]]])
>>> x.transpose(0,1)
tensor([[[ 2.6529, -0.7599],
[-1.1326, -0.1308],
[ 1.0046, -0.8917]],
[[ 1.0370, -0.3682],
[ 0.0860, 0.8604],
[ 1.8634, -0.1128]],
[[-0.8821, 1.3665],
[-0.4915, 0.7214],
[-0.5388, 0.2329]]])
The torch.permute method applies to a torch.Tensor and allows you to reorder all dimensions.
>>> x.permute(2, 1, 0)
tensor([[[ 2.6529, -1.1326, 1.0046],
[ 1.0370, 0.0860, 1.8634],
[-0.8821, -0.4915, -0.5388]],
[[-0.7599, -0.1308, -0.8917],
[-0.3682, 0.8604, -0.1128],
[ 1.3665, 0.7214, 0.2329]]])
Operations on Tensors
The split and chunk methods allow you to split a torch.Tensor into several sub-tensors. The split method takes as a parameter the size of the sub-tensors you want to obtain and chunk the number of sub-tensors you want to obtain. For tensors with a dimension greater than one, we can specify as a second parameter on which dimension we want to split the tensor.
>>> x = torch.randn(6)
>>> x
tensor([-0.4718, 0.8133, -0.6558, 0.8866, -0.6374, -0.1219])
>>> x.split(2)
(tensor([-0.4718, 0.8133]), tensor([-0.6558, 0.8866]), tensor([-0.6374, -0.1219]))
>>> x.split(3)
(tensor([-0.4718, 0.8133, -0.6558]), tensor([ 0.8866, -0.6374, -0.1219]))
>>> x.chunk(2)
(tensor([-0.4718, 0.8133, -0.6558]), tensor([ 0.8866, -0.6374, -0.1219]))
>>> x.chunk(3)
(tensor([-0.4718, 0.8133]), tensor([-0.6558, 0.8866]), tensor([-0.6374, -0.1219]))
>>> x = torch.randn(4, 4)
>>> x
tensor([[-1.2150, 0.4500, -0.0291, -1.6581],
[ 1.2452, 2.2687, -2.7824, 1.6155],
[ 0.3234, -1.5922, -0.5113, 1.3072],
[-0.3231, -0.0239, 0.8616, -0.3413]])
>>> x.split(2, 0)
(tensor([[-1.2150, 0.4500, -0.0291, -1.6581],
[ 1.2452, 2.2687, -2.7824, 1.6155]]), tensor([[ 0.3234, -1.5922, -0.5113, 1.3072],
[-0.3231, -0.0239, 0.8616, -0.3413]]))
>>> x.split(2, 1)
(tensor([[-1.2150, 0.4500],
[ 1.2452, 2.2687],
[ 0.3234, -1.5922],
[-0.3231, -0.0239]]), tensor([[-0.0291, -1.6581],
[-2.7824, 1.6155],
[-0.5113, 1.3072],
[ 0.8616, -0.3413]]))
The cat function allows you to concatenate tensors. It takes a tensor tuple and optionally a dimension as parameters.
>>> x = torch.randn(3, 3)
>>> y = torch.randn(3, 3)
>>> x
tensor([[-0.2900, -0.6176, 2.3342],
[-0.5439, 0.3578, -0.2407],
[ 1.8569, -0.6359, 2.1390]])
>>> y
tensor([[-2.0391, 0.0218, -1.1134],
[ 2.2730, 1.5351, -0.7073],
[-0.0502, 0.5240, -0.7461]])
>>> torch.cat((x, y))
tensor([[-0.2900, -0.6176, 2.3342],
[-0.5439, 0.3578, -0.2407],
[ 1.8569, -0.6359, 2.1390],
[-2.0391, 0.0218, -1.1134],
[ 2.2730, 1.5351, -0.7073],
[-0.0502, 0.5240, -0.7461]])
>>> torch.cat((x, y), 1)
tensor([[-0.2900, -0.6176, 2.3342, -2.0391, 0.0218, -1.1134],
[-0.5439, 0.3578, -0.2407, 2.2730, 1.5351, -0.7073],
[ 1.8569, -0.6359, 2.1390, -0.0502, 0.5240, -0.7461]])
The stack function is similar to cat, but it allows you to concatenate tensors, adding a new dimension.
>>> torch.stack((x, y))
tensor([[[-0.2900, -0.6176, 2.3342],
[-0.5439, 0.3578, -0.2407],
[ 1.8569, -0.6359, 2.1390]],
[[-2.0391, 0.0218, -1.1134],
[ 2.2730, 1.5351, -0.7073],
[-0.0502, 0.5240, -0.7461]]])
>>> torch.stack((x, y), 1)
tensor([[[-0.2900, -0.6176, 2.3342],
[-2.0391, 0.0218, -1.1134]],
[[-0.5439, 0.3578, -0.2407],
[ 2.2730, 1.5351, -0.7073]],
[[ 1.8569, -0.6359, 2.1390],
[-0.0502, 0.5240, -0.7461]]])
The matmul function allows you to perform matrix multiplication between two tensors, but you can also use the @ operator. The add function allows you to add two tensors, but you can also use the + operator. And the mul function allows you to multiply tensors element by element, but you can also use the * operator.
>>> x = torch.randn(3, 3)
>>> y = torch.randn(3, 3)
>>> x
tensor([[-1.4301, -0.5729, 0.6620],
[ 0.8530, 1.3618, -0.3986],
[-0.2317, 0.4181, 1.3767]])
>>> y
tensor([[ 0.6150, 1.3889, 1.3452],
[ 2.9333, 0.6640, 0.5375],
[-1.4516, 0.1693, -0.5814]])
>>> torch.matmul(x, y)
tensor([[-3.5209, -2.2546, -2.6167],
[ 5.0977, 2.0215, 2.1112],
[-0.9144, 0.1890, -0.8874]])
>>> x @ y
tensor([[-3.5209, -2.2546, -2.6167],
[ 5.0977, 2.0215, 2.1112],
[-0.9144, 0.1890, -0.8874]])
>>> torch.add(x, y)
tensor([[-0.8152, 0.8160, 2.0072],
[ 3.7864, 2.0257, 0.1389],
[-1.6833, 0.5875, 0.7953]])
>>> x + y
tensor([[-0.8152, 0.8160, 2.0072],
[ 3.7864, 2.0257, 0.1389],
[-1.6833, 0.5875, 0.7953]])
>>> torch.mul(x, y)
tensor([[-0.8795, -0.7957, 0.8905],
[ 2.5022, 0.9042, -0.2142],
[ 0.3363, 0.0708, -0.8004]])
>>> x * y
tensor([[-0.8795, -0.7957, 0.8905],
[ 2.5022, 0.9042, -0.2142],
[ 0.3363, 0.0708, -0.8004]])
The sum and mean methods allow you to calculate the sum and mean of the elements of a tensor.
>>> x = torch.rand(5)
>>> x
tensor([0.3798, 0.6449, 0.4116, 0.0874, 0.4655])
>>> x.sum()
tensor(1.9892)
>>> x.mean()
tensor(0.3978)
Broadcasting
Of course, for some operations, the shapes of the tensors must be compatible. If the shapes are not compatible, pytorch will try to apply broadcasting. To do this, it will try to extend, if possible, the tensor with missing dimensions. To do this, it compares the dimensions starting from the end.
>>> x = torch.randn(3, 3)
>>> y = torch.ones(3)
>>> x
tensor([[ 0.1552, 0.5561, -0.4699],
[ 2.4877, 1.5746, 2.3511],
[ 0.1658, 0.0917, -0.4135]])
>>> x + y
tensor([[1.1552, 1.5561, 0.5301],
[3.4877, 2.5746, 3.3511],
[1.1658, 1.0917, 0.5865]])
Triangular Matrices
The triu and tril functions allow you to create triangular, upper and lower matrices. These functions take a tensor as a parameter and transform it into a diagonal matrix. This is used in the attention mechanism of a generative transformer (decoder only) to make its function autoregressive (dependent only on the previous values in the sequence). You can also use the optional diagonal parameter to slide the contents of the matrix. This is used in sparse transformers with strided attention (large context models).
>>> torch.triu(torch.ones(3, 3))
tensor([[1., 1., 1.],
[0., 1., 1.],
[0., 0., 1.]])
>>> torch.triu(torch.ones(3, 3), diagonal=1)
tensor([[0., 1., 1.],
[0., 0., 1.],
[0., 0., 0.]])
>>> torch.triu(torch.ones(3, 3), diagonal=2)
tensor([[0., 0., 1.],
[0., 0., 0.],
[0., 0., 0.]])
>>> torch.triu(torch.ones(3, 3), diagonal=-1)
tensor([[1., 1., 1.],
[1., 1., 1.],
[0., 1., 1.]])
>>> torch.tril(torch.ones(3, 3))
tensor([[1., 0., 0.],
[1., 1., 0.],
[1., 1., 1.]])
>>> torch.tril(torch.ones(3, 3), diagonal=1)
tensor([[1., 1., 0.],
[1., 1., 1.],
[1., 1., 1.]])
>>> torch.tril(torch.ones(3, 3), diagonal=-1)
tensor([[0., 0., 0.],
[1., 0., 0.],
[1., 1., 0.]])
Of course, feel free to play around and tinker with all of this in the Python shell to fully understand how it works and see what is and isn’t possible.
Don’t miss my upcoming posts — hit the follow button on my LinkedIn profile