Grouped Query Attention (GQA)
Core Idea
With normal Multi-Head Attention, every query matrix has a corresponding key and value matrix. The key idea behind Grouped Query Attention is to group query matrices to use the same key and value matrices in order to reduce the number of matrices while keeping a similar performance.
In Multi-Head Attention, every query matrix has a corresponding key and value matrix whereas with Grouped Query Attention, queries are grouped to share the key and value matrices.
Experiments
The GPU memory used per iteration has dropped by ~100MB with the Grouped Query Attention. The perplexity lines are on top of each other, indicating that the performance is similar.
The main metrics that are relavent are going to be GPU Memory used throughout training and perplexity. As we can see from the experimental results, the Grouped Query Attention is able to reduce the GPU memory usage by ~100MB while keeping the perplexity and loss at a similar as the normal Multi-Head Attention. On top of that, the memory advantages are going to be even more pronounced when we are using larger models with bigger dimension sizes, more number of heads and layers. Therefore, we can conclude that the Grouped Query Attention is an effective method to reduce the GPU memory usage while keeping the performance at a similar level. Notably, GQA completed ~1000 more training steps than the baseline model (visible from the ending dots of both lines in perplexity graph), confirming its faster training efficiency.
Rotary Position Embedding
Core Idea
Rotary Position Embedding (RoPE) is a method to encode positional information. With regular Absolute Positional Encoding, the positional information is added statically to the input embeddings. However, with RoPE, the query and key matrices are multiplied with a rotation matrix in order to capture the positional information. A small walkthrough of the mathematical formulation can be found below:
Mathematical Formulation of RoPE
1. The Rotation Operation
In 2D space, rotating a vector (x, y) by angle θ is defined as:
Rθ(x, y) = (x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ))
2. Complex Number Representation
This rotation can be represented using complex numbers. For z = x + iy:
Rθ(z) = eiθ · z = (cos(θ) + i·sin(θ))(x + iy)
3. Position-Frequency Mapping
For each position m and dimension d in a model of dimension D, we define the angle of rotation as:
θm,d = m · θbase-2d/D
Where θbase is typically 10000. Consecutive dimensions are paired (dimension 2i and 2i+1, where i is the index) and each pair represents a single complex number, with the first dimension corresponding to the real component and the second to the imaginary component.
4. Application to Attention
For query q at position m and key k at position n:
q·k becomes Rmθ(q)·Rnθ(k) = q·R(m-n)θ(k)
This shows how RoPE naturally captures relative positions (m-n) in the attention mechanism while also preserving the magnitudes of the vectors.
The attention mechanism in RoPE produces outputs that are primarily functions of two factors: the relative distance between tokens and the content of the token embeddings themselves.
Experiments
Perplexity with RoPE remains consistently lower than baseline (see top-right figure), but the implementation reduces inference speed due to additional matrix multiplication requirements.
The RoPE implementation demonstrates ~1.5 point lower perplexity scores compared to the baseline, indicating a measurable advantage. This improvement likely stems from RoPE's theoretical foundations: positional information is encoded explicitly while vector magnitudes remain preserved, potentially leading to more stable training. In contrast, Absolute Positional Encoding alters vector magnitudes arbitrarily, which may contribute to less stable training dynamics. Despite the perplexity improvements, our RoPE implementation revealed a significant drawback: decreased token processing speed during inference due to the additional matrix multiplications required for rotations. This slower inference possibly can become a limitation for practical applications.
SwiGLU
Core Idea
SwiGLU is an activation function that uses Gated Linear Units (GLU) and uses a Swish function instead of the standard Sigmoid function. The formula for SwiGLU is:
SwiGLU(x, W, V, b, c) = Swish(xW + b) ⊗ (xV + c)
Where Swish(x) = x · sigmoid(βx), and ⊗ represents element-wise multiplication. All the parameters are trainable.
SwiGLU was introduced in the PaLM paper by Google Research as an improvement over other activation functions for transformer models. While there isn't a comprehensive theoretical explanation for why SwiGLU outperforms alternatives like ReLU or GELU, empirical results show it leads to better performance in large language models. It's now commonly used in many state-of-the-art transformer architectures.

Illustration of SwiGLU and other similar activation functions
Experiments
SwiGLU reduces perplexity by ~2 points but decreases throughput by ~100,000, slowing training and increasing GPU memory usage by ~350MB.
SwiGLU, despite being a small implementation change, have decreased the perplexity by ~2 points throughout training. Therefore, SwiGLU is an easy way to decrease the models perplexity considering its ease of implementation. One obvious drawback of SwiGLU is the introduction of trainable parameters. Despite being a clear advantage for model perplexity, it has caused the model to be slower according to different metrics. First, as evident from the zoomed in perplexity figure, the swiglu training is cutoff at ~20k, while the baseline continued training until ~22k. Since we have allowed each model to train for 3 hours, this indicates that the training of the SwiGLU model was slower than the baseline. When we dig further, we also see that the training throughput is 100.000 tokens less than the baseline. Lastly, the new parameters have introduced 350MB more GPU Memory usage throughout training iterations. Even if SwiGLU boosts model capability, there are concerns about memory usage, training and inference speed.
SOAP Optimiser
Core Idea
The SOAP optimizer (Second-Order Approximate Propagation) aims for rapid early‑stage progress. The idea is to bridge ideas from quasi-Newton methods and modern large-scale deep learning optimization. It's particularly notable for its ability to approximate second-order updates using only first-order information, while retaining compatibility with large-scale training. The core idea is to track a low-rank approximation to the curvature of the loss landscape, and use it to update parameters more efficiently than standard first-order methods like Adam.
Experiments
Here, it proved somewhat counter-productive: SOAP reduced throughput by almost 40%, although perplexity was slightly better than AdamW in performance, overtaking it after step 2500. Because of the throughput reduction, SOAP implementation is not advantageous according to our analysis.
MUON Optimiser
Core Idea
MUON (MomentUm Orthogonalized by Newton-Schulz) is an optimizer designed only for the hidden layers of neural networks, particularly those with 2D parameters. It operates by first computing the standard momentum-based gradient updates and then applying a Newton-Schulz iteration to orthogonalize these updates. This orthogonalization process aims to improve the conditioning of the updates, potentially leading to better training dynamics and efficiency.
Experiments
MUON gives slightly worse perplexity and throughput than the baseline, although this may be well within the margin of error.
However, in our experiments on GPT‑2 Small—a dense, low-parameter regime—MUON under‑performed: it yielded +0.8 perplexity and 6 % slower training compared to AdamW. We attribute this to a mismatch between MUON's hyperparameter tuning and the dense characteristics of our toy workload. Even though we have done hyperparameter tuning -as far as our computational budget allowed us to- it was not enough the bring out the true potential that is expected from MUON, suggesting less robustness against hyperparameter tuning.
RMSNorm
RMS norm is a simple change to the normalization step in which we drop mean subtraction to simplify. We measured a +2 % tokens/s boost with equal perplexity. The change is independent and easy to integrate., but the effect size is small.