Benchmarking Algorithmic Enhancements for LLMs

Introduction

Transformer research from 2020 – 2024 produced a flurry of algorithmic upgrades—clever changes to the attention mechanism, activation functions, normalisation layers and optimisers—that materially improve quality or efficiency without changing hardware. Unfortunately, empirical comparisons are scattered: papers evaluate on different datasets, model sizes or GPUs, making trade‑offs opaque. Inspired by the ConvNeXt^[1] methodology, we ask a simple question:

Which modern algorithmic innovations transfer best to a legacy, small‑scale language model when measured under a single experimental budget?

Related Work

Our study mirrors ConvNeXt^[1], which revisited the ResNet design space and systematically introduced modern transformer practices to CNNs. We instead start from GPT‑2 Small and graft in modern LLM tricks one by one. The GPT‑2 Speed Run^[2] competition provided practical baselines for throughput evaluation. We also draw on the original proposal papers for each technique: Grouped Query Attention^[3], RoPE ^[4], SwiGLU^[5], RMSNorm^[6], FlashAttention^[7], SOAP^[8] and MUON^[9].

Algorithmic Enhancements Evaluated

Technique	Component Replaced	Claimed Benefit
Grouped Query Attention (GQA)	Multi‑Head Attention	Memory reduction, faster training
Rotary Positional Embedding (RoPE)	Absolute Positional Encoding	Long‑context generalisation, lower perplexity
SwiGLU	GELU	Higher model capacity at fixed FLOPs
SOAP Optimiser	AdamW	Faster convergence early in training
MUON Optimiser	AdamW	Stability on sparse gradients
RMSNorm	LayerNorm	Fewer parameters, marginal throughput gain
FlashAttention v2 (PyTorch)	Scaled Dot‑Product Attention kernel	Up to 3× tokens/s, lower RAM

Experimental Setup

In order to test the algorithmic enhancements, we have used the following setup:

Base model: GPT‑2 Small (124 M parameters) - NanoGPT implementation with a batch size of 256
Dataset: Subset of OpenWebText (≈ 4B tokens)
Compute: Single NVIDIA A100 80 GB, PyTorch 2.5, bfloat16 precision
Training budget: 3 hours wall‑clock (Each model gets trained for 3 hours, no limit on number of steps)
Metrics: Our naming convention follows the pattern {category}/{metric_name}
- gpu/gpu_gb_iter: Peak GPU memory usage in gigabytes tracked per iteration during training, providing visibility into model memory efficiency and helping identify potential memory bottlenecks.
- validation/perplexity, train/perplexity: Exponential of validation and training loss respectively, measuring prediction quality on unseen and training data.
- validation/loss, train/loss: Cross-entropy loss on validation and training data.
- throughput/tokens_per_sec_inference: Inference speed measured as tokens generated per second during autoregressive text generation (actual content creation), reflecting real-world generation performance.
- train/throughput: Forward-pass evaluation speed on training data measured in tokens per second without gradient computation, providing insight into pure inference capability.
- throughput/tokens_per_sec_train: Training speed measured as total tokens processed per second including forward pass, backpropagation, and optimization steps, representing true training throughput.

Within this setting, we implement the algorithmic enhancements seperately one-by-one and apply hyperparameter tuning when necessary. We present the results in the following sections.

Results & Analysis

Grouped Query Attention (GQA)

Core Idea

With normal Multi-Head Attention, every query matrix has a corresponding key and value matrix. The key idea behind Grouped Query Attention is to group query matrices to use the same key and value matrices in order to reduce the number of matrices while keeping a similar performance.

In Multi-Head Attention, every query matrix has a corresponding key and value matrix whereas with Grouped Query Attention, queries are grouped to share the key and value matrices.

Experiments

The GPU memory used per iteration has dropped by ~100MB with the Grouped Query Attention. The perplexity lines are on top of each other, indicating that the performance is similar.

The main metrics that are relavent are going to be GPU Memory used throughout training and perplexity. As we can see from the experimental results, the Grouped Query Attention is able to reduce the GPU memory usage by ~100MB while keeping the perplexity and loss at a similar as the normal Multi-Head Attention. On top of that, the memory advantages are going to be even more pronounced when we are using larger models with bigger dimension sizes, more number of heads and layers. Therefore, we can conclude that the Grouped Query Attention is an effective method to reduce the GPU memory usage while keeping the performance at a similar level. Notably, GQA completed ~1000 more training steps than the baseline model (visible from the ending dots of both lines in perplexity graph), confirming its faster training efficiency.

Rotary Position Embedding

Core Idea

Rotary Position Embedding (RoPE) is a method to encode positional information. With regular Absolute Positional Encoding, the positional information is added statically to the input embeddings. However, with RoPE, the query and key matrices are multiplied with a rotation matrix in order to capture the positional information. A small walkthrough of the mathematical formulation can be found below:

Mathematical Formulation of RoPE

1. The Rotation Operation

In 2D space, rotating a vector (x, y) by angle θ is defined as:
R_θ(x, y) = (x·cos(θ) - y·sin(θ), x·sin(θ) + y·cos(θ))

2. Complex Number Representation

This rotation can be represented using complex numbers. For z = x + iy:
R_θ(z) = e^iθ · z = (cos(θ) + i·sin(θ))(x + iy)

3. Position-Frequency Mapping

For each position m and dimension d in a model of dimension D, we define the angle of rotation as:
θ_m,d = m · θ_base^-2d/D

Where θ_base is typically 10000. Consecutive dimensions are paired (dimension 2i and 2i+1, where i is the index) and each pair represents a single complex number, with the first dimension corresponding to the real component and the second to the imaginary component.

4. Application to Attention

For query q at position m and key k at position n:
q·k becomes R_mθ(q)·R_nθ(k) = q·R_(m-n)θ(k)

This shows how RoPE naturally captures relative positions (m-n) in the attention mechanism while also preserving the magnitudes of the vectors.

The attention mechanism in RoPE produces outputs that are primarily functions of two factors: the relative distance between tokens and the content of the token embeddings themselves.

Experiments

Perplexity with RoPE remains consistently lower than baseline (see top-right figure), but the implementation reduces inference speed due to additional matrix multiplication requirements.

The RoPE implementation demonstrates ~1.5 point lower perplexity scores compared to the baseline, indicating a measurable advantage. This improvement likely stems from RoPE's theoretical foundations: positional information is encoded explicitly while vector magnitudes remain preserved, potentially leading to more stable training. In contrast, Absolute Positional Encoding alters vector magnitudes arbitrarily, which may contribute to less stable training dynamics. Despite the perplexity improvements, our RoPE implementation revealed a significant drawback: decreased token processing speed during inference due to the additional matrix multiplications required for rotations. This slower inference possibly can become a limitation for practical applications.

SwiGLU

Core Idea

SwiGLU is an activation function that uses Gated Linear Units (GLU) and uses a Swish function instead of the standard Sigmoid function. The formula for SwiGLU is:

SwiGLU(x, W, V, b, c) = Swish(xW + b) ⊗ (xV + c)

Where Swish(x) = x · sigmoid(βx), and ⊗ represents element-wise multiplication. All the parameters are trainable.

SwiGLU was introduced in the PaLM paper by Google Research as an improvement over other activation functions for transformer models. While there isn't a comprehensive theoretical explanation for why SwiGLU outperforms alternatives like ReLU or GELU, empirical results show it leads to better performance in large language models. It's now commonly used in many state-of-the-art transformer architectures.

Illustration of SwiGLU and other similar activation functions

Experiments

SwiGLU reduces perplexity by ~2 points but decreases throughput by ~100,000, slowing training and increasing GPU memory usage by ~350MB.

SwiGLU, despite being a small implementation change, have decreased the perplexity by ~2 points throughout training. Therefore, SwiGLU is an easy way to decrease the models perplexity considering its ease of implementation. One obvious drawback of SwiGLU is the introduction of trainable parameters. Despite being a clear advantage for model perplexity, it has caused the model to be slower according to different metrics. First, as evident from the zoomed in perplexity figure, the swiglu training is cutoff at ~20k, while the baseline continued training until ~22k. Since we have allowed each model to train for 3 hours, this indicates that the training of the SwiGLU model was slower than the baseline. When we dig further, we also see that the training throughput is 100.000 tokens less than the baseline. Lastly, the new parameters have introduced 350MB more GPU Memory usage throughout training iterations. Even if SwiGLU boosts model capability, there are concerns about memory usage, training and inference speed.

SOAP Optimiser

Core Idea

The SOAP optimizer (Second-Order Approximate Propagation) aims for rapid early‑stage progress. The idea is to bridge ideas from quasi-Newton methods and modern large-scale deep learning optimization. It's particularly notable for its ability to approximate second-order updates using only first-order information, while retaining compatibility with large-scale training. The core idea is to track a low-rank approximation to the curvature of the loss landscape, and use it to update parameters more efficiently than standard first-order methods like Adam.

Experiments

Here, it proved somewhat counter-productive: SOAP reduced throughput by almost 40%, although perplexity was slightly better than AdamW in performance, overtaking it after step 2500. Because of the throughput reduction, SOAP implementation is not advantageous according to our analysis.

MUON Optimiser

Core Idea

MUON (MomentUm Orthogonalized by Newton-Schulz) is an optimizer designed only for the hidden layers of neural networks, particularly those with 2D parameters. It operates by first computing the standard momentum-based gradient updates and then applying a Newton-Schulz iteration to orthogonalize these updates. This orthogonalization process aims to improve the conditioning of the updates, potentially leading to better training dynamics and efficiency.

Experiments

MUON gives slightly worse perplexity and throughput than the baseline, although this may be well within the margin of error.

However, in our experiments on GPT‑2 Small—a dense, low-parameter regime—MUON under‑performed: it yielded +0.8 perplexity and 6 % slower training compared to AdamW. We attribute this to a mismatch between MUON's hyperparameter tuning and the dense characteristics of our toy workload. Even though we have done hyperparameter tuning -as far as our computational budget allowed us to- it was not enough the bring out the true potential that is expected from MUON, suggesting less robustness against hyperparameter tuning.

RMSNorm

RMS norm is a simple change to the normalization step in which we drop mean subtraction to simplify. We measured a +2 % tokens/s boost with equal perplexity. The change is independent and easy to integrate., but the effect size is small.

Discussion & Recommended Recipe

We evaluated a series of drop-in algorithmic enhancements to GPT-2 Small: Grouped Query Attention (GQA) reduced GPU memory by ≈100 MB with negligible impact on perplexity. The SOAP optimizer provided slight late-stage perplexity gains but cut throughput by ~40%, so it's not recommended for small models. MUON underperformed on GPT-2 Small, increasing perplexity by +0.8 and slowing training by 6%. RMSNorm delivered a ~2% tokens/s boost with equal perplexity, and Rotary Positional Embeddings (RoPE) and SwiGLU activations remain promising for increasing model capability but they impact computation speed. Based on these findings, we recommend integrating GQA, RMSNorm for immediate efficiency gains on resource-constrained GPUs, and suggest careful evaluation of SOAP and MUON. SwiGLU and RoPE is recommended for decreasing model perplexity, but caution is needed regarding their computational requirements. We believe that this evaluation is useful as a guide for people who are planning to train small models with limited compute. However, for generalization to bigger models and much higher training times, further evaluation is needed to give a conclusive answer.

Code & Data

https://github.com/loftusa/GPTNext/

MIT License

References

Team Members

Kerem Sahin · Alex Loftus