Skip to content

MikeyBeez/hybrid-transformer-experiment_2

Repository files navigation

Hybrid Transformer: ~60% Faster Inference by Approximating Deep Attention

This repository contains the code and results for an experiment demonstrating that the dynamic attention mechanism in a Transformer's deeper layers can be replaced by a tiny, static MLP with no significant loss in model quality.

This project provides a corrected, architecturally sound implementation of the "Static Function Hypothesis." The final, aligned Hybrid GPT model achieves a ~60% inference speedup over the standard architecture while maintaining conversational ability.


🧠 Key Findings

By replacing the multi-head attention blocks in the final two layers of a 4-layer GPT model with a tiny, per-token MLP, we achieved the following results on the final, DPO-aligned models:

Metric (Context Size 256) Final Standard GPT Final Hybrid GPT Improvement
Final Validation Loss 5.02 5.09 **~ Parity**
Prefill Time 0.0047 s 0.0029 s 38% faster
Decode Speed 206.27 tok/s 326.11 tok/s 58% faster

These results validate that a hybrid approach can dramatically increase performance while maintaining the model's language capabilities, even after a full alignment pipeline.

The Static Function Hypothesis

This experiment is based on the hypothesis that a Transformer operates as a computational hierarchy:

  • Early Layers perform complex, dynamic feature extraction on raw input, requiring the full power of self-attention.
  • Deeper Layers perform more regular, predictable operations on the already-structured data from earlier layers.

Our results show that the function of these deeper layers is simple enough to be learned and replaced by a tiny, efficient MLP, removing the computational bottleneck of attention where it is no longer necessary.

How to Run the Experiment

This project uses uv for environment and package management.

1. Set Up the Environment

This command will read the pyproject.toml file, create a virtual environment, and install all necessary dependencies.

uv sync
2. Run the Full Experiment Pipeline
The following commands will run the entire experiment in the correct order, from data preparation to final benchmarking.

Bash
# 1. Download all datasets (Shakespeare, SFT, DPO)
uv run python prepare_data.py
uv run python prepare_chat_data.py

# 2. Pre-train the base models
uv run python train_base_model.py --model_type standard --output_path base_standard_model.pt
uv run python train_base_model.py --model_type hybrid --output_path base_hybrid_model.pt

# 3. Supervised Fine-Tune (SFT) both models
uv run python train_sft.py --model_type standard
uv run python train_sft.py --model_type hybrid

# 4. Direct Preference Optimization (DPO) on both models
uv run python train_dpo.py --model_type standard
uv run python train_dpo.py --model_type hybrid

# 5. Benchmark the final, aligned models
uv run python benchmark.py
3. Chat with the Final Models
You can interact with your fully-aligned chatbots to qualitatively compare their performance.

Bash
# Chat with the standard model
uv run python chat.py dpo_standard_model.pt

# Chat with the faster hybrid model
uv run python chat.py dpo_hybrid_model.pt
Citation
If you use this work, please consider citing it:

@misc{bee2025hybrid,
  author = {Bee, M.},
  title = {A Hybrid Approach to Efficient Transformer Inference},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{[https://github.com/MikeyBeez/hybrid-transformer-experiment_2](https://github.com/MikeyBeez/hybrid-transformer-experiment_2)}},
}
Acknowledgments
This project is built upon the excellent nanoGPT repository by Andrej Karpathy and uses alignment principles from nanochat.

About

A faster Transformer by replacing deep attention layers. Achieves ~60% speedup.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages