Hybrid Transformer: ~60% Faster Inference by Approximating Deep Attention

This repository contains the code and results for an experiment demonstrating that the dynamic attention mechanism in a Transformer's deeper layers can be replaced by a tiny, static MLP with no significant loss in model quality.

This project provides a corrected, architecturally sound implementation of the "Static Function Hypothesis." The final, aligned Hybrid GPT model achieves a ~60% inference speedup over the standard architecture while maintaining conversational ability.

🧠 Key Findings

By replacing the multi-head attention blocks in the final two layers of a 4-layer GPT model with a tiny, per-token MLP, we achieved the following results on the final, DPO-aligned models:

Metric (Context Size 256)	Final Standard GPT	Final Hybrid GPT	Improvement
Final Validation Loss	5.02	5.09	~ Parity
Prefill Time	0.0047 s	0.0029 s	38% faster
Decode Speed	206.27 tok/s	326.11 tok/s	58% faster

These results validate that a hybrid approach can dramatically increase performance while maintaining the model's language capabilities, even after a full alignment pipeline.

The Static Function Hypothesis

This experiment is based on the hypothesis that a Transformer operates as a computational hierarchy:

Early Layers perform complex, dynamic feature extraction on raw input, requiring the full power of self-attention.
Deeper Layers perform more regular, predictable operations on the already-structured data from earlier layers.

Our results show that the function of these deeper layers is simple enough to be learned and replaced by a tiny, efficient MLP, removing the computational bottleneck of attention where it is no longer necessary.

How to Run the Experiment

This project uses uv for environment and package management.

1. Set Up the Environment

This command will read the pyproject.toml file, create a virtual environment, and install all necessary dependencies.

uv sync
2. Run the Full Experiment Pipeline
The following commands will run the entire experiment in the correct order, from data preparation to final benchmarking.

Bash
# 1. Download all datasets (Shakespeare, SFT, DPO)
uv run python prepare_data.py
uv run python prepare_chat_data.py

# 2. Pre-train the base models
uv run python train_base_model.py --model_type standard --output_path base_standard_model.pt
uv run python train_base_model.py --model_type hybrid --output_path base_hybrid_model.pt

# 3. Supervised Fine-Tune (SFT) both models
uv run python train_sft.py --model_type standard
uv run python train_sft.py --model_type hybrid

# 4. Direct Preference Optimization (DPO) on both models
uv run python train_dpo.py --model_type standard
uv run python train_dpo.py --model_type hybrid

# 5. Benchmark the final, aligned models
uv run python benchmark.py
3. Chat with the Final Models
You can interact with your fully-aligned chatbots to qualitatively compare their performance.

Bash
# Chat with the standard model
uv run python chat.py dpo_standard_model.pt

# Chat with the faster hybrid model
uv run python chat.py dpo_hybrid_model.pt
Citation
If you use this work, please consider citing it:

@misc{bee2025hybrid,
  author = {Bee, M.},
  title = {A Hybrid Approach to Efficient Transformer Inference},
  year = {2025},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{[https://github.com/MikeyBeez/hybrid-transformer-experiment_2](https://github.com/MikeyBeez/hybrid-transformer-experiment_2)}},
}
Acknowledgments
This project is built upon the excellent nanoGPT repository by Andrej Karpathy and uses alignment principles from nanochat.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Hybrid Transformer: ~60% Faster Inference by Approximating Deep Attention

🧠 Key Findings

The Static Function Hypothesis

How to Run the Experiment

1. Set Up the Environment

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
README.md		README.md
benchmark.py		benchmark.py
chat.py		chat.py
config.py		config.py
model.py		model.py
prepare_chat_data.py		prepare_chat_data.py
prepare_data.py		prepare_data.py
pyproject.toml		pyproject.toml
train_approximators.py		train_approximators.py
train_base_model.py		train_base_model.py
train_dpo.py		train_dpo.py
train_end_to_end.py		train_end_to_end.py
train_sft.py		train_sft.py

MikeyBeez/hybrid-transformer-experiment_2

Folders and files

Latest commit

History

Repository files navigation

Hybrid Transformer: ~60% Faster Inference by Approximating Deep Attention

🧠 Key Findings

The Static Function Hypothesis

How to Run the Experiment

1. Set Up the Environment

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages