LegoFuzz

LegoFuzz is an LLM-based fuzzing framework. It currently supports testing C compilers, such as GCC and LLVM.

The core idea behind LegoFuzz is to separate the whole testing process into two phases: offline and online. The offline phase queries LLMs to collect valid code snippets, which enables us to control the quality of code as well as the cost for LLMs querying. In contrast, the online phase eliminates the dependency on LLMs by reusing these pre-generated code snippets. Through our proposed iterative program synthesis, the online phase constructs increasingly complex yet valid programs for executing testing.

This project is partly based on Creal.

Structure of the project

├── synthesize.py         # For iterative program synthesis
├── fuzz.py               # For conducting fuzzing
├── transformer           # LLMs-based real code-aligned code generation
│   ├── config            # Configuration for LLMs
│   └── generate.py       # For generating code with LLMs
├── databaseconstructor   # Constructing function database
│   ├── functionextractor  
│   │   └── extract.py    # For extracting valid functions
│   └── generate.py       # For generating IO for functions
├── profiler              
│   └── profile.py        # For profiling functions
└── utils                 # Development utilities

QuickStart

Step 1: Environment setup

Python >= 3.10
Csmith (Please install it following Csmith)
CSMITH_HOME: After installing Csmith, please set the environment variable CSMITH_HOME to the installation path, with which we can locate $CSMITH_HOME/include/csmith.h.
CompCert (Please install it following CompCert)
clang >= 18, libclang-dev
diopter == 0.0.24 (pip install diopter==0.0.24)
termcolor (pip install termcolor)
openai (pip install openai)
together (pip install together)

Step 2: Synthesize a program

Please run git lfs install && git lfs pull first to download functions.json.

$ ./synthesize.py --src functions.json --dst ./tmp --iter 10

LegoFuzz will synthesize a program with 10 iterations. The synthesized program is stored in the dst directory, which is "./tmp" in this case.

Note: Due to GitHub's 2GB file size limitation, we are unable to provide the complete database containing 553,246 functions. Instead, functions.json currently includes 250,000 functions of the dataset. You can follow the full run instructions to generate the database yourself.

We are also preparing a formal artifact that will provide the complete dataset. Please stay tuned for the artifact release! 🚀

Full run with LegoFuzz

Before running, please setup the enviroment following Step 1 in Quickstart.

Phase 1: Offline Code Database Construction with LLMs

Step 1: Real Code-aligned Code Generation

All configurations related to LLMs, including model and prompt settings, are defined in transformer/config/config.yaml. Feel free to customize these settings to suit your needs.

Before getting started, ensure that your API key is properly configured by setting the environment variable:

$ cd transformer
$ echo "<API_KEY_NAME>=<API_KEY_SECRET>" > .env

LegoFuzz currently supports three API providers: OpenAI, TogetherAI, and DeepSeek. You can replace <API_KEY_NAME> with one of the following:

OPENAI_API_KEY
TOGETHER_API_KEY
DEEPSEEK_API_KEY

If you wish to use a different platform, you can extend support in transformer/config/models.py. For instance, to integrate OpenRouter API, modify the LLMClient type and specify base_url="https://openrouter.ai/api/v1" in the init method. We've provided essential API hooks for easy customization.

Once your API key is set, you can generate aligned C code using the following command:

$ ./generate.py --src <DIR_SRC> --dst <DIR_DST> openai

Parameters explanation:

--src SRC             Path to the source directory containing C files
--dst DST             Directory to save generated C files
--model {openai,deepseek,togetherai}
                    Which LLM model to use (openai, deepseek, togetherai)
--max_files MAX_FILES 
                    Maximum number of C files to process (Optional)

Step 2: Code Database Construction

Before proceeding, ensure you have followed the instructions in databaseconstructor/functionextractor/README.md and profiler/README.md to build the function extractor and profiler.

Firstly, extract functions from the LLMs-generated C files:

$ cd databaseconstructor/functionextractor
$ ./extract.py --src <DIR_C_FILES> --dst ./functions.json

Then generate I/O pairs for the functions with verification:

$ cd ..
$ ./generate.py --src functions.json --dst ./functions_io.json

Finally, proflie the functions:

$ cd ../profiler
$ ./profile.py --src ../databaseconstructor/functions_io.json --dst ./functions_profiled.json

If you have multiple functions, there may be duplicate randomly generated names. Use the deduplication script: ./dedup.py functions_profiled.json

After these steps, you will have a fully constructed function database.

Phase 2: Online Iterative Program Synthesis

Once the function database is ready (profiler/functions_profiled.json), you can synthesize programs using:

$ ./synthesize.py --src profiler/functions_profiled.json --dst ./tmp --prob 80 --num_mutant 10 --iter 100

This command generates 10 mutants in the tmp directory, starting from an initial seed function.

Parameters explanation:

--src SRC                path to the function database json file.
--dst DST                path to the destination dir.
--prob PROB              probability of replacing an expression (default=80). 
--num_mutant NUM_MUTANT  number of mutants to generate (default=1).
--iter ITER              number of iterations for one synthesis (default=100).
--no-rand                randomize the number of iterations.
--inline                 inline the function call.
--debug                  print debug information.

Fuzzing execution

The fuzzing process continuously synthesizes 10 mutants per iteration and applies differential testing to detect crashes or miscompilation bugs.

Before starting fuzzing, configure the compiler settings by copying the example file:

$ cp compilers.in.example compilers.in

In compilers.in, specify the compiler commands you want to test, such as:

gcc -O0
gcc -O1

Then, start to fuzz:

$ ./fuzz.py --cpu 4 --config compilers.in

This will launch the fuzzing process using 4 CPU cores. A fuzz directory will be created and any detected bugs will be stored in bugs directory.

Parameters explanation:

--cpu CPU        Number of CPUs to run in parallel (default: all available cores)
--config CONFIG  Path to compiler config file (default: ./compilers.in)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LegoFuzz

Structure of the project

QuickStart

Full run with LegoFuzz

Phase 1: Offline Code Database Construction with LLMs

Phase 2: Online Iterative Program Synthesis

Fuzzing execution

About

Uh oh!

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
databaseconstructor		databaseconstructor
profiler		profiler
transformer		transformer
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
compilers.in.example		compilers.in.example
functions.json		functions.json
fuzz.py		fuzz.py
synthesize.py		synthesize.py

License

cardigan1008/LegoFuzz

Folders and files

Latest commit

History

Repository files navigation

LegoFuzz

Structure of the project

QuickStart

Full run with LegoFuzz

Phase 1: Offline Code Database Construction with LLMs

Phase 2: Online Iterative Program Synthesis

Fuzzing execution

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Uh oh!

Languages