This repository contains the code and data for the paper "LinuxFLBench: Benchmarking and Enhancing LLM-based Agents in Localizing Linux Kernel Bugs".
LINUXFLBENCH is a new benchmark of 250 Fault Localization tasks derived from real-world Linux kernel bugs.
- The dataset is located at
dataset/LINUXFLBENCH_dataset.jsonl
in JSON Lines format. - Each line is a real Linux kernel bug sample, with fields including:
id
: Bug IDtitle
: Bug titledescription
: Detailed bug descriptionKernel Version
: The version of the Linux kernel in which the bug occurred (e.g., 5.6.7).patch
: Patch content for the fixpaths
: Source file paths involved (i.e., localization target files)methods
: Function names involved- Additional metadata: kernel version, component, hardware, etc.
- The dataset covers various kernel versions and is suitable for evaluating LLM/agent-based fault localization in large and complex systems(i.e., the Linux kernel).
- The source code for different Linux kernel versions can be downloaded from here.
The main code is under the code/
directory, organized as follows:
scale/
: Candidate file expansion and reasoningscaling_candidates_with_dir.py
: Directory-based candidate expansionscaling_candidates_with_guess.py
: LLM-based candidate expansion
merge/
: Multi-method result fusion and rerankingmerge.py
: Fusion of multiple ranking resultsrerank.py
: LLM-based candidate reranking
mail/
:Mail-related scriptsmails_retrieval.py
:Retrieves relevant emails from the mail dataset based on queriessearch_mails_bm25s.py
:BM25-based Mail Search Utils
method_fl/
: Method-level fault localization based on the predicted code filesmethod_localize.py
: Method-level fault localization script
eval/
: Evaluation and metricsevaluate.py
: Main evaluation scriptevaluation_metrics.py
: Common metrics such as Recall@K, MRR
utils.py
,file_parser.py
: General utility functions- The mail data for retrieval can be downloaded from here.
- Candidate Expansion
Use scripts inscale/
to expand candidate file lists for each bug (e.g., Directory-Aware Expansion, Potential Cause Expansion). - Candidate Integration
Use scripts inmerge/
to fuse multiple candidate ranking results, and rerank with LLM. - Evaluation
Use scripts ineval/
to evaluate the final results with metrics such as Recall@K and MRR.
All experimental results are located in the result/
directory and can be used for reproduction.
This project requires Python 3.8+ and the following packages:
- openai
- jsonlines
Install dependencies with pip:
pip install openai jsonlines
Some scripts require configuration of OpenAI API Key and base_url. See script arguments for details.
Example: Directory-Aware Expansion
python code/scale/scaling_candidates_with_dir.py \
--data_path dataset/LINUXFLBENCH_dataset.jsonl \
--save_path results/dir_scaling.jsonl \
--gpt_base_url https://api.openai.com/v1 \
--api_key YOUR_API_KEY \
--kernel_path /path/to/linux/kernel/
Evaluate the results:
python code/eval/evaluate.py --path results/dir_scaling.jsonl
For more details, usage, or questions, please open an issue or contact the authors.