Skip to content

FutureComputing4AI/vtpipeline-rs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

VTPipeline

A pipeline for querying/downloading hashes, scan reports, and files from VirusTotal. This package implements the methodology used to build the EMBER2024 dataset.

Installation

git clone 
cd vtpipeline-rs/
cargo build --release

Setup

Basic configuration

To get started, create a config file for VTPipeline. An example can be found in example/vt.toml. Set data_dir to the path where you want your base directory to be, and set vt_api_key to your VirusTotal API key.

data_dir = "/path/to/base"
vt_api_key = "aabbcc"

VTPipeline searches for the config file in /etc/vtpipeline/ by default, or you can pass it as a command-line argument.

Configuring file thresholds

You can configure which types of files you want to query. A default list of files, and how many of each to query on a given day, is located in data/file_thresholds.json

Configuring supported antivirus products

A list of default-supported antivirus products is located in data/avs.json. It also lists if there are any AV products with known relationships to other AVs.

Query hashes from VirusTotal

Use the vtpipeline hashes command to query hashes of files that were first uploaded to VirusTotal within the last 24 hours, or 90 days ago.

vtpipeline hashes --date-offset 1

Query scans and files from VirusTotal

After querying hashes from VirusTotal, you can retrieve scans and download files using the vtpipeline scans command.

vtpipeline scans --date-offset 90

The TLSH fuzzy hash was used to identify and down-sample near-duplicate files before downloading them. This is not exactly the same as the EMBER2024 dataset implementation, which entirely removes near-duplicates.

vtpipeline scans --date-offset 90 --diff-threshold 30

If you use VTPipeline, please cite it using:

@inproceedings{joyce2025ember,
      title={EMBER2024 - A Benchmark Dataset for Holistic Evaluation of Malware Classifiers},
      author={Robert J. Joyce and Gideon Miller and Phil Roth and Richard Zak and Elliott Zaresky-Williams and Hyrum Anderson and Edward Raff and James Holt},
      year={2025},
      booktitle={Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining},
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published