-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Make De-duplication Multi-threaded and Happen Only During Pre-processing #2747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Make De-duplication Multi-threaded and Happen Only During Pre-processing #2747
Conversation
Warning Rate limit exceeded@xzuyn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 32 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughThe deduplication logic for datasets has been refactored and consolidated. Deduplication now occurs once after dataset merging in the pre-processing workflow, controlled by configuration, and is removed from later stages. The deduplication implementation is now parallelized and hash-based, improving efficiency and ensuring deduplication is performed only during pre-processing. Changes
Sequence Diagram(s)sequenceDiagram
participant Config
participant SFT
participant Utils
Config->>SFT: Provide cfg.dataset_exact_deduplication, cfg.dataset_processes
SFT->>SFT: Merge datasets (if applicable)
alt Deduplication enabled
SFT->>Utils: deduplicate_and_log_datasets(merged_dataset, num_proc)
Utils->>Utils: deduplicate_dataset (parallel hash-based deduplication)
Utils-->>SFT: Deduplicated dataset
end
SFT->>SFT: Drop long sequences
SFT-->>Caller: Return processed dataset
Assessment against linked issues
Assessment against linked issues: Out-of-scope changesNo out-of-scope changes found. Poem
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
src/axolotl/utils/data/utils.py (1)
110-116
:⚠️ Potential issueFix type annotation for consistency.
The
num_proc
parameter should useOptional[int]
for proper type annotation.Apply this diff:
def deduplicate_and_log_datasets( *, train_dataset: Dataset = None, eval_dataset: Dataset = None, dataset: Dataset = None, - num_proc: int = None, + num_proc: Optional[int] = None, ) -> tuple[Dataset, Dataset, Dataset]:
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/axolotl/utils/data/sft.py
(3 hunks)src/axolotl/utils/data/utils.py
(2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/sft.py (1)
src/axolotl/utils/data/utils.py (1)
deduplicate_and_log_datasets
(110-143)
🪛 GitHub Actions: lint
src/axolotl/utils/data/sft.py
[error] 341-343: Pre-commit black hook reformatted this file. Run 'black' to fix code style issues.
src/axolotl/utils/data/utils.py
[error] 1-1: Black formatting failed. Files src/axolotl/utils/data/utils.py and src/axolotl/utils/data/sft.py were reformatted by the black hook.
[warning] 101-101: Pylint warning W0622: Redefining built-in 'hash' (redefined-builtin).
[error] 79-115: Mypy type checking errors: Incompatible default for argument 'num_proc' (default None, expected int), missing type annotation for 'other_hashes', incompatible assignment types (set[Any] assigned to list[Any]).
🪛 Pylint (3.3.7)
src/axolotl/utils/data/utils.py
[warning] 101-101: Redefining built-in 'hash'
(W0622)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.6.0)
🔇 Additional comments (2)
src/axolotl/utils/data/utils.py (1)
72-74
: LGTM!The helper function correctly computes SHA256 hash for dataset rows.
src/axolotl/utils/data/sft.py (1)
444-461
: Deduplication consolidation looks good!The removal of deduplication calls from the train/test split logic is correct. Deduplication now happens once after dataset merging (line 344), which is more efficient and aligns with the PR objective of "deduplication happening only during pre-processing."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
src/axolotl/utils/data/utils.py (1)
77-108
: Fix type annotation for num_proc parameter.The type annotation for
num_proc
is incorrect - it's annotated asint
but has a default value ofNone
, which causes type checking errors.Apply this diff to fix the type annotation:
def deduplicate_dataset( dataset: Dataset, other_dataset: Dataset = None, - num_proc: int = None, + num_proc: Optional[int] = None, ) -> Dataset:The rest of the function implementation looks correct - the variable initialization properly uses sets, and the loop variable is appropriately named
row_hash
to avoid shadowing the built-inhash
function.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
src/axolotl/utils/data/sft.py
(3 hunks)src/axolotl/utils/data/utils.py
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/axolotl/utils/data/sft.py
🧰 Additional context used
🪛 Pylint (3.3.7)
src/axolotl/utils/data/utils.py
[convention] 12-12: standard import "typing.Optional" should be placed before third party imports "huggingface_hub", "numpy", "requests", "datasets.Dataset"
(C0411)
🪛 GitHub Actions: lint
src/axolotl/utils/data/utils.py
[error] 1-1: Pre-commit black hook failed: file was reformatted by black. Run 'black src/axolotl/utils/data/utils.py' to fix formatting.
[error] 1-1: Pre-commit isort hook failed: file was modified to fix import sorting. Run 'isort src/axolotl/utils/data/utils.py' to fix import order.
[error] 80-153: Mypy type checking errors: Incompatible default for argument 'num_proc' (None vs int); multiple calls to 'deduplicate_dataset' with argument 'num_proc' of type 'int | None' instead of expected 'int'. See lines 80, 129, 141, 153.
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (6)
src/axolotl/utils/data/utils.py (6)
73-74
: LGTM! Clean helper function implementation.The
compute_row_hash
function is well-implemented and provides a clean interface for computing SHA256 hashes of dataset examples.
82-82
: LGTM! Correct variable initialization.The variable initialization now correctly uses
set()
forother_hashes
,seen_hashes
, andunique_indices
, which aligns with their usage patterns in the function.
84-100
: Efficient vectorized hash computation approach.The new implementation using
dataset.map()
withcompute_row_hash
is a significant improvement over manual iteration. The parallel processing support vianum_proc
will enhance performance for large datasets.
102-106
: LGTM! Clean deduplication logic.The deduplication loop correctly uses set operations for efficient lookup and avoids variable naming conflicts by using
row_hash
instead ofhash
.
116-116
: LGTM! Consistent type annotation.The
num_proc
parameter correctly usesOptional[int]
type annotation, which is consistent with the intended usage pattern.
126-142
: Verify consistent parameter passing in deduplication calls.The updated function calls correctly pass the
num_proc
parameter to enable parallel processing. The logging messages provide good visibility into the deduplication process.However, ensure that the type annotation issue in
deduplicate_dataset
is fixed to prevent type checking errors in these calls.
continue | ||
seen_hashes.add(row_hash) | ||
unique_indices.add(idx) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth dropping the added row_hash
column here now that it's done.. The previous implementation did not modify the dataset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this shouldn't be modifying it either. the only change that happens to dataset
is at the end when it returns with .select
. the row_hash
stuff comes from a copy in hashed
and other_hashed
(which I should probably make delete after grabbing the hashes)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/data/utils.py
(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (5)
src/axolotl/utils/data/utils.py (5)
7-7
: Import addition looks good.The
Optional
import is properly placed according to PEP 8 conventions and is needed for the type annotations in the refactored functions.
73-74
: Clean helper function for hash computation.The function correctly computes SHA256 hashes for dataset examples and returns the expected dictionary format for use with
dataset.map()
.
77-106
: Excellent refactoring for multi-threaded deduplication.The refactored implementation successfully addresses the PR objectives:
- Multi-threading: Uses
num_proc
parameter withdataset.map()
for parallel hash computation- Efficiency: Replaces iterative approach with vectorized operations and set-based lookups
- Memory management: Explicit cleanup of intermediate variables
- Correctness: Proper type initialization and variable naming (addresses past review concerns)
The hash-based approach without collision verification aligns with the stated conclusion that SHA256 collisions are too unlikely to justify additional checks.
114-114
: Parameter addition supports parallelization.The
num_proc
parameter correctly enables parallel processing for the deduplication operations.
127-127
: Proper integration of parallel processing parameter.All calls to
deduplicate_dataset
correctly pass thenum_proc
parameter, enabling the multi-threaded deduplication feature described in the PR objectives.Also applies to: 139-139, 151-151
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/axolotl/utils/data/utils.py (1)
104-104
: Move memory cleanup after return statement.The
del
statement forhashes
,other_hashes
, andseen_hashes
occurs before the return, butunique_indices
is still needed. Consider moving this cleanup or removing it since these variables will be garbage collected anyway when the function exits.Apply this diff to clean up the memory management:
- del hashes, other_hashes, seen_hashes - return dataset.select(unique_indices)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/axolotl/utils/data/utils.py
(5 hunks)
🧰 Additional context used
🪛 GitHub Actions: lint
src/axolotl/utils/data/utils.py
[error] 90-92: Pre-commit black hook failed: file was reformatted due to code style issues. Run 'black src/axolotl/utils/data/utils.py' to fix.
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (4)
src/axolotl/utils/data/utils.py (4)
73-74
: LGTM: Clean hash computation implementation.The
compute_row_hash
function provides a clean interface for computing SHA256 hashes of dataset examples, supporting the new parallelized deduplication approach.
89-89
: Good memory management practices.The explicit deletion of intermediate variables (
del hashed
anddel other_hashed
) is excellent for memory efficiency, especially important when processing large datasets.Also applies to: 96-96
98-102
: Verify deduplication logic correctness.The deduplication logic looks correct - it properly checks against both
seen_hashes
(within the dataset) andother_hashes
(from the other dataset) before adding indices. This ensures no duplicates within the dataset or between datasets.
86-86
:❓ Verification inconclusive
Fix code formatting to resolve pipeline failure.
The pipeline indicates a black formatting issue. The code needs to be reformatted according to black style guidelines.
Run the following command to fix the formatting issue:
Also applies to: 93-93
🏁 Script executed:
#!/bin/bash # Description: Fix black formatting issues in the file # Expected: The file should be reformatted according to black style guidelines black src/axolotl/utils/data/utils.pyLength of output: 135
Install Black and reformat code to resolve pipeline failure
It looks like the
black
formatter isn’t available in the environment (black: command not found
). Please take the following steps:
Add Black to your development dependencies (for example, run
pip install black
or update yourrequirements-dev.txt
/poetry
/pipenv
files).Re-run the formatter on the affected file(s):
black src/axolotl/utils/data/utils.pyConfirm that both line 86 and line 93 (and any other Black warnings) are now correctly formatted.
Uses only the SHA256 hash. Collision is too unlikely to be worth checking for.
262K samples de-duped in ~4 minutes. Previously it took ~15 minutes to de-dupe 162K samples. So roughly 6x faster, or even faster if you count it de-duping multiple times before.
Resolves #2719
Summary by CodeRabbit