Skip to content

Make De-duplication Multi-threaded and Happen Only During Pre-processing #2747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

xzuyn
Copy link
Contributor

@xzuyn xzuyn commented Jun 1, 2025

Uses only the SHA256 hash. Collision is too unlikely to be worth checking for.

262K samples de-duped in ~4 minutes. Previously it took ~15 minutes to de-dupe 162K samples. So roughly 6x faster, or even faster if you count it de-duping multiple times before.

[2025-06-01 06:26:34,020] [INFO] [axolotl.utils.data.utils.deduplicate_and_log_datasets:132] [PID:74277] [RANK:0] Train dataset is None. Skipping deduplication.
[2025-06-01 06:26:34,020] [INFO] [axolotl.utils.data.utils.deduplicate_and_log_datasets:145] [PID:74277] [RANK:0] Eval dataset is None. Skipping deduplication.
[2025-06-01 06:26:34,020] [INFO] [axolotl.utils.data.utils.deduplicate_and_log_datasets:148] [PID:74277] [RANK:0] Starting deduplication for combined dataset. Original size: 262437
Map (num_proc=12): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 262437/262437 [03:45<00:00, 1163.76 examples/s]
[2025-06-01 06:30:20,862] [INFO] [axolotl.utils.data.utils.deduplicate_and_log_datasets:152] [PID:74277] [RANK:0] Deduplication complete for combined dataset. New size: 258523

Resolves #2719

Summary by CodeRabbit

  • Refactor
    • Improved dataset deduplication with faster, parallel processing using hash-based methods.
    • Deduplication now occurs once immediately after dataset merging, reducing redundant steps.
  • Chores
    • Simplified dataset preparation workflow for better consistency and efficiency.

Copy link

coderabbitai bot commented Jun 1, 2025

Warning

Rate limit exceeded

@xzuyn has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 1 minutes and 32 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between e331df1 and 783579b.

📒 Files selected for processing (1)
  • src/axolotl/utils/data/utils.py (5 hunks)

Walkthrough

The deduplication logic for datasets has been refactored and consolidated. Deduplication now occurs once after dataset merging in the pre-processing workflow, controlled by configuration, and is removed from later stages. The deduplication implementation is now parallelized and hash-based, improving efficiency and ensuring deduplication is performed only during pre-processing.

Changes

File(s) Change Summary
src/axolotl/utils/data/sft.py Deduplication moved to a single point after dataset merging; removed from later stages.
src/axolotl/utils/data/utils.py Refactored deduplication to a parallel, hash-based approach; updated function signatures; added helper for row hashing.

Sequence Diagram(s)

sequenceDiagram
    participant Config
    participant SFT
    participant Utils

    Config->>SFT: Provide cfg.dataset_exact_deduplication, cfg.dataset_processes
    SFT->>SFT: Merge datasets (if applicable)
    alt Deduplication enabled
        SFT->>Utils: deduplicate_and_log_datasets(merged_dataset, num_proc)
        Utils->>Utils: deduplicate_dataset (parallel hash-based deduplication)
        Utils-->>SFT: Deduplicated dataset
    end
    SFT->>SFT: Drop long sequences
    SFT-->>Caller: Return processed dataset
Loading

Assessment against linked issues

Objective Addressed Explanation
Move deduplication to before saving in pre-processing phase (#2719)
Prevent deduplication from running before training phase (#2719)
Make deduplication code multi-threaded/parallelized (#2719)

Assessment against linked issues: Out-of-scope changes

No out-of-scope changes found.

Poem

In the land of datasets, so vast and so wide,
A bunny dedupes with parallel pride.
Once done at each stage, now just once—what a hop!
No more waiting for hashes to stop.
With swift little paws and a whiskery cheer,
Pre-processed sets are now crystal clear!
🐇✨

✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
src/axolotl/utils/data/utils.py (1)

110-116: ⚠️ Potential issue

Fix type annotation for consistency.

The num_proc parameter should use Optional[int] for proper type annotation.

Apply this diff:

 def deduplicate_and_log_datasets(
     *,
     train_dataset: Dataset = None,
     eval_dataset: Dataset = None,
     dataset: Dataset = None,
-    num_proc: int = None,
+    num_proc: Optional[int] = None,
 ) -> tuple[Dataset, Dataset, Dataset]:
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5d0dc5 and 9357fe6.

📒 Files selected for processing (2)
  • src/axolotl/utils/data/sft.py (3 hunks)
  • src/axolotl/utils/data/utils.py (2 hunks)
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/sft.py (1)
src/axolotl/utils/data/utils.py (1)
  • deduplicate_and_log_datasets (110-143)
🪛 GitHub Actions: lint
src/axolotl/utils/data/sft.py

[error] 341-343: Pre-commit black hook reformatted this file. Run 'black' to fix code style issues.

src/axolotl/utils/data/utils.py

[error] 1-1: Black formatting failed. Files src/axolotl/utils/data/utils.py and src/axolotl/utils/data/sft.py were reformatted by the black hook.


[warning] 101-101: Pylint warning W0622: Redefining built-in 'hash' (redefined-builtin).


[error] 79-115: Mypy type checking errors: Incompatible default for argument 'num_proc' (default None, expected int), missing type annotation for 'other_hashes', incompatible assignment types (set[Any] assigned to list[Any]).

🪛 Pylint (3.3.7)
src/axolotl/utils/data/utils.py

[warning] 101-101: Redefining built-in 'hash'

(W0622)

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.6.0)
🔇 Additional comments (2)
src/axolotl/utils/data/utils.py (1)

72-74: LGTM!

The helper function correctly computes SHA256 hash for dataset rows.

src/axolotl/utils/data/sft.py (1)

444-461: Deduplication consolidation looks good!

The removal of deduplication calls from the train/test split logic is correct. Deduplication now happens once after dataset merging (line 344), which is more efficient and aligns with the PR objective of "deduplication happening only during pre-processing."

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
src/axolotl/utils/data/utils.py (1)

77-108: Fix type annotation for num_proc parameter.

The type annotation for num_proc is incorrect - it's annotated as int but has a default value of None, which causes type checking errors.

Apply this diff to fix the type annotation:

 def deduplicate_dataset(
     dataset: Dataset,
     other_dataset: Dataset = None,
-    num_proc: int = None,
+    num_proc: Optional[int] = None,
 ) -> Dataset:

The rest of the function implementation looks correct - the variable initialization properly uses sets, and the loop variable is appropriately named row_hash to avoid shadowing the built-in hash function.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9357fe6 and 055bf97.

📒 Files selected for processing (2)
  • src/axolotl/utils/data/sft.py (3 hunks)
  • src/axolotl/utils/data/utils.py (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/axolotl/utils/data/sft.py
🧰 Additional context used
🪛 Pylint (3.3.7)
src/axolotl/utils/data/utils.py

[convention] 12-12: standard import "typing.Optional" should be placed before third party imports "huggingface_hub", "numpy", "requests", "datasets.Dataset"

(C0411)

🪛 GitHub Actions: lint
src/axolotl/utils/data/utils.py

[error] 1-1: Pre-commit black hook failed: file was reformatted by black. Run 'black src/axolotl/utils/data/utils.py' to fix formatting.


[error] 1-1: Pre-commit isort hook failed: file was modified to fix import sorting. Run 'isort src/axolotl/utils/data/utils.py' to fix import order.


[error] 80-153: Mypy type checking errors: Incompatible default for argument 'num_proc' (None vs int); multiple calls to 'deduplicate_dataset' with argument 'num_proc' of type 'int | None' instead of expected 'int'. See lines 80, 129, 141, 153.

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (6)
src/axolotl/utils/data/utils.py (6)

73-74: LGTM! Clean helper function implementation.

The compute_row_hash function is well-implemented and provides a clean interface for computing SHA256 hashes of dataset examples.


82-82: LGTM! Correct variable initialization.

The variable initialization now correctly uses set() for other_hashes, seen_hashes, and unique_indices, which aligns with their usage patterns in the function.


84-100: Efficient vectorized hash computation approach.

The new implementation using dataset.map() with compute_row_hash is a significant improvement over manual iteration. The parallel processing support via num_proc will enhance performance for large datasets.


102-106: LGTM! Clean deduplication logic.

The deduplication loop correctly uses set operations for efficient lookup and avoids variable naming conflicts by using row_hash instead of hash.


116-116: LGTM! Consistent type annotation.

The num_proc parameter correctly uses Optional[int] type annotation, which is consistent with the intended usage pattern.


126-142: Verify consistent parameter passing in deduplication calls.

The updated function calls correctly pass the num_proc parameter to enable parallel processing. The logging messages provide good visibility into the deduplication process.

However, ensure that the type annotation issue in deduplicate_dataset is fixed to prevent type checking errors in these calls.

continue
seen_hashes.add(row_hash)
unique_indices.add(idx)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth dropping the added row_hash column here now that it's done.. The previous implementation did not modify the dataset.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be modifying it either. the only change that happens to dataset is at the end when it returns with .select. the row_hash stuff comes from a copy in hashed and other_hashed (which I should probably make delete after grabbing the hashes)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d314ca9 and 1be5174.

📒 Files selected for processing (1)
  • src/axolotl/utils/data/utils.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (5)
src/axolotl/utils/data/utils.py (5)

7-7: Import addition looks good.

The Optional import is properly placed according to PEP 8 conventions and is needed for the type annotations in the refactored functions.


73-74: Clean helper function for hash computation.

The function correctly computes SHA256 hashes for dataset examples and returns the expected dictionary format for use with dataset.map().


77-106: Excellent refactoring for multi-threaded deduplication.

The refactored implementation successfully addresses the PR objectives:

  • Multi-threading: Uses num_proc parameter with dataset.map() for parallel hash computation
  • Efficiency: Replaces iterative approach with vectorized operations and set-based lookups
  • Memory management: Explicit cleanup of intermediate variables
  • Correctness: Proper type initialization and variable naming (addresses past review concerns)

The hash-based approach without collision verification aligns with the stated conclusion that SHA256 collisions are too unlikely to justify additional checks.


114-114: Parameter addition supports parallelization.

The num_proc parameter correctly enables parallel processing for the deduplication operations.


127-127: Proper integration of parallel processing parameter.

All calls to deduplicate_dataset correctly pass the num_proc parameter, enabling the multi-threaded deduplication feature described in the PR objectives.

Also applies to: 139-139, 151-151

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
src/axolotl/utils/data/utils.py (1)

104-104: Move memory cleanup after return statement.

The del statement for hashes, other_hashes, and seen_hashes occurs before the return, but unique_indices is still needed. Consider moving this cleanup or removing it since these variables will be garbage collected anyway when the function exits.

Apply this diff to clean up the memory management:

-    del hashes, other_hashes, seen_hashes
-
     return dataset.select(unique_indices)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 1be5174 and 7af308a.

📒 Files selected for processing (1)
  • src/axolotl/utils/data/utils.py (5 hunks)
🧰 Additional context used
🪛 GitHub Actions: lint
src/axolotl/utils/data/utils.py

[error] 90-92: Pre-commit black hook failed: file was reformatted due to code style issues. Run 'black src/axolotl/utils/data/utils.py' to fix.

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
🔇 Additional comments (4)
src/axolotl/utils/data/utils.py (4)

73-74: LGTM: Clean hash computation implementation.

The compute_row_hash function provides a clean interface for computing SHA256 hashes of dataset examples, supporting the new parallelized deduplication approach.


89-89: Good memory management practices.

The explicit deletion of intermediate variables (del hashed and del other_hashed) is excellent for memory efficiency, especially important when processing large datasets.

Also applies to: 96-96


98-102: Verify deduplication logic correctness.

The deduplication logic looks correct - it properly checks against both seen_hashes (within the dataset) and other_hashes (from the other dataset) before adding indices. This ensures no duplicates within the dataset or between datasets.


86-86:

❓ Verification inconclusive

Fix code formatting to resolve pipeline failure.

The pipeline indicates a black formatting issue. The code needs to be reformatted according to black style guidelines.

Run the following command to fix the formatting issue:

Also applies to: 93-93


🏁 Script executed:

#!/bin/bash
# Description: Fix black formatting issues in the file
# Expected: The file should be reformatted according to black style guidelines

black src/axolotl/utils/data/utils.py

Length of output: 135


Install Black and reformat code to resolve pipeline failure

It looks like the black formatter isn’t available in the environment (black: command not found). Please take the following steps:

  • Add Black to your development dependencies (for example, run pip install black or update your requirements-dev.txt/poetry/pipenv files).

  • Re-run the formatter on the affected file(s):

    black src/axolotl/utils/data/utils.py
  • Confirm that both line 86 and line 93 (and any other Black warnings) are now correctly formatted.

@winglian
Copy link
Collaborator

winglian commented Jun 5, 2025

@djsaunde does #2707 overlap or supersede this?

@djsaunde
Copy link
Member

djsaunde commented Jun 5, 2025

@djsaunde does #2707 overlap or supersede this?

Should go in after.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Save De-duplicated Set During Pre-processing
3 participants