Skip to content

Data loader refactor #2707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Jun 10, 2025
Merged

Data loader refactor #2707

merged 36 commits into from
Jun 10, 2025

Conversation

djsaunde
Copy link
Member

@djsaunde djsaunde commented May 22, 2025

Description

Data loading refactor, with an emphasis on sft.py, rl.py, and related modules.

Motivation and Context

The current state of data loading involves a lot of misdirection, undocumented code, missing typing, etc. This refactor aims to clean things up to improve readability and extensibility.

Also closes #2684 via filelock implementation (credit to @casper-hansen for reference code).

How has this been tested?

TODO

Screenshots (if appropriate)

Types of changes

Social Handles (Optional)

Summary by CodeRabbit

  • New Features

    • Introduced a modular system for dataset wrapping and tokenization supporting diverse dataset types and prompt styles.
    • Added generalized dataset preparation workflows for supervised fine-tuning and reinforcement learning with distributed synchronization and caching.
    • Implemented a file-based locking mechanism to coordinate dataset loading and preparation across concurrent processes.
  • Improvements

    • Enhanced dataset loading from local, cloud, remote, and URL sources with improved error handling and modular design.
    • Streamlined deduplication and sequence filtering for better clarity and reliability.
    • Improved code clarity, documentation, and consistent use of modern Python type hints.
    • Simplified dataset preparation logic with better modularity and distributed coordination.
    • Refined retry strategies and hashing utilities with clearer documentation.
    • Safer configuration validation preventing attribute errors.
    • Added explicit public APIs and standardized dataset cache paths.
    • Updated dataset wrapper selection with extensible handlers for various dataset categories.
    • Simplified and clarified dataset loading and preparation function signatures and internal logic.
    • Removed redundant CLI argument dependencies in dataset loading calls across tests.
  • Bug Fixes

    • Corrected subprocess termination check logic to properly detect process exit.
    • Fixed potential attribute errors in configuration validation by using safe attribute access.
  • Tests

    • Updated tests to reflect new dataset preparation and deduplication APIs.
    • Added comprehensive tests for the file-based locking mechanism ensuring safe concurrent dataset loading.
    • Adjusted tests to remove CLI argument dependencies in dataset loading calls.

@djsaunde djsaunde self-assigned this May 22, 2025
Copy link

coderabbitai bot commented May 22, 2025

Caution

Review failed

The pull request is closed.

Walkthrough

This pull request refactors and modularizes dataset preparation, loading, and wrapping logic across supervised fine-tuning (SFT) and reinforcement learning (RL) workflows. It introduces new modules, updates function signatures, improves distributed synchronization and deduplication, and aligns tests with the new APIs. Docstrings and type hints are also modernized throughout.

Changes

Files/Paths Change Summary
src/axolotl/common/const.py, src/axolotl/train.py, tests/e2e/* Reformatted docstrings to single-line style; no logic changes.
src/axolotl/common/datasets.py, src/axolotl/prompt_tokenizers.py Updated function names, signatures, and imports for dataset preparation and wrapping; added abstract method for wrapping strategy.
src/axolotl/datasets.py Modernized type hints, reformatted docstrings, minor variable changes; no logic changes.
src/axolotl/prompt_strategies/messages/init.py Removed redundant return None statement.
src/axolotl/utils/data/init.py Updated imports, renamed functions, added __all__ for explicit API.
src/axolotl/utils/data/rl.py Refactored for general RL dataset handling; modularized loading, transformation, filtering, distributed coordination, and caching.
src/axolotl/utils/data/sft.py Major refactor: split logic into modular functions, improved distributed synchronization, renamed and retyped API, removed hardcoded prompt handling.
src/axolotl/utils/data/shared.py Refactored dataset loading: modularized by source, improved error handling, renamed functions, added fingerprint utility.
src/axolotl/utils/data/utils.py Refactored deduplication logic, improved docstrings, simplified error handling, updated function signatures.
src/axolotl/utils/data/wrappers.py (new) Introduced centralized, extensible dataset wrapping/strategy selection for SFT.
src/axolotl/utils/schemas/config.py Made micro_batch_size access safer with getattr in validation.
src/axolotl/utils/data/lock.py (new) Added FileLockLoader class for process synchronization during dataset preparation using file locks and ready flags.
src/axolotl/loaders/tokenizer.py Added type annotations; renamed local variable to avoid shadowing.
tests/prompt_strategies/test_dpo_chatml.py, tests/test_datasets.py, tests/test_exact_deduplication.py, tests/core/test_builders.py Updated imports, patch targets, and function calls to match new dataset preparation APIs and signatures; adjusted test logic for deduplication.
tests/e2e/multigpu/test_locking.py (new) Added tests for FileLockLoader synchronization behavior and concurrency handling.

Sequence Diagram(s)

sequenceDiagram
    participant Trainer as Trainer/CLI
    participant SFT as prepare_datasets (SFT)
    participant RL as prepare_preference_datasets (RL)
    participant Shared as load_dataset_with_config
    participant Wrappers as get_dataset_wrapper
    participant Lock as FileLock/ReadyFlag

    Trainer->>SFT: prepare_datasets(cfg, tokenizer, ...)
    SFT->>Lock: Acquire file lock, check ready flag
    alt Not ready
        SFT->>Shared: load_dataset_with_config(...)
        SFT->>Wrappers: get_dataset_wrapper(...)
        SFT->>Lock: Save processed dataset, set ready flag
    else Ready
        SFT->>Shared: load processed dataset
    end
    SFT->>Trainer: Return train/eval datasets, steps, prompters

    Trainer->>RL: prepare_preference_datasets(cfg)
    RL->>Lock: Acquire file lock, check ready flag
    alt Not ready
        RL->>Shared: load_dataset_with_config(...)
        RL->>RL: Transform, filter, cache dataset
        RL->>Lock: Save processed dataset, set ready flag
    else Ready
        RL->>Shared: load processed dataset
    end
    RL->>Trainer: Return train/eval datasets
Loading

Assessment against linked issues

Objective Addressed Explanation
Fix distributed timeout during parallel dataset tokenization in Axolotl training, ensuring stable, deadlock-free synchronization (#2684)

Poem

In fields of data, rabbits hop—
Refactoring code, we never stop!
Locks and wrappers, deduped delight,
Distributed woes are now set right.
With modular paths and type hints clear,
Our datasets run without a fear.
🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d523857 and 9b1b33d.

📒 Files selected for processing (2)
  • src/axolotl/utils/data/rl.py (2 hunks)
  • tests/e2e/multigpu/test_locking.py (1 hunks)
✨ Finishing Touches
  • 📝 Generate Docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@djsaunde djsaunde changed the title Data load refactor Data loader refactor May 22, 2025
Copy link

codecov bot commented May 22, 2025

@djsaunde djsaunde force-pushed the data-load-refactor branch 2 times, most recently from 3dad97a to 255757c Compare May 27, 2025 16:12
@djsaunde
Copy link
Member Author

Current test failures will be fixed once #2608 is merged

@djsaunde djsaunde force-pushed the data-load-refactor branch from 255757c to daf5076 Compare May 29, 2025 16:06
@djsaunde djsaunde marked this pull request as ready for review May 29, 2025 16:13
@djsaunde djsaunde requested review from winglian, NanoCode012 and SalmanMohammadi and removed request for winglian May 29, 2025 16:13
@djsaunde
Copy link
Member Author

Opening up for review, but there's still plenty to do IMO under axolotl.utils.data.

Main areas of focus (under axolotl/utils/data): rl.py, sft.py, shared.py, wrappers.py, utils.py.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🔭 Outside diff range comments (1)
src/axolotl/utils/data/rl.py (1)

161-161: ⚠️ Potential issue

Fix incorrect log message in save function.

The log message says "Loading" but this function is saving the dataset.

-        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
+        LOG.info(f"Saving prepared dataset to disk at {prepared_ds_path}...")
♻️ Duplicate comments (1)
src/axolotl/utils/data/rl.py (1)

50-83: Apply the same distributed coordination fix as suggested for sft.py.

This function has the same race condition issue where the ready flag is created before datasets are fully saved to disk.

🧹 Nitpick comments (3)
src/axolotl/utils/data/shared.py (1)

116-121: Consider using contextlib.suppress for cleaner exception handling.

The static analysis tool correctly identifies an opportunity to simplify the exception handling.

Apply this refactor for more Pythonic code:

+import contextlib
 # ... other imports ...

     is_cloud_dataset = False
     if remote_fs:
-        try:
-            is_cloud_dataset = remote_fs.exists(dataset_config.path)
-        except (FileNotFoundError, ConnectionError):
-            pass
+        with contextlib.suppress(FileNotFoundError, ConnectionError):
+            is_cloud_dataset = remote_fs.exists(dataset_config.path)
🧰 Tools
🪛 Ruff (0.11.9)

117-120: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

src/axolotl/utils/data/wrappers.py (2)

120-127: Extract duplicated error handling logic

The error handling logic for unhandled dataset types is duplicated. Consider extracting it to a helper function.

Add this helper function at the module level:

def _raise_unhandled_dataset_error(dataset_type: str) -> None:
    """Raise a ValueError for unhandled dataset types with helpful suggestions."""
    suffix = ""
    if ":load_" in dataset_type:
        suffix = f" Did you mean {dataset_type.replace(':load_', '.load_')}?"
    
    error_message = f"unhandled prompt tokenization strategy: {dataset_type}. {suffix}"
    LOG.error(error_message)
    raise ValueError(error_message)

Then replace both occurrences with:

-    ds_type = dataset_config.type
-    suffix = ""
-    if ":load_" in ds_type:
-        suffix = f" Did you mean {ds_type.replace(':load_', '.load_')}?"
-
-    error_message = f"unhandled prompt tokenization strategy: {ds_type}. {suffix}"
-    LOG.error(error_message)
-    raise ValueError(error_message)
+    _raise_unhandled_dataset_error(dataset_config.type)

Also applies to: 172-180


234-416: Consider refactoring similar handler functions

Many handler functions follow nearly identical patterns, differing only in the prompter and strategy classes used. This could be refactored to reduce code duplication.

Consider creating a generic handler factory:

def _create_generic_handler(
    prompter_class: type[Prompter],
    strategy_class: type[PromptTokenizingStrategy],
) -> Callable:
    """Create a handler function for a specific prompter and strategy combination."""
    def handler(
        dataset_prompt_style: str | None,
        tokenizer: PreTrainedTokenizer,
        cfg: DictDefault,
        dataset: Dataset | IterableDataset,
        dataset_kwargs: dict[str, Any],
    ) -> tuple[Dataset | IterableDataset, Prompter]:
        dataset_prompter = prompter_class(dataset_prompt_style)
        dataset_strategy = strategy_class(
            dataset_prompter,
            tokenizer,
            cfg.train_on_inputs,
            cfg.sequence_len,
        )
        dataset_wrapper = wrap_dataset_for_tokenized_prompt(
            dataset_strategy,
            dataset,
            **dataset_kwargs,
        )
        return dataset_wrapper, dataset_prompter
    
    return handler

# Then define handlers more concisely:
DATASET_HANDLERS = {
    "alpaca": _create_generic_handler(AlpacaPrompter, AlpacaPromptTokenizingStrategy),
    "summarizetldr": _create_generic_handler(SummarizeTLDRPrompter, SummarizeTLDRPromptTokenizingStrategy),
    # ... etc
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ec4ebfd and ce70f27.

📒 Files selected for processing (19)
  • src/axolotl/common/const.py (1 hunks)
  • src/axolotl/common/datasets.py (5 hunks)
  • src/axolotl/datasets.py (5 hunks)
  • src/axolotl/prompt_strategies/messages/__init__.py (0 hunks)
  • src/axolotl/prompt_tokenizers.py (2 hunks)
  • src/axolotl/train.py (1 hunks)
  • src/axolotl/utils/data/__init__.py (1 hunks)
  • src/axolotl/utils/data/pretraining.py (1 hunks)
  • src/axolotl/utils/data/rl.py (5 hunks)
  • src/axolotl/utils/data/sft.py (4 hunks)
  • src/axolotl/utils/data/shared.py (2 hunks)
  • src/axolotl/utils/data/utils.py (5 hunks)
  • src/axolotl/utils/data/wrappers.py (1 hunks)
  • src/axolotl/utils/schemas/config.py (1 hunks)
  • tests/e2e/test_dpo.py (1 hunks)
  • tests/e2e/test_llama_pretrain.py (2 hunks)
  • tests/prompt_strategies/test_dpo_chatml.py (2 hunks)
  • tests/test_datasets.py (15 hunks)
  • tests/test_exact_deduplication.py (15 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/prompt_strategies/messages/init.py
🧰 Additional context used
🧬 Code Graph Analysis (8)
tests/prompt_strategies/test_dpo_chatml.py (1)
src/axolotl/utils/data/rl.py (1)
  • prepare_preference_datasets (32-99)
src/axolotl/prompt_tokenizers.py (1)
src/axolotl/prompt_strategies/messages/chat.py (1)
  • wrap_dataset (33-48)
src/axolotl/utils/data/__init__.py (5)
src/axolotl/utils/data/pretraining.py (2)
  • encode_pretraining (20-176)
  • wrap_pretraining_dataset (179-239)
src/axolotl/utils/data/rl.py (1)
  • prepare_preference_datasets (32-99)
src/axolotl/utils/data/wrappers.py (1)
  • get_dataset_wrapper (46-127)
src/axolotl/utils/data/sft.py (1)
  • prepare_datasets (48-69)
src/axolotl/utils/data/utils.py (1)
  • md5 (72-77)
tests/test_datasets.py (2)
src/axolotl/utils/data/rl.py (1)
  • prepare_preference_datasets (32-99)
src/axolotl/utils/data/sft.py (1)
  • _load_tokenized_prepared_datasets (256-304)
src/axolotl/datasets.py (4)
src/axolotl/prompt_tokenizers.py (2)
  • PromptTokenizingStrategy (43-105)
  • supports_batched (70-71)
src/axolotl/prompt_strategies/chat_template.py (1)
  • supports_batched (328-330)
src/axolotl/prompt_strategies/pretrain.py (1)
  • supports_batched (21-22)
src/axolotl/prompt_strategies/stepwise_supervised.py (1)
  • supports_batched (101-102)
src/axolotl/utils/data/shared.py (2)
src/axolotl/utils/data/utils.py (1)
  • md5 (72-77)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/data/sft.py (8)
src/axolotl/prompters.py (1)
  • Prompter (26-29)
src/axolotl/utils/data/pretraining.py (1)
  • wrap_pretraining_dataset (179-239)
src/axolotl/utils/data/shared.py (3)
  • datasets_with_name_generator (50-80)
  • generate_split_fingerprints (327-339)
  • load_dataset_with_config (83-138)
src/axolotl/utils/data/utils.py (4)
  • deduplicate_and_log_datasets (111-147)
  • drop_long_seq_in_dataset (150-202)
  • md5 (72-77)
  • retry_on_request_exceptions (31-69)
tests/conftest.py (1)
  • retry_on_request_exceptions (28-49)
src/axolotl/utils/data/wrappers.py (1)
  • get_dataset_wrapper (46-127)
src/axolotl/utils/distributed.py (1)
  • is_local_main_process (90-93)
src/axolotl/utils/trainer.py (1)
  • calculate_total_num_steps (393-510)
src/axolotl/utils/data/utils.py (3)
tests/test_exact_deduplication.py (1)
  • cfg (203-217)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/samplers/utils.py (1)
  • get_dataset_lengths (8-21)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py

117-120: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

⏰ Context from checks skipped due to timeout of 90000ms (6)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
🔇 Additional comments (41)
src/axolotl/common/const.py (1)

1-1: LGTM! Docstring formatting improvement.

The conversion from multi-line to single-line docstring format improves consistency across the codebase while preserving the original meaning.

tests/e2e/test_dpo.py (1)

1-1: LGTM! Docstring formatting improvement.

The single-line docstring format improves consistency while maintaining the original content and meaning.

tests/e2e/test_llama_pretrain.py (2)

1-1: LGTM! Module docstring formatting improvement.

The single-line format improves consistency across the codebase.


21-21: LGTM! Class docstring formatting improvement.

The single-line format aligns with the module-level docstring formatting and improves overall consistency.

src/axolotl/utils/data/pretraining.py (1)

253-253: LGTM! Improved function call clarity.

Converting from positional to keyword argument makes the code more explicit and readable, which aligns with the PR's goal of improving code clarity.

tests/prompt_strategies/test_dpo_chatml.py (2)

10-10: LGTM! Updated import aligns with dataset preparation refactoring.

The import correctly reflects the renamed function prepare_preference_datasets that was introduced as part of the dataset loading refactor.


58-58: LGTM! Function call updated to use refactored API.

The function call correctly uses the new prepare_preference_datasets function, maintaining the same parameter signature and return value expectations. This aligns with the broader dataset preparation refactoring mentioned in the PR objectives.

src/axolotl/train.py (1)

56-57: LGTM! Docstring formatting improvement.

The docstring formatting has been updated to a more consistent style while preserving the original content.

src/axolotl/prompt_tokenizers.py (2)

6-6: Good addition of required import.

The Dataset import is necessary for the type hint in the new abstract method wrap_dataset.


32-40: Excellent formalization of the dataset wrapping interface.

The addition of the abstract wrap_dataset method properly formalizes the interface that subclasses must implement. The method signature with optional process_count, keep_in_memory, and flexible **kwargs parameters aligns well with existing implementations and provides good flexibility for different dataset wrapping strategies.

src/axolotl/utils/schemas/config.py (1)

1200-1200: Excellent defensive programming improvement.

Using getattr(self, "micro_batch_size", 1) instead of direct attribute access prevents potential AttributeError when micro_batch_size is not set, while maintaining the existing validation logic with a sensible default value.

src/axolotl/utils/data/__init__.py (3)

1-1: Clear and specific module docstring.

The updated docstring properly identifies this as the initializer for the axolotl.utils.data module.


3-12: Well-organized import consolidation reflecting the refactoring.

The import updates properly reflect the function renaming and consolidation from the broader refactoring:

  • load_prepare_preference_datasetsprepare_preference_datasets
  • Multiple SFT imports consolidated to prepare_datasets
  • Addition of get_dataset_wrapper (moved from SFT module)
  • Addition of md5 utility function

The import organization is clean and logical.


14-21: Excellent addition of explicit public API definition.

The __all__ list clearly defines the module's public interface, making it easier for users to understand what functionality is available and ensuring consistent exports. All the listed functions align with the refactored dataset preparation workflow.

tests/test_exact_deduplication.py (5)

17-18: LGTM: Import updates align with refactored APIs

The import changes correctly reflect the function renames from prepare_dataset to prepare_datasets and load_prepare_preference_datasets to prepare_preference_datasets, maintaining consistency with the refactored data loading modules.


230-232: Correct patch target for renamed function

The patch target correction from "axolotl.utils.data.rl.load_dataset_w_config" to "axolotl.utils.data.rl.load_dataset_with_config" properly reflects the function rename in the shared module.

Also applies to: 256-258


242-242: Function call updates align with refactored RL module

The calls to prepare_preference_datasets correctly replace the previous load_prepare_preference_datasets function, maintaining the expected return tuple structure.

Also applies to: 270-270


324-324: Updated function calls match new SFT module API

The function calls correctly use the renamed prepare_datasets function, and the return value unpacking remains consistent with the expected 4-tuple format.

Also applies to: 351-351, 378-378


74-77:

✅ Verification successful

Verify the updated deduplication API usage

The API change from separate train_dataset and eval_dataset parameters to a single dataset parameter with optional dataset_name aligns with the simplified deduplication interface mentioned in the AI summary.

Let me verify that the new API signature matches the implementation:


🏁 Script executed:

#!/bin/bash
# Description: Verify the deduplicate_and_log_datasets function signature
# Expected: Function should accept dataset and optional dataset_name/other_dataset parameters

ast-grep --pattern 'def deduplicate_and_log_datasets($$$)'

Length of output: 2785


✅ deduplicate_and_log_datasets signature verified

The deduplicate_and_log_datasets function in
src/axolotl/utils/data/utils.py (lines 111–116) is defined as:

  • dataset: Dataset
  • other_dataset: Dataset | None = None
  • dataset_name: str | None = "train"
  • other_name: str | None = "eval"

Your test calls:

train_dataset, _ = deduplicate_and_log_datasets(dataset=self.dataset)
eval_dataset, _ = deduplicate_and_log_datasets(
    dataset=self.dataset, dataset_name="eval"
)

These correctly use the default dataset_name="train" for training and override it to "eval" for evaluation. No changes required.

src/axolotl/datasets.py (5)

22-29: Improved docstring formatting enhances readability

The restructured docstring with a proper Args section follows Python documentation conventions and improves code documentation quality.


35-36: Modernized type hints using Python 3.10+ union syntax

The change from Optional[int] and Optional[bool] to int | None and bool | None respectively uses the more concise union syntax introduced in Python 3.10. This improves code readability and follows modern Python typing conventions.


78-78: Type hint and explicit list conversion improvements

The updates include:

  1. Modern union syntax for Dataset | IterableDataset
  2. Explicit conversion list(dataset.features.keys()) for clarity

These changes improve type safety and code explicitness.

Also applies to: 85-85


96-103: Consistent docstring formatting and type hint modernization

The docstring formatting follows the same improved pattern as the previous class, and the type hint list[IterableDataset] uses modern Python syntax.

Also applies to: 113-113


177-181: Improved logging message formatting

The multi-line f-string formatting makes the warning message more readable while maintaining the same information content.

src/axolotl/common/datasets.py (5)

12-12: Import updates align with refactored modules

The imports correctly reference the renamed functions prepare_datasets and prepare_preference_datasets from their respective modules.

Also applies to: 17-17


40-40: Improved function signature and safer attribute access

The changes include:

  1. Making cli_args mandatory (removing Optional)
  2. Using getattr(cli_args, "iterable", False) for safer attribute access

This improves API clarity and prevents potential AttributeError exceptions.

Also applies to: 55-55


57-57: Updated function calls and debug condition handling

The function call update to prepare_datasets is correct, and the debug condition logic has been restructured for better readability. The conditions appear to check the same attributes but with improved formatting.

Also applies to: 64-69


124-129: Explicit keyword arguments improve code clarity

Using explicit keyword arguments in the check_dataset_labels call makes the function invocation more readable and less prone to errors from positional argument changes.


110-110:

✅ Verification successful

Review the inverted total_num_steps calculation logic

The logic has been inverted - previously it calculated total_num_steps by default and set it to None for GRPO, now it initializes as None and only calculates for non-GRPO types. This appears to match the AI summary description.

Let me verify this change aligns with the RL module logic:

Also applies to: 112-116


🏁 Script executed:

#!/bin/bash
# Description: Check how total_num_steps is handled in RL training
# Expected: Should find logic that distinguishes GRPO from other RL types

rg -A 5 -B 5 "total_num_steps.*GRPO|GRPO.*total_num_steps"

Length of output: 60


I’ll gather more context around how total_num_steps and GRPO are used across the codebase to be sure.

#!/bin/bash
set -e

echo "===== Occurrences of GRPO ====="
rg -n "GRPO" || echo "(none found)"

echo
echo "===== Occurrences of total_num_steps ====="
rg -n "total_num_steps" || echo "(none found)"

echo
echo "===== Definition of prepare_preference_datasets ====="
rg -n "def prepare_preference_datasets" -A50 src/axolotl/common/datasets.py || echo "File or function not found"

I’ll search the repo for all uses of total_num_steps and GRPO, and then inspect the prepare_preference_datasets definition to see how the flag is applied.


🏁 Script executed:

#!/bin/bash
set -e

echo "===== References to total_num_steps ====="
rg -n "total_num_steps" || echo "(none found)"

echo
echo "===== References to GRPO ====="
rg -n "GRPO" || echo "(none found)"

echo
echo "===== prepare_preference_datasets definition ====="
rg -n -C5 "def prepare_preference_datasets" src/axolotl/common/datasets.py || echo "(not found)"

Length of output: 10569


I need to see the exact logic in src/axolotl/common/datasets.py around the prepare_preference_datasets definition. Let’s dump lines 80–140 to inspect the total_num_steps handling.

#!/bin/bash
set -e
echo "===== Lines 80–140 of datasets.py ====="
sed -n '80,140p' src/axolotl/common/datasets.py

I need to inspect the prepare_preference_datasets implementation. Dumping lines 1–200 of the file:


🏁 Script executed:

#!/bin/bash
set -e
echo "===== src/axolotl/common/datasets.py (lines 1–200) ====="
sed -n '1,200p' src/axolotl/common/datasets.py

Length of output: 4510


total_num_steps logic for GRPO verified

The code correctly initializes total_num_steps to None and only computes it when cfg.rl is not RLType.GRPO, preserving the previous behavior for GRPO runs.

• File: src/axolotl/common/datasets.py (lines 108–116)

tests/test_datasets.py (5)

14-15: Import updates correctly reference refactored functions

The imports now use prepare_preference_datasets and the internal function _load_tokenized_prepared_datasets, aligning with the modular refactoring described in the AI summary.


30-32: Updated generator type hint for fixture

The type hint Generator[PreTrainedTokenizer, Any, Any] properly reflects the fixture's generator nature, improving type safety.


67-70: Consistent patching pattern ensures test isolation

All test methods now consistently wrap calls to _load_tokenized_prepared_datasets with patches to override DEFAULT_DATASET_PREPARED_PATH. This ensures proper test isolation by using temporary directories instead of default paths.

Also applies to: 114-117, 146-149, 184-187, 222-225, 254-257, 286-289, 340-343, 424-427, 463-466


311-311: Function calls updated to use refactored RL API

The calls correctly use prepare_preference_datasets instead of the previous function name, maintaining the expected return tuple structure.

Also applies to: 375-375


368-368: Patch target corrections for renamed functions

The patch targets correctly reference load_dataset_with_config instead of the old load_dataset_w_config function name.

Also applies to: 417-417

src/axolotl/utils/data/sft.py (2)

47-69: Clean refactoring of the main entry point!

The function has been well-refactored with:

  • Clear type annotations
  • Comprehensive docstring
  • Clean separation between pretraining and standard dataset preparation

137-666: Excellent modularization of dataset preparation logic!

The refactoring successfully breaks down the monolithic dataset preparation into well-defined, single-responsibility functions. The improvements include:

  • Clear separation of pretraining vs standard dataset flows
  • Comprehensive caching with Hub integration
  • Proper handling of distributed environments
  • Type hints throughout
src/axolotl/utils/data/shared.py (1)

33-340: Well-structured refactoring of dataset loading utilities!

The improvements include:

  • Clearer function names following Python conventions
  • Modular loading functions for different data sources
  • Comprehensive error handling with helpful messages
  • New generate_split_fingerprints utility for consistent dataset splitting
🧰 Tools
🪛 Ruff (0.11.9)

117-120: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

src/axolotl/utils/data/rl.py (2)

203-258: Excellent implementation of sequence length filtering for RL!

The function properly handles multiple RL types with:

  • Clear validation of required keys
  • Accurate length calculations for each type
  • Helpful error messages

165-384: Well-designed RL dataset preparation module!

The refactoring successfully:

  • Modularizes RL dataset loading and processing
  • Supports multiple RL types (DPO, ORPO, KTO, etc.)
  • Implements proper caching and distributed coordination
  • Includes comprehensive filtering and validation
src/axolotl/utils/data/utils.py (2)

173-179: LGTM! Clean error handling with contextlib.suppress

The use of contextlib.suppress(AttributeError) is cleaner than try-except blocks. Note that this will silently ignore any AttributeError, not just from get_dataset_lengths.


111-117:

✅ Verification successful

Verify callers of deduplicate_and_log_datasets for breaking changes

The function signature has been simplified from supporting three datasets to only two. This is a breaking change that could affect existing code.

Run the following script to verify all callers have been updated:


🏁 Script executed:

#!/bin/bash
# Description: Check for calls to deduplicate_and_log_datasets to ensure they match the new signature

# Search for function calls with old three-dataset pattern
echo "Checking for potential outdated calls with three datasets:"
rg -A 3 "deduplicate_and_log_datasets\s*\([^,]+,[^,]+,[^,]+" --type py

# Search for all current calls to the function
echo -e "\nAll current calls to deduplicate_and_log_datasets:"
rg -A 3 "deduplicate_and_log_datasets\s*\(" --type py

Length of output: 3931


No outdated calls to deduplicate_and_log_datasets detected

A workspace-wide search confirms that there are no remaining calls passing three dataset arguments. All usages now match the new signature (only dataset and optional other_dataset, with named parameters for dataset_name/other_name). No breaking changes remain—no further action is needed.


# The rank that acquires the lock first does the data preprocessing
with FileLock(str(lock_file_path)):
ready_flag_path = Path(dataset_prepared_path) / "datasets_ready.flag"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this get removed when we reprocess a dataset? This can be an issue especially when modifying dataset transforms and reprocessing to test.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm good question. I need to add some tests around this.

In hindsight, this bit should have been a separate PR

@axolotl-ai-cloud axolotl-ai-cloud deleted a comment from coderabbitai bot May 29, 2025
Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some reviews from a few days ago. Feel free to ignore them if they're outdated.

Copy link
Collaborator

@NanoCode012 NanoCode012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this refactor! We accumulated a lot of tech debt as we kept adding more dataset formats early on.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🔭 Outside diff range comments (1)
src/axolotl/utils/data/rl.py (1)

160-162: ⚠️ Potential issue

Fix incorrect log message in save function.

The log message incorrectly states "Loading prepared dataset from disk" when it should indicate saving/preparing the dataset.

Apply this diff to fix the misleading log message:

-        LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...")
+        LOG.info(f"Saving prepared dataset to disk at {prepared_ds_path}...")
♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (3)

46-52: Verify cleanup of lock and flag files during reprocessing.

This mirrors the previous concern about whether lock/flag files get cleaned up during dataset reprocessing. Consider implementing cleanup logic when cfg.is_preprocess is True or when dataset transforms are modified.

#!/bin/bash
# Check if there's any cleanup logic for lock/flag files when reprocessing
rg -A 10 -B 5 "is_preprocess.*True|datasets_ready\.flag.*unlink|datasets_prep\.lock.*unlink"

221-221: Clarify GRPO handling in sequence length filtering.

As noted in previous reviews, GRPO returns True without any actual filtering. Consider adding a comment explaining why GRPO doesn't need sequence length filtering, or implement appropriate filtering logic if needed.

Add a clarifying comment:

    if rl is RLType.GRPO:
+        # GRPO doesn't use preference datasets, so no sequence length filtering needed
        return True

32-37: 🛠️ Refactor suggestion

Consider extracting caching/deduplication logic to shared module.

As mentioned in previous reviews, since SFT also needs caching and deduplication logic, consider extracting the file locking and coordination mechanism to shared.py for reuse across different training types.

🧹 Nitpick comments (2)
src/axolotl/utils/data/rl.py (2)

4-4: Consider the necessity of the time import.

The time import is only used for time.sleep(1) on line 82. Consider if this polling approach is the most efficient method for coordination, or if there are better alternatives.


81-82: Consider implementing exponential backoff for polling.

The current implementation uses a simple 1-second sleep in a polling loop. For better resource utilization and responsiveness, consider implementing exponential backoff.

Apply this pattern to improve polling efficiency:

-            while not ready_flag_path.exists():
-                time.sleep(1)
+            wait_time = 0.1
+            while not ready_flag_path.exists():
+                time.sleep(wait_time)
+                wait_time = min(wait_time * 1.5, 5.0)  # Cap at 5 seconds
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce70f27 and d7c31fa.

📒 Files selected for processing (3)
  • src/axolotl/train.py (1 hunks)
  • src/axolotl/utils/data/rl.py (5 hunks)
  • src/axolotl/utils/schemas/config.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • src/axolotl/train.py
  • src/axolotl/utils/schemas/config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/rl.py (6)
src/axolotl/loaders/tokenizer.py (1)
  • load_tokenizer (120-281)
src/axolotl/prompt_strategies/orpo/chat_template.py (1)
  • load (26-43)
src/axolotl/utils/data/shared.py (3)
  • datasets_with_name_generator (50-80)
  • generate_split_fingerprints (327-339)
  • load_dataset_with_config (83-138)
src/axolotl/utils/data/utils.py (2)
  • deduplicate_and_log_datasets (111-147)
  • md5 (72-77)
src/axolotl/utils/distributed.py (1)
  • is_main_process (72-87)
src/axolotl/utils/schemas/enums.py (1)
  • RLType (22-30)
⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
  • GitHub Check: pre-commit
  • GitHub Check: pre-commit
🔇 Additional comments (3)
src/axolotl/utils/data/rl.py (3)

32-99: Excellent refactoring with comprehensive distributed coordination.

The main entry function is well-structured with proper file locking, ready flag coordination, and clear separation of concerns. The docstring is comprehensive and the logic handles both distributed and non-distributed scenarios effectively.


184-188: Good tokenizer injection pattern.

The dynamic tokenizer injection based on function signature inspection is a clean solution that maintains flexibility while avoiding unnecessary tokenizer loading.


354-356: Excellent use of walrus operator for concise preprocessing check.

The walrus operator usage in the conditional assignment makes the code more concise and readable while maintaining clarity about the preprocessing status.

@djsaunde djsaunde marked this pull request as draft May 30, 2025 18:19
@djsaunde
Copy link
Member Author

Converting back to draft since there's a few bigger pieces I want to address.

@djsaunde djsaunde force-pushed the data-load-refactor branch from afc8c96 to 669579a Compare June 10, 2025 21:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (2)

109-111: 🛠️ Refactor suggestion

Blindly indexing dataset["train"] can raise KeyError

When a DatasetDict lacks a "train" split (e.g. only "validation"),
this line crashes. Fail fast with a clear error or fall back to the
first split.

-    if isinstance(dataset, DatasetDict):
-        dataset = dataset["train"]
+    if isinstance(dataset, DatasetDict):
+        if "train" not in dataset:
+            raise ValueError(
+                "Expected a 'train' split in the loaded DatasetDict "
+                f"but found {list(dataset.keys())}"
+            )
+        dataset = dataset["train"]

219-223: ⚠️ Potential issue

KTO datasets without an explicit type are left un-transformed

The else branch assumes the raw dataset already has
prompt/completion pre-tokenised.
For cfg.rl is RLType.KTO this is almost never true and will surface as
schema errors later.

-        else:
-            # If no `type` is provided, assume the dataset is already in the expected format with
-            # "prompt", "chosen", and "rejected" already preprocessed
-            split_datasets[i] = data_set
+        else:
+            if cfg.rl is RLType.KTO:  # fallback to default KTO transform
+                ds_transform_fn = load_kto("user_defined.default", cfg, dataset_idx=i)
+                split_datasets[i] = _map_dataset(cfg, data_set, ds_transform_fn, tokenizer)
+            else:
+                # Assume the dataset is already in final form
+                split_datasets[i] = data_set
src/axolotl/utils/data/sft.py (1)

457-464: 🛠️ Refactor suggestion

Validate dataset_shard_idx is within range before sharding

Supplying an out-of-bounds index crashes with a cryptic IndexError.
Emit a clear validation error instead.

-    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
+    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
+        if not (0 <= cfg.dataset_shard_idx < cfg.dataset_shard_num):
+            raise ValueError(
+                f"dataset_shard_idx ({cfg.dataset_shard_idx}) must be in "
+                f"[0, {cfg.dataset_shard_num - 1}]"
+            )
🧹 Nitpick comments (1)
src/axolotl/utils/data/shared.py (1)

126-129: Prefer contextlib.suppress for silent existence probe

Using try/except: pass to swallow FileNotFoundError, ConnectionError is verbose; contextlib.suppress conveys the intent
clearly.

-    try:
-        is_cloud_dataset = remote_fs.exists(dataset_config.path)
-    except (FileNotFoundError, ConnectionError):
-        pass
+    from contextlib import suppress
+    with suppress(FileNotFoundError, ConnectionError):
+        is_cloud_dataset = remote_fs.exists(dataset_config.path)
🧰 Tools
🪛 Ruff (0.11.9)

126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc802af and 669579a.

📒 Files selected for processing (62)
  • src/axolotl/common/const.py (1 hunks)
  • src/axolotl/common/datasets.py (5 hunks)
  • src/axolotl/core/builders/causal.py (1 hunks)
  • src/axolotl/datasets.py (5 hunks)
  • src/axolotl/loaders/tokenizer.py (3 hunks)
  • src/axolotl/prompt_strategies/messages/__init__.py (0 hunks)
  • src/axolotl/prompt_tokenizers.py (2 hunks)
  • src/axolotl/train.py (1 hunks)
  • src/axolotl/utils/data/__init__.py (1 hunks)
  • src/axolotl/utils/data/lock.py (1 hunks)
  • src/axolotl/utils/data/pretraining.py (1 hunks)
  • src/axolotl/utils/data/rl.py (2 hunks)
  • src/axolotl/utils/data/sft.py (3 hunks)
  • src/axolotl/utils/data/shared.py (2 hunks)
  • src/axolotl/utils/data/utils.py (5 hunks)
  • src/axolotl/utils/data/wrappers.py (1 hunks)
  • src/axolotl/utils/schemas/config.py (2 hunks)
  • tests/core/test_builders.py (2 hunks)
  • tests/e2e/integrations/test_cut_cross_entropy.py (3 hunks)
  • tests/e2e/integrations/test_hooks.py (1 hunks)
  • tests/e2e/integrations/test_kd.py (2 hunks)
  • tests/e2e/integrations/test_liger.py (2 hunks)
  • tests/e2e/integrations/test_llm_compressor.py (1 hunks)
  • tests/e2e/multigpu/solo/test_grpo.py (1 hunks)
  • tests/e2e/multigpu/test_locking.py (1 hunks)
  • tests/e2e/patched/test_4d_multipack_llama.py (2 hunks)
  • tests/e2e/patched/test_activation_checkpointing.py (1 hunks)
  • tests/e2e/patched/test_fa_xentropy.py (1 hunks)
  • tests/e2e/patched/test_falcon_samplepack.py (2 hunks)
  • tests/e2e/patched/test_fused_llama.py (1 hunks)
  • tests/e2e/patched/test_llama_s2_attention.py (2 hunks)
  • tests/e2e/patched/test_lora_llama_multipack.py (2 hunks)
  • tests/e2e/patched/test_mistral_samplepack.py (2 hunks)
  • tests/e2e/patched/test_mixtral_samplepack.py (2 hunks)
  • tests/e2e/patched/test_phi_multipack.py (2 hunks)
  • tests/e2e/patched/test_resume.py (1 hunks)
  • tests/e2e/patched/test_unsloth_qlora.py (3 hunks)
  • tests/e2e/solo/test_flex.py (1 hunks)
  • tests/e2e/solo/test_relora_llama.py (1 hunks)
  • tests/e2e/test_deepseekv3.py (2 hunks)
  • tests/e2e/test_dpo.py (1 hunks)
  • tests/e2e/test_embeddings_lr.py (2 hunks)
  • tests/e2e/test_falcon.py (3 hunks)
  • tests/e2e/test_gemma2.py (2 hunks)
  • tests/e2e/test_gemma3_text.py (2 hunks)
  • tests/e2e/test_llama.py (4 hunks)
  • tests/e2e/test_llama_pretrain.py (3 hunks)
  • tests/e2e/test_llama_vision.py (2 hunks)
  • tests/e2e/test_lora_llama.py (1 hunks)
  • tests/e2e/test_mamba.py (1 hunks)
  • tests/e2e/test_mistral.py (2 hunks)
  • tests/e2e/test_mixtral.py (5 hunks)
  • tests/e2e/test_optimizers.py (5 hunks)
  • tests/e2e/test_packing_loss.py (1 hunks)
  • tests/e2e/test_phi.py (2 hunks)
  • tests/e2e/test_process_reward_model_smollm2.py (1 hunks)
  • tests/e2e/test_qat.py (1 hunks)
  • tests/e2e/test_reward_model_smollm2.py (1 hunks)
  • tests/e2e/test_schedulers.py (1 hunks)
  • tests/prompt_strategies/test_dpo_chatml.py (2 hunks)
  • tests/test_datasets.py (15 hunks)
  • tests/test_exact_deduplication.py (14 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/prompt_strategies/messages/init.py
✅ Files skipped from review due to trivial changes (11)
  • tests/e2e/test_dpo.py
  • tests/e2e/test_qat.py
  • tests/e2e/test_reward_model_smollm2.py
  • tests/e2e/patched/test_unsloth_qlora.py
  • tests/e2e/test_schedulers.py
  • src/axolotl/utils/data/pretraining.py
  • tests/e2e/integrations/test_cut_cross_entropy.py
  • tests/e2e/test_mixtral.py
  • tests/e2e/test_optimizers.py
  • tests/e2e/test_process_reward_model_smollm2.py
  • tests/e2e/test_llama.py
🚧 Files skipped from review as they are similar to previous changes (46)
  • src/axolotl/common/const.py
  • tests/e2e/test_falcon.py
  • tests/e2e/integrations/test_llm_compressor.py
  • tests/e2e/test_lora_llama.py
  • tests/e2e/solo/test_flex.py
  • tests/e2e/test_phi.py
  • tests/e2e/test_llama_pretrain.py
  • tests/e2e/solo/test_relora_llama.py
  • tests/e2e/test_mamba.py
  • tests/e2e/test_embeddings_lr.py
  • tests/e2e/patched/test_activation_checkpointing.py
  • tests/e2e/test_mistral.py
  • tests/e2e/patched/test_fa_xentropy.py
  • src/axolotl/core/builders/causal.py
  • tests/e2e/patched/test_lora_llama_multipack.py
  • tests/e2e/integrations/test_hooks.py
  • tests/e2e/patched/test_mixtral_samplepack.py
  • tests/e2e/test_gemma3_text.py
  • tests/e2e/patched/test_phi_multipack.py
  • tests/e2e/test_packing_loss.py
  • tests/e2e/test_deepseekv3.py
  • tests/e2e/patched/test_mistral_samplepack.py
  • tests/e2e/patched/test_falcon_samplepack.py
  • tests/e2e/patched/test_resume.py
  • tests/e2e/multigpu/solo/test_grpo.py
  • tests/e2e/patched/test_4d_multipack_llama.py
  • tests/e2e/patched/test_fused_llama.py
  • tests/prompt_strategies/test_dpo_chatml.py
  • tests/e2e/integrations/test_liger.py
  • src/axolotl/loaders/tokenizer.py
  • tests/e2e/test_llama_vision.py
  • tests/e2e/test_gemma2.py
  • tests/e2e/patched/test_llama_s2_attention.py
  • src/axolotl/train.py
  • tests/e2e/integrations/test_kd.py
  • src/axolotl/prompt_tokenizers.py
  • src/axolotl/utils/data/lock.py
  • tests/core/test_builders.py
  • tests/test_exact_deduplication.py
  • src/axolotl/common/datasets.py
  • src/axolotl/datasets.py
  • src/axolotl/utils/data/init.py
  • tests/test_datasets.py
  • src/axolotl/utils/data/utils.py
  • src/axolotl/utils/schemas/config.py
  • src/axolotl/utils/data/wrappers.py
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/shared.py (4)
src/axolotl/utils/data/utils.py (2)
  • deduplicate_and_log_datasets (111-147)
  • md5 (72-77)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/logging.py (1)
  • get_logger (53-62)
tests/e2e/multigpu/test_locking.py (1)
  • cfg (25-27)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py

126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

tests/e2e/multigpu/test_locking.py

107-110: Return the negated condition directly

(SIM103)

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.5.1)
  • GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
  • GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (2)

109-111: Blind "train" access may raise KeyError

DatasetDict objects are unconditionally indexed with "train". If a dataset
only exposes "validation" / "test" (quite common for RL preference sets)
this will crash.

Guard the key or fall back gracefully – this was already pointed out in an
earlier review.


220-223: KTO datasets with missing type still skip mandatory preprocessing

The else branch keeps the raw dataset untouched even when cfg.rl is RLType.KTO, so un-tokenised "prompt" / "completion" pairs slip through.
The earlier review suggested routing this branch through load_kto; the code
is still unchanged.

src/axolotl/utils/data/sft.py (1)

457-465: Shard-index bounds still unchecked

The earlier review noted that dataset_shard_idx should be validated against
dataset_shard_num before calling .shard(). The guard is still missing.

🧹 Nitpick comments (6)
tests/e2e/multigpu/test_locking.py (2)

90-131: Avoid patching Path with a bare Mock – keep the real interface

Replacing ready_flag_path with a plain Mock means any method other than exists() (touch, write_text, etc.) would raise AttributeError if the implementation of FileLockLoader changes.
Create a real temporary file instead and patch only Path.exists:

-        mock_ready_flag_path = Mock()
-        ...
-        loader.ready_flag_path = mock_ready_flag_path
+        real_flag = loader.ready_flag_path
+
+        # Override Path.exists only
+        original_exists = Path.exists
+        Path.exists = lambda self: mock_exists() if self == real_flag else original_exists(self)

This keeps the object’s full pathlib.Path API intact and isolates the behaviour you’re testing.

🧰 Tools
🪛 Ruff (0.11.9)

107-110: Return the negated condition directly

(SIM103)


100-110: Simplify the conditional chain flagged by Ruff (SIM103)

The three branches can be collapsed into a single expression, improving readability and satisfying the linter:

-            if exists_call_count == 1:
-                return True
-            if exists_call_count <= 3:
-                return False
-            return True
+            return not (1 < exists_call_count <= 3)
🧰 Tools
🪛 Ruff (0.11.9)

107-110: Return the negated condition directly

(SIM103)

src/axolotl/utils/data/rl.py (1)

258-269: Docstring does not match return type

_load_or_create_dataset_split claims to return
“Tuple of (dataset, is_preprocessed)” but actually returns just the Dataset.

Either return the tuple or update the docstring to avoid confusion.

src/axolotl/utils/data/utils.py (1)

101-108: Deduplication loads the entire dataset into Python memory

Iterating row-by-row (for idx, row in enumerate(dataset)) materialises every
sample before selecting uniques. On multi-million-row Arrow datasets this
blows memory and negates parquet/arrow efficiency.

Consider a vectorised Arrow approach:

-unique_indices = []
-for idx, row in enumerate(dataset):
-    row_hash = sha256(str(row))
-    if row_hash not in seen_hashes:
-        seen_hashes.add(row_hash)
-        unique_indices.append(idx)
-
-return dataset.select(unique_indices), seen_hashes
+# hash every row in a single pass using pyarrow compute
+import pyarrow.compute as pc
+hashes = pc.hash_sha256(dataset.data)          # vectorised
+mask   = ~pc.is_in(hashes, value_set=list(seen_hashes))
+seen_hashes.update(set(hashes.filter(mask).to_pylist()))
+return dataset.filter(mask.to_pylist()), seen_hashes

This keeps the pipeline in C++ and is orders of magnitude faster/more memory
friendly.

src/axolotl/utils/data/shared.py (2)

107-114: Avoid passing None (or a list) to datasets.load_dataset kwargs

load_dataset() rejects unknown/None parameters. Building the kwargs dict first and then filtering out falsy values keeps the call-site clean and prevents subtle ValueErrors when, e.g., dataset_config.name is a list or None.

-    load_dataset_kwargs = {
-        "split": dataset_config.split if dataset_config.split else None,
-        "name": dataset_config.name,
-        "streaming": streaming,
-        "trust_remote_code": dataset_config.trust_remote_code,
-    }
+    # Filter-out parameters that are not set
+    load_dataset_kwargs = {
+        k: v
+        for k, v in {
+            "split": dataset_config.split,
+            "name": dataset_config.name,
+            "streaming": streaming,
+            "trust_remote_code": dataset_config.trust_remote_code,
+        }.items()
+        if v is not None
+    }

125-130: Replace empty except …: pass with contextlib.suppress

Suppressing exceptions explicitly is clearer, keeps the variable scope clean, and satisfies ruff’s SIM105 suggestion.

-        try:
-            is_cloud_dataset = remote_fs.exists(dataset_config.path)
-        except (FileNotFoundError, ConnectionError):
-            pass
+        import contextlib
+        with contextlib.suppress(FileNotFoundError, ConnectionError):
+            is_cloud_dataset = remote_fs.exists(dataset_config.path)
🧰 Tools
🪛 Ruff (0.11.9)

126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between cc802af and 669579a.

📒 Files selected for processing (62)
  • src/axolotl/common/const.py (1 hunks)
  • src/axolotl/common/datasets.py (5 hunks)
  • src/axolotl/core/builders/causal.py (1 hunks)
  • src/axolotl/datasets.py (5 hunks)
  • src/axolotl/loaders/tokenizer.py (3 hunks)
  • src/axolotl/prompt_strategies/messages/__init__.py (0 hunks)
  • src/axolotl/prompt_tokenizers.py (2 hunks)
  • src/axolotl/train.py (1 hunks)
  • src/axolotl/utils/data/__init__.py (1 hunks)
  • src/axolotl/utils/data/lock.py (1 hunks)
  • src/axolotl/utils/data/pretraining.py (1 hunks)
  • src/axolotl/utils/data/rl.py (2 hunks)
  • src/axolotl/utils/data/sft.py (3 hunks)
  • src/axolotl/utils/data/shared.py (2 hunks)
  • src/axolotl/utils/data/utils.py (5 hunks)
  • src/axolotl/utils/data/wrappers.py (1 hunks)
  • src/axolotl/utils/schemas/config.py (2 hunks)
  • tests/core/test_builders.py (2 hunks)
  • tests/e2e/integrations/test_cut_cross_entropy.py (3 hunks)
  • tests/e2e/integrations/test_hooks.py (1 hunks)
  • tests/e2e/integrations/test_kd.py (2 hunks)
  • tests/e2e/integrations/test_liger.py (2 hunks)
  • tests/e2e/integrations/test_llm_compressor.py (1 hunks)
  • tests/e2e/multigpu/solo/test_grpo.py (1 hunks)
  • tests/e2e/multigpu/test_locking.py (1 hunks)
  • tests/e2e/patched/test_4d_multipack_llama.py (2 hunks)
  • tests/e2e/patched/test_activation_checkpointing.py (1 hunks)
  • tests/e2e/patched/test_fa_xentropy.py (1 hunks)
  • tests/e2e/patched/test_falcon_samplepack.py (2 hunks)
  • tests/e2e/patched/test_fused_llama.py (1 hunks)
  • tests/e2e/patched/test_llama_s2_attention.py (2 hunks)
  • tests/e2e/patched/test_lora_llama_multipack.py (2 hunks)
  • tests/e2e/patched/test_mistral_samplepack.py (2 hunks)
  • tests/e2e/patched/test_mixtral_samplepack.py (2 hunks)
  • tests/e2e/patched/test_phi_multipack.py (2 hunks)
  • tests/e2e/patched/test_resume.py (1 hunks)
  • tests/e2e/patched/test_unsloth_qlora.py (3 hunks)
  • tests/e2e/solo/test_flex.py (1 hunks)
  • tests/e2e/solo/test_relora_llama.py (1 hunks)
  • tests/e2e/test_deepseekv3.py (2 hunks)
  • tests/e2e/test_dpo.py (1 hunks)
  • tests/e2e/test_embeddings_lr.py (2 hunks)
  • tests/e2e/test_falcon.py (3 hunks)
  • tests/e2e/test_gemma2.py (2 hunks)
  • tests/e2e/test_gemma3_text.py (2 hunks)
  • tests/e2e/test_llama.py (4 hunks)
  • tests/e2e/test_llama_pretrain.py (3 hunks)
  • tests/e2e/test_llama_vision.py (2 hunks)
  • tests/e2e/test_lora_llama.py (1 hunks)
  • tests/e2e/test_mamba.py (1 hunks)
  • tests/e2e/test_mistral.py (2 hunks)
  • tests/e2e/test_mixtral.py (5 hunks)
  • tests/e2e/test_optimizers.py (5 hunks)
  • tests/e2e/test_packing_loss.py (1 hunks)
  • tests/e2e/test_phi.py (2 hunks)
  • tests/e2e/test_process_reward_model_smollm2.py (1 hunks)
  • tests/e2e/test_qat.py (1 hunks)
  • tests/e2e/test_reward_model_smollm2.py (1 hunks)
  • tests/e2e/test_schedulers.py (1 hunks)
  • tests/prompt_strategies/test_dpo_chatml.py (2 hunks)
  • tests/test_datasets.py (15 hunks)
  • tests/test_exact_deduplication.py (14 hunks)
💤 Files with no reviewable changes (1)
  • src/axolotl/prompt_strategies/messages/init.py
✅ Files skipped from review due to trivial changes (14)
  • tests/e2e/test_dpo.py
  • tests/e2e/test_reward_model_smollm2.py
  • tests/e2e/test_packing_loss.py
  • tests/e2e/patched/test_fa_xentropy.py
  • tests/e2e/patched/test_unsloth_qlora.py
  • tests/e2e/test_schedulers.py
  • tests/e2e/test_falcon.py
  • tests/e2e/test_optimizers.py
  • tests/e2e/test_mixtral.py
  • tests/e2e/solo/test_relora_llama.py
  • tests/e2e/patched/test_phi_multipack.py
  • src/axolotl/utils/data/lock.py
  • tests/e2e/test_process_reward_model_smollm2.py
  • src/axolotl/utils/data/wrappers.py
🚧 Files skipped from review as they are similar to previous changes (42)
  • src/axolotl/common/const.py
  • tests/e2e/test_llama_pretrain.py
  • tests/e2e/patched/test_activation_checkpointing.py
  • tests/e2e/integrations/test_llm_compressor.py
  • tests/e2e/test_mamba.py
  • tests/e2e/multigpu/solo/test_grpo.py
  • tests/e2e/test_phi.py
  • src/axolotl/utils/data/pretraining.py
  • tests/e2e/patched/test_falcon_samplepack.py
  • tests/e2e/test_embeddings_lr.py
  • tests/e2e/patched/test_lora_llama_multipack.py
  • tests/e2e/patched/test_fused_llama.py
  • tests/e2e/test_gemma2.py
  • tests/prompt_strategies/test_dpo_chatml.py
  • tests/e2e/test_llama_vision.py
  • tests/e2e/patched/test_resume.py
  • tests/e2e/test_deepseekv3.py
  • tests/e2e/patched/test_mixtral_samplepack.py
  • tests/e2e/solo/test_flex.py
  • tests/e2e/integrations/test_cut_cross_entropy.py
  • tests/e2e/test_mistral.py
  • src/axolotl/train.py
  • tests/e2e/integrations/test_liger.py
  • tests/e2e/test_qat.py
  • tests/e2e/test_llama.py
  • tests/e2e/patched/test_mistral_samplepack.py
  • tests/e2e/integrations/test_hooks.py
  • tests/e2e/test_lora_llama.py
  • tests/e2e/patched/test_4d_multipack_llama.py
  • tests/e2e/integrations/test_kd.py
  • tests/e2e/patched/test_llama_s2_attention.py
  • src/axolotl/loaders/tokenizer.py
  • tests/core/test_builders.py
  • tests/e2e/test_gemma3_text.py
  • tests/test_exact_deduplication.py
  • src/axolotl/prompt_tokenizers.py
  • src/axolotl/utils/schemas/config.py
  • src/axolotl/core/builders/causal.py
  • tests/test_datasets.py
  • src/axolotl/utils/data/init.py
  • src/axolotl/datasets.py
  • src/axolotl/common/datasets.py
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (3)
src/axolotl/utils/data/utils.py (3)
tests/test_exact_deduplication.py (1)
  • cfg (201-215)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/samplers/utils.py (1)
  • get_dataset_lengths (8-21)
src/axolotl/utils/data/shared.py (4)
src/axolotl/utils/data/utils.py (2)
  • deduplicate_and_log_datasets (111-147)
  • md5 (72-77)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/logging.py (1)
  • get_logger (53-62)
tests/e2e/multigpu/test_locking.py (1)
  • cfg (25-27)
src/axolotl/utils/data/sft.py (11)
src/axolotl/prompters.py (1)
  • Prompter (26-29)
src/axolotl/utils/data/lock.py (3)
  • FileLockLoader (17-66)
  • load (33-44)
  • cleanup (54-66)
src/axolotl/utils/data/pretraining.py (1)
  • wrap_pretraining_dataset (179-239)
src/axolotl/utils/data/shared.py (8)
  • create_train_validation_split (364-393)
  • datasets_with_name_generator (59-89)
  • generate_dataset_hash_from_config (494-513)
  • load_dataset_with_config (92-147)
  • load_preprocessed_dataset (443-471)
  • merge_datasets (516-538)
  • save_preprocessed_dataset (405-440)
  • try_load_from_hub (474-491)
src/axolotl/utils/data/utils.py (3)
  • deduplicate_and_log_datasets (111-147)
  • drop_long_seq_in_dataset (150-202)
  • retry_on_request_exceptions (31-69)
src/axolotl/utils/data/wrappers.py (1)
  • get_dataset_wrapper (58-132)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/distributed.py (1)
  • is_local_main_process (90-93)
src/axolotl/prompt_strategies/__init__.py (1)
  • load (12-50)
src/axolotl/prompt_strategies/stepwise_supervised.py (1)
  • load (105-116)
src/axolotl/utils/trainer.py (2)
  • calculate_total_num_steps (393-510)
  • process_datasets_for_packing (238-369)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py

126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError) instead of try-except-pass

Replace with contextlib.suppress(FileNotFoundError, ConnectionError)

(SIM105)

tests/e2e/multigpu/test_locking.py

107-110: Return the negated condition directly

(SIM103)

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: docker-e2e-tests-1st (124, 12.4.1, 3.11, 2.6.0, 1, vllm)
  • GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
  • GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
  • GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
🔇 Additional comments (2)
tests/e2e/multigpu/test_locking.py (1)

184-196: Use autospec=True when patching FileLock to prevent false positives

Patching without autospec replaces FileLock with a generic MagicMock, so calls like .acquire() or attribute access that don’t exist on the original object will silently succeed, masking breakages.

-with patch("axolotl.utils.data.lock.FileLock") as mock_filelock:
+with patch("axolotl.utils.data.lock.FileLock", autospec=True) as mock_filelock:

This binds the mock to the real signature and attributes of FileLock, giving you stricter, more trustworthy tests.

src/axolotl/utils/data/shared.py (1)

274-288: isfile/isdir aren’t consistently implemented across fsspec filesystems

AzureBlobFileSystem (and some others) only implement exists/info. Depending on the backend this branch can raise AttributeError, breaking cloud loads.

Consider falling back to remote_fs.exists() when the attribute is missing:

if getattr(remote_fs, "isdir", None) and remote_fs.isdir(dataset_config.path):
    …
elif getattr(remote_fs, "isfile", None) and remote_fs.isfile(dataset_config.path):
    …
elif remote_fs.exists(dataset_config.path):
    # treat as file by default

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (5)
src/axolotl/utils/data/rl.py (2)

111-113: ⚠️ Potential issue

Guard against missing "train" split before indexing DatasetDict

Blindly doing dataset = dataset["train"] will raise a KeyError when the prepared dataset contains only "validation" / "test" or custom split names.
Add an existence check (or a fallback + explicit error) before indexing.

-    if isinstance(dataset, DatasetDict):
-        dataset = dataset["train"]
+    if isinstance(dataset, DatasetDict):
+        if "train" not in dataset:
+            raise ValueError("Expected 'train' split in DatasetDict")
+        dataset = dataset["train"]

203-205: 🛠️ Refactor suggestion

Indexing bug – config list can be shorter than split_datasets

datasets_with_name_generator() may expand a single config into multiple datasets (multiple names or preprocessing shards).
Using datasets_configs[i] assumes a 1-to-1 correspondence and will mis-align or raise IndexError.

Iterate over (dataset, dataset_cfg) pairs instead:

-for i, data_set in enumerate(split_datasets):
-    _type = datasets_configs[i]["type"]
+for data_set, ds_cfg in zip(split_datasets, datasets_with_name_generator(datasets_configs)):
+    _type = ds_cfg["type"]
src/axolotl/utils/data/sft.py (2)

498-500: _apply_dataset_sharding still fails when a DatasetDict is returned

Comment previously raised: when a prepared dataset pushed to the hub contains multiple splits, the object coming back is a DatasetDict. Passing it straight into .shard() crashes.
Handle DatasetDict (e.g. pick the requested split) inside _apply_dataset_sharding.


459-466: 🛠️ Refactor suggestion

Validate shard index is within bounds before calling .shard()

Supplying an out-of-range dataset_shard_idx will raise deep inside datasets.Dataset.shard, giving users a cryptic error. Validate early and fail fast:

-    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
+    if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None:
+        if not 0 <= cfg.dataset_shard_idx < cfg.dataset_shard_num:
+            raise ValueError(
+                f"dataset_shard_idx ({cfg.dataset_shard_idx}) "
+                f"must be within [0, {cfg.dataset_shard_num - 1}]"
+            )
tests/e2e/multigpu/test_locking.py (1)

63-70: Race-condition around call_count is still unfixed

call_count += 1 is executed in three concurrent threads without any synchronisation.
A LOAD / INPLACE_ADD / STORE byte-code triplet is not atomic; increments can be lost, yielding duplicate "data_<n>" values and flaky assertions.

The same issue was pointed out in an earlier review and remains unresolved. Guard the counter with a threading.Lock() (or switch to itertools.count()).

🧹 Nitpick comments (2)
src/axolotl/utils/data/rl.py (1)

269-272: Docstring out of sync with return value

The function returns a Dataset, not a Tuple. Update the docstring to prevent confusion.

tests/e2e/multigpu/test_locking.py (1)

103-106: Micro-simplification opportunity

The two consecutive if/return branches can be collapsed by returning the negated condition directly, improving clarity:

-            if exists_call_count <= 3:
-                return False
-            return True
+            return exists_call_count > 3
🧰 Tools
🪛 Ruff (0.11.9)

103-106: Return the negated condition directly

(SIM103)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 669579a and d523857.

📒 Files selected for processing (3)
  • src/axolotl/utils/data/rl.py (2 hunks)
  • src/axolotl/utils/data/sft.py (3 hunks)
  • tests/e2e/multigpu/test_locking.py (1 hunks)
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (2)
tests/e2e/multigpu/test_locking.py (2)
src/axolotl/utils/data/lock.py (3)
  • FileLockLoader (17-66)
  • load (33-44)
  • cleanup (54-66)
src/axolotl/utils/dict.py (1)
  • DictDefault (6-38)
src/axolotl/utils/data/sft.py (9)
src/axolotl/prompters.py (1)
  • Prompter (26-29)
src/axolotl/utils/data/lock.py (3)
  • FileLockLoader (17-66)
  • load (33-44)
  • cleanup (54-66)
src/axolotl/utils/data/pretraining.py (1)
  • wrap_pretraining_dataset (179-239)
src/axolotl/utils/data/shared.py (6)
  • create_train_validation_split (364-393)
  • datasets_with_name_generator (59-89)
  • generate_dataset_hash_from_config (494-513)
  • load_preprocessed_dataset (443-471)
  • merge_datasets (516-538)
  • try_load_from_hub (474-491)
src/axolotl/utils/data/utils.py (3)
  • deduplicate_and_log_datasets (111-147)
  • drop_long_seq_in_dataset (150-202)
  • retry_on_request_exceptions (31-69)
src/axolotl/utils/data/wrappers.py (1)
  • get_dataset_wrapper (58-132)
src/axolotl/utils/distributed.py (1)
  • is_local_main_process (90-93)
src/axolotl/prompt_strategies/__init__.py (1)
  • load (12-50)
src/axolotl/utils/trainer.py (2)
  • calculate_total_num_steps (393-510)
  • process_datasets_for_packing (238-369)
🪛 ast-grep (0.38.1)
tests/e2e/multigpu/test_locking.py

[warning] 92-92: The function mktemp is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile() instead and pass it the delete=False parameter.
Context: tempfile.mktemp()
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control

(avoid-mktemp-python)

🪛 Ruff (0.11.9)
tests/e2e/multigpu/test_locking.py

103-106: Return the negated condition directly

(SIM103)

⏰ Context from checks skipped due to timeout of 90000ms (9)
  • GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
  • GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
  • GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
  • GitHub Check: PyTest (3.11, 2.7.0)
  • GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.5.1)
  • GitHub Check: PyTest (3.11, 2.6.0)
  • GitHub Check: pre-commit

@djsaunde djsaunde merged commit 00cda8c into main Jun 10, 2025
8 of 12 checks passed
@djsaunde djsaunde deleted the data-load-refactor branch June 10, 2025 23:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distributed Timeout during Dataset Tokenization
4 participants