-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Data loader refactor #2707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data loader refactor #2707
Conversation
Caution Review failedThe pull request is closed. WalkthroughThis pull request refactors and modularizes dataset preparation, loading, and wrapping logic across supervised fine-tuning (SFT) and reinforcement learning (RL) workflows. It introduces new modules, updates function signatures, improves distributed synchronization and deduplication, and aligns tests with the new APIs. Docstrings and type hints are also modernized throughout. Changes
Sequence Diagram(s)sequenceDiagram
participant Trainer as Trainer/CLI
participant SFT as prepare_datasets (SFT)
participant RL as prepare_preference_datasets (RL)
participant Shared as load_dataset_with_config
participant Wrappers as get_dataset_wrapper
participant Lock as FileLock/ReadyFlag
Trainer->>SFT: prepare_datasets(cfg, tokenizer, ...)
SFT->>Lock: Acquire file lock, check ready flag
alt Not ready
SFT->>Shared: load_dataset_with_config(...)
SFT->>Wrappers: get_dataset_wrapper(...)
SFT->>Lock: Save processed dataset, set ready flag
else Ready
SFT->>Shared: load processed dataset
end
SFT->>Trainer: Return train/eval datasets, steps, prompters
Trainer->>RL: prepare_preference_datasets(cfg)
RL->>Lock: Acquire file lock, check ready flag
alt Not ready
RL->>Shared: load_dataset_with_config(...)
RL->>RL: Transform, filter, cache dataset
RL->>Lock: Save processed dataset, set ready flag
else Ready
RL->>Shared: load processed dataset
end
RL->>Trainer: Return train/eval datasets
Assessment against linked issues
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
✨ Finishing Touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
SupportNeed help? Create a ticket on our support page for assistance with any issues or questions. Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
Codecov ReportAttention: Patch coverage is 📢 Thoughts on this report? Let us know! |
3dad97a
to
255757c
Compare
Current test failures will be fixed once #2608 is merged |
255757c
to
daf5076
Compare
Opening up for review, but there's still plenty to do IMO under Main areas of focus (under |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🔭 Outside diff range comments (1)
src/axolotl/utils/data/rl.py (1)
161-161
:⚠️ Potential issueFix incorrect log message in save function.
The log message says "Loading" but this function is saving the dataset.
- LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...") + LOG.info(f"Saving prepared dataset to disk at {prepared_ds_path}...")
♻️ Duplicate comments (1)
src/axolotl/utils/data/rl.py (1)
50-83
: Apply the same distributed coordination fix as suggested for sft.py.This function has the same race condition issue where the ready flag is created before datasets are fully saved to disk.
🧹 Nitpick comments (3)
src/axolotl/utils/data/shared.py (1)
116-121
: Consider usingcontextlib.suppress
for cleaner exception handling.The static analysis tool correctly identifies an opportunity to simplify the exception handling.
Apply this refactor for more Pythonic code:
+import contextlib # ... other imports ... is_cloud_dataset = False if remote_fs: - try: - is_cloud_dataset = remote_fs.exists(dataset_config.path) - except (FileNotFoundError, ConnectionError): - pass + with contextlib.suppress(FileNotFoundError, ConnectionError): + is_cloud_dataset = remote_fs.exists(dataset_config.path)🧰 Tools
🪛 Ruff (0.11.9)
117-120: Use
contextlib.suppress(FileNotFoundError, ConnectionError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
src/axolotl/utils/data/wrappers.py (2)
120-127
: Extract duplicated error handling logicThe error handling logic for unhandled dataset types is duplicated. Consider extracting it to a helper function.
Add this helper function at the module level:
def _raise_unhandled_dataset_error(dataset_type: str) -> None: """Raise a ValueError for unhandled dataset types with helpful suggestions.""" suffix = "" if ":load_" in dataset_type: suffix = f" Did you mean {dataset_type.replace(':load_', '.load_')}?" error_message = f"unhandled prompt tokenization strategy: {dataset_type}. {suffix}" LOG.error(error_message) raise ValueError(error_message)Then replace both occurrences with:
- ds_type = dataset_config.type - suffix = "" - if ":load_" in ds_type: - suffix = f" Did you mean {ds_type.replace(':load_', '.load_')}?" - - error_message = f"unhandled prompt tokenization strategy: {ds_type}. {suffix}" - LOG.error(error_message) - raise ValueError(error_message) + _raise_unhandled_dataset_error(dataset_config.type)Also applies to: 172-180
234-416
: Consider refactoring similar handler functionsMany handler functions follow nearly identical patterns, differing only in the prompter and strategy classes used. This could be refactored to reduce code duplication.
Consider creating a generic handler factory:
def _create_generic_handler( prompter_class: type[Prompter], strategy_class: type[PromptTokenizingStrategy], ) -> Callable: """Create a handler function for a specific prompter and strategy combination.""" def handler( dataset_prompt_style: str | None, tokenizer: PreTrainedTokenizer, cfg: DictDefault, dataset: Dataset | IterableDataset, dataset_kwargs: dict[str, Any], ) -> tuple[Dataset | IterableDataset, Prompter]: dataset_prompter = prompter_class(dataset_prompt_style) dataset_strategy = strategy_class( dataset_prompter, tokenizer, cfg.train_on_inputs, cfg.sequence_len, ) dataset_wrapper = wrap_dataset_for_tokenized_prompt( dataset_strategy, dataset, **dataset_kwargs, ) return dataset_wrapper, dataset_prompter return handler # Then define handlers more concisely: DATASET_HANDLERS = { "alpaca": _create_generic_handler(AlpacaPrompter, AlpacaPromptTokenizingStrategy), "summarizetldr": _create_generic_handler(SummarizeTLDRPrompter, SummarizeTLDRPromptTokenizingStrategy), # ... etc }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (19)
src/axolotl/common/const.py
(1 hunks)src/axolotl/common/datasets.py
(5 hunks)src/axolotl/datasets.py
(5 hunks)src/axolotl/prompt_strategies/messages/__init__.py
(0 hunks)src/axolotl/prompt_tokenizers.py
(2 hunks)src/axolotl/train.py
(1 hunks)src/axolotl/utils/data/__init__.py
(1 hunks)src/axolotl/utils/data/pretraining.py
(1 hunks)src/axolotl/utils/data/rl.py
(5 hunks)src/axolotl/utils/data/sft.py
(4 hunks)src/axolotl/utils/data/shared.py
(2 hunks)src/axolotl/utils/data/utils.py
(5 hunks)src/axolotl/utils/data/wrappers.py
(1 hunks)src/axolotl/utils/schemas/config.py
(1 hunks)tests/e2e/test_dpo.py
(1 hunks)tests/e2e/test_llama_pretrain.py
(2 hunks)tests/prompt_strategies/test_dpo_chatml.py
(2 hunks)tests/test_datasets.py
(15 hunks)tests/test_exact_deduplication.py
(15 hunks)
💤 Files with no reviewable changes (1)
- src/axolotl/prompt_strategies/messages/init.py
🧰 Additional context used
🧬 Code Graph Analysis (8)
tests/prompt_strategies/test_dpo_chatml.py (1)
src/axolotl/utils/data/rl.py (1)
prepare_preference_datasets
(32-99)
src/axolotl/prompt_tokenizers.py (1)
src/axolotl/prompt_strategies/messages/chat.py (1)
wrap_dataset
(33-48)
src/axolotl/utils/data/__init__.py (5)
src/axolotl/utils/data/pretraining.py (2)
encode_pretraining
(20-176)wrap_pretraining_dataset
(179-239)src/axolotl/utils/data/rl.py (1)
prepare_preference_datasets
(32-99)src/axolotl/utils/data/wrappers.py (1)
get_dataset_wrapper
(46-127)src/axolotl/utils/data/sft.py (1)
prepare_datasets
(48-69)src/axolotl/utils/data/utils.py (1)
md5
(72-77)
tests/test_datasets.py (2)
src/axolotl/utils/data/rl.py (1)
prepare_preference_datasets
(32-99)src/axolotl/utils/data/sft.py (1)
_load_tokenized_prepared_datasets
(256-304)
src/axolotl/datasets.py (4)
src/axolotl/prompt_tokenizers.py (2)
PromptTokenizingStrategy
(43-105)supports_batched
(70-71)src/axolotl/prompt_strategies/chat_template.py (1)
supports_batched
(328-330)src/axolotl/prompt_strategies/pretrain.py (1)
supports_batched
(21-22)src/axolotl/prompt_strategies/stepwise_supervised.py (1)
supports_batched
(101-102)
src/axolotl/utils/data/shared.py (2)
src/axolotl/utils/data/utils.py (1)
md5
(72-77)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)
src/axolotl/utils/data/sft.py (8)
src/axolotl/prompters.py (1)
Prompter
(26-29)src/axolotl/utils/data/pretraining.py (1)
wrap_pretraining_dataset
(179-239)src/axolotl/utils/data/shared.py (3)
datasets_with_name_generator
(50-80)generate_split_fingerprints
(327-339)load_dataset_with_config
(83-138)src/axolotl/utils/data/utils.py (4)
deduplicate_and_log_datasets
(111-147)drop_long_seq_in_dataset
(150-202)md5
(72-77)retry_on_request_exceptions
(31-69)tests/conftest.py (1)
retry_on_request_exceptions
(28-49)src/axolotl/utils/data/wrappers.py (1)
get_dataset_wrapper
(46-127)src/axolotl/utils/distributed.py (1)
is_local_main_process
(90-93)src/axolotl/utils/trainer.py (1)
calculate_total_num_steps
(393-510)
src/axolotl/utils/data/utils.py (3)
tests/test_exact_deduplication.py (1)
cfg
(203-217)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)src/axolotl/utils/samplers/utils.py (1)
get_dataset_lengths
(8-21)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py
117-120: Use contextlib.suppress(FileNotFoundError, ConnectionError)
instead of try
-except
-pass
Replace with contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
⏰ Context from checks skipped due to timeout of 90000ms (6)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
🔇 Additional comments (41)
src/axolotl/common/const.py (1)
1-1
: LGTM! Docstring formatting improvement.The conversion from multi-line to single-line docstring format improves consistency across the codebase while preserving the original meaning.
tests/e2e/test_dpo.py (1)
1-1
: LGTM! Docstring formatting improvement.The single-line docstring format improves consistency while maintaining the original content and meaning.
tests/e2e/test_llama_pretrain.py (2)
1-1
: LGTM! Module docstring formatting improvement.The single-line format improves consistency across the codebase.
21-21
: LGTM! Class docstring formatting improvement.The single-line format aligns with the module-level docstring formatting and improves overall consistency.
src/axolotl/utils/data/pretraining.py (1)
253-253
: LGTM! Improved function call clarity.Converting from positional to keyword argument makes the code more explicit and readable, which aligns with the PR's goal of improving code clarity.
tests/prompt_strategies/test_dpo_chatml.py (2)
10-10
: LGTM! Updated import aligns with dataset preparation refactoring.The import correctly reflects the renamed function
prepare_preference_datasets
that was introduced as part of the dataset loading refactor.
58-58
: LGTM! Function call updated to use refactored API.The function call correctly uses the new
prepare_preference_datasets
function, maintaining the same parameter signature and return value expectations. This aligns with the broader dataset preparation refactoring mentioned in the PR objectives.src/axolotl/train.py (1)
56-57
: LGTM! Docstring formatting improvement.The docstring formatting has been updated to a more consistent style while preserving the original content.
src/axolotl/prompt_tokenizers.py (2)
6-6
: Good addition of required import.The
Dataset
import is necessary for the type hint in the new abstract methodwrap_dataset
.
32-40
: Excellent formalization of the dataset wrapping interface.The addition of the abstract
wrap_dataset
method properly formalizes the interface that subclasses must implement. The method signature with optionalprocess_count
,keep_in_memory
, and flexible**kwargs
parameters aligns well with existing implementations and provides good flexibility for different dataset wrapping strategies.src/axolotl/utils/schemas/config.py (1)
1200-1200
: Excellent defensive programming improvement.Using
getattr(self, "micro_batch_size", 1)
instead of direct attribute access prevents potentialAttributeError
whenmicro_batch_size
is not set, while maintaining the existing validation logic with a sensible default value.src/axolotl/utils/data/__init__.py (3)
1-1
: Clear and specific module docstring.The updated docstring properly identifies this as the initializer for the
axolotl.utils.data
module.
3-12
: Well-organized import consolidation reflecting the refactoring.The import updates properly reflect the function renaming and consolidation from the broader refactoring:
load_prepare_preference_datasets
→prepare_preference_datasets
- Multiple SFT imports consolidated to
prepare_datasets
- Addition of
get_dataset_wrapper
(moved from SFT module)- Addition of
md5
utility functionThe import organization is clean and logical.
14-21
: Excellent addition of explicit public API definition.The
__all__
list clearly defines the module's public interface, making it easier for users to understand what functionality is available and ensuring consistent exports. All the listed functions align with the refactored dataset preparation workflow.tests/test_exact_deduplication.py (5)
17-18
: LGTM: Import updates align with refactored APIsThe import changes correctly reflect the function renames from
prepare_dataset
toprepare_datasets
andload_prepare_preference_datasets
toprepare_preference_datasets
, maintaining consistency with the refactored data loading modules.
230-232
: Correct patch target for renamed functionThe patch target correction from
"axolotl.utils.data.rl.load_dataset_w_config"
to"axolotl.utils.data.rl.load_dataset_with_config"
properly reflects the function rename in the shared module.Also applies to: 256-258
242-242
: Function call updates align with refactored RL moduleThe calls to
prepare_preference_datasets
correctly replace the previousload_prepare_preference_datasets
function, maintaining the expected return tuple structure.Also applies to: 270-270
324-324
: Updated function calls match new SFT module APIThe function calls correctly use the renamed
prepare_datasets
function, and the return value unpacking remains consistent with the expected 4-tuple format.Also applies to: 351-351, 378-378
74-77
:✅ Verification successful
Verify the updated deduplication API usage
The API change from separate
train_dataset
andeval_dataset
parameters to a singledataset
parameter with optionaldataset_name
aligns with the simplified deduplication interface mentioned in the AI summary.Let me verify that the new API signature matches the implementation:
🏁 Script executed:
#!/bin/bash # Description: Verify the deduplicate_and_log_datasets function signature # Expected: Function should accept dataset and optional dataset_name/other_dataset parameters ast-grep --pattern 'def deduplicate_and_log_datasets($$$)'Length of output: 2785
✅ deduplicate_and_log_datasets signature verified
The
deduplicate_and_log_datasets
function in
src/axolotl/utils/data/utils.py
(lines 111–116) is defined as:
dataset: Dataset
other_dataset: Dataset | None = None
dataset_name: str | None = "train"
other_name: str | None = "eval"
Your test calls:
train_dataset, _ = deduplicate_and_log_datasets(dataset=self.dataset) eval_dataset, _ = deduplicate_and_log_datasets( dataset=self.dataset, dataset_name="eval" )
These correctly use the default
dataset_name="train"
for training and override it to"eval"
for evaluation. No changes required.src/axolotl/datasets.py (5)
22-29
: Improved docstring formatting enhances readabilityThe restructured docstring with a proper Args section follows Python documentation conventions and improves code documentation quality.
35-36
: Modernized type hints using Python 3.10+ union syntaxThe change from
Optional[int]
andOptional[bool]
toint | None
andbool | None
respectively uses the more concise union syntax introduced in Python 3.10. This improves code readability and follows modern Python typing conventions.
78-78
: Type hint and explicit list conversion improvementsThe updates include:
- Modern union syntax for
Dataset | IterableDataset
- Explicit conversion
list(dataset.features.keys())
for clarityThese changes improve type safety and code explicitness.
Also applies to: 85-85
96-103
: Consistent docstring formatting and type hint modernizationThe docstring formatting follows the same improved pattern as the previous class, and the type hint
list[IterableDataset]
uses modern Python syntax.Also applies to: 113-113
177-181
: Improved logging message formattingThe multi-line f-string formatting makes the warning message more readable while maintaining the same information content.
src/axolotl/common/datasets.py (5)
12-12
: Import updates align with refactored modulesThe imports correctly reference the renamed functions
prepare_datasets
andprepare_preference_datasets
from their respective modules.Also applies to: 17-17
40-40
: Improved function signature and safer attribute accessThe changes include:
- Making
cli_args
mandatory (removingOptional
)- Using
getattr(cli_args, "iterable", False)
for safer attribute accessThis improves API clarity and prevents potential AttributeError exceptions.
Also applies to: 55-55
57-57
: Updated function calls and debug condition handlingThe function call update to
prepare_datasets
is correct, and the debug condition logic has been restructured for better readability. The conditions appear to check the same attributes but with improved formatting.Also applies to: 64-69
124-129
: Explicit keyword arguments improve code clarityUsing explicit keyword arguments in the
check_dataset_labels
call makes the function invocation more readable and less prone to errors from positional argument changes.
110-110
:✅ Verification successful
Review the inverted total_num_steps calculation logic
The logic has been inverted - previously it calculated
total_num_steps
by default and set it toNone
for GRPO, now it initializes asNone
and only calculates for non-GRPO types. This appears to match the AI summary description.Let me verify this change aligns with the RL module logic:
Also applies to: 112-116
🏁 Script executed:
#!/bin/bash # Description: Check how total_num_steps is handled in RL training # Expected: Should find logic that distinguishes GRPO from other RL types rg -A 5 -B 5 "total_num_steps.*GRPO|GRPO.*total_num_steps"Length of output: 60
I’ll gather more context around how
total_num_steps
andGRPO
are used across the codebase to be sure.#!/bin/bash set -e echo "===== Occurrences of GRPO =====" rg -n "GRPO" || echo "(none found)" echo echo "===== Occurrences of total_num_steps =====" rg -n "total_num_steps" || echo "(none found)" echo echo "===== Definition of prepare_preference_datasets =====" rg -n "def prepare_preference_datasets" -A50 src/axolotl/common/datasets.py || echo "File or function not found"
I’ll search the repo for all uses of
total_num_steps
andGRPO
, and then inspect theprepare_preference_datasets
definition to see how the flag is applied.
🏁 Script executed:
#!/bin/bash set -e echo "===== References to total_num_steps =====" rg -n "total_num_steps" || echo "(none found)" echo echo "===== References to GRPO =====" rg -n "GRPO" || echo "(none found)" echo echo "===== prepare_preference_datasets definition =====" rg -n -C5 "def prepare_preference_datasets" src/axolotl/common/datasets.py || echo "(not found)"Length of output: 10569
I need to see the exact logic in src/axolotl/common/datasets.py around the
prepare_preference_datasets
definition. Let’s dump lines 80–140 to inspect thetotal_num_steps
handling.#!/bin/bash set -e echo "===== Lines 80–140 of datasets.py =====" sed -n '80,140p' src/axolotl/common/datasets.py
I need to inspect the
prepare_preference_datasets
implementation. Dumping lines 1–200 of the file:
🏁 Script executed:
#!/bin/bash set -e echo "===== src/axolotl/common/datasets.py (lines 1–200) =====" sed -n '1,200p' src/axolotl/common/datasets.pyLength of output: 4510
total_num_steps logic for GRPO verified
The code correctly initializes
total_num_steps
toNone
and only computes it whencfg.rl
is notRLType.GRPO
, preserving the previous behavior for GRPO runs.• File: src/axolotl/common/datasets.py (lines 108–116)
tests/test_datasets.py (5)
14-15
: Import updates correctly reference refactored functionsThe imports now use
prepare_preference_datasets
and the internal function_load_tokenized_prepared_datasets
, aligning with the modular refactoring described in the AI summary.
30-32
: Updated generator type hint for fixtureThe type hint
Generator[PreTrainedTokenizer, Any, Any]
properly reflects the fixture's generator nature, improving type safety.
67-70
: Consistent patching pattern ensures test isolationAll test methods now consistently wrap calls to
_load_tokenized_prepared_datasets
with patches to overrideDEFAULT_DATASET_PREPARED_PATH
. This ensures proper test isolation by using temporary directories instead of default paths.Also applies to: 114-117, 146-149, 184-187, 222-225, 254-257, 286-289, 340-343, 424-427, 463-466
311-311
: Function calls updated to use refactored RL APIThe calls correctly use
prepare_preference_datasets
instead of the previous function name, maintaining the expected return tuple structure.Also applies to: 375-375
368-368
: Patch target corrections for renamed functionsThe patch targets correctly reference
load_dataset_with_config
instead of the oldload_dataset_w_config
function name.Also applies to: 417-417
src/axolotl/utils/data/sft.py (2)
47-69
: Clean refactoring of the main entry point!The function has been well-refactored with:
- Clear type annotations
- Comprehensive docstring
- Clean separation between pretraining and standard dataset preparation
137-666
: Excellent modularization of dataset preparation logic!The refactoring successfully breaks down the monolithic dataset preparation into well-defined, single-responsibility functions. The improvements include:
- Clear separation of pretraining vs standard dataset flows
- Comprehensive caching with Hub integration
- Proper handling of distributed environments
- Type hints throughout
src/axolotl/utils/data/shared.py (1)
33-340
: Well-structured refactoring of dataset loading utilities!The improvements include:
- Clearer function names following Python conventions
- Modular loading functions for different data sources
- Comprehensive error handling with helpful messages
- New
generate_split_fingerprints
utility for consistent dataset splitting🧰 Tools
🪛 Ruff (0.11.9)
117-120: Use
contextlib.suppress(FileNotFoundError, ConnectionError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
src/axolotl/utils/data/rl.py (2)
203-258
: Excellent implementation of sequence length filtering for RL!The function properly handles multiple RL types with:
- Clear validation of required keys
- Accurate length calculations for each type
- Helpful error messages
165-384
: Well-designed RL dataset preparation module!The refactoring successfully:
- Modularizes RL dataset loading and processing
- Supports multiple RL types (DPO, ORPO, KTO, etc.)
- Implements proper caching and distributed coordination
- Includes comprehensive filtering and validation
src/axolotl/utils/data/utils.py (2)
173-179
: LGTM! Clean error handling with contextlib.suppressThe use of
contextlib.suppress(AttributeError)
is cleaner than try-except blocks. Note that this will silently ignore any AttributeError, not just fromget_dataset_lengths
.
111-117
:✅ Verification successful
Verify callers of deduplicate_and_log_datasets for breaking changes
The function signature has been simplified from supporting three datasets to only two. This is a breaking change that could affect existing code.
Run the following script to verify all callers have been updated:
🏁 Script executed:
#!/bin/bash # Description: Check for calls to deduplicate_and_log_datasets to ensure they match the new signature # Search for function calls with old three-dataset pattern echo "Checking for potential outdated calls with three datasets:" rg -A 3 "deduplicate_and_log_datasets\s*\([^,]+,[^,]+,[^,]+" --type py # Search for all current calls to the function echo -e "\nAll current calls to deduplicate_and_log_datasets:" rg -A 3 "deduplicate_and_log_datasets\s*\(" --type pyLength of output: 3931
No outdated calls to deduplicate_and_log_datasets detected
A workspace-wide search confirms that there are no remaining calls passing three dataset arguments. All usages now match the new signature (only
dataset
and optionalother_dataset
, with named parameters fordataset_name
/other_name
). No breaking changes remain—no further action is needed.
src/axolotl/utils/data/rl.py
Outdated
|
||
# The rank that acquires the lock first does the data preprocessing | ||
with FileLock(str(lock_file_path)): | ||
ready_flag_path = Path(dataset_prepared_path) / "datasets_ready.flag" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this get removed when we reprocess a dataset? This can be an issue especially when modifying dataset transforms and reprocessing to test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm good question. I need to add some tests around this.
In hindsight, this bit should have been a separate PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some reviews from a few days ago. Feel free to ignore them if they're outdated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for this refactor! We accumulated a lot of tech debt as we kept adding more dataset formats early on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🔭 Outside diff range comments (1)
src/axolotl/utils/data/rl.py (1)
160-162
:⚠️ Potential issueFix incorrect log message in save function.
The log message incorrectly states "Loading prepared dataset from disk" when it should indicate saving/preparing the dataset.
Apply this diff to fix the misleading log message:
- LOG.info(f"Loading prepared dataset from disk at {prepared_ds_path}...") + LOG.info(f"Saving prepared dataset to disk at {prepared_ds_path}...")
♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (3)
46-52
: Verify cleanup of lock and flag files during reprocessing.This mirrors the previous concern about whether lock/flag files get cleaned up during dataset reprocessing. Consider implementing cleanup logic when
cfg.is_preprocess
is True or when dataset transforms are modified.#!/bin/bash # Check if there's any cleanup logic for lock/flag files when reprocessing rg -A 10 -B 5 "is_preprocess.*True|datasets_ready\.flag.*unlink|datasets_prep\.lock.*unlink"
221-221
: Clarify GRPO handling in sequence length filtering.As noted in previous reviews, GRPO returns
True
without any actual filtering. Consider adding a comment explaining why GRPO doesn't need sequence length filtering, or implement appropriate filtering logic if needed.Add a clarifying comment:
if rl is RLType.GRPO: + # GRPO doesn't use preference datasets, so no sequence length filtering needed return True
32-37
: 🛠️ Refactor suggestionConsider extracting caching/deduplication logic to shared module.
As mentioned in previous reviews, since SFT also needs caching and deduplication logic, consider extracting the file locking and coordination mechanism to
shared.py
for reuse across different training types.
🧹 Nitpick comments (2)
src/axolotl/utils/data/rl.py (2)
4-4
: Consider the necessity of the time import.The
time
import is only used fortime.sleep(1)
on line 82. Consider if this polling approach is the most efficient method for coordination, or if there are better alternatives.
81-82
: Consider implementing exponential backoff for polling.The current implementation uses a simple 1-second sleep in a polling loop. For better resource utilization and responsiveness, consider implementing exponential backoff.
Apply this pattern to improve polling efficiency:
- while not ready_flag_path.exists(): - time.sleep(1) + wait_time = 0.1 + while not ready_flag_path.exists(): + time.sleep(wait_time) + wait_time = min(wait_time * 1.5, 5.0) # Cap at 5 seconds
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/axolotl/train.py
(1 hunks)src/axolotl/utils/data/rl.py
(5 hunks)src/axolotl/utils/schemas/config.py
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- src/axolotl/train.py
- src/axolotl/utils/schemas/config.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/rl.py (6)
src/axolotl/loaders/tokenizer.py (1)
load_tokenizer
(120-281)src/axolotl/prompt_strategies/orpo/chat_template.py (1)
load
(26-43)src/axolotl/utils/data/shared.py (3)
datasets_with_name_generator
(50-80)generate_split_fingerprints
(327-339)load_dataset_with_config
(83-138)src/axolotl/utils/data/utils.py (2)
deduplicate_and_log_datasets
(111-147)md5
(72-77)src/axolotl/utils/distributed.py (1)
is_main_process
(72-87)src/axolotl/utils/schemas/enums.py (1)
RLType
(22-30)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: pre-commit
- GitHub Check: pre-commit
🔇 Additional comments (3)
src/axolotl/utils/data/rl.py (3)
32-99
: Excellent refactoring with comprehensive distributed coordination.The main entry function is well-structured with proper file locking, ready flag coordination, and clear separation of concerns. The docstring is comprehensive and the logic handles both distributed and non-distributed scenarios effectively.
184-188
: Good tokenizer injection pattern.The dynamic tokenizer injection based on function signature inspection is a clean solution that maintains flexibility while avoiding unnecessary tokenizer loading.
354-356
: Excellent use of walrus operator for concise preprocessing check.The walrus operator usage in the conditional assignment makes the code more concise and readable while maintaining clarity about the preprocessing status.
Converting back to draft since there's a few bigger pieces I want to address. |
afc8c96
to
669579a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (2)
109-111
: 🛠️ Refactor suggestionBlindly indexing
dataset["train"]
can raiseKeyError
When a
DatasetDict
lacks a"train"
split (e.g. only"validation"
),
this line crashes. Fail fast with a clear error or fall back to the
first split.- if isinstance(dataset, DatasetDict): - dataset = dataset["train"] + if isinstance(dataset, DatasetDict): + if "train" not in dataset: + raise ValueError( + "Expected a 'train' split in the loaded DatasetDict " + f"but found {list(dataset.keys())}" + ) + dataset = dataset["train"]
219-223
:⚠️ Potential issueKTO datasets without an explicit
type
are left un-transformedThe
else
branch assumes the raw dataset already has
prompt/completion
pre-tokenised.
Forcfg.rl is RLType.KTO
this is almost never true and will surface as
schema errors later.- else: - # If no `type` is provided, assume the dataset is already in the expected format with - # "prompt", "chosen", and "rejected" already preprocessed - split_datasets[i] = data_set + else: + if cfg.rl is RLType.KTO: # fallback to default KTO transform + ds_transform_fn = load_kto("user_defined.default", cfg, dataset_idx=i) + split_datasets[i] = _map_dataset(cfg, data_set, ds_transform_fn, tokenizer) + else: + # Assume the dataset is already in final form + split_datasets[i] = data_setsrc/axolotl/utils/data/sft.py (1)
457-464
: 🛠️ Refactor suggestionValidate
dataset_shard_idx
is within range before shardingSupplying an out-of-bounds index crashes with a cryptic
IndexError
.
Emit a clear validation error instead.- if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: + if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: + if not (0 <= cfg.dataset_shard_idx < cfg.dataset_shard_num): + raise ValueError( + f"dataset_shard_idx ({cfg.dataset_shard_idx}) must be in " + f"[0, {cfg.dataset_shard_num - 1}]" + )
🧹 Nitpick comments (1)
src/axolotl/utils/data/shared.py (1)
126-129
: Prefercontextlib.suppress
for silent existence probeUsing
try/except: pass
to swallowFileNotFoundError, ConnectionError
is verbose;contextlib.suppress
conveys the intent
clearly.- try: - is_cloud_dataset = remote_fs.exists(dataset_config.path) - except (FileNotFoundError, ConnectionError): - pass + from contextlib import suppress + with suppress(FileNotFoundError, ConnectionError): + is_cloud_dataset = remote_fs.exists(dataset_config.path)🧰 Tools
🪛 Ruff (0.11.9)
126-129: Use
contextlib.suppress(FileNotFoundError, ConnectionError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (62)
src/axolotl/common/const.py
(1 hunks)src/axolotl/common/datasets.py
(5 hunks)src/axolotl/core/builders/causal.py
(1 hunks)src/axolotl/datasets.py
(5 hunks)src/axolotl/loaders/tokenizer.py
(3 hunks)src/axolotl/prompt_strategies/messages/__init__.py
(0 hunks)src/axolotl/prompt_tokenizers.py
(2 hunks)src/axolotl/train.py
(1 hunks)src/axolotl/utils/data/__init__.py
(1 hunks)src/axolotl/utils/data/lock.py
(1 hunks)src/axolotl/utils/data/pretraining.py
(1 hunks)src/axolotl/utils/data/rl.py
(2 hunks)src/axolotl/utils/data/sft.py
(3 hunks)src/axolotl/utils/data/shared.py
(2 hunks)src/axolotl/utils/data/utils.py
(5 hunks)src/axolotl/utils/data/wrappers.py
(1 hunks)src/axolotl/utils/schemas/config.py
(2 hunks)tests/core/test_builders.py
(2 hunks)tests/e2e/integrations/test_cut_cross_entropy.py
(3 hunks)tests/e2e/integrations/test_hooks.py
(1 hunks)tests/e2e/integrations/test_kd.py
(2 hunks)tests/e2e/integrations/test_liger.py
(2 hunks)tests/e2e/integrations/test_llm_compressor.py
(1 hunks)tests/e2e/multigpu/solo/test_grpo.py
(1 hunks)tests/e2e/multigpu/test_locking.py
(1 hunks)tests/e2e/patched/test_4d_multipack_llama.py
(2 hunks)tests/e2e/patched/test_activation_checkpointing.py
(1 hunks)tests/e2e/patched/test_fa_xentropy.py
(1 hunks)tests/e2e/patched/test_falcon_samplepack.py
(2 hunks)tests/e2e/patched/test_fused_llama.py
(1 hunks)tests/e2e/patched/test_llama_s2_attention.py
(2 hunks)tests/e2e/patched/test_lora_llama_multipack.py
(2 hunks)tests/e2e/patched/test_mistral_samplepack.py
(2 hunks)tests/e2e/patched/test_mixtral_samplepack.py
(2 hunks)tests/e2e/patched/test_phi_multipack.py
(2 hunks)tests/e2e/patched/test_resume.py
(1 hunks)tests/e2e/patched/test_unsloth_qlora.py
(3 hunks)tests/e2e/solo/test_flex.py
(1 hunks)tests/e2e/solo/test_relora_llama.py
(1 hunks)tests/e2e/test_deepseekv3.py
(2 hunks)tests/e2e/test_dpo.py
(1 hunks)tests/e2e/test_embeddings_lr.py
(2 hunks)tests/e2e/test_falcon.py
(3 hunks)tests/e2e/test_gemma2.py
(2 hunks)tests/e2e/test_gemma3_text.py
(2 hunks)tests/e2e/test_llama.py
(4 hunks)tests/e2e/test_llama_pretrain.py
(3 hunks)tests/e2e/test_llama_vision.py
(2 hunks)tests/e2e/test_lora_llama.py
(1 hunks)tests/e2e/test_mamba.py
(1 hunks)tests/e2e/test_mistral.py
(2 hunks)tests/e2e/test_mixtral.py
(5 hunks)tests/e2e/test_optimizers.py
(5 hunks)tests/e2e/test_packing_loss.py
(1 hunks)tests/e2e/test_phi.py
(2 hunks)tests/e2e/test_process_reward_model_smollm2.py
(1 hunks)tests/e2e/test_qat.py
(1 hunks)tests/e2e/test_reward_model_smollm2.py
(1 hunks)tests/e2e/test_schedulers.py
(1 hunks)tests/prompt_strategies/test_dpo_chatml.py
(2 hunks)tests/test_datasets.py
(15 hunks)tests/test_exact_deduplication.py
(14 hunks)
💤 Files with no reviewable changes (1)
- src/axolotl/prompt_strategies/messages/init.py
✅ Files skipped from review due to trivial changes (11)
- tests/e2e/test_dpo.py
- tests/e2e/test_qat.py
- tests/e2e/test_reward_model_smollm2.py
- tests/e2e/patched/test_unsloth_qlora.py
- tests/e2e/test_schedulers.py
- src/axolotl/utils/data/pretraining.py
- tests/e2e/integrations/test_cut_cross_entropy.py
- tests/e2e/test_mixtral.py
- tests/e2e/test_optimizers.py
- tests/e2e/test_process_reward_model_smollm2.py
- tests/e2e/test_llama.py
🚧 Files skipped from review as they are similar to previous changes (46)
- src/axolotl/common/const.py
- tests/e2e/test_falcon.py
- tests/e2e/integrations/test_llm_compressor.py
- tests/e2e/test_lora_llama.py
- tests/e2e/solo/test_flex.py
- tests/e2e/test_phi.py
- tests/e2e/test_llama_pretrain.py
- tests/e2e/solo/test_relora_llama.py
- tests/e2e/test_mamba.py
- tests/e2e/test_embeddings_lr.py
- tests/e2e/patched/test_activation_checkpointing.py
- tests/e2e/test_mistral.py
- tests/e2e/patched/test_fa_xentropy.py
- src/axolotl/core/builders/causal.py
- tests/e2e/patched/test_lora_llama_multipack.py
- tests/e2e/integrations/test_hooks.py
- tests/e2e/patched/test_mixtral_samplepack.py
- tests/e2e/test_gemma3_text.py
- tests/e2e/patched/test_phi_multipack.py
- tests/e2e/test_packing_loss.py
- tests/e2e/test_deepseekv3.py
- tests/e2e/patched/test_mistral_samplepack.py
- tests/e2e/patched/test_falcon_samplepack.py
- tests/e2e/patched/test_resume.py
- tests/e2e/multigpu/solo/test_grpo.py
- tests/e2e/patched/test_4d_multipack_llama.py
- tests/e2e/patched/test_fused_llama.py
- tests/prompt_strategies/test_dpo_chatml.py
- tests/e2e/integrations/test_liger.py
- src/axolotl/loaders/tokenizer.py
- tests/e2e/test_llama_vision.py
- tests/e2e/test_gemma2.py
- tests/e2e/patched/test_llama_s2_attention.py
- src/axolotl/train.py
- tests/e2e/integrations/test_kd.py
- src/axolotl/prompt_tokenizers.py
- src/axolotl/utils/data/lock.py
- tests/core/test_builders.py
- tests/test_exact_deduplication.py
- src/axolotl/common/datasets.py
- src/axolotl/datasets.py
- src/axolotl/utils/data/init.py
- tests/test_datasets.py
- src/axolotl/utils/data/utils.py
- src/axolotl/utils/schemas/config.py
- src/axolotl/utils/data/wrappers.py
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (1)
src/axolotl/utils/data/shared.py (4)
src/axolotl/utils/data/utils.py (2)
deduplicate_and_log_datasets
(111-147)md5
(72-77)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)src/axolotl/utils/logging.py (1)
get_logger
(53-62)tests/e2e/multigpu/test_locking.py (1)
cfg
(25-27)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py
126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError)
instead of try
-except
-pass
Replace with contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
tests/e2e/multigpu/test_locking.py
107-110: Return the negated condition directly
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (3)
src/axolotl/utils/data/rl.py (2)
109-111
: Blind"train"
access may raiseKeyError
DatasetDict
objects are unconditionally indexed with"train"
. If a dataset
only exposes"validation"
/"test"
(quite common for RL preference sets)
this will crash.Guard the key or fall back gracefully – this was already pointed out in an
earlier review.
220-223
: KTO datasets with missingtype
still skip mandatory preprocessingThe
else
branch keeps the raw dataset untouched even whencfg.rl is RLType.KTO
, so un-tokenised"prompt"
/"completion"
pairs slip through.
The earlier review suggested routing this branch throughload_kto
; the code
is still unchanged.src/axolotl/utils/data/sft.py (1)
457-465
: Shard-index bounds still uncheckedThe earlier review noted that
dataset_shard_idx
should be validated against
dataset_shard_num
before calling.shard()
. The guard is still missing.
🧹 Nitpick comments (6)
tests/e2e/multigpu/test_locking.py (2)
90-131
: Avoid patchingPath
with a bareMock
– keep the real interfaceReplacing
ready_flag_path
with a plainMock
means any method other thanexists()
(touch
,write_text
, etc.) would raiseAttributeError
if the implementation ofFileLockLoader
changes.
Create a real temporary file instead and patch onlyPath.exists
:- mock_ready_flag_path = Mock() - ... - loader.ready_flag_path = mock_ready_flag_path + real_flag = loader.ready_flag_path + + # Override Path.exists only + original_exists = Path.exists + Path.exists = lambda self: mock_exists() if self == real_flag else original_exists(self)This keeps the object’s full
pathlib.Path
API intact and isolates the behaviour you’re testing.🧰 Tools
🪛 Ruff (0.11.9)
107-110: Return the negated condition directly
(SIM103)
100-110
: Simplify the conditional chain flagged by Ruff (SIM103)The three branches can be collapsed into a single expression, improving readability and satisfying the linter:
- if exists_call_count == 1: - return True - if exists_call_count <= 3: - return False - return True + return not (1 < exists_call_count <= 3)🧰 Tools
🪛 Ruff (0.11.9)
107-110: Return the negated condition directly
(SIM103)
src/axolotl/utils/data/rl.py (1)
258-269
: Docstring does not match return type
_load_or_create_dataset_split
claims to return
“Tuple of (dataset, is_preprocessed)” but actually returns just theDataset
.Either return the tuple or update the docstring to avoid confusion.
src/axolotl/utils/data/utils.py (1)
101-108
: Deduplication loads the entire dataset into Python memoryIterating row-by-row (
for idx, row in enumerate(dataset)
) materialises every
sample before selecting uniques. On multi-million-row Arrow datasets this
blows memory and negates parquet/arrow efficiency.Consider a vectorised Arrow approach:
-unique_indices = [] -for idx, row in enumerate(dataset): - row_hash = sha256(str(row)) - if row_hash not in seen_hashes: - seen_hashes.add(row_hash) - unique_indices.append(idx) - -return dataset.select(unique_indices), seen_hashes +# hash every row in a single pass using pyarrow compute +import pyarrow.compute as pc +hashes = pc.hash_sha256(dataset.data) # vectorised +mask = ~pc.is_in(hashes, value_set=list(seen_hashes)) +seen_hashes.update(set(hashes.filter(mask).to_pylist())) +return dataset.filter(mask.to_pylist()), seen_hashesThis keeps the pipeline in C++ and is orders of magnitude faster/more memory
friendly.src/axolotl/utils/data/shared.py (2)
107-114
: Avoid passingNone
(or a list) todatasets.load_dataset
kwargs
load_dataset()
rejects unknown/None
parameters. Building the kwargs dict first and then filtering out falsy values keeps the call-site clean and prevents subtleValueError
s when, e.g.,dataset_config.name
is a list orNone
.- load_dataset_kwargs = { - "split": dataset_config.split if dataset_config.split else None, - "name": dataset_config.name, - "streaming": streaming, - "trust_remote_code": dataset_config.trust_remote_code, - } + # Filter-out parameters that are not set + load_dataset_kwargs = { + k: v + for k, v in { + "split": dataset_config.split, + "name": dataset_config.name, + "streaming": streaming, + "trust_remote_code": dataset_config.trust_remote_code, + }.items() + if v is not None + }
125-130
: Replace emptyexcept …: pass
withcontextlib.suppress
Suppressing exceptions explicitly is clearer, keeps the variable scope clean, and satisfies
ruff
’s SIM105 suggestion.- try: - is_cloud_dataset = remote_fs.exists(dataset_config.path) - except (FileNotFoundError, ConnectionError): - pass + import contextlib + with contextlib.suppress(FileNotFoundError, ConnectionError): + is_cloud_dataset = remote_fs.exists(dataset_config.path)🧰 Tools
🪛 Ruff (0.11.9)
126-129: Use
contextlib.suppress(FileNotFoundError, ConnectionError)
instead oftry
-except
-pass
Replace with
contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (62)
src/axolotl/common/const.py
(1 hunks)src/axolotl/common/datasets.py
(5 hunks)src/axolotl/core/builders/causal.py
(1 hunks)src/axolotl/datasets.py
(5 hunks)src/axolotl/loaders/tokenizer.py
(3 hunks)src/axolotl/prompt_strategies/messages/__init__.py
(0 hunks)src/axolotl/prompt_tokenizers.py
(2 hunks)src/axolotl/train.py
(1 hunks)src/axolotl/utils/data/__init__.py
(1 hunks)src/axolotl/utils/data/lock.py
(1 hunks)src/axolotl/utils/data/pretraining.py
(1 hunks)src/axolotl/utils/data/rl.py
(2 hunks)src/axolotl/utils/data/sft.py
(3 hunks)src/axolotl/utils/data/shared.py
(2 hunks)src/axolotl/utils/data/utils.py
(5 hunks)src/axolotl/utils/data/wrappers.py
(1 hunks)src/axolotl/utils/schemas/config.py
(2 hunks)tests/core/test_builders.py
(2 hunks)tests/e2e/integrations/test_cut_cross_entropy.py
(3 hunks)tests/e2e/integrations/test_hooks.py
(1 hunks)tests/e2e/integrations/test_kd.py
(2 hunks)tests/e2e/integrations/test_liger.py
(2 hunks)tests/e2e/integrations/test_llm_compressor.py
(1 hunks)tests/e2e/multigpu/solo/test_grpo.py
(1 hunks)tests/e2e/multigpu/test_locking.py
(1 hunks)tests/e2e/patched/test_4d_multipack_llama.py
(2 hunks)tests/e2e/patched/test_activation_checkpointing.py
(1 hunks)tests/e2e/patched/test_fa_xentropy.py
(1 hunks)tests/e2e/patched/test_falcon_samplepack.py
(2 hunks)tests/e2e/patched/test_fused_llama.py
(1 hunks)tests/e2e/patched/test_llama_s2_attention.py
(2 hunks)tests/e2e/patched/test_lora_llama_multipack.py
(2 hunks)tests/e2e/patched/test_mistral_samplepack.py
(2 hunks)tests/e2e/patched/test_mixtral_samplepack.py
(2 hunks)tests/e2e/patched/test_phi_multipack.py
(2 hunks)tests/e2e/patched/test_resume.py
(1 hunks)tests/e2e/patched/test_unsloth_qlora.py
(3 hunks)tests/e2e/solo/test_flex.py
(1 hunks)tests/e2e/solo/test_relora_llama.py
(1 hunks)tests/e2e/test_deepseekv3.py
(2 hunks)tests/e2e/test_dpo.py
(1 hunks)tests/e2e/test_embeddings_lr.py
(2 hunks)tests/e2e/test_falcon.py
(3 hunks)tests/e2e/test_gemma2.py
(2 hunks)tests/e2e/test_gemma3_text.py
(2 hunks)tests/e2e/test_llama.py
(4 hunks)tests/e2e/test_llama_pretrain.py
(3 hunks)tests/e2e/test_llama_vision.py
(2 hunks)tests/e2e/test_lora_llama.py
(1 hunks)tests/e2e/test_mamba.py
(1 hunks)tests/e2e/test_mistral.py
(2 hunks)tests/e2e/test_mixtral.py
(5 hunks)tests/e2e/test_optimizers.py
(5 hunks)tests/e2e/test_packing_loss.py
(1 hunks)tests/e2e/test_phi.py
(2 hunks)tests/e2e/test_process_reward_model_smollm2.py
(1 hunks)tests/e2e/test_qat.py
(1 hunks)tests/e2e/test_reward_model_smollm2.py
(1 hunks)tests/e2e/test_schedulers.py
(1 hunks)tests/prompt_strategies/test_dpo_chatml.py
(2 hunks)tests/test_datasets.py
(15 hunks)tests/test_exact_deduplication.py
(14 hunks)
💤 Files with no reviewable changes (1)
- src/axolotl/prompt_strategies/messages/init.py
✅ Files skipped from review due to trivial changes (14)
- tests/e2e/test_dpo.py
- tests/e2e/test_reward_model_smollm2.py
- tests/e2e/test_packing_loss.py
- tests/e2e/patched/test_fa_xentropy.py
- tests/e2e/patched/test_unsloth_qlora.py
- tests/e2e/test_schedulers.py
- tests/e2e/test_falcon.py
- tests/e2e/test_optimizers.py
- tests/e2e/test_mixtral.py
- tests/e2e/solo/test_relora_llama.py
- tests/e2e/patched/test_phi_multipack.py
- src/axolotl/utils/data/lock.py
- tests/e2e/test_process_reward_model_smollm2.py
- src/axolotl/utils/data/wrappers.py
🚧 Files skipped from review as they are similar to previous changes (42)
- src/axolotl/common/const.py
- tests/e2e/test_llama_pretrain.py
- tests/e2e/patched/test_activation_checkpointing.py
- tests/e2e/integrations/test_llm_compressor.py
- tests/e2e/test_mamba.py
- tests/e2e/multigpu/solo/test_grpo.py
- tests/e2e/test_phi.py
- src/axolotl/utils/data/pretraining.py
- tests/e2e/patched/test_falcon_samplepack.py
- tests/e2e/test_embeddings_lr.py
- tests/e2e/patched/test_lora_llama_multipack.py
- tests/e2e/patched/test_fused_llama.py
- tests/e2e/test_gemma2.py
- tests/prompt_strategies/test_dpo_chatml.py
- tests/e2e/test_llama_vision.py
- tests/e2e/patched/test_resume.py
- tests/e2e/test_deepseekv3.py
- tests/e2e/patched/test_mixtral_samplepack.py
- tests/e2e/solo/test_flex.py
- tests/e2e/integrations/test_cut_cross_entropy.py
- tests/e2e/test_mistral.py
- src/axolotl/train.py
- tests/e2e/integrations/test_liger.py
- tests/e2e/test_qat.py
- tests/e2e/test_llama.py
- tests/e2e/patched/test_mistral_samplepack.py
- tests/e2e/integrations/test_hooks.py
- tests/e2e/test_lora_llama.py
- tests/e2e/patched/test_4d_multipack_llama.py
- tests/e2e/integrations/test_kd.py
- tests/e2e/patched/test_llama_s2_attention.py
- src/axolotl/loaders/tokenizer.py
- tests/core/test_builders.py
- tests/e2e/test_gemma3_text.py
- tests/test_exact_deduplication.py
- src/axolotl/prompt_tokenizers.py
- src/axolotl/utils/schemas/config.py
- src/axolotl/core/builders/causal.py
- tests/test_datasets.py
- src/axolotl/utils/data/init.py
- src/axolotl/datasets.py
- src/axolotl/common/datasets.py
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (3)
src/axolotl/utils/data/utils.py (3)
tests/test_exact_deduplication.py (1)
cfg
(201-215)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)src/axolotl/utils/samplers/utils.py (1)
get_dataset_lengths
(8-21)
src/axolotl/utils/data/shared.py (4)
src/axolotl/utils/data/utils.py (2)
deduplicate_and_log_datasets
(111-147)md5
(72-77)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)src/axolotl/utils/logging.py (1)
get_logger
(53-62)tests/e2e/multigpu/test_locking.py (1)
cfg
(25-27)
src/axolotl/utils/data/sft.py (11)
src/axolotl/prompters.py (1)
Prompter
(26-29)src/axolotl/utils/data/lock.py (3)
FileLockLoader
(17-66)load
(33-44)cleanup
(54-66)src/axolotl/utils/data/pretraining.py (1)
wrap_pretraining_dataset
(179-239)src/axolotl/utils/data/shared.py (8)
create_train_validation_split
(364-393)datasets_with_name_generator
(59-89)generate_dataset_hash_from_config
(494-513)load_dataset_with_config
(92-147)load_preprocessed_dataset
(443-471)merge_datasets
(516-538)save_preprocessed_dataset
(405-440)try_load_from_hub
(474-491)src/axolotl/utils/data/utils.py (3)
deduplicate_and_log_datasets
(111-147)drop_long_seq_in_dataset
(150-202)retry_on_request_exceptions
(31-69)src/axolotl/utils/data/wrappers.py (1)
get_dataset_wrapper
(58-132)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)src/axolotl/utils/distributed.py (1)
is_local_main_process
(90-93)src/axolotl/prompt_strategies/__init__.py (1)
load
(12-50)src/axolotl/prompt_strategies/stepwise_supervised.py (1)
load
(105-116)src/axolotl/utils/trainer.py (2)
calculate_total_num_steps
(393-510)process_datasets_for_packing
(238-369)
🪛 Ruff (0.11.9)
src/axolotl/utils/data/shared.py
126-129: Use contextlib.suppress(FileNotFoundError, ConnectionError)
instead of try
-except
-pass
Replace with contextlib.suppress(FileNotFoundError, ConnectionError)
(SIM105)
tests/e2e/multigpu/test_locking.py
107-110: Return the negated condition directly
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (4)
- GitHub Check: docker-e2e-tests-1st (124, 12.4.1, 3.11, 2.6.0, 1, vllm)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (126, 12.6.3, 3.11, 2.7.0, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
🔇 Additional comments (2)
tests/e2e/multigpu/test_locking.py (1)
184-196
: Useautospec=True
when patchingFileLock
to prevent false positivesPatching without
autospec
replacesFileLock
with a genericMagicMock
, so calls like.acquire()
or attribute access that don’t exist on the original object will silently succeed, masking breakages.-with patch("axolotl.utils.data.lock.FileLock") as mock_filelock: +with patch("axolotl.utils.data.lock.FileLock", autospec=True) as mock_filelock:This binds the mock to the real signature and attributes of
FileLock
, giving you stricter, more trustworthy tests.src/axolotl/utils/data/shared.py (1)
274-288
:isfile
/isdir
aren’t consistently implemented across fsspec filesystems
AzureBlobFileSystem
(and some others) only implementexists
/info
. Depending on the backend this branch can raiseAttributeError
, breaking cloud loads.Consider falling back to
remote_fs.exists()
when the attribute is missing:if getattr(remote_fs, "isdir", None) and remote_fs.isdir(dataset_config.path): … elif getattr(remote_fs, "isfile", None) and remote_fs.isfile(dataset_config.path): … elif remote_fs.exists(dataset_config.path): # treat as file by default …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
♻️ Duplicate comments (5)
src/axolotl/utils/data/rl.py (2)
111-113
:⚠️ Potential issueGuard against missing
"train"
split before indexingDatasetDict
Blindly doing
dataset = dataset["train"]
will raise aKeyError
when the prepared dataset contains only"validation"
/"test"
or custom split names.
Add an existence check (or a fallback + explicit error) before indexing.- if isinstance(dataset, DatasetDict): - dataset = dataset["train"] + if isinstance(dataset, DatasetDict): + if "train" not in dataset: + raise ValueError("Expected 'train' split in DatasetDict") + dataset = dataset["train"]
203-205
: 🛠️ Refactor suggestionIndexing bug – config list can be shorter than
split_datasets
datasets_with_name_generator()
may expand a single config into multiple datasets (multiple names or preprocessing shards).
Usingdatasets_configs[i]
assumes a 1-to-1 correspondence and will mis-align or raiseIndexError
.Iterate over
(dataset, dataset_cfg)
pairs instead:-for i, data_set in enumerate(split_datasets): - _type = datasets_configs[i]["type"] +for data_set, ds_cfg in zip(split_datasets, datasets_with_name_generator(datasets_configs)): + _type = ds_cfg["type"]src/axolotl/utils/data/sft.py (2)
498-500
:_apply_dataset_sharding
still fails when aDatasetDict
is returnedComment previously raised: when a prepared dataset pushed to the hub contains multiple splits, the object coming back is a
DatasetDict
. Passing it straight into.shard()
crashes.
HandleDatasetDict
(e.g. pick the requested split) inside_apply_dataset_sharding
.
459-466
: 🛠️ Refactor suggestionValidate shard index is within bounds before calling
.shard()
Supplying an out-of-range
dataset_shard_idx
will raise deep insidedatasets.Dataset.shard
, giving users a cryptic error. Validate early and fail fast:- if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: + if cfg.dataset_shard_num and cfg.dataset_shard_idx is not None: + if not 0 <= cfg.dataset_shard_idx < cfg.dataset_shard_num: + raise ValueError( + f"dataset_shard_idx ({cfg.dataset_shard_idx}) " + f"must be within [0, {cfg.dataset_shard_num - 1}]" + )tests/e2e/multigpu/test_locking.py (1)
63-70
: Race-condition aroundcall_count
is still unfixed
call_count += 1
is executed in three concurrent threads without any synchronisation.
ALOAD / INPLACE_ADD / STORE
byte-code triplet is not atomic; increments can be lost, yielding duplicate"data_<n>"
values and flaky assertions.The same issue was pointed out in an earlier review and remains unresolved. Guard the counter with a
threading.Lock()
(or switch toitertools.count()
).
🧹 Nitpick comments (2)
src/axolotl/utils/data/rl.py (1)
269-272
: Docstring out of sync with return valueThe function returns a
Dataset
, not aTuple
. Update the docstring to prevent confusion.tests/e2e/multigpu/test_locking.py (1)
103-106
: Micro-simplification opportunityThe two consecutive
if/return
branches can be collapsed by returning the negated condition directly, improving clarity:- if exists_call_count <= 3: - return False - return True + return exists_call_count > 3🧰 Tools
🪛 Ruff (0.11.9)
103-106: Return the negated condition directly
(SIM103)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
src/axolotl/utils/data/rl.py
(2 hunks)src/axolotl/utils/data/sft.py
(3 hunks)tests/e2e/multigpu/test_locking.py
(1 hunks)
🧰 Additional context used
🧠 Learnings (1)
src/axolotl/utils/data/sft.py (1)
Learnt from: winglian
PR: axolotl-ai-cloud/axolotl#2707
File: src/axolotl/utils/data/sft.py:247-254
Timestamp: 2025-05-29T22:23:39.312Z
Learning: In distributed training scenarios with batch dispatching, placeholder datasets for non-zero ranks may intentionally use temporary files that persist during training. These files are typically very small and don't require explicit cleanup due to their minimal resource impact and specific training requirements.
🧬 Code Graph Analysis (2)
tests/e2e/multigpu/test_locking.py (2)
src/axolotl/utils/data/lock.py (3)
FileLockLoader
(17-66)load
(33-44)cleanup
(54-66)src/axolotl/utils/dict.py (1)
DictDefault
(6-38)
src/axolotl/utils/data/sft.py (9)
src/axolotl/prompters.py (1)
Prompter
(26-29)src/axolotl/utils/data/lock.py (3)
FileLockLoader
(17-66)load
(33-44)cleanup
(54-66)src/axolotl/utils/data/pretraining.py (1)
wrap_pretraining_dataset
(179-239)src/axolotl/utils/data/shared.py (6)
create_train_validation_split
(364-393)datasets_with_name_generator
(59-89)generate_dataset_hash_from_config
(494-513)load_preprocessed_dataset
(443-471)merge_datasets
(516-538)try_load_from_hub
(474-491)src/axolotl/utils/data/utils.py (3)
deduplicate_and_log_datasets
(111-147)drop_long_seq_in_dataset
(150-202)retry_on_request_exceptions
(31-69)src/axolotl/utils/data/wrappers.py (1)
get_dataset_wrapper
(58-132)src/axolotl/utils/distributed.py (1)
is_local_main_process
(90-93)src/axolotl/prompt_strategies/__init__.py (1)
load
(12-50)src/axolotl/utils/trainer.py (2)
calculate_total_num_steps
(393-510)process_datasets_for_packing
(238-369)
🪛 ast-grep (0.38.1)
tests/e2e/multigpu/test_locking.py
[warning] 92-92: The function mktemp
is deprecated. When using this function, it is possible for an attacker to modify the created file before the filename is returned. Use NamedTemporaryFile()
instead and pass it the delete=False
parameter.
Context: tempfile.mktemp()
Note: [CWE-377]: Insecure Temporary File [OWASP A01:2021]: Broken Access Control [REFERENCES]
https://docs.python.org/3/library/tempfile.html#tempfile.mktemp
https://owasp.org/Top10/A01_2021-Broken_Access_Control
(avoid-mktemp-python)
🪛 Ruff (0.11.9)
tests/e2e/multigpu/test_locking.py
103-106: Return the negated condition directly
(SIM103)
⏰ Context from checks skipped due to timeout of 90000ms (9)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.6.0, vllm, 2, true)
- GitHub Check: test-axolotl-multigpu (124, 12.4.1, 3.11, 2.5.1, 2, true)
- GitHub Check: PyTest from Source Dist (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.6.0)
- GitHub Check: PyTest (3.11, 2.7.0)
- GitHub Check: PyTest from Source Dist (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.5.1)
- GitHub Check: PyTest (3.11, 2.6.0)
- GitHub Check: pre-commit
Description
Data loading refactor, with an emphasis on
sft.py
,rl.py
, and related modules.Motivation and Context
The current state of data loading involves a lot of misdirection, undocumented code, missing typing, etc. This refactor aims to clean things up to improve readability and extensibility.
Also closes #2684 via filelock implementation (credit to @casper-hansen for reference code).
How has this been tested?
TODO
Screenshots (if appropriate)
Types of changes
Social Handles (Optional)
Summary by CodeRabbit
New Features
Improvements
Bug Fixes
Tests