Skip to content

Fix MLFlow Save Model for TE #1353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 12, 2024
Merged

Fix MLFlow Save Model for TE #1353

merged 2 commits into from
Jul 12, 2024

Conversation

j316chuck
Copy link
Contributor

@j316chuck j316chuck commented Jul 12, 2024

Description

Fix MLFlow Save Model for TE FP8

Error log from runs:

[rank0]: │ /usr/lib/python3/dist-packages/llmfoundry/callbacks/hf_checkpointer.py:589   │
[rank0]: │ in _save_checkpoint                                                          │
[rank0]: │                                                                              │
[rank0]: │   586 │   │   │   │   │   │   model_saving_kwargs['transformers_model'] = co │
[rank0]: │   587 │   │   │   │   │   │   model_saving_kwargs.update(self.mlflow_logging │
[rank0]: │   588 │   │   │   │   │                                                      │
[rank0]: │ ❱ 589 │   │   │   │   │   mlflow_logger.save_model(**model_saving_kwargs)    │
[rank0]: │   590 │   │   │   │   │                                                      │
[rank0]: │   591 │   │   │   │   │   # Upload the license file generated by mlflow duri │
[rank0]: │   592 │   │   │   │   │   license_filename = _maybe_get_license_filename(    │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/composer/loggers/mlflow_logger.py:382 in      │
[rank0]: │ save_model                                                                   │
[rank0]: │                                                                              │
...
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/huggingface_hub/serialization/_base.py:103 in │
[rank0]: │ split_state_dict_into_shards_factory                                         │
[rank0]: │                                                                              │
[rank0]: │   100 │   │   │   continue                                                   │
[rank0]: │   101 │   │                                                                  │
[rank0]: │   102 │   │   # If a `tensor` shares the same underlying storage as another  │
[rank0]: │ ❱ 103 │   │   storage_id = get_storage_id(tensor)                            │
[rank0]: │   104 │   │   if storage_id is not None:                                     │
[rank0]: │   105 │   │   │   if storage_id in storage_id_to_tensors:                    │
[rank0]: │   106 │   │   │   │   # We skip this tensor for now and will reassign to cor │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/huggingface_hub/serialization/_torch.py:106   │
[rank0]: │ in get_storage_id                                                            │
[rank0]: │                                                                              │
[rank0]: │   103 │                                                                      │
[rank0]: │   104 │   Taken from https://github.com/huggingface/transformers/blob/1ecf5f │
[rank0]: │   105 │   """                                                                │
[rank0]: │ ❱ 106 │   if tensor.device.type == "xla" and is_torch_tpu_available():       │
[rank0]: │   107 │   │   # NOTE: xla tensors dont have storage                          │
[rank0]: │   108 │   │   # use some other unique id to distinguish.                     │
[rank0]: │   109 │   │   # this is a XLA tensor, it must be created using torch_xla's   │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯

Tests

  • Before: te-fp8-llama-magicoder-4ep-sTHxj7 🔴
  • After: te-fp8-llama-magicoder-4ep-6sfMZh

@j316chuck j316chuck requested a review from a team as a code owner July 12, 2024 05:30
@j316chuck j316chuck enabled auto-merge (squash) July 12, 2024 05:35
@j316chuck j316chuck merged commit 502eb12 into main Jul 12, 2024
9 checks passed
@dakinggg dakinggg deleted the chuck/fix_te_mlflow_save branch August 6, 2024 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants