Fix MLFlow Save Model for TE #1353

j316chuck · 2024-07-12T05:30:03Z

Description

Fix MLFlow Save Model for TE FP8

Error log from runs:

[rank0]: │ /usr/lib/python3/dist-packages/llmfoundry/callbacks/hf_checkpointer.py:589   │
[rank0]: │ in _save_checkpoint                                                          │
[rank0]: │                                                                              │
[rank0]: │   586 │   │   │   │   │   │   model_saving_kwargs['transformers_model'] = co │
[rank0]: │   587 │   │   │   │   │   │   model_saving_kwargs.update(self.mlflow_logging │
[rank0]: │   588 │   │   │   │   │                                                      │
[rank0]: │ ❱ 589 │   │   │   │   │   mlflow_logger.save_model(**model_saving_kwargs)    │
[rank0]: │   590 │   │   │   │   │                                                      │
[rank0]: │   591 │   │   │   │   │   # Upload the license file generated by mlflow duri │
[rank0]: │   592 │   │   │   │   │   license_filename = _maybe_get_license_filename(    │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/composer/loggers/mlflow_logger.py:382 in      │
[rank0]: │ save_model                                                                   │
[rank0]: │                                                                              │
...
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/huggingface_hub/serialization/_base.py:103 in │
[rank0]: │ split_state_dict_into_shards_factory                                         │
[rank0]: │                                                                              │
[rank0]: │   100 │   │   │   continue                                                   │
[rank0]: │   101 │   │                                                                  │
[rank0]: │   102 │   │   # If a `tensor` shares the same underlying storage as another  │
[rank0]: │ ❱ 103 │   │   storage_id = get_storage_id(tensor)                            │
[rank0]: │   104 │   │   if storage_id is not None:                                     │
[rank0]: │   105 │   │   │   if storage_id in storage_id_to_tensors:                    │
[rank0]: │   106 │   │   │   │   # We skip this tensor for now and will reassign to cor │
[rank0]: │                                                                              │
[rank0]: │ /usr/lib/python3/dist-packages/huggingface_hub/serialization/_torch.py:106   │
[rank0]: │ in get_storage_id                                                            │
[rank0]: │                                                                              │
[rank0]: │   103 │                                                                      │
[rank0]: │   104 │   Taken from https://github.com/huggingface/transformers/blob/1ecf5f │
[rank0]: │   105 │   """                                                                │
[rank0]: │ ❱ 106 │   if tensor.device.type == "xla" and is_torch_tpu_available():       │
[rank0]: │   107 │   │   # NOTE: xla tensors dont have storage                          │
[rank0]: │   108 │   │   # use some other unique id to distinguish.                     │
[rank0]: │   109 │   │   # this is a XLA tensor, it must be created using torch_xla's   │
[rank0]: ╰──────────────────────────────────────────────────────────────────────────────╯

Tests

Before: te-fp8-llama-magicoder-4ep-sTHxj7 🔴
After: te-fp8-llama-magicoder-4ep-6sfMZh ✅

commit change

d21ae29

j316chuck requested a review from a team as a code owner July 12, 2024 05:30

mvpatel2000 approved these changes Jul 12, 2024

View reviewed changes

Merge branch 'main' into chuck/fix_te_mlflow_save

c289a47

j316chuck enabled auto-merge (squash) July 12, 2024 05:35

j316chuck merged commit 502eb12 into main Jul 12, 2024
9 checks passed

dakinggg deleted the chuck/fix_te_mlflow_save branch August 6, 2024 18:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix MLFlow Save Model for TE #1353

Fix MLFlow Save Model for TE #1353

Uh oh!

j316chuck commented Jul 12, 2024 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Fix MLFlow Save Model for TE #1353

Fix MLFlow Save Model for TE #1353

Uh oh!

Conversation

j316chuck commented Jul 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Uh oh!

Uh oh!

Uh oh!

j316chuck commented Jul 12, 2024 •

edited

Loading