GENERATE V2 HOT DOG #1563

joecummings · 2024-09-12T16:49:29Z

Finally, a generate recipe that doesn't make me wanna eat fire ants.

Output from generate with Llama2

(joe-torchtune-2) [[email protected] ~/projects/joe-torchtune (b4fea32)]$ tune run dev/generate_v2 --config llama2/generation_v2 2024-09-23:19:51:51,843 INFO [_logging.py:101] Running InferenceRecipe with resolved config:

checkpointer:
component: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Llama-2-7b-chat-hf
checkpoint_files:

pytorch_model-00001-of-00002.bin
pytorch_model-00002-of-00002.bin
model_type: LLAMA2
output_dir: ./
device: cuda
dtype: bf16
log_level: INFO
max_new_tokens: 200
model:
component: torchtune.models.llama2.llama2_7b
prompt:
system: You are a helpful and creative AI assistant.
user: What is the capital of France?
seed: 1234
temperature: 0.6
top_k: 300
transform:
component: torchtune.models.llama2.llama2_tokenizer
max_seq_len: 2048
path: /tmp/Llama-2-7b-chat-hf/tokenizer.model

2024-09-23:19:52:01,241 INFO [generate_v2.py:90] Model was initialized with precision torch.bfloat16.
2024-09-23:19:52:09,680 INFO [generate_v2.py:201]

Oh, how delightful! adjusts glasses The capital of France is... drumroll Paris! 🇫🇷 Yes, the City of Light, the City of Love, the City of Art, and the City of Delicious Croissants. 🥐 Is there anything else I can help you with? 😊

2024-09-23:19:52:09,684 INFO [generate_v2.py:108] Time for inference: 7.57 sec total, 11.09 tokens/sec
2024-09-23:19:52:09,684 INFO [generate_v2.py:111] Bandwidth achieved: 151.59 GB/s
2024-09-23:19:52:09,684 INFO [generate_v2.py:114] Max memory allocated: 13.95 GB

Output from a MM model

(joe-torchtune-2) [[email protected] ~/projects/(fe6d3ad0d)]$ tune run generate.py --config multimodal_generation.yaml 2024-09-23:19:36:31,579 INFO [_logging.py:101] Running InferenceRecipe with resolved config:

checkpointer:
component: checkpointer.FullModelMetaCheckpointer
checkpoint_dir: ../joe-torchtune
checkpoint_files:

consolidated.pth
model_type:
output_dir: ./
device: cuda
dtype: bf16
log_level: INFO
max_new_tokens: 200
model:
component:
prompt:
system: You are a helpful assistant who responds like the author Shakespeare.
user:
image: https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg
text: What is in this image?
seed: 1234
temperature: 0.6
top_k: 300
transform:
component:
max_seq_len: 8192
tokenizer_path: ../joe-torchtune/tokenizer.model

2024-09-23:19:36:37,868 INFO [generate.py:90] Model was initialized with precision torch.bfloat16.
2024-09-23:19:36:48,979 INFO [generate.py:201]

Fair patron, thou dost behold a wondrous sight, a monument of liberty, a beacon of hope, that doth stand tall on Liberty Island, in the midst of the Hudson River's flow. 'Tis the Statue of Liberty, a gift from France, a symbol of freedom's call, that doth welcome all, who seek a new life, in this fair land of America's thrall.

2024-09-23:19:36:48,987 INFO [generate.py:108] Time for inference: 9.75 sec total, 8.51 tokens/sec
2024-09-23:19:36:48,987 INFO [generate.py:111] Bandwidth achieved: 187.46 GB/s
2024-09-23:19:36:48,987 INFO [generate.py:114] Max memory allocated: 22.32 GB

Limitations:

Does not work with compile or quantization

pytorch-bot · 2024-09-12T16:49:33Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1563

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 026107d with merge base b4fea32 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

ebsmothers · 2024-09-13T21:14:06Z

torchtune/_recipe_registry.py

@@ -252,6 +252,17 @@ class Recipe:
        ],
        supports_distributed=False,
    ),
+    Recipe(
+        name="dev/generate_v2",


Do we wanna call it "dev/generate_v2"? Can do what we did for FSDP2 and just call "generate_v2", which is snappier from the CLI

@pbontrager yelled at me for that

I didn't want to call the recipe generate_v2 once it's moved out of dev. But I'm fine with not including dev in the cli name.

could move the old recipe into dev as /generate_old

Yeah I would just remove dev from the CLI name

ebsmothers · 2024-09-13T21:14:36Z

recipes/dev/generate_v2.py

+        self._logger.info(
+            f"Memory used: {torch.cuda.max_memory_allocated() / 1e9:.02f} GB"
+        )


use our friendly util for this?

ebsmothers · 2024-09-13T21:19:25Z

recipes/dev/generate_v2.py

+class SingleTurnYAMLToMessages(Transform):
+    """
+    Converts a single turn conversation in YAML format to a list of messages.
+    """


Just curious, why does it have to be in YAML format? Like why not just call .to_dict or whatever so this isn't as tied to configs?

Separately, is this the long-term home for this class or should it go in data?

Recipes ARE tied to configs though. So, this recipe needs a way to translate YAML to Message.

The core of the recipe can be ripped out / modified in any way now that things are in messages. Eventually, this should go in data, but not until we get some usage.

By YAML you mean an omegaconf DictConfig, right? That's why I'm suggesting to just use .to_dict cause then we get a primitive type straightaway (I think). But anyways I don't disagree with your points here

I'm being dumb but I'm not sure I understand. Can you show a code sample?

I mean in practice it's basically the same thing since the signature in SingleTurnYAMLToMessages already takes Dict[str, Any]. But really you are doing self.to_messages(cfg.prompt) and (I think) cfg.prompt is an OmegaConf.DictConfig. So might be better to just do self.to_messages(OmegaConf.to_container(cfg.prompt)) so that SingleTurnYAMLToMessages is actually just taking a vanilla dict rather than some OmegaConf class.

ebsmothers · 2024-09-13T21:20:35Z

recipes/dev/generate_v2.py

+        return messages
+
+
+def batch_to_device(batch: dict, device: torch.device) -> None:


Put in training/ or something?

Yeah @pbontrager is going to add this in his PR. This code will be deleted here.

recipes/dev/generate_v2.py

ebsmothers · 2024-09-13T21:46:30Z

recipes/dev/generate_v2.py

+        logits = self.model(**batch)[:, -1]
+        token = sample(logits, cfg.temperature, cfg.top_k)
+        generated_tokens.append(token.item())
+
+        if is_multimodal_input:
+            cache_mask = {"encoder_mask": batch["encoder_mask"][:, -1:]}


shape comments here would be nice

Also dumb q: how should I think about batch["encoder_mask"][:, -1:]? Like is this just some kind of stand-in for "here are all the text tokens seen by the final image"? And why do we not need to update it as we generate more text tokens? I know we aggressively pad in the image dimension (I think in the collate function?), but didn't think we were doing this in the text dimension

cc @pbontrager

This isn't a pretty solution, but essentially what's happening is during decoding your attention shapes are [1 x kv_cache_size] where 1 is the token being decoded. Inside of TransformerDecoder we take a single slice of a causal mask for our self attention masks. For cross attention, the mask is static during decoding, and the assumption being made here is the decoding tokens have the same mask as the final token of pre-fill e.g. the decoded tokens can see whatever images the previous token could.

ebsmothers

It's a beautiful recipe, much more transparent than our existing one. Add tests 😃

SalmanMohammadi · 2024-09-15T12:56:54Z

recipes/dev/generate_v2.py

+        generated_tokens = []
+        t0 = time.perf_counter()
+        logits = self.model(**batch)[:, -1]
+        token = sample(logits, cfg.temperature, cfg.top_k)
+        generated_tokens.append(token.item())
+
+        if is_multimodal_input:
+            cache_mask = {"encoder_mask": batch["encoder_mask"][:, -1:]}
+        else:
+            cache_mask = {}
+
+        # 5. Continue generating
+        for _ in range(cfg.max_new_tokens):
+            if token.item() in self.model_transform.stop_tokens:
+                break
+            logits = self.model(token, **cache_mask)[:, -1]
+            token = sample(logits, cfg.temperature, cfg.top_k)
+            generated_tokens.append(token.item())
+        t = time.perf_counter() - t0


Is this going to be replaced with a call to generation.generate? If so, does that need to be updated to support encoder masks?

My thinking is to actually leave this logic in the recipe as it increases visibility into how we actually do generate and allows users to hack around specific parts. Eventually, I'd like to deprecate the generation.generate for use in the generation recipe. We can still use it for DPO.

Then, we can expand our generation utils to easily support more versions of sampling or maybe even beam search decoding.

I guess generation.generate would also be used for evals?

When I put up the PR, I'd actually love to have a discussion on this. I'm honestly not sure what the best way to balance code readability + abstractions for this kind of thing right now.

I think this is definitely the way forward for a generation recipe just used to test out your finetuned models, we can benefit from this simplicity by not needing any funky masking/padding logic. It's precise and very easy to understand.

I think it's mostly okay to sacrifice this simplicity when we need to cover broader use-cases and maximise performance like in PPO/evals/agentic etc. How easy it to understand all the compile integrations in the codebase? Not very. The best we can do imo is surface as much of the generation logic as we can so users can reference a single file for all the chunkier generation functionality.

SalmanMohammadi · 2024-09-15T12:59:51Z

recipes/dev/generate_v2.py

+from torchtune.modules.transforms import Transform
+
+
+class SingleTurnYAMLToMessages(Transform):


Should this end up somewhere like data/_utils.py?

Yep, eventually will move it there.

SalmanMohammadi · 2024-09-15T13:03:46Z

This looks really clean. Generation has felt like a bit of a second-class citizen in torchtune, and I think this pushes us in the right direction in setting the bar for clean + hackable code across all our recipes. I really like how transparent it makes going from MM prompt -> message, this is probably the most accessible example we have in our codebase.

I don't have a huge amount to add ontop of Evan's comments. What's the minimum feature set you'd like this to support before it lands?

joecummings · 2024-09-16T13:18:10Z

This looks really clean. Generation has felt like a bit of a second-class citizen in torchtune, and I think this pushes us in the right direction in setting the bar for clean + hackable code across all our recipes. I really like how transparent it makes going from MM prompt -> message, this is probably the most accessible example we have in our codebase.

I don't have a huge amount to add ontop of Evan's comments. What's the minimum feature set you'd like this to support before it lands?

Minimum feature set is what I have up now. Ideal state would have compile, as well. Quantization will definitely come later.

pbontrager · 2024-09-19T20:45:53Z

recipes/configs/dev/multimodal_generation.yaml

+# Generation arguments
+prompt:
+  system: You are a helpful assistant who responds like the author Shakespeare.
+  user:


Does this support multiple image/text entries or just preprend image?

codecov-commenter · 2024-09-19T23:17:58Z

Codecov Report

Attention: Patch coverage is 42.85714% with 16 lines in your changes missing coverage. Please review.

Project coverage is 71.08%. Comparing base (50b24e5) to head (026107d).
Report is 6 commits behind head on main.

Files with missing lines	Patch %	Lines
tests/recipes/dev/test_generate_v2.py	44.44%	15 Missing ⚠️
torchtune/models/llama2/_tokenizer.py	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1563      +/-   ##
==========================================
- Coverage   71.11%   71.08%   -0.03%     
==========================================
  Files         297      298       +1     
  Lines       15120    15165      +45     
==========================================
+ Hits        10752    10780      +28     
- Misses       4368     4385      +17

Flag	Coverage Δ
	`71.08% <42.85%> (-0.03%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

RdoubleA · 2024-09-21T19:36:04Z

How is the prompt template applied here if we are not calling tokenize_messages?

joecummings · 2024-09-21T20:46:33Z

How is the prompt template applied here if we are not calling tokenize_messages?

We are! The tokenizer / transform call method uses tokenize_messages :)

SalmanMohammadi · 2024-09-22T15:05:49Z

recipes/dev/generate_v2.py

+        self.to_messages = SingleTurnYAMLToMessages()
+
+    @torch.inference_mode()
+    def generate(self, cfg: DictConfig):


I think this looks pretty neat!

ebsmothers · 2024-09-24T03:59:54Z

recipes/configs/llama2/generation_v2.yaml

+model:
+  _component_: torchtune.models.llama2.llama2_7b
+
+# Transform arguments
+transform:
+  _component_: torchtune.models.llama2.llama2_tokenizer
+  path: /tmp/Llama-2-7b-chat-hf/tokenizer.model


Can we update the defaults to something newer than Llama2?

This is just for an example + the test.

The real default is coming soon.

In reality, I want to move generation to under each model.

ebsmothers · 2024-09-24T04:01:10Z

torchtune/_recipe_registry.py

@@ -252,6 +252,17 @@ class Recipe:
        ],
        supports_distributed=False,
    ),
+    Recipe(
+        name="dev/generate_v2",


Yeah I would just remove dev from the CLI name

ebsmothers · 2024-09-24T04:14:48Z

recipes/dev/generate_v2.py

+class SingleTurnYAMLToMessages(Transform):
+    """
+    Converts a single turn conversation in YAML format to a list of messages.
+    """


I mean in practice it's basically the same thing since the signature in SingleTurnYAMLToMessages already takes Dict[str, Any]. But really you are doing self.to_messages(cfg.prompt) and (I think) cfg.prompt is an OmegaConf.DictConfig. So might be better to just do self.to_messages(OmegaConf.to_container(cfg.prompt)) so that SingleTurnYAMLToMessages is actually just taking a vanilla dict rather than some OmegaConf class.

SalmanMohammadi · 2024-09-24T08:25:58Z

recipes/dev/generate_v2.py

+        self.model_transform = config.instantiate(cfg.transform)
+        self.to_messages = SingleTurnYAMLToMessages()
+
+    def log_metrics(self, total_time: int, tokens_per_second: float) -> None:


very nice very hackable

SalmanMohammadi · 2024-09-24T08:29:23Z

hot dog

Is there any reason why this wouldn't work with compile?

edit: maybe because I broke something? Currently working this out.

joecummings · 2024-09-24T13:27:04Z

Yeah I would just remove dev from the CLI name

@ebsmothers Why? I think it makes sense b/c it's buried in dev. Lots of our users use the recipes from the git clone installation.

ebsmothers · 2024-09-24T14:09:36Z

Yeah I would just remove dev from the CLI name

@ebsmothers Why? I think it makes sense b/c it's buried in dev. Lots of our users use the recipes from the git clone installation.

Discussed the rationale in this comment. Our FSDP2 recipe was just tune run lora_finetune_fsdp2 ..., which I personally preferred to tune run dev/lora_finetune_fsdp2. Not the end of the world either way though

joecummings added 2 commits September 12, 2024 09:45

Add generate-v2 recipe

749d318

Linting

aaf16cc

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 12, 2024

joecummings added 3 commits September 12, 2024 09:53

Add example config

049923d

Merge remote-tracking branch 'upstream/main' into add-generate-v2

7e28402

Move to dev

6dfec35

SalmanMohammadi self-requested a review September 13, 2024 17:25

ebsmothers reviewed Sep 13, 2024

View reviewed changes

recipes/dev/generate_v2.py Show resolved Hide resolved

ebsmothers reviewed Sep 13, 2024

View reviewed changes

felipemello1 mentioned this pull request Sep 13, 2024

Generate Command phi3 Error #1581

Open

SalmanMohammadi reviewed Sep 15, 2024

View reviewed changes

ebsmothers mentioned this pull request Sep 15, 2024

[RFC] Batched inference 🤝 KV-cache 🤝 compile #1424

Merged

13 tasks

pbontrager reviewed Sep 19, 2024

View reviewed changes

joecummings added 2 commits September 19, 2024 14:06

Merge remote-tracking branch 'upstream/main' into add-generate-v2

2980cb2

Remove batch_to_device from recipe

6e164fb

joecummings added 4 commits September 20, 2024 09:53

stash

71ce7b2

Merge remote-tracking branch 'upstream/main' into add-generate-v2

f5a6b36

kwargs

a02b1d2

Stub

1b098e9

joecummings added 6 commits September 21, 2024 13:57

Merge remote-tracking branch 'upstream/main' into add-generate-v2

e768d32

Add test for Llama2

038da16

Add generation config for llama2

da4c1ba

chore: lint

a55c570

Add test that passes

b2b4baf

transform

ee88671

SalmanMohammadi reviewed Sep 22, 2024

View reviewed changes

joecummings added 6 commits September 23, 2024 11:14

stub

a148166

Merge remote-tracking branch 'upstream/main' into add-generate-v2

d60033e

Update parts of recipe and make nicer Llama2 output

be316dc

Merge

5be985f

Asdf

19264c6

Remove MM generate

7d98743

joecummings changed the title ~~[WIP] Add generate-v2 recipe for MM~~ GENERATE V2 HOT DOG Sep 24, 2024

joecummings marked this pull request as ready for review September 24, 2024 02:33

ebsmothers reviewed Sep 24, 2024

View reviewed changes

SalmanMohammadi reviewed Sep 24, 2024

View reviewed changes

Update tests

4fdc811

pbontrager approved these changes Sep 24, 2024

View reviewed changes

joecummings added 2 commits September 24, 2024 07:36

Change key name to tokenizer (@pbotranger)

0124d24

Whoops

026107d

joecummings merged commit 34d70b4 into pytorch:main Sep 24, 2024
17 checks passed

joecummings deleted the add-generate-v2 branch September 24, 2024 17:30

		return messages


		def batch_to_device(batch: dict, device: torch.device) -> None:

		from torchtune.modules.transforms import Transform


		class SingleTurnYAMLToMessages(Transform):

GENERATE V2 HOT DOG #1563

GENERATE V2 HOT DOG #1563

Uh oh!

Conversation

joecummings commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Sep 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1563

✅ No Failures

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ebsmothers left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SalmanMohammadi commented Sep 15, 2024

Uh oh!

joecummings commented Sep 16, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-commenter commented Sep 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

RdoubleA commented Sep 21, 2024

joecummings commented Sep 12, 2024 •

edited

Loading

pytorch-bot bot commented Sep 12, 2024 •

edited

Loading

codecov-commenter commented Sep 19, 2024 •

edited

Loading

SalmanMohammadi commented Sep 24, 2024 •

edited

Loading