Add changes to support MM eval #1669

joecummings · 2024-09-25T01:44:35Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?
*

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-09-25T01:44:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1669

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 04d1e53 with merge base 18efc81 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

joecummings · 2024-09-25T01:47:30Z

recipes/eleuther_eval.py

-except ImportError:
-    logger.error(
-        "Recipe requires EleutherAI Eval Harness v0.4. Please install with `pip install lm_eval==0.4.*`"
+    lm_eval_version = importlib.metadata.version("lm_eval")


Instead of checking via importing legacy functions, we actually check the version.

This is the correct way lol

recipes/eleuther_eval.py

joecummings · 2024-09-25T01:48:27Z

recipes/eleuther_eval.py

+        return self._device
+
+    @property
+    def cache_hook(self):


Have to do this to make the harness happy.

joecummings · 2024-09-25T01:48:50Z

recipes/eleuther_eval.py

+        for text, images in zip(all_texts, all_images):
+            # Ensure images are all RGB
+            proper_images = []
+            for image in images:


Not all the images are in RGB format so we need to convert them

joecummings · 2024-09-25T01:49:02Z

recipes/eleuther_eval.py

+                text, image_tag=self._image_str, images=proper_images
+            )
+            messages.append(Message(role="user", content=content))
+            messages.append(Message(role="assistant", content=""))


Append assistant message to kick start generation.

is it fine for the context here to be empty, or should it be defined somewhere?

No, the content should be empty.

So to properly prompt the model for generation we need to

Make sure we trail with an empty assistant message

Set inference to true on tokenize messages

This is hard to remember both... not for this PR, but we should see if there's a more intuitive way to do this

Yeah definitely agree. Shouldn't be that hard to know what's going to give proper generation.

recipes/eleuther_eval.py

joecummings · 2024-09-25T01:50:29Z

recipes/eleuther_eval.py

-            batch_size=self._batch_size,
-            dtype=self._dtype,
+        # Finally, we setup the actual EvalWrapper class
+        eleuther_model_wrapper = (


per our discussion earlier, this is simplified @felipemello1

maybe worth logging and letting the user know that we see it as a multimodal or text only eval? Could unify it with "self.logger.info(f"Running evaluation on the following tasks: {self.tasks}")"

joecummings · 2024-09-25T01:50:39Z

recipes/eleuther_eval.py

-
+        # Log metrics
+        self.logger.info(f"Eval completed in {t1:.02f} seconds.")
+        self.logger.info(


I added some memory logging.

joecummings · 2024-09-25T01:50:53Z

tests/recipes/test_eleuther_eval.py

@@ -22,21 +22,21 @@ class TestEleutherEval:
    @pytest.mark.parametrize(
        "eval_name, expected_acc, bsz",
        [
-            ("truthfulqa_gen", 0.1, 8),
+            ("truthfulqa_gen", 0.1, 4),


No need to do bsz 8, that's just slow as hell.

joecummings · 2024-09-25T01:51:05Z

tests/recipes/test_eleuther_eval.py

@@ -122,3 +122,40 @@ def test_eval_recipe_errors_without_lm_eval(self, caplog, monkeypatch, tmpdir):

        err_log = caplog.messages[-1]
        assert "Recipe requires EleutherAI Eval Harness v0.4" in err_log
+
+    @pytest.mark.integration_test
+    def test_eval_recipe_errors_with_generate_until_and_mc_tasks(


more tests good, yes?

I though @SalmanMohammadi added a test very similar to this already? Also test isn't added?

I've decided to be a bad person and add it in a follow-up.

I can just take it out when I fix this behaviour.

felipemello1 · 2024-09-25T01:57:41Z

recipes/eleuther_eval.py

+
+if not lm_eval_version >= "0.4.2":
+    raise ImportError(
+        "lm_eval version must be >= 0.4.2. Please install lm_eval >= 0.4.2."


fyi, i had to create this import_guard.py for another PR, because i was using it in multiple files. I wonder if it would make sense to move this there. My first intuition is that we should NOT, as keeping this logic closer to the code is better, unless you are using it somewhere else too: https://github.com/pytorch/torchtune/blob/main/torchtune/utils/_import_guard.py

This is good to keep in mind - my guess is that we will end up utilizing version and package guards much more frequently as our recipes expand.

recipes/eleuther_eval.py

felipemello1 · 2024-09-25T02:21:28Z

recipes/eleuther_eval.py

+        # +1% on truthfulqa_mc2 with a LoRA finetune. lit-gpt also sets this to False,
+        # see https://github.com/Lightning-AI/lit-gpt/blob/main/eval/lm_eval_harness.py#
+        # L66, though notably fast-gpt does the opposite
+        return self._transform.tokenizer.encode(string, add_bos=False, add_eos=False)


kwargs are not passed. Should we do args and kwargs?

In general Transform may not have tokenizer field, right? May wanna add a check or something. I see you use this in a couple places, so maybe just in init?

A multimodal transform will

the transform should have the encode method directly, no?

but yeah in general, I'm leaning towards just calling them both tokenizers now since it still takes in a list of messages...

Makes sense - we should have a unified language going forward and do a s/ replace globally at some point.

WRT this specific issue, no I don't think transforms are guaranteed to have the encode method. And the encode output is NOT being used as input to the model - it's just for sorting lengths of inputs on the backend for Eleuther.

recipes/eleuther_eval.py

felipemello1 · 2024-09-25T02:24:17Z

recipes/eleuther_eval.py

+    def tok_batch_multimodal_encode(
+        self,
+        all_texts: List[str],
+        all_images: List[List[PIL.Image.Image]],


nit: fine to keep "all_", but it feels weird. I think it should be the same as encode and reduce the number of bumps on the way because of different arg names

Is batch_texts and batch_images better?

your call, just a nit

felipemello1 · 2024-09-25T02:24:45Z

recipes/eleuther_eval.py

+        # it into a Message format for our tokenizer
+        all_encoded_messages = []
+
+        for text, images in zip(all_texts, all_images):


we should probably add a test here to check if they are lists. If not, maybe we should make them a list in the case bsz=1

Not an issue - they are definitely lists even if bsz = 1

felipemello1 · 2024-09-25T02:25:37Z

recipes/eleuther_eval.py

+            proper_images = []
+            for image in images:
+                if image.mode != "RGB":
+                    image = image.convert("RGB")


our image transform takes care of it, I believe, as long as the input is PIL. I dont like the idea that the eval is processing images.

When I remove this, our transform does not work. Worth investigating if our transform should take care of this but I can tell you right now it doesn't seem to be.

@pbontrager

Yeah, this should be in the model transform, I think we enforce 3 channels now but not rgb vs bgr which might be happening here.

Worth just adding Phil's comment in the code?

felipemello1 · 2024-09-25T02:28:55Z

recipes/eleuther_eval.py

+
+        # Pad the encoded messages
+        tok_batch = padded_collate_tiled_images_and_mask(
+            all_encoded_messages,


nit: feel free to ignore. Not a big fan of "all_" and "tok_", but i think it may help the reader to understand its not a single sample?

recipes/eleuther_eval.py

felipemello1 · 2024-09-25T02:59:30Z

please, if you make changes, and they are applicable to the other class, we should probably keep them in sync

ebsmothers · 2024-09-25T03:25:35Z

.github/workflows/recipe_test.yaml

@@ -41,7 +41,7 @@ jobs:
        run: |
          python -m pip install torch torchvision torchao
          python -m pip install -e ".[dev]"
-          python -m pip install lm-eval==0.4.*
+          python -m pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git@fb963f0f0a5b28b69763590bb59676072cf43a01


Do we plan to add something on this to our readme or something?

README would be ideal

U gonna add?

Sorry, why are we pinning to a commit here?

Eleuther added support for multimodal eval in this commit, but hasn't released a v0.4.5 patch yet. Once they do, a lot of this nonsense can go away.

recipes/eleuther_eval.py

ebsmothers · 2024-09-25T04:03:43Z

recipes/eleuther_eval.py

+            content = format_content_with_images(
+                text, image_tag=self._image_tag, images=proper_images
+            )
+            messages.append(Message(role="user", content=content))


Dumb q: how do we support system messages?

MMMU does not have system messages so we do not support system messages right now.

ebsmothers · 2024-09-25T04:17:10Z

recipes/eleuther_eval.py

+
+        # 4. Prefill step
+        generated_tokens = []
+        logits = self.model(prompt, **batch)[:, -1]


Just wondering: why do we pop prompt earlier just to get shape and then reinsert it here? Seems unintuitive

Mirrors our generate recipe where the tokens used in the forward pass are passed in positionally and the batch is unrolled via double asterisk.

ebsmothers · 2024-09-25T04:22:01Z

recipes/eleuther_eval.py

    ):
+        # TODO (@joecummings): Remove this init function so we don't load in extraneous stuff


Sorry what does this mean? Are some of these fields unused?

Uhmmmmm, turns out we actually load in a copy of a GPT-2 model (not that large so es okay) when we do this call. We overwrite anything that would affect generation, but it's extra memory overhead we shouldn't need.

How big? If nontrivial let's def file a follow-up task (I assume we can just load in some dummy empty model instead?)

codecov-commenter · 2024-09-25T04:27:36Z

Codecov Report

Attention: Patch coverage is 1.57068% with 188 lines in your changes missing coverage. Please review.

Project coverage is 26.07%. Comparing base (50b24e5) to head (04d1e53).
Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
recipes/eleuther_eval.py	0.00%	184 Missing ⚠️
tests/recipes/test_eleuther_eval.py	42.85%	4 Missing ⚠️

Additional details and impacted files

@@             Coverage Diff             @@
##             main    #1669       +/-   ##
===========================================
- Coverage   71.11%   26.07%   -45.04%     
===========================================
  Files         297      299        +2     
  Lines       15120    15392      +272     
===========================================
- Hits        10752     4013     -6739     
- Misses       4368    11379     +7011

Flag	Coverage Δ
	`26.07% <1.57%> (-45.04%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ebsmothers · 2024-09-25T04:31:19Z

recipes/eleuther_eval.py

-            for k, v in model_state_dict.items():
-                model_state_dict[k] = v.to(self._device)
-            model.load_state_dict(model_state_dict, assign=True)
-        else:
-            model.load_state_dict(model_state_dict)


Was this changed deliberately? Based on #1403 I think we want to load state dict with assign=True when quantization is enabled

SalmanMohammadi · 2024-09-25T08:10:55Z

recipes/eleuther_eval.py

+                "Any decoding strategy other than greedy is not supported."
+            )
+
+        if bsz > 1:


Bit out of the loop here, what needs to change to make this happen?

So currently we have no utils that support batching for Fusion models (encoder_input and encoder_mask is never accounted for). My plan would be to add it here first, then upstream it to the generate function.

SalmanMohammadi · 2024-09-25T08:11:34Z

recipes/eleuther_eval.py

+                    encoder_max_seq_len=self.model_transform.image_seq_len
+                    * self._max_images_per_sample,


nit

Suggested change

encoder_max_seq_len=self.model_transform.image_seq_len

* self._max_images_per_sample,

encoder_max_seq_len= (self.model_transform.image_seq_len

* self._max_images_per_sample),

unless this turns it into a tuple or smth

SalmanMohammadi · 2024-09-25T08:19:56Z

recipes/eleuther_eval.py

-        - Loading model in fp32 or bf16. Fp16 is currently not supported.
-
-    We recommend launching evaluation using the tune CLI:
+        - Quantization and torch.compile (for text-only models) is supported.


Are you planning to add in compile support?

Yes definitely.

Take out compile for now?

SalmanMohammadi · 2024-09-25T08:26:06Z

very very nice

SalmanMohammadi · 2024-09-25T08:27:47Z

What's a guy gotta do to see some outputs round here?

SalmanMohammadi · 2024-09-25T09:14:11Z

recipes/eleuther_eval.py

+                "multimodal generation."
+            )
+
+        # 1. Setup caches for a given batch size


We're not actually doing this rn

SalmanMohammadi · 2024-09-25T12:21:25Z

recipes/eleuther_eval.py

+            if self.model.caches_are_enabled():
+                self.model.reset_caches()
+            else:
+                self.model.setup_caches(


What if self._enable_kv_cache=False?

Yeah good catch. I'm actually considering turning off this option since using the kv cache is strictly faster.

For text-only models, this was not a huge deal, but for multimodal models, things are getting slowwwwww. I'll drop a note that this is what we're doing.

SalmanMohammadi · 2024-09-25T12:55:58Z

recipes/eleuther_eval.py

+        if self.model.caches_are_enabled():
+            self.model.reset_caches()
+        else:
+            with self.device:


same comment above aboutself._enable_kv_cache

joecummings · 2024-09-25T13:00:38Z

What's a guy gotta do to see some outputs round here?

|                 Tasks                 |Version|Filter|n-shot|Metric|   |Value |   |Stderr|
|---------------------------------------|------:|------|------|------|---|-----:|---|-----:|
|mmmu_val                               |      0|none  |      |acc   |↑  |0.3233|±  |0.0271|
| - Art and Design                      |      0|none  |      |acc   |↑  |0.2000|±  |0.0656|
|  - Art                                |      0|none  |None  |acc   |↑  |0.1000|±  |0.1000|
|  - Art Theory                         |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Design                             |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Music                              |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
| - Business                            |      0|none  |      |acc   |↑  |0.3200|±  |0.0680|
|  - Accounting                         |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Economics                          |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Finance                            |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Manage                             |      0|none  |None  |acc   |↑  |0.5000|±  |0.1667|
|  - Marketing                          |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
| - Health and Medicine                 |      0|none  |      |acc   |↑  |0.3600|±  |0.0706|
|  - Basic Medical Science              |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Clinical Medicine                  |      0|none  |None  |acc   |↑  |0.5000|±  |0.1667|
|  - Diagnostics and Laboratory Medicine|      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Pharmacy                           |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Public Health                      |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
| - Humanities and Social Science       |      0|none  |      |acc   |↑  |0.4000|±  |0.0791|
|  - History                            |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Literature                         |      0|none  |None  |acc   |↑  |0.5000|±  |0.1667|
|  - Psychology                         |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
|  - Sociology                          |      0|none  |None  |acc   |↑  |0.5000|±  |0.1667|
| - Science                             |      0|none  |      |acc   |↑  |0.3000|±  |0.0585|
|  - Biology                            |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Chemistry                          |      0|none  |None  |acc   |↑  |0.0000|±  |0.0000|
|  - Geography                          |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Math                               |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
|  - Physics                            |      0|none  |None  |acc   |↑  |0.7000|±  |0.1528|
| - Tech and Engineering                |      0|none  |      |acc   |↑  |0.3429|±  |0.0583|
|  - Agriculture                        |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Architecture and Engineering       |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
|  - Computer Science                   |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
|  - Electronics                        |      0|none  |None  |acc   |↑  |0.3000|±  |0.1528|
|  - Energy and Power                   |      0|none  |None  |acc   |↑  |0.4000|±  |0.1633|
|  - Materials                          |      0|none  |None  |acc   |↑  |0.2000|±  |0.1333|
|  - Mechanical Engineering             |      0|none  |None  |acc   |↑  |0.5000|±  |0.1667|

Add changes to support MM eval

95f8f11

joecummings commented Sep 25, 2024

View reviewed changes

recipes/eleuther_eval.py Outdated Show resolved Hide resolved

joecummings commented Sep 25, 2024

View reviewed changes

recipes/eleuther_eval.py Outdated Show resolved Hide resolved

joecummings commented Sep 25, 2024

View reviewed changes