Fixed reporting of single value of loss and ppl across devices. #496

quic-meetkuma · 2025-07-07T10:12:19Z

Fixed reporting of single value of loss and ppl across devices.
Minor refactoring changes and variable name changes to make it consistent.

Signed-off-by: meetkuma <[email protected]>

quic-swatia · 2025-07-07T11:25:35Z

QEfficient/finetune/utils/train_utils.py


+            if local_rank == 0:


local_rank will be None in non ddp case. Hence, it will not update the tensorboard in non DDP case. Defining following method in helper.py and calling it over here will help:
def is_rank_zero():
return int(os.getenv("LOCAL_RANK", 0)) == 0

Good catch, will update!

quic-swatia · 2025-07-07T11:27:04Z

QEfficient/finetune/utils/train_utils.py

        # Update the learning rate as needed
        lr_scheduler.step()

        if train_config.run_validation:
-            if train_config.enable_ddp:
-                dist.barrier()


Moving line #368 and #369 won't be any help. We can keep these here only.

It is code refactoring. Moved inside evaluation function.

quic-swatia · 2025-07-07T11:41:16Z

QEfficient/finetune/utils/train_utils.py

-    eval_epoch_loss = (
-        0.0 if eval_loss == 0.0 else eval_loss / (step + 1 - num_dummy_samples / train_config.val_batch_size)
-    )
+    eval_loss = 0.0 if eval_loss == 0.0 else eval_loss / (step + 1 - num_dummy_samples / train_config.val_batch_size)


Since we are using the variable name train_epoch_loss for the average train loss of the epoch, it will be good to keep the name eval_epoch_loss for the average evaluation loss of the epoch to maintain uniformity.

Check the other variables being returned from this function. Made the names consistent.

Signed-off-by: meetkuma <[email protected]>

quic-mamta · 2025-07-08T06:14:16Z

QEfficient/finetune/utils/train_utils.py

+        dist.all_reduce(eval_loss, op=dist.ReduceOp.SUM)
+        eval_loss /= get_num_ddp_devices()
+        dist.all_reduce(eval_metric, op=dist.ReduceOp.SUM)
+        eval_metric /= get_num_ddp_devices()


will it not make each json files have same data for train_epoch_loss, train_epoch_metric and val_epoch_loss and val_epoch_metric?

train_epoch_loss and train_epoch_metric are all_reduced after they are populated at L357.

For this eval_loss and eval_metric, you are right. I will move this all reduce after L386.

Fixed reporting of single value of loss and ppl across devices.

6b8aaad

Signed-off-by: meetkuma <[email protected]>

quic-swatia reviewed Jul 7, 2025

View reviewed changes

Updated local_rank usages with is_rank_zero function call.

76ce094

Signed-off-by: meetkuma <[email protected]>

quic-meetkuma marked this pull request as ready for review July 8, 2025 05:31

quic-meetkuma requested review from quic-rishinr, ochougul, quic-hemagnih and quic-amitraj as code owners July 8, 2025 05:32

quic-mamta reviewed Jul 8, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fixed reporting of single value of loss and ppl across devices. #496

Fixed reporting of single value of loss and ppl across devices. #496

quic-meetkuma commented Jul 7, 2025

Uh oh!

quic-swatia Jul 7, 2025

Uh oh!

quic-meetkuma Jul 8, 2025

Uh oh!

quic-swatia Jul 7, 2025

Uh oh!

quic-meetkuma Jul 8, 2025

Uh oh!

quic-swatia Jul 7, 2025

Uh oh!

quic-meetkuma Jul 8, 2025

Uh oh!

quic-mamta Jul 8, 2025

Uh oh!

quic-meetkuma Jul 8, 2025

Uh oh!

Uh oh!

Fixed reporting of single value of loss and ppl across devices. #496

Are you sure you want to change the base?

Fixed reporting of single value of loss and ppl across devices. #496

Conversation

quic-meetkuma commented Jul 7, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!