feat: introduces session-centric tracing APIs#539
Conversation
Summary of ChangesHello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refines the rollout execution system by enhancing API clarity, improving the structure of workflow inputs, and overhauling the performance tracing mechanism. The changes aim to make the system more robust, easier to debug, and more consistent in its data handling and type definitions. It focuses on providing a clearer understanding of request flows and execution phases without altering core functionalities. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Pull Request Overview
This pull request refactors the workflow API and performance tracing infrastructure to improve type safety, consistency, and observability. The changes introduce a structured WorkflowTaskInput wrapper to standardize how data flows through workflows and rename the should_accept parameter to should_accept_fn for clarity across the codebase.
Key changes:
- Introduced
WorkflowTaskInputdataclass to wrap workflow input data with request tracing metadata - Renamed
should_acceptparameter toshould_accept_fnthroughout the codebase for naming consistency - Enhanced performance tracing with phase-based event tracking (generate, reward calculation) and extensible event handling
- Added input validation and improved error handling in workflow execution paths
- Modernized type hints using PEP 604 syntax (
dict | Noneinstead ofOptional[Dict])
Reviewed Changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| areal/api/workflow_api.py | Introduced WorkflowTaskInput dataclass and made RolloutWorkflow an abstract base class with updated signature |
| areal/api/engine_api.py | Updated API method signatures to use should_accept_fn parameter |
| areal/api/controller_api.py | Updated controller API to use should_accept_fn parameter |
| areal/core/workflow_executor.py | Refactored to use WorkflowTaskInput, improved error handling, and integrated new tracing API |
| areal/core/dist_rollout.py | Updated parameter names and type hints for consistency |
| areal/core/remote_inf_engine.py | Updated to use should_accept_fn parameter |
| areal/engine/fsdp_engine.py | Updated to use should_accept_fn parameter |
| areal/engine/megatron_engine.py | Updated to use should_accept_fn parameter |
| areal/engine/vllm_remote.py | Updated to use should_accept_fn parameter |
| areal/engine/sglang_remote.py | Updated to use should_accept_fn and modernized type hints |
| areal/experimental/sglang_engine.py | Updated to use should_accept_fn, removed duplicate import |
| areal/workflow/rlvr.py | Updated to use WorkflowTaskInput, added type hints, and integrated performance tracing |
| areal/workflow/vision_rlvr.py | Updated to use WorkflowTaskInput, added type hints, and integrated performance tracing |
| areal/workflow/multi_turn.py | Updated to use WorkflowTaskInput, added validation logic, and improved type hints |
| areal/experimental/workflow/multi_turn_v2.py | Updated to use WorkflowTaskInput and improved type hints |
| areal/utils/perf_tracer.py | Major refactoring to support phase-based event tracking with extensible architecture |
| areal/utils/image.py | Simplified return type and removed unused import |
| areal/tests/test_perf_tracer.py | Updated test to use new event-based tracing API |
| recipe/AEnt/gsm8k_aent_grpo.py | Updated to use should_accept_fn parameter |
| examples//.py | Updated all example files to use should_accept_fn parameter |
| examples/math/gsm8k_ppo.py | Updated assertion formatting and parameter name |
| examples/lora/gsm8k_grpo_lora.py | Updated type hints and parameter name |
| notebook/math_reflection_en.ipynb | Updated notebook to use should_accept_fn parameter |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the rollout APIs and performance tracing system. The changes improve clarity, robustness, and diagnostics. Renaming should_accept to should_accept_fn and introducing WorkflowTaskInput make the APIs more intuitive. The new phase-aware event tracing is a major enhancement. I've found one minor potential issue regarding a tensor shape change in one of the workflows, which could be an unintended side effect of the refactoring.
57d7c83 to
521adab
Compare
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 34 out of 34 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
247ff4f to
7e28625
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring of the tracing and rollout APIs. The introduction of a structured, phase-aware session tracing system with decorators and context managers greatly improves code clarity and maintainability. The consistent renaming from 'request' to 'session' and the tightening of type hints across the codebase are also welcome improvements. My review includes a couple of suggestions: one to improve type safety in an experimental workflow and another to fix a potential bug in how rewards are calculated in the vision workflow.
There was a problem hiding this comment.
Pull Request Overview
Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks. Adds request phase tracing helpers (#543)
0a209a2 to
dc81216
Compare
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a significant and well-executed refactoring of the tracing system, moving from request-centric to session-centric tracing. The new SessionTracer API is a major improvement, with a robust, extensible, and rule-driven design that leverages context variables, decorators, and context managers to provide a clean and ergonomic developer experience. The changes are consistently applied across the codebase, including updates to configurations, tests, and documentation.
I've included a couple of minor suggestions for further refinement, but overall, this is excellent work that enhances the observability and maintainability of the system.
There was a problem hiding this comment.
Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
* Introduces session-centric tracing APIs Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks. Adds request phase tracing helpers (inclusionAI#543) * Resolve review comments
* Introduces session-centric tracing APIs Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks. Adds request phase tracing helpers (inclusionAI#543) * Resolve review comments
Description
Introduces session-centric tracing APIs
Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks.
In addition, this PR also adds request phase tracing helpers (#543).
A typical request trace looks like:
{ "request_id": 0, "rank": 0, "status": "accepted", "submit_ts": 8738314.227833074, "enqueue_ts": 8738314.228516107, "wait_return_ts": 8738318.774206188, "rollout_stats": { "accepted": 8, "enqueued": 192, "rejected": 0, "running": 56 }, "queue_wait_s": 0.0006830338388681412, "runner_wait_s": 0.20169099420309067, "execution_s": 3.5818509366363287, "post_wait_s": 0.7621481493115425, "total_s": 4.54637311398983, "generate_s": 3.315287623554468, "reward_calc_s": 0.168320894241333, "phases": { "execution": [ { "start_ts": 8738314.430207102, "end_ts": 8738318.012058038 } ], "generate": [ { "start_ts": 8738314.44885158, "end_ts": 8738317.764139203 } ], "reward": [ { "start_ts": 8738317.764251094, "end_ts": 8738317.9342864 } ] } }Type of Change
work as expected)
Checklist
jb build docs/gemini review)