Skip to content

feat: introduces session-centric tracing APIs#539

Merged
garrett4wade merged 3 commits intomainfrom
rchardx/trace
Nov 10, 2025
Merged

feat: introduces session-centric tracing APIs#539
garrett4wade merged 3 commits intomainfrom
rchardx/trace

Conversation

@rchardx
Copy link
Copy Markdown
Collaborator

@rchardx rchardx commented Nov 6, 2025

Description

Introduces session-centric tracing APIs

Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks.

In addition, this PR also adds request phase tracing helpers (#543).

A typical request trace looks like:

{
    "request_id": 0,
    "rank": 0,
    "status": "accepted",
    "submit_ts": 8738314.227833074,
    "enqueue_ts": 8738314.228516107,
    "wait_return_ts": 8738318.774206188,
    "rollout_stats": {
        "accepted": 8,
        "enqueued": 192,
        "rejected": 0,
        "running": 56
    },
    "queue_wait_s": 0.0006830338388681412,
    "runner_wait_s": 0.20169099420309067,
    "execution_s": 3.5818509366363287,
    "post_wait_s": 0.7621481493115425,
    "total_s": 4.54637311398983,
    "generate_s": 3.315287623554468,
    "reward_calc_s": 0.168320894241333,
    "phases": {
        "execution": [
            {
                "start_ts": 8738314.430207102,
                "end_ts": 8738318.012058038
            }
        ],
        "generate": [
            {
                "start_ts": 8738314.44885158,
                "end_ts": 8738317.764139203
            }
        ],
        "reward": [
            {
                "start_ts": 8738317.764251094,
                "end_ts": 8738317.9342864
            }
        ]
    }
}

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not
    work as expected)
  • Documentation update
  • Code refactoring (no functional changes)
  • Performance improvement
  • Test coverage improvement

Checklist

  • I have read the Contributing Guide
  • I have run formatting tools (pre-commit or manual)
  • I have run relevant unit tests and they pass
  • I have added tests for new functionality
  • I have updated documentation if needed
  • My branch is up to date with main
  • This PR introduces breaking changes (if yes, fill out details below)
  • If this PR changes documentation, I have built and previewed it locally with
    jb build docs
  • No critical issues raised by AI reviewers (/gemini review)

@rchardx rchardx requested a review from Copilot November 6, 2025 06:02
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @rchardx, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refines the rollout execution system by enhancing API clarity, improving the structure of workflow inputs, and overhauling the performance tracing mechanism. The changes aim to make the system more robust, easier to debug, and more consistent in its data handling and type definitions. It focuses on providing a clearer understanding of request flows and execution phases without altering core functionalities.

Highlights

  • API Clarity and Consistency: The should_accept parameter has been consistently renamed to should_accept_fn across various API, engine, and example files. This change improves clarity by explicitly indicating its role as a function or callback for filtering.
  • Workflow Execution Refactoring: The core RolloutWorkflow is now an abstract base class, and a new WorkflowTaskInput dataclass has been introduced. This standardizes the passing of structured task metadata, including request IDs for tracing, into workflow execution methods, enhancing modularity and maintainability.
  • Enhanced Performance Tracing: The perf_tracer module has undergone a significant overhaul. It now uses a RequestTraceEvent enum and a more structured RequestRecord dataclass to capture phase-aware events (e.g., ENQUEUED, EXECUTION_START, GENERATE_END, REWARD_START). This provides more granular and flexible diagnostic capabilities for understanding request lifecycles.
  • Type Hinting and Data Streamlining: Type hints have been modernized across several files, moving to Python 3.9+ syntax (e.g., list, dict, | None). The ModelRequest's image_data now exclusively expects base64 encoded strings, and tensor creations in workflow files explicitly specify dtype for improved type safety and consistency.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request refactors the workflow API and performance tracing infrastructure to improve type safety, consistency, and observability. The changes introduce a structured WorkflowTaskInput wrapper to standardize how data flows through workflows and rename the should_accept parameter to should_accept_fn for clarity across the codebase.

Key changes:

  • Introduced WorkflowTaskInput dataclass to wrap workflow input data with request tracing metadata
  • Renamed should_accept parameter to should_accept_fn throughout the codebase for naming consistency
  • Enhanced performance tracing with phase-based event tracking (generate, reward calculation) and extensible event handling
  • Added input validation and improved error handling in workflow execution paths
  • Modernized type hints using PEP 604 syntax (dict | None instead of Optional[Dict])

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
areal/api/workflow_api.py Introduced WorkflowTaskInput dataclass and made RolloutWorkflow an abstract base class with updated signature
areal/api/engine_api.py Updated API method signatures to use should_accept_fn parameter
areal/api/controller_api.py Updated controller API to use should_accept_fn parameter
areal/core/workflow_executor.py Refactored to use WorkflowTaskInput, improved error handling, and integrated new tracing API
areal/core/dist_rollout.py Updated parameter names and type hints for consistency
areal/core/remote_inf_engine.py Updated to use should_accept_fn parameter
areal/engine/fsdp_engine.py Updated to use should_accept_fn parameter
areal/engine/megatron_engine.py Updated to use should_accept_fn parameter
areal/engine/vllm_remote.py Updated to use should_accept_fn parameter
areal/engine/sglang_remote.py Updated to use should_accept_fn and modernized type hints
areal/experimental/sglang_engine.py Updated to use should_accept_fn, removed duplicate import
areal/workflow/rlvr.py Updated to use WorkflowTaskInput, added type hints, and integrated performance tracing
areal/workflow/vision_rlvr.py Updated to use WorkflowTaskInput, added type hints, and integrated performance tracing
areal/workflow/multi_turn.py Updated to use WorkflowTaskInput, added validation logic, and improved type hints
areal/experimental/workflow/multi_turn_v2.py Updated to use WorkflowTaskInput and improved type hints
areal/utils/perf_tracer.py Major refactoring to support phase-based event tracking with extensible architecture
areal/utils/image.py Simplified return type and removed unused import
areal/tests/test_perf_tracer.py Updated test to use new event-based tracing API
recipe/AEnt/gsm8k_aent_grpo.py Updated to use should_accept_fn parameter
examples//.py Updated all example files to use should_accept_fn parameter
examples/math/gsm8k_ppo.py Updated assertion formatting and parameter name
examples/lora/gsm8k_grpo_lora.py Updated type hints and parameter name
notebook/math_reflection_en.ipynb Updated notebook to use should_accept_fn parameter

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the rollout APIs and performance tracing system. The changes improve clarity, robustness, and diagnostics. Renaming should_accept to should_accept_fn and introducing WorkflowTaskInput make the APIs more intuitive. The new phase-aware event tracing is a major enhancement. I've found one minor potential issue regarding a tensor shape change in one of the workflows, which could be an unintended side effect of the refactoring.

@rchardx rchardx force-pushed the rchardx/trace branch 2 times, most recently from 57d7c83 to 521adab Compare November 6, 2025 06:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 34 out of 34 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@garrett4wade garrett4wade mentioned this pull request Nov 6, 2025
25 tasks
@rchardx
Copy link
Copy Markdown
Collaborator Author

rchardx commented Nov 10, 2025

/gemini review

@rchardx rchardx changed the title Streamlines rollout APIs and tracing feat: introduces session-centric tracing APIs Nov 10, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring of the tracing and rollout APIs. The introduction of a structured, phase-aware session tracing system with decorators and context managers greatly improves code clarity and maintainability. The consistent renaming from 'request' to 'session' and the tightening of type hints across the codebase are also welcome improvements. My review includes a couple of suggestions: one to improve type safety in an experimental workflow and another to fix a potential bug in how rewards are calculated in the vision workflow.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 22 out of 22 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@rchardx rchardx added the safe-to-test Ready to run unit-tests in a PR. label Nov 10, 2025
Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks.

Adds request phase tracing helpers (#543)
@rchardx rchardx requested a review from Copilot November 10, 2025 11:16
@rchardx
Copy link
Copy Markdown
Collaborator Author

rchardx commented Nov 10, 2025

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant and well-executed refactoring of the tracing system, moving from request-centric to session-centric tracing. The new SessionTracer API is a major improvement, with a robust, extensible, and rule-driven design that leverages context variables, decorators, and context managers to provide a clean and ergonomic developer experience. The changes are consistently applied across the codebase, including updates to configurations, tests, and documentation.

I've included a couple of minor suggestions for further refinement, but overall, this is excellent work that enhances the observability and maintainability of the system.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot reviewed 22 out of 22 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@garrett4wade garrett4wade merged commit ca801cc into main Nov 10, 2025
1 check passed
@garrett4wade garrett4wade deleted the rchardx/trace branch November 10, 2025 11:24
Bruce-rl-hw pushed a commit to Bruce-rl-hw/AReaL-vllm that referenced this pull request Dec 4, 2025
* Introduces session-centric tracing APIs

Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks.

Adds request phase tracing helpers (inclusionAI#543)

* Resolve review comments
leandermaben pushed a commit to leandermaben/AReaL that referenced this pull request Mar 24, 2026
* Introduces session-centric tracing APIs

Replaces request tracing with a session-scoped tracer so rollout executors, engines, and workflows can log lifecycle and phase spans in a single record. Adds context-aware helpers and decorators to propagate session IDs through async code, enabling phase timing, counter aggregation, and structured flush thresholds. Updates configs, tests, docs, and example presets to align with the session tracer and tightens validation around workflow submission and tracing hooks.

Adds request phase tracing helpers (inclusionAI#543)

* Resolve review comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe-to-test Ready to run unit-tests in a PR.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants