Skip to content

Assistant: Initial pass at implementing a data summary tool for Python #8208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

melissa-barca
Copy link
Contributor

@melissa-barca melissa-barca commented Jun 19, 2025

First pass at #7114

Provides Assistant with a getDataSummary tool, currently only implemented for Python, that provides a JSON structured summary of a data object by using the Positron API to communicate with the Variables Comm. I updated the variable's python backend to reuse existing functionality from the data explorer.

I used the inspectVariables tool as a guide for retrieving info from the variables comm.

image

Release Notes

New Features

  • N/A

Bug Fixes

  • N/A

QA Notes

@:data-explorer
@:assistant
@:variables
@:plots
@:viewer

@melissa-barca melissa-barca requested a review from wesm June 19, 2025 21:15
Copy link

github-actions bot commented Jun 19, 2025

E2E Tests 🚀
This PR will run tests tagged with: @:critical @:data-explorer @:assistant @:variables @:plots @:viewer

readme  valid tags

Copy link
Contributor

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a great start. My main suggestion is to rename the API that routes requests to the variables comm to something more generic (and it can just query a single session variable at a time) so that we can use it to add more data querying tools without having to modify the Positron API each time

The other changes that we will want to make is to make the handling of these tool calls "asynchronous" so they they do not block the functioning of the variables comm — this means basically copying the pattern from the data explorer comm for the get_column_profiles request (and its corresponding return_column_profiles front-end API, see https://github.com/posit-dev/positron/blob/main/extensions/positron-python/python_files/posit/positron/data_explorer.py#L492-L519)

"type_display": column.type_display,
"summary_stats": summary_stats,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good starting point to have this tool surfaced in the variables comm — since computing summary stats or other computed profiles can be expensive (and thus block other messaging handling in the variables comm), we'll probably want to separate "expensive" requests (e.g. summary stats, frequency tables, histograms, etc.) from "cheap" requests (like asking for the schema), and make sure that the expensive requests and performed in an asynchronous-response pattern like the get_column_profiles request in the data explorer. This doesn't all have to get done in this PR so can be follow up work

@melissa-barca melissa-barca force-pushed the feature/ai-data branch 2 times, most recently from 29b64a0 to 94cb220 Compare June 27, 2025 04:03
@melissa-barca melissa-barca requested a review from jmcphers June 27, 2025 04:42
@melissa-barca melissa-barca marked this pull request as ready for review June 27, 2025 04:49
Copy link
Contributor

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is close to a good stopping point for the initial pass — I think the main thing that need to get fixed is the return type for the query_variable_data RPC — since it isn't easy to access all of the data explorer comm types in all the layers where this function is called, we can just return serialized JSON from the function for now (effectively schema: string, column_profiles: string[])

# Create a temporary table view with a temporary comm
temp_state = DataExplorerState("temp_summary")
temp_comm = PositronComm.create(target_name="temp_summary", comm_id="temp_summary_comm")
table_view = _get_table_view(value, temp_comm, temp_state, self.kernel.job_queue)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe later we can set up a persistent data explorer comm to use for Assistant tool calls (I realized just now after my earlier comment about the async column profiles — not needed for now — that these depend on there being a live comm available to send the frontend event though with the asynchronous result. We can look more closely at this later)

"description": "Result of the summarize operation",
"type": "object",
"properties": {
"children": {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is returning a different return type right now (with the schema and column profiles, so a lot more complex). I think to avoid having to drag along the schema and profile result type (and mainly having to expose these in the Positron runtime / extHost API) we can just return the schema and profiles as a serialized JSON string to sidestep this issue for now -- it would be good to make these results well-typed everywhere but there's a bunch of plumbing needed).

@wesm wesm changed the title initial pass at implementing a data summary tool for Python Assistant: Initial pass at implementing a data summary tool for Python Jun 30, 2025
@wesm wesm force-pushed the feature/ai-data branch from b902acc to df11174 Compare June 30, 2025 18:59
@wesm
Copy link
Contributor

wesm commented Jun 30, 2025

I rebased this today and will work on some unit tests on the Python backend portion before it can be merged

@wesm wesm force-pushed the feature/ai-data branch 2 times, most recently from 36d49d5 to b0bb2d8 Compare July 1, 2025 23:19
melissa-barca and others added 5 commits July 2, 2025 16:50
improve logging performance to satisfy linter

clean up code

provide temp comm to satisfy pyright

modify openRPC specs to autogen comms ccode and fix bug with passing
'path' parameter, also rename summarizeData function to make it more
generic

create data explorer helper functions

revert formatting change
@wesm wesm force-pushed the feature/ai-data branch from b0bb2d8 to 3d81ecd Compare July 2, 2025 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants