-
Notifications
You must be signed in to change notification settings - Fork 12
Refactored retrieval of last API call timestamps to improve performance. #156
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactored retrieval of last API call timestamps to improve performance. #156
Conversation
- Changed from using an aggregation pipeline on `get_timeseries_db` to a loop fetching data from `get_profile_db`. - New approach iterates over `uuid_list`, fetching profile data and extracting `last_call_ts` for each user. - Simplifies logic, avoids heavy aggregation, and reduces database load. Results in a significant performance improvement.
@TeachMeTW I would like some more technical detail in the benefits and impact. I am not a general audience 😄 Can you please clarify:
|
@shankari Here's a breakdown of the clarification you requested:
|
OldScreen.Recording.2025-01-02.at.10.18.22.AM.movNewScreen.Recording.2025-01-02.at.10.19.27.AM.movTakeawaysAs we can see, it is much faster. I would just need to test my hypothesis above in where loading a specific uuid and updating it would classify as an active user. |
@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained. |
Hypothesis verified. First I ran Next I ran Lastly, I ran
to simulate an API call. |
In the home page, I only found one Mongo query which was the active_users card which this pr addresses. Previously in
However, in not_excluded_uuid_query = {'user_id': {'$nin': [UUID(uuid) for uuid in excluded_uuids]}} In
|
…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.
@TeachMeTW I asked
I see an investigation of the first part but not of the second. |
This is what I found:
|
@TeachMeTW I am merging this now, but this is not sufficient
You need to list out the calls to |
current_timestamp = arrow.utcnow().timestamp() | ||
for npu in uuid_list: | ||
user_uuid = UUID(npu) | ||
profile_data = edb.get_profile_db().find_one({'user_id': user_uuid}) | ||
if profile_data: | ||
last_call_ts = profile_data.get('last_call_ts') | ||
if last_call_ts and (current_timestamp - arrow.get(last_call_ts).timestamp()) <= threshold: | ||
number_of_active_users += 1 | ||
esdsq.store_dashboard_time("admin/home/get_number_of_active_users/total_time", total_timer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this works, it also makes multiple calls to the database; for
A better option, given that the profile entries are small and not very numerous, is to read the entire profile_db and then do a pandas merge to create a single table for display.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shankari I thought of doing a singular db call as suggested but would a pandas really be needed? Its only getting the number of active users. I believe on this approach:
num_active = edb.get_profile_db().count_documents({
"user_id": {"$in": uuid_objs},
"last_call_ts": {"$gte": cutoff_dt},
})
Far far simpler and accomplishes the same thing -- its not exactly that slow either. I tested and seems to work. What do you think?
Description
This PR refactors the process for retrieving the last API call timestamps to significantly improve performance. The previous implementation relied on a complex aggregation pipeline, which has been replaced with a simpler, iterative approach using profile db like previous enhancements.
Benefits
Impact
The changes result in faster execution and lower computational costs, particularly for scenarios with large user datasets like openaccess.
Testing
Active Users
Points of concern