Refactored retrieval of last API call timestamps to improve performance. #156

TeachMeTW · 2025-01-02T05:05:56Z

Description

This PR refactors the process for retrieving the last API call timestamps to significantly improve performance. The previous implementation relied on a complex aggregation pipeline, which has been replaced with a simpler, iterative approach using profile db like previous enhancements.

Benefits

Improved Performance: Reduces database load and speeds up the retrieval process.
Simplified Logic: Enhances maintainability and readability of the code.
Scalability: Handles larger datasets more efficiently.

Impact

The changes result in faster execution and lower computational costs, particularly for scenarios with large user datasets like openaccess.

Testing

Tested by loading a dataset and seeing the overview page's Active Users
Logging checks

Points of concern

Do we have a way to manually make something an active user? I know the criteria according to the code and blame is one day delta. My naive approach is to manually change said ts to validate.

- Changed from using an aggregation pipeline on `get_timeseries_db` to a loop fetching data from `get_profile_db`. - New approach iterates over `uuid_list`, fetching profile data and extracting `last_call_ts` for each user. - Simplifies logic, avoids heavy aggregation, and reduces database load. Results in a significant performance improvement.

shankari · 2025-01-02T17:41:07Z

@TeachMeTW I would like some more technical detail in the benefits and impact. I am not a general audience 😄

Can you please clarify:

which page loading time does this affect?
Is this related to the slow loading time on the Overview page (from Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking e-mission-server#993 (comment))
what was the testing that you did with "Tested by loading a dataset and seeing the overview page's Active Users"
- was it just that the page loaded? or was it that it was faster to load?

TeachMeTW · 2025-01-02T18:17:07Z

@shankari Here's a breakdown of the clarification you requested:

Which page loading time does this affect?
This primarily affects the Overview page, specifically the Active Users card component.
Is this related to the slow loading time on the Overview page (from Enhance _get_and_store_range Function to Include Trip Statistics and Last API Call Tracking e-mission-server#993 (comment))?
Yes, it is. The update now leverages the last API call in the profiledb rather than relying on a MongoDB query, improving performance (since we already did processing during the pipeline run.
What was the testing that you did with "Tested by loading a dataset and seeing the overview page's Active Users"?
I tested this by:
- Loading a dataset like openaccess.
- Navigating to the Overview page.
- Observing the loading time visually to estimate improvements.
  My next step is to load a specific UUID, update it, and verify that it's correctly classified as an active user. The current is pretty adhoc
Was it just that the page loaded? Or was it that it was faster to load?
The components load much faster now. Previously, the Active Users card was either nonexistent until fully loaded or appeared grey with skeleton loading (if implemented like in the future branch). This fix ensures the component loads with about the same speed as other components, improving the overall experience.

TeachMeTW · 2025-01-02T18:20:21Z

Old

Screen.Recording.2025-01-02.at.10.18.22.AM.mov

New

Screen.Recording.2025-01-02.at.10.19.27.AM.mov

Takeaways

As we can see, it is much faster. I would just need to test my hypothesis above in where loading a specific uuid and updating it would classify as an active user.

shankari · 2025-01-02T19:01:19Z

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

TeachMeTW · 2025-01-02T19:03:26Z

Hypothesis verified.

First I ran ./e-mission-py.bash bin/debug/load_timeline_for_day_and_user.py emission/tests/data/real_examples/shankari_2016-07-25 dummy_test in order to create a new user

Next I ran ./e-mission-py.bash bin/debug/intake_single_user.py -e dummy_test in order to load into the pipeline

Lastly, I ran

test_call_ts = time.time()
testUUID = uuid.UUID("UUID of the dummy_test")

enac.store_server_api_time(testUUID, "test_call_ts", test_call_ts, 69420)
etc.runIntakePipeline(testUUID)

to simulate an API call.

It does show as an active user:

TeachMeTW · 2025-01-02T19:13:33Z

@TeachMeTW did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data? When we finish a particular performance improvement, I want to get it done fully and not over multiple weeks. Each improvement is already carefully scoped to be small and self-contained.

In the home page, I only found one Mongo query which was the active_users card which this pr addresses. Previously in db_utils, there were Mongo queries earlier, but they were replaced in a previous PR -- the user stats for the data page. I also checked other pages:

Data page: No Mongo queries found.
Map page: Looks good, no direct DB calls.
Push_notifs page: Also looks good, no issues.
Segment_trip_time page: No Mongo queries detected.

However, in db_utils, I noticed direct DB access in the query_segments_crossing_endpoints() function with the query:

not_excluded_uuid_query = {'user_id': {'$nin': [UUID(uuid) for uuid in excluded_uuids]}}

In db_utils there is also this chunk:

    # Vestigial code commented out and left below for future reference

    # logging.debug("Querying the UUID DB for %s -> %s" % (start_date,end_date))
    # query = {'update_ts': {'$exists': True}}
    # if start_date is not None:
    #     # have arrow create a datetime using start_date and time 00:00:00 in UTC
    #     start_time = arrow.get(start_date).datetime
    #     query['update_ts']['$gte'] = start_time
    # if end_date is not None:
    #     # have arrow create a datetime using end_date and time 23:59:59 in UTC
    #     end_time = arrow.get(end_date).replace(hour=23, minute=59, second=59).datetime
    #     query['update_ts']['$lt'] = end_time
    # projection = {
    #     '_id': 0,
    #     'user_id': '$uuid',
    #     'user_token': '$user_email',
    #     'update_ts': 1
    # }

    logging.debug("Querying the UUID DB for (no date range)")

    # This should actually use the profile DB instead of (or in addition to)
    # the UUID DB so that we can see the app version, os, manufacturer...
    # I will write a couple of functions to get all the users in a time range
    # (although we should define what that time range should be) and to merge
    # that with the profile data```

pages/home.py

…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.

shankari · 2025-01-23T05:52:54Z

@TeachMeTW I asked

did you look through the code to check if there are any other sites that are making direct DB calls, or calls to the timeseries for profile data?

I see an investigation of the first part but not of the second.
Please finish an investigation of the second part as well so we can merge once.
Also, what is the difference between "Also looks good, no issues." and "No Mongo queries found."?

TeachMeTW · 2025-01-23T06:24:14Z

@shankari

This is what I found:

`db_utils.py`

DB Calls

edb.get_uuid_db().find() in query_uuids()

Timeseries Calls (doesn't seem to be for profile data however)

esta.TimeSeries.get_aggregate_time_series() (and subsequent .get_data_df() / .find_entries()) are used in:
- query_confirmed_trips()
- query_demographics()
- query_trajectories()
- query_segments_crossing_endpoints()

`home.py`

DB Calls for Profile Data

profile_data = edb.get_profile_db().find_one({'user_id': user_uuid}) inside get_number_of_active_users()

`map.py`

DB Calls

In create_user_emails_options(), there is a debug/log statement that uses edb.get_uuid_db().find()

I believe these are all the DB calls and TS calls

Also, what is the difference between "Also looks good, no issues." and "No Mongo queries found."?

None, they are the same; hope that clears up things

shankari · 2025-02-03T07:07:15Z

@TeachMeTW I am merging this now, but this is not sufficient

esta.TimeSeries.get_aggregate_time_series() (and subsequent .get_data_df() / .find_entries()) are used in:
query_confirmed_trips()
query_demographics()
query_trajectories()
query_segments_crossing_endpoints()

You need to list out the calls to find_entries and get_data_df so we can see if it is possible to cache them in the profile.
Please make sure to finish that investigation before declaring that this task is done

shankari · 2025-02-03T07:20:32Z

pages/home.py

+        current_timestamp = arrow.utcnow().timestamp()
+        for npu in uuid_list:
+            user_uuid = UUID(npu)
+            profile_data = edb.get_profile_db().find_one({'user_id': user_uuid})
+            if profile_data:
+                last_call_ts = profile_data.get('last_call_ts')
+                if last_call_ts and (current_timestamp - arrow.get(last_call_ts).timestamp()) <= threshold:
+                    number_of_active_users += 1
+    esdsq.store_dashboard_time("admin/home/get_number_of_active_users/total_time", total_timer)


While this works, it also makes multiple calls to the database; for $n$ entries, the number of DB calls is $O(n)$

A better option, given that the profile entries are small and not very numerous, is to read the entire profile_db and then do a pandas merge to create a single table for display.

@shankari I thought of doing a singular db call as suggested but would a pandas really be needed? Its only getting the number of active users. I believe on this approach:

num_active = edb.get_profile_db().count_documents({ "user_id": {"$in": uuid_objs}, "last_call_ts": {"$gte": cutoff_dt}, })

Far far simpler and accomplishes the same thing -- its not exactly that slow either. I tested and seems to work. What do you think?

JGreenlee requested changes Jan 6, 2025

View reviewed changes

pages/home.py Outdated Show resolved Hide resolved

pages/home.py Outdated Show resolved Hide resolved

Refactored get_number_of_active_users by combining it with find_last_…

9566ffa

…get. Reduced the bloat of operations by leaving only one for loop that does the same thing.

shankari merged commit 33d41d1 into e-mission:master Feb 3, 2025

shankari reviewed Feb 3, 2025

View reviewed changes

TeachMeTW mentioned this pull request Feb 4, 2025

Optimization Investigation on dashboard #162

Open

shankari mentioned this pull request Feb 11, 2025

Collect data on performance improvements to the dashboard #145

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Refactored retrieval of last API call timestamps to improve performance. #156

Refactored retrieval of last API call timestamps to improve performance. #156

Uh oh!

TeachMeTW commented Jan 2, 2025

Uh oh!

shankari commented Jan 2, 2025 •

edited

Loading

Uh oh!

TeachMeTW commented Jan 2, 2025 •

edited

Loading

Uh oh!

TeachMeTW commented Jan 2, 2025 •

edited

Loading

Uh oh!

shankari commented Jan 2, 2025

Uh oh!

TeachMeTW commented Jan 2, 2025

Uh oh!

TeachMeTW commented Jan 2, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

shankari commented Jan 23, 2025

Uh oh!

TeachMeTW commented Jan 23, 2025 •

edited

Loading

Uh oh!

shankari commented Feb 3, 2025

Uh oh!

shankari Feb 3, 2025

Uh oh!

TeachMeTW Feb 4, 2025

Uh oh!

Uh oh!

Refactored retrieval of last API call timestamps to improve performance. #156

Refactored retrieval of last API call timestamps to improve performance. #156

Uh oh!

Conversation

TeachMeTW commented Jan 2, 2025

Description

Benefits

Impact

Testing

Points of concern

Uh oh!

shankari commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TeachMeTW commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TeachMeTW commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Old

New

Takeaways

Uh oh!

shankari commented Jan 2, 2025

Uh oh!

TeachMeTW commented Jan 2, 2025

Uh oh!

TeachMeTW commented Jan 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shankari commented Jan 23, 2025

Uh oh!

TeachMeTW commented Jan 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

db_utils.py

DB Calls

Timeseries Calls (doesn't seem to be for profile data however)

home.py

DB Calls for Profile Data

map.py

DB Calls

Uh oh!

shankari commented Feb 3, 2025

Uh oh!

shankari Feb 3, 2025

Choose a reason for hiding this comment

Uh oh!

TeachMeTW Feb 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shankari commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 2, 2025 •

edited

Loading

TeachMeTW commented Jan 23, 2025 •

edited

Loading

`db_utils.py`

`home.py`

`map.py`