🚀 Perform overpass queries in batches #1026

JGreenlee · 2025-02-14T15:15:07Z

The existing implementation predicted modes for sections one by one (if the section's sensed mode was IN_VEHICLE or UNKNOWN, we queried transit stops at the start and end locations respectively)

This was inefficient in 2 ways:

in the case of consecutive IN_VEHICLE / UNKNOWN sections, we queried twice for the same point because the first section's end is the same as the second section's start

we queried everything one-by-one, which led to a lot of small queries in succession

This can be optimized by doing the predictions in 2 steps.
On the first pass, we make the predictions we can without any transit stops.
Then we query overpass for stops near all the remaining sections' locations
Then we do a second pass on those sections, with the transit stops at each location having now been retrieved

in match_stops.py, get_stops_near now accepts a list of locations rather than just one location. Updated TestOverpass to reflect this, and also made test_get_stops_near a little more comprehensive

Testing done:
Tests pass. I also manually inspected the created analysis/inferred_section entries for shankari_2016-07-27 and shankari_2016-08-04 on this branch vs. master to ensure the sensed_modes were the same

JGreenlee · 2025-02-14T15:45:12Z

Results are promising so far:

Pipeline stage runtimes for MODE_INFERENCE

JGreenlee · 2025-02-14T19:10:37Z

TODO:

Add limit to how many locations are in one query
- While batching multiple locations into one query does appear to be faster overall, the current implementation does all sections at once and we risk creating a query so large that it consistently times out
Address configurability regressions (if they are important)
- section.endStopRadius is not used. Can it be removed or might it be used in the future?
- The query template is now defined inline instead of a separate file in conf/. Is it important for this to be configurable?

TeachMeTW · 2025-02-19T18:54:59Z

Add limit to how many locations are in one query

Added MAX_BBOXES_PER_QUERY

The query template is now defined inline instead of a separate file in conf/. Is it important for this to be configurable?

Added optionality to load from config else use default inline

emission/individual_tests/TestOverpass.py

JGreenlee · 2025-02-20T22:12:40Z

emission/net/ext_service/transit_matching/match_stops.py

+
+
+def get_query_for_bboxes(bboxes):
+    query = '[out:json][timeout:25];\n'


The hard part about making the query configurable is that the query now has 2 distinct parts: the header which is defined here, and the body which is repeated for X number of locations
The current solution only makes the body configurable. Now that I am thinking about it, I am not even sure if we need the body to be configurable because the code expects those specific features when it parses the result. If someone wants to change what features are queried, they will have to make code changes anyway

If there is one thing that definitely should be configurable, I think it is the timeout threshold which goes in the header and is currently 25

JGreenlee · 2025-02-20T22:17:53Z

emission/individual_tests/TestOverpass.py

+        enetm.MAX_BBOXES_PER_QUERY = original_max
+        enetm.make_request_and_catch = original_make_request_and_catch
+
+    def test_get_predicted_transit_mode_different_sizes(self):


Can you explain what this one is supposed to do?

The goal here is to verify that the function behaves correctly regardless of the number of stops provided. It checks that the function scales correctly and returns the expected result for a range of input sizes. Though it could be argued that is is redundant since we have test_get_predicted_transit_mode_many_chunks

JGreenlee · 2025-02-20T22:32:27Z

emission/individual_tests/TestOverpass.py

+        stops = enetm.get_stops_near(coords, 150.0)
+        # Expect one chunk per coordinate = 20 chunks.
+        self.assertEqual(len(stops), 20)


This works but not for the reason you think it does.

Looking at your implementation of get_stops_near, the return value is not separated by chunk; the chunks have already been merged together.
The length of this list will match the length of coords (in this case 20), regardless of what MAX_BBOXES_PER_QUERY is.
So this isn't really validating the chunking at all.

Come up with a better way to test the chunking and do it with MAX_BBOXES_PER_QUERY=10

Hint: maybe you can measure how many API calls were made?

@JGreenlee So you suggest that with MAX_BBOXES_PER_QUERY=10 and 20 dummy coordinates, we expect 2 API calls?

JGreenlee · 2025-02-20T22:34:07Z

emission/individual_tests/TestOverpass.py

+    def test_chunk_list(self):
+        # Case 1: List of 10 elements with chunk size of 3.
+        data = list(range(1, 11))  # [1, 2, ..., 10]
+        chunk_size = 3
+        chunks = list(enetm.chunk_list(data, chunk_size))
+        expected_chunks = [[1, 2, 3], [4, 5, 6], [7, 8, 9], [10]]
+        self.assertEqual(chunks, expected_chunks)
+
+        # Case 2: Exact division
+        data_exact = list(range(1, 10))  # [1, 2, ..., 9]
+        chunk_size = 3
+        chunks_exact = list(enetm.chunk_list(data_exact, chunk_size))
+        expected_chunks_exact = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
+        self.assertEqual(chunks_exact, expected_chunks_exact)
+
+        # Case 3: Empty list
+        data_empty = []
+        chunks_empty = list(enetm.chunk_list(data_empty, chunk_size))
+        self.assertEqual(chunks_empty, [])


I don't see much value in testing this function by itself.
Validate the behavior of get_public_transit_stops and that will tell us whether the chunking is working correctly

JGreenlee · 2025-02-20T22:39:43Z

emission/individual_tests/TestOverpass.py

If there are going to be a lot of new tests and they are specific to functions in match_stops.py, then we should probably move those to a new file called TestMatchStops or similar.

However, I don't think we really need all these new tests; we just need one that ensures the correct number of Overpass requests are made (e.g. 1 call for <=10 locations, 2 calls for <=20 locations, etc)

JGreenlee · 2025-03-03T19:41:48Z

I am looking at the tests we've added here and the tests that already exist, and I'm thinking there really should be tests for rule_engine.py.

I can see emission/tests/analysisTests/modeInferTests/TestPipeline.py, which tests emission/analysis/classification/inference/mode/pipeline.py.
But I do not see tests for emission/analysis/classification/inference/mode/rule_engine.py.

So, what is the situation with these 2 separate implementations? rule_engine.py appears to be the one that is actually used in intake_stage.py; however pipeline.py is the only one we have tests for

JGreenlee · 2025-03-03T19:58:23Z

Ok, I think the old master used pipeline.py and a separate branch gis-based-mode-detection used rule_engine.py (I vaguely remember)
As of 2023, gis-based-mode-detection is now master so pipeline.py is unused.

gis-based-mode-detection, created by #712, never had tests for rule_engine.py

So intake_stage.py uses eacimr (rule_engine.py)

e-mission-server/emission/pipeline/intake_stage.py

Line 166 in 869796b

eacimr.predict_mode(uuid)

while the tests' runIntakePipeline uses eacimp (pipeline.py)

e-mission-server/emission/tests/common.py

Line 210 in 869796b

eacimp.predict_mode(uuid)

If there's a regression with rule_engine.py, our tests are not going to pick it up. So this should definitely be cleaned up.

- Created TestMatchStops.py with full test coverage for transit stop matching functionality - Updated match_stops.py to incorporate caching of overpass api results - Implemented TestRuleEngine.py with test cases for mode inference rules. Originally there was no test coverage. This PR seeks to rectify that. See: e-mission#1026 (comment). - Added additional cases based on 'bad' labels, see e-mission/e-mission-docs#1124 (comment) - Regenerated ground truths now that we are using rule engine. For Transit Matching Logic Tests: - Test Overpass already tests the get_stops_near and predicted_transit_modes. TestMatchStops focuses on the caching mechanism to validate it works. For RuleEngine Tests: - Seeks to test several mode predictions such as walking, cycling, driving, etc based on different factors. - Cases include empty sections, AirOrHSR, Motorized, Unknown, Combination - Added test based on prefixed like 'XMAS:Train' (Patched and mocked until real example)

- Created TestMatchStops.py with full test coverage for transit stop matching functionality - Updated match_stops.py to incorporate caching of overpass api results - Implemented TestRuleEngine.py with test cases for mode inference rules. Originally there was no test coverage. This PR seeks to rectify that. See: e-mission#1026 (comment). - Added additional cases based on 'bad' labels, see e-mission/e-mission-docs#1124 (comment) - Regenerated ground truths now that we are using rule engine. For Transit Matching Logic Tests: - Test Overpass already tests the get_stops_near and predicted_transit_modes. TestMatchStops focuses on the caching mechanism to validate it works. For RuleEngine Tests: - Seeks to test several mode predictions such as walking, cycling, driving, etc based on different factors. - Cases include empty sections, AirOrHSR, Motorized, Unknown, Combination - Added test based on prefixed like 'XMAS:Train' added shankari xmas real data to test behavior on prefix modes like XMAS:Train

The existing implementation predicted modes for sections one by one (if the section's sensed mode was IN_VEHICLE or UNKNOWN, we queried transit stops at the start and end locations respectively) This was inefficient in 2 ways: - in the case of consecutive IN_VEHICLE / UNKNOWN sections, we queried twice for the same point because the first section's end is the same as the second section's start - we queried everything one-by-one, which led to a lot of small queries in succession This can be optimized by doing the predictions in 2 steps. On the first pass, we make the predictions we can without any transit stops. Then we query overpass for stops near all the remaining sections' locations Then we do a second pass on those sections, with the transit stops at each location having now been retrieved in match_stops.py, get_stops_near now accepts a list of locations rather than just one location. Updated TestOverpass to reflect this, and also made test_get_stops_near a little more comprehensive Testing done: Tests pass. I also manually inspected the created `analysis/inferred_section` entries for `shankari_2016-07-27` and `shankari_2016-08-04` on this branch vs. master to ensure the sensed_modes were the same

Modified test overpass to skip IFF geofabrik_overpass_key is not configured (which is for local testing and should pass in gh) Modified match_stops to add a limit to bboxes per query (MAX_BBOXES_PER_QUERY) Modified get_query_for_bbox to have the default inline query but add option for future configuration.

Update emission/individual_tests/TestOverpass.py Co-authored-by: Jack Greenlee <[email protected]>

Updated tests to one that ensures the correct number of Overpass requests are made (e.g. 1 call for <=10 locations, 2 calls for <=20 locations, etc)

Using a `set` to store these coordinates instead of a `list` guarantees that we do not have any duplicates, and this saves us from having including the same location twice in our overpass queries (We cast the coordinates as tuples because sets can only store immutable types) This will apply for any consecutive sections that are IN_VEHICLE or UNKNOWN, where the end location of the first section is the start of the second section

JGreenlee · 2025-04-11T02:51:47Z

Rebased this onto TeachMeTW:test_coverage_ms_re
Will come back to this when #1039 is merged

- Created TestMatchStops.py with full test coverage for transit stop matching functionality - Updated match_stops.py to incorporate caching of overpass api results - Implemented TestRuleEngine.py with test cases for mode inference rules. Originally there was no test coverage. This PR seeks to rectify that. See: e-mission#1026 (comment). - Added additional cases based on 'bad' labels, see e-mission/e-mission-docs#1124 (comment) - Regenerated ground truths now that we are using rule engine. For Transit Matching Logic Tests: - Test Overpass already tests the get_stops_near and predicted_transit_modes. TestMatchStops focuses on the caching mechanism to validate it works. For RuleEngine Tests: - Seeks to test several mode predictions such as walking, cycling, driving, etc based on different factors. - Cases include empty sections, AirOrHSR, Motorized, Unknown, Combination - Added test based on prefixed like 'XMAS:Train' added shankari xmas real data to test behavior on prefix modes like XMAS:Train rev

JGreenlee mentioned this pull request Feb 14, 2025

Pipeline Optimization Strategies e-mission/e-mission-docs#1105

Open

TeachMeTW force-pushed the batch_overpass branch from dd2282e to 3569a01 Compare February 20, 2025 20:57

JGreenlee commented Feb 20, 2025

View reviewed changes

TeachMeTW force-pushed the batch_overpass branch from e9a333e to abe3105 Compare February 20, 2025 23:17

JGreenlee mentioned this pull request Feb 28, 2025

✂️ Optimize BuiltinTimeSeries to reduce DB calls #1032

Merged

shankari mentioned this pull request Mar 17, 2025

🚑 Errors while resetting the pipeline after the most recent scalability changes e-mission/e-mission-docs#1122

Open

TeachMeTW and others added 6 commits April 9, 2025 20:17

Added tests to test chunking of the BBOXES.

66f3180

Update emission/individual_tests/TestOverpass.py Co-authored-by: Jack Greenlee <[email protected]>

Made only timeout configurable

c89802c

Updated tests to one that ensures the correct number of Overpass requests are made (e.g. 1 call for <=10 locations, 2 calls for <=20 locations, etc)

JGreenlee force-pushed the batch_overpass branch from 265dbe2 to 13279db Compare April 11, 2025 02:48

TeachMeTW mentioned this pull request Apr 24, 2025

Non-Test Changes for Rule Engine #1058

Open

shankari force-pushed the master branch 4 times, most recently from a2a9a44 to e50e9f3 Compare June 4, 2025 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🚀 Perform overpass queries in batches #1026

🚀 Perform overpass queries in batches #1026

Uh oh!

JGreenlee commented Feb 14, 2025

Uh oh!

JGreenlee commented Feb 14, 2025

Uh oh!

JGreenlee commented Feb 14, 2025

Uh oh!

TeachMeTW commented Feb 19, 2025

Uh oh!

Uh oh!

JGreenlee Feb 20, 2025

Uh oh!

JGreenlee Feb 20, 2025

Uh oh!

TeachMeTW Feb 20, 2025

Uh oh!

JGreenlee Feb 20, 2025

Uh oh!

TeachMeTW Feb 20, 2025

Uh oh!

JGreenlee Feb 20, 2025

Uh oh!

JGreenlee Feb 20, 2025

Uh oh!

JGreenlee commented Mar 3, 2025

Uh oh!

JGreenlee commented Mar 3, 2025

Uh oh!

JGreenlee commented Apr 11, 2025

Uh oh!

Uh oh!



		def get_query_for_bboxes(bboxes):
		query = '[out:json][timeout:25];\n'

🚀 Perform overpass queries in batches #1026

Are you sure you want to change the base?

🚀 Perform overpass queries in batches #1026

Uh oh!

Conversation

JGreenlee commented Feb 14, 2025

Uh oh!

JGreenlee commented Feb 14, 2025

Uh oh!

JGreenlee commented Feb 14, 2025

Uh oh!

TeachMeTW commented Feb 19, 2025

Uh oh!

Uh oh!

JGreenlee Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

TeachMeTW Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

TeachMeTW Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

JGreenlee commented Mar 3, 2025

Uh oh!

JGreenlee commented Mar 3, 2025

Uh oh!

JGreenlee commented Apr 11, 2025

Uh oh!

Uh oh!