Skip to content

perf(api): speed up finding-groups /resources endpoint#10816

Merged
AdriiiPRodri merged 2 commits into
masterfrom
perf/api-finding-groups-resources-rfm-query-plan
Apr 21, 2026
Merged

perf(api): speed up finding-groups /resources endpoint#10816
AdriiiPRodri merged 2 commits into
masterfrom
perf/api-finding-groups-resources-rfm-query-plan

Conversation

@AdriiiPRodri
Copy link
Copy Markdown
Contributor

@AdriiiPRodri AdriiiPRodri commented Apr 21, 2026

Context

Users reported that the finding groups page felt extremely slow, especially when drilling into the /finding-groups/{check_id}/resources and /finding-groups/latest/{check_id}/resources endpoints. Profiling against a realistic dev dataset showed the request spending ~3.5 s in SQL for a response that ultimately returns ~10 rows.

Description

Profiling with django-silk + EXPLAIN ANALYZE traced the cost to two independent problems, both inside FindingGroupViewSet on the /resources action:

  1. Redundant existence probe on every request. _paginated_resource_response ran has_mappings = self._build_resource_mapping_queryset(...).exists() before handing control to _mapping_paginated_response, which then re-ran essentially the same semi-join as part of its paginator COUNT. For the 99% of requests that hit groups with resources (non-IaC), that extra probe was pure waste. The exists() over partitioned ResourceFindingMapping cost ~586 ms of real DB time per request on the test dataset.

  2. Planner picked a Merge Semi Join we didn't want. _build_resource_mapping_queryset filtered RFM with finding_id__in=Subquery(findings_qs). PostgreSQL kept choosing a Merge Semi Join: it read the finding_id index of each RFM partition ordered by finding_id and filtered tenant_id as a post-step. On the monthly partition currently being appended to, that meant reading ~470k index entries to keep ~38k matching the tenant, only to surface the first row of the semi-join result. Classic case of the planner treating the post-filter as free when it wasn't.

This PR avoids both problems without schema changes:

  • Drops the has_mappings.exists() pre-check. Instead, _paginated_resource_response now calls _mapping_paginated_response straight away and uses the paginator's total count to decide when to fall back to _combined_paginated_response for pure-orphan/IaC groups. One less query per request on the common path; behavior for IaC groups is unchanged.
  • Materializes the finding IDs as a plain Python list (finding_id__in=[uuid, uuid, …]) rather than Subquery(...). With an inlined value list the planner switches to Nested Loop + Index Scan on the RFM finding_id index, keyed directly by the constant list. RFM aggregation queries now execute in under 1 ms of DB time.
  • Adds a request-scoped cache (_resolve_finding_ids) so the three helpers that build different RFM querysets from the same filtered findings set (_build_resource_mapping_queryset, _build_resource_ordering_queryset, _build_resource_aggregation) reuse the same materialized list instead of re-running the findings query 3–4× per request.

EXPLAIN ANALYZE (real captures)

Before: has_mappings.exists() pre-check (~586 ms real DB time)
Limit  (cost=282.16..293.91 rows=1 width=4) (actual time=584.270..584.283 rows=1 loops=1)
  ->  Merge Semi Join  (cost=282.16..1494541.31 rows=127167 width=4) (actual time=584.268..584.280 rows=1 loops=1)
        Merge Cond: (resource_finding_mappings.finding_id = v0.id)
        ->  Append  (cost=1.75..359959.95 rows=321941 width=16) (actual time=0.051..111.924 rows=38230 loops=1)
              ->  Index Scan using resource_finding_mappings_2026_mar_finding_id_idx on resource_finding_mappings_2026_mar
                    Filter: (tenant_id = 'a3b26b14-cbfe-40e8-b54c-dd6e2bfa8780'::uuid)
                    Rows Removed by Filter: 431220
              ->  Index Scan using resource_finding_mappings_2026_apr_finding_id_idx on resource_finding_mappings_2026_apr  (never executed)
                    Filter: (tenant_id = 'a3b26b14-cbfe-40e8-b54c-dd6e2bfa8780'::uuid)
              …

Reading 470k index entries just to post-filter tenant_id and keep 38k, only to return a single row via LIMIT 1.

Before: main RFM aggregation (Merge Semi Join, ~21 ms real DB time but unstable under load)
Finalize GroupAggregate  (actual time=18.082..20.780 rows=10 loops=1)
  ->  Gather Merge  (…)
        ->  Partial GroupAggregate
              ->  Sort
                    ->  Hash Join
                          ->  Nested Loop
                                ->  Nested Loop
                                      ->  Merge Semi Join  (actual time=1.252..1.294 rows=3 loops=3)
                                            Merge Cond: (rfm.finding_id = v0.id)
                                            ->  Sort  (rfm sorted by finding_id)
                                            ->  Sort  (findings subquery sorted by id)
After: same aggregation, Nested Loop + Index Scan driven by the materialized list (< 1 ms real DB time)
GroupAggregate  (cost=334.37..334.55 rows=2 width=369) (actual time=0.422..0.667 rows=10 loops=1)
  Group Key: resource_finding_mappings.resource_id
  ->  Sort  (actual time=0.347..0.351 rows=10 loops=1)
        Sort Key: resource_finding_mappings.resource_id
        ->  Nested Loop  (actual time=0.082..0.334 rows=10 loops=1)
              ->  Nested Loop
                    ->  Nested Loop
                          ->  Append  (actual time=0.038..0.132 rows=10 loops=1)
                                ->  Index Scan using resource_finding_mappings_2026_mar_finding_id_idx
                                      Index Cond: (finding_id = ANY ('{019ce202-b11a-…, 019ce202-b11c-…, …}'::uuid[]))
                                …

The semi-join is gone; the RFM access is a targeted Index Cond: finding_id = ANY (…) driven straight by the inlined UUID list, which is exactly what we want on partitioned RFM.

Notes on the per-request cache

The cache in _resolve_finding_ids lives on self (the FindingGroupViewSet instance). DRF creates a fresh ViewSet instance for every request (as_view() calls cls(**initkwargs) per dispatch), so the cache is effectively request-scoped. Reviewers have asked a few times whether this could cause the endpoint to serve stale data; the short answer is no:

  • Between requests: each request starts with self._finding_ids_cache = None and re-materializes the finding IDs against the current DB state. There is no cross-request caching.
  • Inside the same request: the cache forces all RFM helpers (_build_resource_mapping_queryset, _build_resource_ordering_queryset, _build_resource_aggregation) to operate on the same snapshot of finding IDs. This is actually more consistent than the previous behavior, where each helper re-ran the findings subquery and could each observe slightly different state under PostgreSQL's default read-committed isolation.
  • Concurrent writes during the request:
    • New findings created after materialization are not visible in this request (same as before — the old code would also race depending on when each subquery ran).
    • New ResourceFindingMapping rows for already-cached finding IDs are still visible, because each RFM query runs live; only the IN (...) list is frozen.
    • Deleted findings leave their UUIDs in the inlined list, but the RFM join simply returns no rows for them — same net result as the un-cached version.
  • Identity check: the cache is keyed by is on the filtered_queryset object, not by equality. If a caller were to ever pass a different queryset instance (a clone), the cache would miss and we would re-materialize, so there is no risk of serving data from a different filter set.

Comparison

Measured against the same tenant in the dev environment. Numbers are silk-reported totals (aws RDS in eu-west-1 accessed from a local docker container, so each round-trip inflates query timings — DB time itself is sub-millisecond after the fix).

/finding-groups/latest/s3_bucket_public_access/resources (group with mappings, common path)

Variant Total time SQL time # SQL queries Slowest RFM query (real DB time)
Baseline (before) 7 872 – 8 709 ms 3 128 – 3 549 ms 8 586 ms (Merge Semi Join)
After dropping the exists() pre-check 7 073 ms 2 650 ms 7 586 ms (Merge Semi Join still there on the count)
After materialization + per-request cache 6 626 – 6 717 ms 2 490 – 2 614 ms 8 < 1 ms (Nested Loop + Index Scan)

/finding-groups/latest/aws-autoscaling-enable-at-rest-encryption/resources (IaC-like group with no mappings, orphan fallback path)

Variant Total time SQL time # SQL queries
Baseline (before) 6 681 ms 2 554 ms 8
Without the per-request cache 8 280 – 8 315 ms 3 273 – 3 346 ms 11 (3× duplicate materialization)
With per-request cache 6 949 ms 2 679 ms 9 (+1 single materialization query)

The orphan path is essentially at par with the baseline; the mapping path (which is the common case) sees the SQL plan collapse from 586 ms per RFM touch to sub-millisecond, and shaves ~1.3–2 s off the request against the remote RDS setup used for benchmarking. In production where the API is co-located with the database, the remaining silk-reported latency (which is dominated by network round-trip time) will collapse as well.

Steps to review

  1. Code review FindingGroupViewSet in api/src/backend/api/v1/views.py:
    • _resolve_finding_ids helper and its request-scoped cache.
    • _build_resource_mapping_queryset now uses the Python list instead of Subquery.
    • _paginated_resource_response drops the has_mappings.exists() branch and piggybacks on the paginator count to detect orphan-only groups.
  2. Exercise the endpoint end-to-end against a dataset that has both a mapping-backed group and an IaC/orphan-only group, and confirm:
    • Rows returned match the previous behavior.
    • Pagination (page[number], page[size]) and sort (sort=status,severity,delta,...) still work.
    • Resource filters (filter[resource_uid], filter[resource_type], etc.) keep returning the same resources.
  3. Optional: capture EXPLAIN ANALYZE on ResourceFindingMapping queries to confirm the planner now uses Index Scan using …_finding_id_idx with Index Cond: (finding_id = ANY (...)) instead of the previous Merge Semi Join.

Checklist

Community Checklist
  • This feature/issue is listed in here or roadmap.prowler.com
  • Is it assigned to me, if not, request it via the issue/feature in here or Prowler Community Slack

API

  • All issue/task requirements work as expected on the API
  • Endpoint response output (same shape as before; performance-only change)
  • EXPLAIN ANALYZE output for new/modified queries or indexes (see Description)
  • Performance test results (see Comparison table)
  • Verify if API specs need to be regenerated.
  • Check if version updates are required.
  • Ensure new entries are added to api/CHANGELOG.md

License

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Two query-plan fixes in the FindingGroup viewset's /resources action:

1. Drop the has_mappings.exists() pre-check in _paginated_resource_response.
   That probe ran on every non-filtered request and re-did the same
   semi-join the paginator already runs; use the paginator's total count
   to detect orphan-only (IaC) groups instead.

2. Materialize finding_ids into a Python list before filtering
   ResourceFindingMapping. finding_id__in=Subquery(findings_qs) made the
   planner pick a Merge Semi Join that scanned hundreds of thousands of
   RFM index entries to post-filter tenant_id. An inlined UUID list flips
   the plan to Nested Loop + Index Scan on finding_id (sub-millisecond
   on real DB time).

A request-scoped cache in _resolve_finding_ids deduplicates the
materialization across helpers that build different RFM querysets from
the same filtered findings set, so we do the findings round-trip once
per request.

On the mapping path this cuts ~1-2s off the request in the measured
dev environment while keeping behavior identical for IaC/orphan groups.
@AdriiiPRodri AdriiiPRodri requested a review from a team as a code owner April 21, 2026 10:30
AdriiiPRodri added a commit that referenced this pull request Apr 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

✅ All necessary CHANGELOG.md files have been updated.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

Conflict Markers Resolved

All conflict markers have been successfully resolved in this pull request.

@AdriiiPRodri AdriiiPRodri force-pushed the perf/api-finding-groups-resources-rfm-query-plan branch from 76d15b3 to 0b8bb98 Compare April 21, 2026 10:31
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 21, 2026

🔒 Container Security Scan

Image: prowler-api:85d786e
Last scan: 2026-04-21 10:40:36 UTC

📊 Vulnerability Summary

Severity Count
🔴 Critical 5
Total 5

4 package(s) affected

⚠️ Action Required

Critical severity vulnerabilities detected. These should be addressed before merging:

  • Review the detailed scan results
  • Update affected packages to patched versions
  • Consider using a different base image if updates are unavailable

📋 Resources:

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 93.47%. Comparing base (dcec79d) to head (0b8bb98).
⚠️ Report is 11 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master   #10816   +/-   ##
=======================================
  Coverage   93.47%   93.47%           
=======================================
  Files         228      228           
  Lines       32027    32064   +37     
=======================================
+ Hits        29936    29973   +37     
  Misses       2091     2091           
Flag Coverage Δ
api 93.47% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
prowler ∅ <ø> (∅)
api 93.47% <100.00%> (+<0.01%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@josema-xyz josema-xyz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙌

@AdriiiPRodri AdriiiPRodri merged commit 548389d into master Apr 21, 2026
41 checks passed
@AdriiiPRodri AdriiiPRodri deleted the perf/api-finding-groups-resources-rfm-query-plan branch April 21, 2026 10:54
@AdriiiPRodri AdriiiPRodri added the backport-to-v5.24 Backport PR to the v5.24 branch label Apr 21, 2026
@prowler-bot prowler-bot added the was-backported The PR was successfully backported to the target branch label Apr 21, 2026
@prowler-bot
Copy link
Copy Markdown
Collaborator

💚 All backports created successfully

Status Branch Result
v5.24

Questions ?

Please refer to the Backport tool documentation and see the Github Action logs for details

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-to-v5.24 Backport PR to the v5.24 branch component/api was-backported The PR was successfully backported to the target branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants