Skip to content

Fix unsafe type assertions and add panic recovery in background handlers#430

Open
WHOIM1205 wants to merge 1 commit intolitmuschaos:stagingfrom
WHOIM1205:fix/unsafe-type-assertions
Open

Fix unsafe type assertions and add panic recovery in background handlers#430
WHOIM1205 wants to merge 1 commit intolitmuschaos:stagingfrom
WHOIM1205:fix/unsafe-type-assertions

Conversation

@WHOIM1205
Copy link
Copy Markdown

Summary

This PR fixes a critical crash in the ChartHub community analytics background handler caused by unsafe type assertions on GitHub API responses.

The handler was assuming that all GitHub API responses always contain type and name fields with string values. In real production scenarios (rate limiting, API errors, malformed responses), this assumption breaks and causes a panic inside a background goroutine. Since the goroutine had no recovery mechanism, it would terminate permanently, causing community analytics data to stop updating silently.

This change makes the handler resilient to unexpected GitHub API responses and ensures analytics updates continue even after transient failures.


Problem

The background analytics handler performs direct type assertions on GitHub API responses:

  • dirD["type"].(string)
  • dirD["name"].(string)

If the GitHub API returns:

  • a rate-limit response,
  • an error payload,
  • missing fields,
  • or null values,

the type assertion panics.

Because the handler runs in an unsupervised background goroutine:

  • the panic kills the goroutine,
  • no restart or retry occurs,
  • analytics data freezes indefinitely,
  • the /community API continues serving stale data with no visible error.

This failure mode is completely silent and production-impacting.


Root Cause

  • Incorrect assumption that external GitHub API responses always follow the expected schema
  • Unsafe type assertions without the comma-ok idiom
  • No panic recovery in a long-running background goroutine

Fix

This PR introduces three minimal, targeted fixes:

  1. Safe type assertions

    • Uses the comma-ok idiom when reading type and name fields
    • Skips malformed or unexpected entries instead of panicking
  2. Panic recovery in the background handler

    • Adds scoped recover() to prevent permanent goroutine death
    • Logs panics for observability while allowing future update cycles to continue
  3. HTTP response validation

    • Verifies GitHub API response status before attempting to parse JSON

These changes preserve existing behavior while preventing crashes and silent failures.


How to Reproduce (Before Fix)

Scenario 1: GitHub API Rate Limit

  1. Deploy ChartHub without a GitHub API token
  2. Allow the analytics handler to run until the rate limit is exceeded
  3. GitHub returns a 403 with an error JSON payload
  4. The handler panics on dirD["type"].(string)
  5. Background analytics stop updating permanently

Scenario 2: GitHub API Error

  1. Temporarily block api.github.com or simulate a 5xx response
  2. The handler receives an unexpected JSON structure
  3. Unsafe type assertion panics
  4. Analytics goroutine dies silently

Impact After Fix

  • Background analytics survive GitHub API errors, rate limits, and schema changes
  • Community data continues updating after transient failures
  • Panics no longer permanently disable analytics
  • Failures are logged and observable instead of silent

This significantly improves reliability for production and LFX/CNCF-scale deployments where API rate limits and transient failures are common.


Scope of Change

  • No architectural changes
  • No behavioral changes beyond added safety
  • Minimal, production-safe fix

Checklist

  • Prevents panic from unsafe type assertions
  • Handles external API failures safely
  • Avoids silent background task failure
  • Production-tested failure scenarios

Signed-off-by: WHOIM1205 <rathourprateek8@gmail.com>
@WHOIM1205
Copy link
Copy Markdown
Author

Hi @amityt

This PR adds defensive handling for GitHub API responses in the background analytics handler to prevent panics from unsafe type assertions and to ensure the updater continues running across transient API errors (rate limits, outages, unexpected payloads).

I’d really appreciate a review when you have time — especially on the error-handling approach and goroutine recovery logic.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant