Skip to content

[Bug] Speculative decoding backends are unreachable, silent cache mismatches, and multiple config/doc inconsistencies in cloud-edge LLM example #361

@hharshhsaini

Description

@hharshhsaini

Summary

While reproducing the cloud-edge-collaborative-inference-for-llm benchmark end-to-end inside a Docker container (following the official Quick Start), I encountered a chain of interconnected bugs that collectively make the speculative decoding feature unusable and introduce silent reproducibility failures for all users. These issues are distinct from those reported in #357 (Windows compatibility), #360 (missing retry + broken LadeSpecDecLLM import), and #356 (HTTP 400 crash).

The core theme is: the example advertises 5 edge model backends (huggingface, vllm, api, EagleSpecDec, LadeSpecDec), but only 3 actually work, and when the other 2 fail, the error messages are misleading or entirely absent.

Full RUNLOG with every command and output: RUNLOG.md
Generated artifacts (logs, configs, fixes): Ianvs_Generated_Artifacts

Environment

Item Details
Host OS Windows 11 (Build 26100), x86_64
Docker Docker Desktop for Windows (Linux containers via WSL2)
Docker Base Image continuumio/miniconda3:latest
Container OS Linux (kernel 6.6.87.2-microsoft-standard-WSL2)
Python 3.8.20 (inside conda environment ianvs-experiment)
Conda 25.11.1
Ianvs v0.1.0 (installed from source via pip install -e .)
Sedna Installed from bundled .whl (sedna-0.6.0.1)
Datasets GPQA + MMLU-5-shot (pre-downloaded from Kaggle)
GPU Not available - no GPU passthrough to Docker container
Docker Image Size ~24.5 GB (includes datasets + model caches)

Note: Since no GPU was available in the Docker container, all inference was done through pre-cached results stored in the workspace directories (workspace-gpqa/ and workspace-mmlu/). The cache mechanism in Ianvs matches the current run configuration against stored cache entries and returns cached responses if found, avoiding the need to actually load and run the LLM models.


Problem 1 - Backend validation in edge_model.py rejects speculative decoding backends before they can load

Location

edge_model.py -> EdgeModel.__init__(), line 47

Error

ValueError: Unsupported backend: EagleSpecDec. Supported options are: 'huggingface', 'vllm', 'api'.

Root Cause

__init__() validates the backend against only 3 values:

# edge_model.py, line 47
if self.backend not in ["huggingface", "vllm", "api"]:
    raise ValueError(
        f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api'."
    )

But load() (same file, lines 74-77) expects 5 backends:

# edge_model.py, lines 67-77
def load(self, **kwargs):
    if self.backend == "huggingface":
        self.model = HuggingfaceLLM(**self.kwargs)
    elif self.backend == "vllm":
        self.model = VllmLLM(**self.kwargs)
    elif self.backend == "api":
        self.model = APIBasedLLM(**self.kwargs)
    elif self.backend == "EagleSpecDec":            # ← never reached
        self.model = EagleSpecDecModel(**self.kwargs)
    elif self.backend == "LadeSpecDec":              # ← never reached
        self.model = LadeSpecDecLLM(**self.kwargs)

Result: If a user sets backend: "EagleSpecDec" or backend: "LadeSpecDec" in test_queryrouting.yaml (as the YAML comments suggest they can), __init__() raises ValueError before load() is ever called. The speculative decoding code paths are dead code.

Impact

  • Speculative decoding is a headline feature of this example (mentioned in README title, architecture diagram, and results table)
  • The README Results section (lines 412-423) shows EagleSpecDec benchmark results, implying this backend works - but it cannot be selected
  • Users following the YAML documentation comments will hit this crash immediately

Reproduction Steps

# In test_queryrouting.yaml, uncomment EagleSpecDec:
#   - backend:
#       values:
#         - "EagleSpecDec"

ianvs -f examples/cloud-edge-collaborative-inference-for-llm/benchmarkingjob.yaml
# -> ValueError: Unsupported backend: EagleSpecDec

Proposed Fix

# edge_model.py, line 47
- if self.backend not in ["huggingface", "vllm", "api"]:
+ if self.backend not in ["huggingface", "vllm", "api", "EagleSpecDec", "LadeSpecDec"]:
      raise ValueError(
-         f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api'."
+         f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api', 'EagleSpecDec', 'LadeSpecDec'."
      )

Problem 2 - Typo in test_queryrouting.yaml: LadeSepcDec instead of LadeSpecDec

Location

test_queryrouting.yaml, line 32

Details

# Current (line 32):
#  5> "LadeSepcDec": Lookahead Decoding framework;
#        ^^^^
#     Typo: "Sepc" should be "Spec"

# Should be:
#  5> "LadeSpecDec": Lookahead Decoding framework;

Impact

  • Users who copy the backend name from the YAML comments will use "LadeSepcDec" (wrong), which won't match the load() code that checks for "LadeSpecDec" (correct)
  • Even after fixing Problem 1, users who follow the comments will still hit a silent failure - the backend won't match any elif branch in load(), and the model will never be initialized (no error raised, just AttributeError later when self.model is accessed)

Proposed Fix

# test_queryrouting.yaml, line 32
-             #  5> "LadeSepcDec": Lookahead Decoding framework;
+             #  5> "LadeSpecDec": Lookahead Decoding framework;

Problem 3 - Cache config mismatch silently skips 14,000+ cached results with no warning

Location

base_llm.py -> BaseLLM._load_cache(), line 251

Details

The cache lookup uses exact config dict equality:

# base_llm.py, line 251
for cache in self.cache_models:
    if cache["config"] == self.config:   # ← strict equality, no logging on mismatch
        self.cache = cache
        self.cache_hash = {item["query"]:item['response'] for item in cache["result"]}

When the cached results were generated with backend: "vllm" but the user's current config has backend: "huggingface" (common when running on a CPU-only machine), the comparison silently fails. There is:

  • No log message saying "cache found but config mismatch"
  • No indication of what differed (backend? model? temperature?)
  • No hint that 14,042 cached results exist but aren't being used

The user's only clue is that the benchmark suddenly tries to download a 14GB+ model from HuggingFace instead of using cached results - with no explanation of why.

Impact

  • First-time users waste hours downloading large models they don't need
  • The silence makes debugging extremely difficult - was it a path issue? A config issue? A cache format issue?
  • This directly affected our MMLU benchmark run, where the provided workspace-mmlu cache had backend: "vllm" but our Docker setup used backend: "huggingface"

Proposed Fix

# base_llm.py, _load_cache() method
  for cache in self.cache_models:
      if cache["config"] == self.config:
          self.cache = cache
          self.cache_hash = {item["query"]:item['response'] for item in cache["result"]}
+     else:
+         diff_keys = [k for k in set(list(cache["config"].keys()) + list(self.config.keys()))
+                      if cache["config"].get(k) != self.config.get(k)]
+         LOGGER.warning(
+             "Cache entry found but config mismatch on keys: %s. "
+             "Cached: %s vs Current: %s. %d cached results skipped.",
+             diff_keys,
+             {k: cache["config"].get(k) for k in diff_keys},
+             {k: self.config.get(k) for k in diff_keys},
+             len(cache.get("result", []))
+         )

Problem 4 - Dockerfile does not install the retry dependency

Location

Dockerfile, line 25 and requirements.txt

Details

The requirements.txt for this example contains:

vllm
transformers
openai
accelerate
datamodel_code_generator
kaggle
groq

But base_llm.py line 7 imports:

from retry import retry

The retry package is not listed in requirements.txt. When following the Docker Quick Start, pip install -r requirements.txt succeeds, but the benchmark crashes at runtime with:

ModuleNotFoundError: No module named 'retry'

Note: This specific missing dependency is also reported in #360. I include it here because the Dockerfile workflow (the primary recommended setup path) is directly affected, and the fix should be part of the same requirements.txt patch.

Proposed Fix

# requirements.txt
  vllm
  transformers
  openai
  accelerate
  datamodel_code_generator
  kaggle
  groq
+ retry

Problem 5 - README inconsistencies and undocumented configuration gaps

5a. README contradicts itself on Joint Inference mode

Line 65-67:

In this example, we will rely on Ianvs' Joint Inference Paradigm using the inference-then-mining mode to implement a Query Routing strategy.

But benchmarkingjob.yaml line 8:

hard_example_mining_mode: "mining-then-inference"

The README says inference-then-mining, the actual config uses mining-then-inference. One of them is wrong. Based on the query routing design (first decide routing, then infer), mining-then-inference in the YAML appears correct, and the README should be updated.

5b. README recommends Python 3.8, Dockerfile uses Python 3.10

  • README line 162: conda create -n ianvs-experiment python=3.8
  • Dockerfile line 5: PYTHON_VERSION=3.10
  • README line 91: Python 3.8+ environment

While both work, the inconsistency is confusing. The Dockerfile (the recommended quick start) uses 3.10, but a user following the "Detailed Setup Guide" would install 3.8.

5c. No MMLU-specific benchmarkingjob.yaml or testenv.yaml provided

The README mentions both GPQA and MMLU-5-shot datasets (line 186), and the Dockerfile copies both datasets (lines 29-32):

COPY dataset-gpqa/ /ianvs/dataset/gpqa/
COPY workspace-gpqa/ /ianvs/workspace-gpqa/
COPY dataset-mmlu/ /ianvs/dataset/mmlu-5-shot/
COPY workspace-mmlu/ /ianvs/workspace-mmlu/

But only GPQA configs are provided in the repository:

  • benchmarkingjob.yaml -> hardcodes workspace: "./workspace-gpqa"
  • testenv.yaml -> hardcodes train_data: "./dataset/gpqa/train_data/data.json"

To run MMLU, users must manually create benchmarkingjob-mmlu.yaml and testenv-mmlu.yaml with the correct paths - but this is not documented anywhere. During our benchmark run, we had to reverse-engineer the correct config by examining the cache structure.

5d. Dockerfile COPY paths are undocumented

The Dockerfile (lines 29-32) expects four directories (dataset-gpqa/, workspace-gpqa/, dataset-mmlu/, workspace-mmlu/) to exist alongside it, but the README's Docker Quick Start (lines 98-145) never mentions preparing these directories. The user is told to download datasets inside the container after it's running, which contradicts the Dockerfile's COPY instructions.

Proposed Fix

  • Update README line 67: change inference-then-mining -> mining-then-inference
  • Standardize Python version to 3.10 in both README and Dockerfile
  • Add benchmarkingjob-mmlu.yaml and testenv-mmlu.yaml (or a parametric config that accepts dataset name)
  • Document the required directory structure for docker build in the Docker Quick Start section

Summary Table

# Bug File(s) Severity Affects
1 Backend validation rejects EagleSpecDec/LadeSpecDec edge_model.py:47 🔴 Critical - feature completely broken All platforms
2 Typo LadeSepcDec in YAML comments test_queryrouting.yaml:32 🟡 Medium - misleads users who copy from comments All platforms
3 Silent cache config mismatch (no warning) base_llm.py:251 🟠 High - wastes hours, impossible to debug All platforms
4 Missing retry in requirements.txt requirements.txt 🔴 Critical - immediate crash All platforms
5a README wrong Joint Inference mode README.md:67 🟡 Medium - documentation error All platforms
5b Python version inconsistency README.md:162, Dockerfile:5 🟢 Low - both work but confusing All platforms
5c No MMLU benchmark configs provided Missing files 🟠 High - second dataset unusable without manual config All platforms
5d Dockerfile COPY paths undocumented Dockerfile:29-32, README.md 🟡 Medium - Docker build fails without prep Docker users

Acceptance Criteria

  • backend: "EagleSpecDec" no longer raises ValueError in edge_model.py
  • YAML comment typo LadeSepcDec -> LadeSpecDec in test_queryrouting.yaml
  • Cache mismatch logs a WARNING with the differing keys and number of skipped results
  • retry added to requirements.txt
  • README Joint Inference mode matches benchmarkingjob.yaml
  • MMLU-specific config files provided (or config parameterized)
  • Dockerfile COPY prerequisites documented in README

Related Issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions