-
Notifications
You must be signed in to change notification settings - Fork 110
[Bug] Speculative decoding backends are unreachable, silent cache mismatches, and multiple config/doc inconsistencies in cloud-edge LLM example #361
Description
Summary
While reproducing the cloud-edge-collaborative-inference-for-llm benchmark end-to-end inside a Docker container (following the official Quick Start), I encountered a chain of interconnected bugs that collectively make the speculative decoding feature unusable and introduce silent reproducibility failures for all users. These issues are distinct from those reported in #357 (Windows compatibility), #360 (missing retry + broken LadeSpecDecLLM import), and #356 (HTTP 400 crash).
The core theme is: the example advertises 5 edge model backends (huggingface, vllm, api, EagleSpecDec, LadeSpecDec), but only 3 actually work, and when the other 2 fail, the error messages are misleading or entirely absent.
Full RUNLOG with every command and output: RUNLOG.md
Generated artifacts (logs, configs, fixes): Ianvs_Generated_Artifacts
Environment
| Item | Details |
|---|---|
| Host OS | Windows 11 (Build 26100), x86_64 |
| Docker | Docker Desktop for Windows (Linux containers via WSL2) |
| Docker Base Image | continuumio/miniconda3:latest |
| Container OS | Linux (kernel 6.6.87.2-microsoft-standard-WSL2) |
| Python | 3.8.20 (inside conda environment ianvs-experiment) |
| Conda | 25.11.1 |
| Ianvs | v0.1.0 (installed from source via pip install -e .) |
| Sedna | Installed from bundled .whl (sedna-0.6.0.1) |
| Datasets | GPQA + MMLU-5-shot (pre-downloaded from Kaggle) |
| GPU | Not available - no GPU passthrough to Docker container |
| Docker Image Size | ~24.5 GB (includes datasets + model caches) |
Note: Since no GPU was available in the Docker container, all inference was done through pre-cached results stored in the workspace directories (
workspace-gpqa/andworkspace-mmlu/). The cache mechanism in Ianvs matches the current run configuration against stored cache entries and returns cached responses if found, avoiding the need to actually load and run the LLM models.
Problem 1 - Backend validation in edge_model.py rejects speculative decoding backends before they can load
Location
edge_model.py -> EdgeModel.__init__(), line 47
Error
ValueError: Unsupported backend: EagleSpecDec. Supported options are: 'huggingface', 'vllm', 'api'.
Root Cause
__init__() validates the backend against only 3 values:
# edge_model.py, line 47
if self.backend not in ["huggingface", "vllm", "api"]:
raise ValueError(
f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api'."
)But load() (same file, lines 74-77) expects 5 backends:
# edge_model.py, lines 67-77
def load(self, **kwargs):
if self.backend == "huggingface":
self.model = HuggingfaceLLM(**self.kwargs)
elif self.backend == "vllm":
self.model = VllmLLM(**self.kwargs)
elif self.backend == "api":
self.model = APIBasedLLM(**self.kwargs)
elif self.backend == "EagleSpecDec": # ← never reached
self.model = EagleSpecDecModel(**self.kwargs)
elif self.backend == "LadeSpecDec": # ← never reached
self.model = LadeSpecDecLLM(**self.kwargs)Result: If a user sets backend: "EagleSpecDec" or backend: "LadeSpecDec" in test_queryrouting.yaml (as the YAML comments suggest they can), __init__() raises ValueError before load() is ever called. The speculative decoding code paths are dead code.
Impact
- Speculative decoding is a headline feature of this example (mentioned in README title, architecture diagram, and results table)
- The README Results section (lines 412-423) shows
EagleSpecDecbenchmark results, implying this backend works - but it cannot be selected - Users following the YAML documentation comments will hit this crash immediately
Reproduction Steps
# In test_queryrouting.yaml, uncomment EagleSpecDec:
# - backend:
# values:
# - "EagleSpecDec"
ianvs -f examples/cloud-edge-collaborative-inference-for-llm/benchmarkingjob.yaml
# -> ValueError: Unsupported backend: EagleSpecDecProposed Fix
# edge_model.py, line 47
- if self.backend not in ["huggingface", "vllm", "api"]:
+ if self.backend not in ["huggingface", "vllm", "api", "EagleSpecDec", "LadeSpecDec"]:
raise ValueError(
- f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api'."
+ f"Unsupported backend: {self.backend}. Supported options are: 'huggingface', 'vllm', 'api', 'EagleSpecDec', 'LadeSpecDec'."
)Problem 2 - Typo in test_queryrouting.yaml: LadeSepcDec instead of LadeSpecDec
Location
test_queryrouting.yaml, line 32
Details
# Current (line 32):
# 5> "LadeSepcDec": Lookahead Decoding framework;
# ^^^^
# Typo: "Sepc" should be "Spec"
# Should be:
# 5> "LadeSpecDec": Lookahead Decoding framework;Impact
- Users who copy the backend name from the YAML comments will use
"LadeSepcDec"(wrong), which won't match theload()code that checks for"LadeSpecDec"(correct) - Even after fixing Problem 1, users who follow the comments will still hit a silent failure - the backend won't match any
elifbranch inload(), and the model will never be initialized (no error raised, justAttributeErrorlater whenself.modelis accessed)
Proposed Fix
# test_queryrouting.yaml, line 32
- # 5> "LadeSepcDec": Lookahead Decoding framework;
+ # 5> "LadeSpecDec": Lookahead Decoding framework;Problem 3 - Cache config mismatch silently skips 14,000+ cached results with no warning
Location
base_llm.py -> BaseLLM._load_cache(), line 251
Details
The cache lookup uses exact config dict equality:
# base_llm.py, line 251
for cache in self.cache_models:
if cache["config"] == self.config: # ← strict equality, no logging on mismatch
self.cache = cache
self.cache_hash = {item["query"]:item['response'] for item in cache["result"]}When the cached results were generated with backend: "vllm" but the user's current config has backend: "huggingface" (common when running on a CPU-only machine), the comparison silently fails. There is:
- No log message saying "cache found but config mismatch"
- No indication of what differed (backend? model? temperature?)
- No hint that 14,042 cached results exist but aren't being used
The user's only clue is that the benchmark suddenly tries to download a 14GB+ model from HuggingFace instead of using cached results - with no explanation of why.
Impact
- First-time users waste hours downloading large models they don't need
- The silence makes debugging extremely difficult - was it a path issue? A config issue? A cache format issue?
- This directly affected our MMLU benchmark run, where the provided
workspace-mmlucache hadbackend: "vllm"but our Docker setup usedbackend: "huggingface"
Proposed Fix
# base_llm.py, _load_cache() method
for cache in self.cache_models:
if cache["config"] == self.config:
self.cache = cache
self.cache_hash = {item["query"]:item['response'] for item in cache["result"]}
+ else:
+ diff_keys = [k for k in set(list(cache["config"].keys()) + list(self.config.keys()))
+ if cache["config"].get(k) != self.config.get(k)]
+ LOGGER.warning(
+ "Cache entry found but config mismatch on keys: %s. "
+ "Cached: %s vs Current: %s. %d cached results skipped.",
+ diff_keys,
+ {k: cache["config"].get(k) for k in diff_keys},
+ {k: self.config.get(k) for k in diff_keys},
+ len(cache.get("result", []))
+ )Problem 4 - Dockerfile does not install the retry dependency
Location
Dockerfile, line 25 and requirements.txt
Details
The requirements.txt for this example contains:
vllm
transformers
openai
accelerate
datamodel_code_generator
kaggle
groq
But base_llm.py line 7 imports:
from retry import retryThe retry package is not listed in requirements.txt. When following the Docker Quick Start, pip install -r requirements.txt succeeds, but the benchmark crashes at runtime with:
ModuleNotFoundError: No module named 'retry'
Note: This specific missing dependency is also reported in #360. I include it here because the Dockerfile workflow (the primary recommended setup path) is directly affected, and the fix should be part of the same
requirements.txtpatch.
Proposed Fix
# requirements.txt
vllm
transformers
openai
accelerate
datamodel_code_generator
kaggle
groq
+ retryProblem 5 - README inconsistencies and undocumented configuration gaps
5a. README contradicts itself on Joint Inference mode
Line 65-67:
In this example, we will rely on Ianvs' Joint Inference Paradigm using the
inference-then-miningmode to implement a Query Routing strategy.
But benchmarkingjob.yaml line 8:
hard_example_mining_mode: "mining-then-inference"The README says inference-then-mining, the actual config uses mining-then-inference. One of them is wrong. Based on the query routing design (first decide routing, then infer), mining-then-inference in the YAML appears correct, and the README should be updated.
5b. README recommends Python 3.8, Dockerfile uses Python 3.10
- README line 162:
conda create -n ianvs-experiment python=3.8 - Dockerfile line 5:
PYTHON_VERSION=3.10 - README line 91:
Python 3.8+ environment
While both work, the inconsistency is confusing. The Dockerfile (the recommended quick start) uses 3.10, but a user following the "Detailed Setup Guide" would install 3.8.
5c. No MMLU-specific benchmarkingjob.yaml or testenv.yaml provided
The README mentions both GPQA and MMLU-5-shot datasets (line 186), and the Dockerfile copies both datasets (lines 29-32):
COPY dataset-gpqa/ /ianvs/dataset/gpqa/
COPY workspace-gpqa/ /ianvs/workspace-gpqa/
COPY dataset-mmlu/ /ianvs/dataset/mmlu-5-shot/
COPY workspace-mmlu/ /ianvs/workspace-mmlu/But only GPQA configs are provided in the repository:
benchmarkingjob.yaml-> hardcodesworkspace: "./workspace-gpqa"testenv.yaml-> hardcodestrain_data: "./dataset/gpqa/train_data/data.json"
To run MMLU, users must manually create benchmarkingjob-mmlu.yaml and testenv-mmlu.yaml with the correct paths - but this is not documented anywhere. During our benchmark run, we had to reverse-engineer the correct config by examining the cache structure.
5d. Dockerfile COPY paths are undocumented
The Dockerfile (lines 29-32) expects four directories (dataset-gpqa/, workspace-gpqa/, dataset-mmlu/, workspace-mmlu/) to exist alongside it, but the README's Docker Quick Start (lines 98-145) never mentions preparing these directories. The user is told to download datasets inside the container after it's running, which contradicts the Dockerfile's COPY instructions.
Proposed Fix
- Update README line 67: change
inference-then-mining->mining-then-inference - Standardize Python version to 3.10 in both README and Dockerfile
- Add
benchmarkingjob-mmlu.yamlandtestenv-mmlu.yaml(or a parametric config that accepts dataset name) - Document the required directory structure for
docker buildin the Docker Quick Start section
Summary Table
| # | Bug | File(s) | Severity | Affects |
|---|---|---|---|---|
| 1 | Backend validation rejects EagleSpecDec/LadeSpecDec |
edge_model.py:47 |
🔴 Critical - feature completely broken | All platforms |
| 2 | Typo LadeSepcDec in YAML comments |
test_queryrouting.yaml:32 |
🟡 Medium - misleads users who copy from comments | All platforms |
| 3 | Silent cache config mismatch (no warning) | base_llm.py:251 |
🟠 High - wastes hours, impossible to debug | All platforms |
| 4 | Missing retry in requirements.txt |
requirements.txt |
🔴 Critical - immediate crash | All platforms |
| 5a | README wrong Joint Inference mode | README.md:67 |
🟡 Medium - documentation error | All platforms |
| 5b | Python version inconsistency | README.md:162, Dockerfile:5 |
🟢 Low - both work but confusing | All platforms |
| 5c | No MMLU benchmark configs provided | Missing files | 🟠 High - second dataset unusable without manual config | All platforms |
| 5d | Dockerfile COPY paths undocumented |
Dockerfile:29-32, README.md |
🟡 Medium - Docker build fails without prep | Docker users |
Acceptance Criteria
-
backend: "EagleSpecDec"no longer raisesValueErrorinedge_model.py - YAML comment typo
LadeSepcDec->LadeSpecDecintest_queryrouting.yaml - Cache mismatch logs a
WARNINGwith the differing keys and number of skipped results -
retryadded torequirements.txt - README Joint Inference mode matches
benchmarkingjob.yaml - MMLU-specific config files provided (or config parameterized)
- Dockerfile
COPYprerequisites documented in README
Related Issues
- [Bug](missing dependency + broken import + undocumented API requirement) #360 - Missing
retrydependency + brokenLadeSpecDecLLMimport + undocumented OpenAI requirement - [PRE-TASK] Enable Windows Support & Replace Inaccessible GPQA Dataset in Cloud Edge LLM Example #357 - Windows compatibility + inaccessible GPQA dataset
- Joint inference pipeline crashes when cloud LLM returns HTTP 400 during API inference #356 - HTTP 400 crash during cloud API inference
- KeyError crashes in cloud-edge-collaborative-inference-for-llm benchmark #338 - KeyError crashes during metric computation