Skip to content

[PRE-TASK] Scientific Integrity Failures in the GAN + Self-Taught Learning Lifelong Learning Benchmark #365

@phantom-712

Description

@phantom-712

[PRE-TASK] Scientific Integrity Failures in the GAN + Self-Taught Learning Lifelong Learning Benchmark

examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/


TL;DR: This example contains a one-line validation bug where training applies an encoder transform but validation does not, silently invalidating all reported validation metrics (BUG-5, P0) since the example's creation. Separately, the STL encoder is trained on GAN-generated images and then applied to real Cityscapes images (BUG-4, P0), making results not directly comparable to any standard Cityscapes benchmark without additional controls or disclaimers. The pipeline is also not runnable end-to-end due to multiple hardcoded path/config mismatches (BUG-1/2/3) and a hardcoded absolute ResNet-18 pretrained path in deeplabv3/model/resnet.py (BUG-6, P2) that only exists on the original developer's machine.

Severity scheme: Bugs are classified P0 (Scientific Integrity), P2 (Runtime Crash), and P3 (Environment/Configuration) in descending order of severity.

1. Background

1.1 Example Introduction

This example claims to demonstrate the integration of Generative Adversarial Networks (GANs) and Self-Taught Learning (STL) for solving small sample and heterogeneous data issues in lifelong learning scenarios, specifically for semantic segmentation on the Cityscapes dataset. The example is the only implementation in the Ianvs repository that combines GAN-based data generation with self-taught learning for unseen task processing in lifelong learning. It serves as a reference implementation for researchers working on domain adaptation and few-shot learning in edge AI scenarios.

1.2 Problem Description

The investigation revealed a two-layered problem that fundamentally compromises the scientific validity of this benchmark:

LAYER 1 (P0 — Scientific Integrity): Training/Validation Distribution Mismatch

The training loop applies encoder preprocessing to images, but the validation loop does not, creating a fundamental distribution mismatch that invalidates all validation metrics. The exact code comparison demonstrates this:

Training Loop (deeplabv3/train.py:89-95):

for step, (imgs, label_imgs) in enumerate(train_loader):
    imgs = Variable(imgs).cuda()
    # encoder images
    imgs = encoder(imgs)  # <-- ENCODER APPLIED
    label_imgs = Variable(label_imgs.type(torch.LongTensor)).cuda()
    outputs = network(imgs)  # Network receives ENCODED images
    loss = loss_fn(outputs, label_imgs)

Validation Loop (deeplabv3/train.py:121-127):

for step, (imgs, label_imgs, img_ids) in enumerate(val_loader):
    with torch.no_grad():
        imgs = Variable(imgs).cuda()
        # <-- NO ENCODER APPLIED!
        label_imgs = Variable(label_imgs.type(torch.LongTensor)).cuda()
        outputs = network(imgs)  # Network receives ORIGINAL images
        loss = loss_fn(outputs, label_imgs)

Impact: All validation metrics are scientifically invalid. The model is trained on encoded image distributions but validated on original image distributions, making validation loss, early stopping decisions, and model selection completely meaningless. This is a one-line fix (imgs = encoder(imgs) added to validation loop) that has been silently invalidating the entire evaluation since the example was created.

Why this breaks metrics (concrete scenario): Suppose the encoder is a learned nonlinear transform (E(\cdot)) with downsampling/feature extraction. Training optimizes (\mathcal{L}(f(E(x)), y)) while validation reports (\mathcal{L}(f(x), y)). Even if training loss smoothly decreases, validation loss can appear noisy/erratic not because of overfitting, but because (x) and (E(x)) belong to different input distributions (different statistics and possibly different effective spatial resolution). The validation curve is therefore evaluating the network on inputs it was never trained to consume, making the reported train/val relationship and any early-stopping logic mathematically meaningless.

LAYER 2 (P0 — Scientific Integrity): Encoder Data Distribution Mismatch

The encoder used for preprocessing is trained on GAN-generated fake images (selftaughtlearning/train.py:118), but then applied to real Cityscapes images during DeepLabV3 training (deeplabv3/train.py:92). This creates a domain transfer problem where:

  1. Encoder training data: GAN-generated synthetic images (../data/fake_imgs/)
  2. Encoder application: Real Cityscapes images (cityscapes_data_path)
  3. Model training: Network receives encoder(real_images) → encoded representations

This makes results incomparable to any standard Cityscapes benchmark (which uses original images), and the "improvement" attributed to self-taught learning cannot be isolated from the encoder's preprocessing effect.

Additional Runtime and Configuration Bugs (P2/P3):

(BUG-4 and BUG-5 are the P0 scientific integrity failures described in LAYER 1 and LAYER 2 above — they are also documented with code-level evidence in Section 1.3. The bugs below are the P2/P3 runtime and configuration failures that prevent the pipeline from executing at all.)

BUG-1 (config.yaml): cityscapes_data_path, cityscapes_meta_path, and class_weights ship as empty strings with no startup validation. The pipeline continues silently past config loading until a cryptic downstream crash with no indication of which key is missing.

BUG-2 (deeplabv3/train.py:46): Hardcoded encoder path '../self-taught-learning/train_results/encoder_models4/encoder50.pth' has three simultaneous mismatches vs config.yaml — wrong parent directory, wrong folder suffix, and wrong checkpoint epoch — causing FileNotFoundError.

BUG-3 (GAN/generate_fake_imgs.py:23,42): Hardcoded GAN model name 'test1' vs config 'test2', and output directory 'fake_imgs1' vs expected 'fake_imgs', causing path mismatches that prevent execution.

BUG-6 (deeplabv3/model/resnet.py:179): Hardcoded absolute path '/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth' to a ResNet-18 pretrained checkpoint that only exists on the original developer's machine. Prevents DeepLabV3 backbone initialization on any other system. Discovered as the fourth consecutive runtime crash in forward execution order during fix verification. Fix: configurable path + graceful random-init fallback.

Note: deeplabv3/model/aspp.py:46 and deeplabv3/model/deeplabv3.py:33 also use the deprecated nn.functional.upsample — should be nn.functional.interpolate for PyTorch compatibility. Observed as a warning during the mock verification run.

1.3 Debug Process and Evidence

Environment: Ubuntu 24.04.1 LTS, Python 3.12.3 (WSL2), commit 1d3da5b.
Official guide specifies Python 3.9. No version guard exists; mismatches surface
as generic errors. All logs are from direct execution.


BUG-1 — ValueError: Empty config paths (P3)

Location: config.yaml / deeplabv3/train.py startup

Cause: cityscapes_data_path, cityscapes_meta_path, class_weights ship as empty strings with no validation. Pipeline continues silently until a downstream crash. The log below is from the fixed code with startup validation added — this is the intended user-facing error after BUG-1 is patched. The original unpatched code produces a cryptic downstream crash with no indication of which config key is missing.

Log:

Traceback (most recent call last):
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 161, in <module>
    train_deepblabv3()
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 54, in train_deepblabv3
    raise ValueError(f"config.yaml: '{key}' is empty. Please set this path.")
ValueError: config.yaml: 'cityscapes_data_path' is empty. Please set this path.

Fix: Add startup key validation raising ValueError with the missing key name.


BUG-2 — FileNotFoundError: Hardcoded encoder checkpoint path (P2)

Location: deeplabv3/train.py, line 46

Cause: Three simultaneous mismatches vs config: wrong parent dir (self-taught-learning vs selftaughtlearning), wrong suffix (encoder_models4 vs encoder_models), wrong epoch (encoder50 vs encoder100).

Log:

Traceback (most recent call last):
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 152, in <module>
    train_deepblabv3()
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 48, in train_deepblabv3
    encoder.load_state_dict(torch.load(
                            ^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 1500, in load
    with _open_file_like(f, "rb") as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 768, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 749, in __init__
    super().__init__(open(name, mode))  # noqa: SIM115
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '../self-taught-learning/train_results/encoder_models4/encoder50.pth'

Fix: Construct path from config keys STL[3]['name'] and STL[0]['iter'].


BUG-3 — FileNotFoundError: Hardcoded GAN checkpoint path (P2)

Location: GAN/generate_fake_imgs.py, lines 23 and 42

Cause: Hardcoded name 'test1' vs config 'test2'; output dir 'fake_imgs1' vs expected 'fake_imgs'.

Log:

Traceback (most recent call last):
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/GAN/generate_fake_imgs.py", line 26, in <module>
    weights = torch.load(os.getcwd() + '/train_results/test1/models/50000.pth')
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 1500, in load
    with _open_file_like(f, "rb") as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 768, in _open_file_like
    return _open_file(name_or_buffer, mode)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 749, in __init__
    super().__init__(open(name, mode))  # noqa: SIM115
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/train_results/test1/models/50000.pth'

Fix: Read GAN[3]['name'] and GAN[0]['iter'] from config; fix output dir.


Minor issue — ModuleNotFoundError: util

Location: GAN/generate_fake_imgs.py

Cause: Running from inside GAN/ sets sys.path[0] to GAN/, so from util import load_yaml fails because the example root is not on sys.path.

Log:

ModuleNotFoundError: No module named 'util'

Fix:

sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), "..")))

BUG-4 — Encoder domain mismatch: trained on fake images, applied to real images (P0)

Location: selftaughtlearning/train.py:118, deeplabv3/train.py:92

Cause: The STL encoder is trained exclusively on GAN-generated fake images (../data/fake_imgs/) but is then applied to real Cityscapes images during DeepLabV3 training. This means the encoder has never seen real image statistics during its own training, yet its learned transform is applied to real images before segmentation. No crash occurs — this is a silent architectural mismatch discovered through code review.

Evidence (code-level):

# selftaughtlearning/train.py:118 — encoder trained on fake images:
dataset = DatasetAutoEncoder(fake_images_path='../data/fake_imgs/')

# deeplabv3/train.py:92 — same encoder applied to real Cityscapes images:
imgs = encoder(imgs)  # imgs loaded from cityscapes_data_path (real images)

Impact: The "improvement" attributed to self-taught learning cannot be isolated from the encoder's domain-transfer effect. Results are not directly comparable to any standard Cityscapes benchmark, which evaluates on original images without this preprocessing step.

Fix: Option A — retrain encoder on real Cityscapes images; Option B — document the domain transfer assumption explicitly in the README with a disclaimer. Proposed choice: Option B initially, with Option A as a future enhancement (see Section 4.2).


BUG-5 — Silent distribution mismatch: encoder not applied in validation (P0)

Location: deeplabv3/train.py, line 123 (missing line)

Cause: Training optimizes L(f(E(x)), y); validation evaluates L(f(x), y). No crash — silently invalid metrics throughout.

Evidence (before fix):

TRAIN input shape: torch.Size([3, 3, 512, 256])
VAL input shape:   torch.Size([3, 3, 1024, 512])

The encoder's stride-2 downsampling halves spatial resolution. Training and validation operate on inputs with different statistics and spatial dimensions, making all reported validation metrics invalid.

After fix:

TRAIN input shape: torch.Size([3, 3, 512, 256])
VAL input shape:   torch.Size([3, 3, 512, 256])

Fix (one line): imgs = encoder(imgs) added at deeplabv3/train.py:123.


BUG-6 — FileNotFoundError: Hardcoded absolute ResNet-18 path (P2)

Location: deeplabv3/model/resnet.py, line 179

Cause: Backbone loads from a hardcoded absolute path on the original developer's machine. Fourth consecutive runtime crash found in forward execution order — only reached after BUG-1/2/3 are fixed. (BUG-4 and BUG-5 are silent failures discovered through code analysis, not crashes.)

Log:

Traceback (most recent call last):
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 175, in <module>
    train_deepblabv3()
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/train.py", line 79, in train_deepblabv3
    network = DeepLabV3(model_id, project_dir=os.getcwd()).to(device)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/model/deeplabv3.py", line 20, in __init__
    self.resnet = ResNet18_OS8() # NOTE! specify the type of ResNet here
                  ^^^^^^^^^^^^^^
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/model/resnet.py", line 230, in ResNet18_OS8
    return ResNet_BasicBlock_OS8(num_layers=18)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/model/resnet.py", line 179, in __init__
    resnet.load_state_dict(torch.load("/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth"))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 1500, in load
    with _open_file_like(f, "rb") as opened_file:
         ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/ianvs_venv/lib/python3.12/site-packages/torch/serialization.py", line 749, in __init__
    super().__init__(open(name, mode))  # noqa: SIM115
                     ^^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: '/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth'

Fix:

def _maybe_load_weights(model, default_path, env_key):
    candidate = os.environ.get(env_key) or default_path
    if candidate and os.path.exists(candidate):
        model.load_state_dict(torch.load(candidate, map_location="cpu"), strict=False)
        print(f"Loaded pretrained weights from {candidate}")
    else:
        print(f"Warning: {env_key} not found ('{candidate}'); using random initialization")

_maybe_load_weights(resnet,
    "/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth",
    "RESNET18_WEIGHTS")

Pipeline verification (mock run — all 6 bugs fixed):

Verified on 10 synthetic Cityscapes-format images, dummy GAN/STL checkpoints, 3 epochs, random ResNet backbone initialization.

Note on mock-only limitations:

  • class_weights.pkl length mismatch with num_classes → fell back to unweighted CrossEntropyLoss (mock-only; production data has correct length).
  • DeepLab head logit size [3, 20, 256, 512] vs synthetic label size [3, 512, 1024] → size mismatch before final loss computation. This is a mock artifact: synthetic labels were generated at full resolution while the DeepLab OS-8 head outputs at 1/8 resolution. All original bugs (BUG-1 through BUG-6) fire earlier in the pipeline and are fully evidenced above.
INFO: epoch: 1/3
Warning: pretrained weights not found for RESNET18_WEIGHTS (looked for '/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth'); using random initialization
pretrained resnet, 18
Warning (mock run): using unweighted CrossEntropyLoss with class_weights shape=(19,)
/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/model/aspp.py:46: UserWarning: `nn.functional.upsample` is deprecated. Use `nn.functional.interpolate` instead.
  out_img = F.upsample(out_img, size=(feature_map_h, feature_map_w), mode="bilinear")
/mnt/f/KUBE_EDGE_T1_FINAL/ianvs/examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/deeplabv3/model/deeplabv3.py:33: UserWarning: `nn.functional.upsample` is deprecated. Use `nn.functional.interpolate` instead.
  output = F.upsample(output, size=(h, w), mode="bilinear")
Traceback (most recent call last):
  File ".../deeplabv3/train.py", line 181, in <module>
    train_deepblabv3()
  File ".../deeplabv3/train.py", line 129, in train_deepblabv3
    loss = loss_fn(outputs, label_imgs)
RuntimeError: size mismatch (got input: [3, 20, 256, 512] , target: [3, 512, 1024]

1.4 Impact Assessment

This matters critically for several reasons:

  1. Invalid Validation Metrics: Researchers using this benchmark get completely invalid validation metrics. Any model selection, early stopping, or performance assessment based on validation loss is scientifically meaningless.

  2. Incomparable Results: Results cannot be compared to Cityscapes standard benchmarks (mIoU papers) because the model is trained on encoded distributions while standard evaluation uses original images. This violates the fundamental principle of fair comparison in machine learning research.

  3. Unverifiable Claims: The "Self-Taught Learning" improvement cannot be isolated or verified because:

    • The encoder preprocessing effect is confounded with the learning algorithm
    • Domain transfer from fake to real images is unvalidated
    • No ablation study exists comparing with/without encoder
  4. Silent Failure: BUG-5 is a one-line fix (imgs = encoder(imgs)) that has been silently invalidating the entire evaluation since the example was created. Based on git history, the example code was last modified on 2024-01-04 (commit 78a2437), meaning this bug has existed for over a year, potentially affecting multiple researchers.

  5. Reproducibility Crisis: The hardcoded paths (BUG-2, BUG-3, BUG-6) prevent the example from running at all, making it impossible for researchers to reproduce or verify any claims.

  6. Quantitative reach (as of 2026-02-19): The kubeedge/ianvs repository has ~150 stars and ~83 forks, indicating an active user base. This example is the only GAN+STL lifelong-learning reference implementation in the repo, so any researcher attempting to use it as a baseline or reference is affected by the scientific integrity and reproducibility issues above.

1.5 Uniqueness

This is the only issue in the kubeedge/ianvs repository that targets Cityscapes-Synthia-4 (the unseen_task_processing-GANwithSelfTaughtLearning example). A full search of the issue tracker confirms zero existing issues, zero PRs, and zero community guides reference this example's path, its deeplabv3/train.py, selftaughtlearning/, or GAN-based preprocessing pipeline. Every other pre-task candidate has focused on different examples entirely.

The three closest existing issues each address a different target:

Beyond targeting an untouched example, this investigation is also distinct in the type of findings it reports. Other candidates address runtime crashes and dependency issues. This proposal additionally surfaces two silent scientific integrity failures (BUG-4, BUG-5) that produce no crash and no error message — bugs that a researcher could run past for months without knowing their results were invalid. A benchmark that runs and produces wrong numbers is more dangerous than one that crashes.


2. Goals

  1. Fix BUG-5: Apply encoder consistently in validation — Implement the one-line fix to apply encoder preprocessing in the validation loop, ensuring training and validation use the same data distribution. Success metric: the validation loop contains the same preprocessing step(s) as training (encoder call parity), and train/val loss curves become meaningfully comparable (e.g., Pearson correlation of loss curves increases to > 0.85 on a controlled run where all else is held constant).

  2. Fix BUG-4: Document or redesign encoder training pipeline — Either train the encoder on real Cityscapes images instead of GAN fake images, or clearly document the domain transfer assumption with a disclaimer that results are not directly comparable to standard Cityscapes benchmarks. Success metric: documentation (or code) makes the train/eval data distributions explicit, and the repository provides a clear "comparison mode" statement (standard Cityscapes vs encoded Cityscapes) so results cannot be misinterpreted as directly comparable.

  3. Fix BUG-1: Add startup config validation — Add explicit validation at deeplabv3/train.py startup that checks all required config.yaml keys (cityscapes_data_path, cityscapes_meta_path, class_weights) and raises a clear, actionable error immediately if any are empty. Success metric: running the example with an incomplete config produces a single named error identifying the missing key, rather than a cryptic downstream crash with no indication of the root cause.

  4. Fix BUG-2, BUG-3, and BUG-6: Replace all hardcoded paths with configurable alternatives — Update deeplabv3/train.py:46, GAN/generate_fake_imgs.py:23,42, and deeplabv3/model/resnet.py:179 to use configuration values or environment variables instead of hardcoded paths. All three bugs share the same root cause — developer machine paths left in production code — and the same fix pattern. Success metric: each stage fails fast with a single actionable error if configuration is incomplete; no hardcoded 'test1', 'fake_imgs1', 'encoder_models4', or /home/nailtu/... paths remain in any entrypoint.

  5. Add integration tests verifying train/val preprocessing parity — Create GitHub Actions workflow that verifies training and validation loops apply identical preprocessing transformations. Success metric: CI fails on the pre-fix code and passes on the fixed code, preventing future regressions of BUG-5-class issues.


3. Scope

3.1 Target Users

This example is primarily used by researchers and engineers who want a working reference implementation of GAN + Self-Taught Learning (STL) for unseen task processing in lifelong learning semantic segmentation. These users are uniquely impacted because the repo does not provide an alternative GAN+STL lifelong-learning baseline to compare against, so they are likely to rely on this exact pipeline for (a) validation-loss-based model selection and reporting, (b) ablation claims about STL benefits, and (c) reproducing the example as a starting point for new edge AI benchmarks. When validation is computed on a different input distribution (BUG-5) and the encoder is trained on synthetic images but applied to real ones (BUG-4), downstream conclusions (early stopping, "improvement" attribution, and comparability to Cityscapes benchmarks) are systematically compromised even for careful users.

3.2 Uniqueness vs Existing Issues

This investigation is distinct from other exisitng issues as in this example hasent been adressed since a long time. I have taken up the charge to cover the issues in this example in detail and achieved success mainly due to the following bugs :

  • BUG-5: Train/validation distribution mismatch unique to this example's validation loop implementation
  • BUG-4: Encoder domain transfer from GAN fake images to real images, specific to this example's architecture
  • BUG-2/BUG-3/BUG-6: Hardcoded paths in GAN, STL, and ResNet modules that are example-specific

These are not general infrastructure issues but fundamental flaws in this specific example's implementation that compromise scientific validity.

3.3 In Scope / Out of Scope

In Scope:

  • deeplabv3/train.py — Fix validation loop encoder application (BUG-5), fix encoder path (BUG-2), add config validation (BUG-1)
  • deeplabv3/model/resnet.py — Replace hardcoded pretrained ResNet-18 path with configurable loading and random-init fallback (BUG-6)
  • selftaughtlearning/train.py — Document encoder training data source (part of BUG-4 resolution)
  • GAN/generate_fake_imgs.py — Fix hardcoded paths (BUG-3)
  • config.yaml — Add validation and documentation for required paths
  • README.md — Update with data flow diagram and domain transfer disclaimers
  • GitHub Actions CI — Add test for train/val preprocessing parity

Out of Scope:

  • Improving GAN architecture or training stability
  • Hosting or distributing Cityscapes dataset
  • Improving model accuracy beyond fixing scientific validity issues
  • Modifying core Ianvs framework (all bugs are contained within example directory)

4. Detailed Design

4.1 Architecture

The fixes are organized into two layers:

Example Layer (No Core Changes Required):

  • BUG-2, BUG-3, BUG-5, BUG-6 fixes are contained within the example directory
  • No modifications to core/, testcasecontroller/, or other Ianvs framework code
  • All changes are in examples/cityscapes/lifelong_learning_bench/unseen_task_processing-GANwithSelfTaughtLearning/

Architectural Reconsideration:

  • BUG-4 requires a decision about encoder training data source
  • This is an example-level architectural choice, not a core framework change
  • Two options: (A) Train encoder on real images, (B) Document domain transfer assumption

Justification: All bugs are self-contained within the example directory. No core Ianvs changes are required, making this a low-risk fix that doesn't affect other examples or the framework itself.

4.2 Module Details

Fix for BUG-5 (deeplabv3/train.py:123):

Before:

for step, (imgs, label_imgs, img_ids) in enumerate(val_loader):
    with torch.no_grad():
        imgs = Variable(imgs).cuda()
        label_imgs = Variable(label_imgs.type(torch.LongTensor)).cuda()
        outputs = network(imgs)

After:

for step, (imgs, label_imgs, img_ids) in enumerate(val_loader):
    with torch.no_grad():
        imgs = Variable(imgs).cuda()
        # encoder images - match training distribution
        imgs = encoder(imgs)
        label_imgs = Variable(label_imgs.type(torch.LongTensor)).cuda()
        outputs = network(imgs)

Fix for BUG-2 (deeplabv3/train.py:45-46):

Before:

encoder = Encoder().cuda()
encoder.load_state_dict(torch.load(
    '../self-taught-learning/train_results/encoder_models4/encoder50.pth'))

After:

stl_name = configs['STL'][3]['name']
stl_epochs = configs['STL'][0]['iter']
encoder = Encoder().cuda()
encoder_path = f'../selftaughtlearning/train_results/{stl_name}/encoder{stl_epochs}.pth'
encoder.load_state_dict(torch.load(encoder_path))

Fix for BUG-3 (GAN/generate_fake_imgs.py:23, 42):

Before (line 23):

weights = torch.load(os.getcwd() + '/train_results/test1/models/50000.pth')

After (line 23):

configs = load_yaml('../config.yaml')
gan_name = configs['GAN'][3]['name']
gan_iter = configs['GAN'][0]['iter']
weights = torch.load(f'train_results/{gan_name}/models/{gan_iter}.pth')

Before (line 42):

io.imsave('../data/fake_imgs1/' + str(index) + '.png', fake_image)

After (line 42):

io.imsave('../data/fake_imgs/' + str(index) + '.png', fake_image)

Fix for BUG-6 (deeplabv3/model/resnet.py:179):

Before:

resnet.load_state_dict(torch.load(
    "/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth"))

After:

def _maybe_load_weights(model, default_path, env_key):
    candidate = os.environ.get(env_key) or default_path
    if candidate and os.path.exists(candidate):
        model.load_state_dict(torch.load(candidate, map_location="cpu"), strict=False)
        print(f"Loaded pretrained weights from {candidate}")
    else:
        print(f"Warning: {env_key} not found ('{candidate}'); using random initialization")

_maybe_load_weights(resnet,
    "/home/nailtu/PycharmProjects/deeplabv3-master/pretrained_models/resnet/resnet18-5c106cde.pth",
    "RESNET18_WEIGHTS")

BUG-4 Architectural Decision:

Two options for resolving the encoder domain transfer issue:

Option A: Train encoder on real images

  • Modify selftaughtlearning/train.py to train encoder on real Cityscapes images instead of GAN fake images
  • Pros: Eliminates domain transfer, makes results comparable to standard benchmarks
  • Cons: Requires real image dataset, changes the "self-taught learning from unlabeled data" claim

Option B: Document domain transfer assumption

  • Keep current architecture but add clear documentation
  • Add clarification in README: "Results represent a distinct evaluation protocol using encoded image representations. For direct comparison to standard Cityscapes benchmarks, encoder preprocessing must be applied consistently at evaluation time."
  • Pros: Preserves original design intent, minimal code changes
  • Cons: Results remain incomparable without re-evaluation

Proposed Choice: Option B (documentation) for initial fix, with Option A as future enhancement. Rationale: Option B provides immediate scientific transparency with minimal risk, while Option A requires architectural changes that should be validated with ablation studies.


5. Road Map

Phase Weeks Key Deliverables
1 — Fix & Verify 1–4 BUG-5/2/3/6 fixes applied, PRs opened, loss curves compared
2 — BUG-4 Resolution 5–8 Encoder decision + ablation study complete
3 — CI & Docs 9–12 GitHub Actions CI, TROUBLESHOOTING.md, README updated

Phase 1: Fix & Verify (Weeks 1-4)

Week 1: Apply fixes for BUG-5, BUG-2, BUG-3, BUG-6

  • Day 1-2: Implement encoder application in validation loop (BUG-5)
  • Day 3-4: Replace hardcoded encoder path with config-driven path (BUG-2)
  • Day 5: Replace hardcoded GAN paths with config-driven paths (BUG-3)
  • Day 6: Replace hardcoded ResNet-18 path with configurable loading (BUG-6)
  • Day 7: Add configuration validation for empty paths (BUG-1)

Week 2: Test fixes locally

  • Day 1-3: Set up Cityscapes dataset (or mock dataset for testing)
  • Day 4-5: Run GAN training pipeline (GAN/train.py)
  • Day 6-7: Run STL training pipeline (selftaughtlearning/train.py)

Week 3: Verify validation fix

  • Day 1-3: Run DeepLabV3 training with fixes applied
  • Day 4-5: Compare training vs validation loss curves — they should now track together (previously validation was erratic due to distribution mismatch)
  • Day 6-7: Document verification results, capture loss curves as evidence

Week 4: Prepare PRs

  • Day 1-3: Create PR for BUG-5 fix with before/after loss curve comparison
  • Day 4-5: Create PRs for BUG-2/BUG-3/BUG-6 fixes with execution verification
  • Day 6-7: Code review preparation, address any feedback

Phase 2: BUG-4 Resolution (Weeks 5-8)

Week 5: Architectural decision

  • Day 1-2: Research domain transfer literature, evaluate Option A vs Option B
  • Day 3-4: Consult with maintainers on architectural preference
  • Day 5-7: Document decision rationale

Week 6-7: Implement chosen approach

  • If Option A: Modify selftaughtlearning/train.py to train on real images, update data pipeline
  • If Option B: Add comprehensive documentation to README, create data flow diagram, add disclaimers
  • Conduct ablation study: Train DeepLabV3 with encoder vs without encoder, compare results

Week 8: Validate and document

  • Day 1-3: Run full pipeline with chosen approach, verify results
  • Day 4-5: Document ablation study results, create comparison tables
  • Day 6-7: Update README with findings and recommendations

Phase 3: CI & Documentation (Weeks 9-12)

Week 9-10: GitHub Actions CI

  • Day 1-3: Design test that verifies train/val preprocessing parity
  • Day 4-7: Implement CI workflow that:
    • Extracts preprocessing steps from training loop
    • Extracts preprocessing steps from validation loop
    • Compares them programmatically
    • Fails if mismatch detected
  • Day 8-10: Test CI on example, verify it catches BUG-5-type issues

Week 11: Documentation updates

  • Day 1-3: Create data flow diagram showing GAN → STL → DeepLabV3 pipeline
  • Day 4-5: Write TROUBLESHOOTING.md with common issues and solutions
  • Day 6-7: Update README with:
    • Clear explanation of data distributions
    • Domain transfer assumptions (if Option B chosen)
    • Comparison guidelines for researchers

Week 12: Final validation and submission

  • Day 1-3: Run full end-to-end test with all fixes applied
  • Day 4-5: Review all documentation for clarity and completeness
  • Day 6-7: Final PR submission, respond to review feedback

Demo

The investigation was conducted on a live clone of the repository at commit 1d3da5b in a WSL2 Ubuntu 24.04.1 environment. Each bug was triggered deliberately in forward execution order and its traceback captured before the fix was applied — the logs in Section 1.3 are direct copy-pastes from terminal output, not reconstructions.

For pipeline verification, a 10-image synthetic Cityscapes-format dataset was created with correct directory structure (leftImg8bit/, gtFine/), dummy GAN and STL encoder checkpoints generated from the actual model architectures, and config.yaml pointed to the mock paths. With all 6 bugs fixed, the mock run reached epoch: 1/3, successfully loaded the encoder, initialized the DeepLabV3 network with the _maybe_load_weights fallback, and entered the training loop — confirmed by the live terminal output captured in Section 1.3. The BUG-5 fix was verified independently by instrumenting both loops with shape-print statements, producing the before/after tensor size evidence shown above.

The mock run stops at a label/logit size mismatch that is a synthetic data artifact (full-resolution labels vs OS-8 head output). This does not affect any of the original bug findings — all six bugs fire before the training loop body and are independently evidenced by their own tracebacks or shape diagnostics.


Conclusion

This investigation uncovered six bugs in the unseen_task_processing-GANwithSelfTaughtLearning example, two of which are silent P0 scientific integrity failures that have existed since the example was created in January 2024. The most critical — BUG-5, the missing encoder(imgs) call in the validation loop — is provable with a single tensor shape comparison and fixable in one line of code, yet it has been silently invalidating all reported validation metrics for over a year. BUG-4 compounds this by making the encoder itself a domain transfer artifact, rendering results incomparable to any standard Cityscapes benchmark. The four additional P2/P3 bugs (BUG-1/2/3/6) mean the pipeline cannot be executed at all on any machine other than the original developer's workstation, so no researcher has been able to discover or reproduce these issues independently.

The proposed fixes are entirely self-contained within the example directory, require no changes to the core Ianvs framework, and range from one-line additions to config-driven path substitutions. The 12-week roadmap delivers scientific correctness first (Phases 1–2), then CI infrastructure to prevent regressions (Phase 3). Taken together, these changes would restore this example to a state where its results are reproducible, its validation metrics are scientifically valid, and its claims about self-taught learning improvement can be fairly evaluated.


Thank you for your time to read this

By -
Ansuman Patra

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/bugCategorizes issue or PR as related to a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions