Problem
The .git directory weighs 177 MB despite roughly 72 MB of working tree content, making fresh clones significantly slower than necessary. A single 164 MB pack file containing 12,525 objects accounts for nearly all of this overhead, and git count-objects -vH reports 5,599 non-delta objects, indicating a substantial proportion of binary or incompressible content.
Investigation methodology
The analysis combined several approaches to identify where the bloat originates:
- Measured the
.git directory size with du -sh .git and inspected ls -lh .git/objects/pack/ to quantify the overhead at 177 MB total, dominated by a single 164 MB pack file.
- Ran
git count-objects -vH for git's own size accounting, confirming 5,599 non-delta objects — a high count that points to many binary files.
- Enumerated the largest blobs across all history using
git rev-list --objects --all | git cat-file --batch-check to surface objects that persist in the pack regardless of whether they still exist in the current working tree.
- Scanned the working tree for large files by extension (GIF, ZIP, PNG, tar.bz2, CSV, JSON) using
find with size sorting to identify current binary bloat.
- Checked
git log --diff-filter=D --summary to identify large files that were committed and later deleted but still consume space in the object database.
- Verified
.gitattributes contents and checked git lfs ls-files — Git LFS is not configured.
- Measured per-directory sizes with
du -sh */ to understand which areas of the repo contribute most.
Root cause analysis
The following table lists the sources of bloat in descending order of impact:
| Source |
Size |
Notes |
Deleted miniconda.sh at 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh |
95.8 MB |
Added in commit d0227b2, deleted in commit 32b3313 (Oct 2023). Still stored in the git object database and downloaded by every clone. This single file accounts for ~54% of the .git directory size. |
Demo GIF files (automate-smhp-demo.gif, automate-smhp-eks-demo.gif) |
~30 MB |
Two animated GIFs used for documentation. |
Lambda function ZIP (grafana-service-token-lambda-function.zip) |
15 MB |
Binary archive checked into the tree. |
PNG screenshots across 0.docs/ and other directories |
~10 MB |
Dozens of image files spread across multiple directories. |
Other binaries (hwloc-2.9.2-h2bc3f7f_0.tar.bz2, slurm-esm1nv-train-102.out, etc.) |
~7 MB |
Miscellaneous archives and log files. |
Proposed next steps
The most impactful remediation is purging the deleted miniconda.sh from history, which alone would reclaim roughly 96 MB. Beyond that, migrating current large binaries to Git LFS and adding guardrails would prevent recurrence.
Specifically:
- Purge
miniconda.sh from history using git filter-repo --path 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh --invert-paths.
- Migrate large binaries (GIFs, ZIPs, archives) to Git LFS via
git lfs migrate import.
- Add
.gitattributes rules to enforce LFS tracking for *.gif, *.zip, *.tar.bz2, *.png, and similar extensions.
- The expected result is roughly a 60% reduction in clone size (from 177 MB down to approximately 60–70 MB).
Rollout considerations
Because history rewriting is a breaking change for existing clones and forks, the following precautions are warranted:
- Announce to contributors before the history rewrite so they can prepare.
- Combine the
filter-repo purge and LFS migration into a single coordinated force push to minimize disruption.
- Update contributing guidelines to mention
git lfs install as a prerequisite for development.
- Provide re-sync instructions for existing forks (e.g., fresh clone or
git fetch --refetch).
Problem
The
.gitdirectory weighs 177 MB despite roughly 72 MB of working tree content, making fresh clones significantly slower than necessary. A single 164 MB pack file containing 12,525 objects accounts for nearly all of this overhead, andgit count-objects -vHreports 5,599 non-delta objects, indicating a substantial proportion of binary or incompressible content.Investigation methodology
The analysis combined several approaches to identify where the bloat originates:
.gitdirectory size withdu -sh .gitand inspectedls -lh .git/objects/pack/to quantify the overhead at 177 MB total, dominated by a single 164 MB pack file.git count-objects -vHfor git's own size accounting, confirming 5,599 non-delta objects — a high count that points to many binary files.git rev-list --objects --all | git cat-file --batch-checkto surface objects that persist in the pack regardless of whether they still exist in the current working tree.findwith size sorting to identify current binary bloat.git log --diff-filter=D --summaryto identify large files that were committed and later deleted but still consume space in the object database..gitattributescontents and checkedgit lfs ls-files— Git LFS is not configured.du -sh */to understand which areas of the repo contribute most.Root cause analysis
The following table lists the sources of bloat in descending order of impact:
miniconda.shat3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.shd0227b2, deleted in commit32b3313(Oct 2023). Still stored in the git object database and downloaded by every clone. This single file accounts for ~54% of the.gitdirectory size.automate-smhp-demo.gif,automate-smhp-eks-demo.gif)grafana-service-token-lambda-function.zip)0.docs/and other directorieshwloc-2.9.2-h2bc3f7f_0.tar.bz2,slurm-esm1nv-train-102.out, etc.)Proposed next steps
The most impactful remediation is purging the deleted
miniconda.shfrom history, which alone would reclaim roughly 96 MB. Beyond that, migrating current large binaries to Git LFS and adding guardrails would prevent recurrence.Specifically:
miniconda.shfrom history usinggit filter-repo --path 3.test_cases/9.nemo-multimodal/nemo_configs/miniconda.sh --invert-paths.git lfs migrate import..gitattributesrules to enforce LFS tracking for*.gif,*.zip,*.tar.bz2,*.png, and similar extensions.Rollout considerations
Because history rewriting is a breaking change for existing clones and forks, the following precautions are warranted:
filter-repopurge and LFS migration into a single coordinated force push to minimize disruption.git lfs installas a prerequisite for development.git fetch --refetch).