Summary
The [[git_repos]] mechanism with mount_as creates a fragile setup where scripts from one version can call core libraries from another version, leading to hard-to-debug errors.
Problem
When using [[git_repos]] with mount_as, a partial override occurs:
- Container has built-in package (e.g., Megatron-Bridge v0.4.0rc0 at
/opt/Megatron-Bridge)
- External git clone (e.g., v0.3.1) provides entry scripts via PYTHONPATH
- Scripts from v0.3.1 import core modules from container's v0.4.0rc0
This causes:
- ModuleNotFoundError - Different module structure between versions
- API mismatches - Functions/parameters differ between versions
- Silent failures - No validation that git repo version is compatible with container
Observed Errors
ModuleNotFoundError: No module named 'megatron.core'
ValueError: Currently there is no support for Pipeline parallelism with CPU offloading
Root Causes
- Partial mounting -
mount_as overwrites some paths but not others
- Two sources of truth -
[[git_repos]] commit vs container's built-in version
- Implicit dependencies - No enforcement that versions match
Proposed Solutions
- Version validation - Validate git repo commit is compatible with container
- Full override or none -
mount_as must override entire package or nothing
- Container-only mode - Warn if
[[git_repos]] targets a package already in container
- Deprecate partial mounts - Remove support for mounting over container paths
Environment
- CloudAI version: v1.6.beta6
- Container: nvcr.io/nvidian/nemo:26.04.rc2 (Megatron-Bridge v0.4.0rc0)
- External repo: Megatron-Bridge v0.3.1
Summary
The
[[git_repos]]mechanism withmount_ascreates a fragile setup where scripts from one version can call core libraries from another version, leading to hard-to-debug errors.Problem
When using
[[git_repos]]withmount_as, a partial override occurs:/opt/Megatron-Bridge)This causes:
Observed Errors
Root Causes
mount_asoverwrites some paths but not others[[git_repos]]commit vs container's built-in versionProposed Solutions
mount_asmust override entire package or nothing[[git_repos]]targets a package already in containerEnvironment