Prerequisite
Environment
OrderedDict([('sys.platform', 'linux'), ('Python', '3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 12.1, V12.1.105'), ('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), ('PyTorch', '2.1.2+cu121'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 12.1\n - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n - CuDNN 8.9.2\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.16.2+cu121'), ('OpenCV', '4.9.0'), ('MMEngine', '0.10.7'), ('MMPose', '1.3.2+')])
mmcv 2.1.0
mmdet 3.3.0
mmengine 0.10.7
mmpose 1.3.2 /root/autodl-tmp/mmpose-main
mmpretrain 1.2.0
Reproduces the problem - code sample
I reproduced ViTPose-S on a single RTX 3090 without modifying any code. https://github.com/open-mmlab/mmpose/blob/main/configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_ViTPose-small_8xb64-210e_coco-256x192.py
Reproduces the problem - command or script
python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_ViTPose-small_8xb64-210e_coco-256x192.py
Reproduces the problem - error message
Table 1: Training Stage (snapshot at the 50-th iter of each epoch)
| epoch |
source |
loss ‰ |
acc_pose ‰ |
lr (×10⁻⁵) |
grad_norm ‰ |
| 20 |
official |
0.8 |
71.8 |
12.7 |
2.0 |
| 20 |
mine |
0.86 |
71.8 |
2.75 |
3.0 |
| 30 |
official |
0.8 |
74.1 |
12.7 |
1.9 |
| 30 |
mine |
0.83 |
77.5 |
2.75 |
2.8 |
| 40 |
official |
0.8 |
75.8 |
12.7 |
1.8 |
| 40 |
mine |
0.81 |
73.4 |
2.75 |
3.0 |
| 50 |
official |
0.7 |
76.4 |
12.7 |
1.8 |
| 50 |
mine |
0.81 |
68.3 |
2.75 |
2.9 |
| 60 |
official |
0.7 |
76.7 |
12.7 |
1.6 |
| 60 |
mine |
0.79 |
72.9 |
2.75 |
3.0 |
| 70 |
official |
0.7 |
77.3 |
12.7 |
1.3 |
| 70 |
mine |
0.79 |
71.7 |
2.75 |
3.0 |
Table 2: Validation Stage (end-of-epoch COCO metrics)
| epoch |
source |
AP |
AP.5 |
AP.75 |
AP(M) |
AP(L) |
AR |
AR.5 |
AR.75 |
AR(M) |
AR(L) |
| 20 |
official |
65.1 |
86.9 |
72.5 |
58.2 |
66.7 |
71.4 |
91.5 |
78.2 |
67.2 |
77.3 |
| 20 |
mine |
65.1 |
87.0 |
73.0 |
62.1 |
71.0 |
71.3 |
91.6 |
78.3 |
67.0 |
77.5 |
| 30 |
official |
66.3 |
87.6 |
73.7 |
59.1 |
68.4 |
72.5 |
92.1 |
79.2 |
68.1 |
78.8 |
| 30 |
mine |
66.4 |
87.5 |
74.1 |
63.4 |
72.2 |
72.3 |
91.8 |
79.3 |
68.1 |
78.3 |
| 40 |
official |
67.7 |
88.0 |
75.1 |
60.9 |
69.6 |
73.8 |
92.3 |
80.5 |
69.7 |
79.7 |
| 40 |
mine |
66.4 |
87.3 |
74.3 |
63.6 |
72.4 |
72.8 |
91.8 |
80.1 |
68.7 |
78.8 |
| 50 |
official |
68.4 |
88.2 |
76.1 |
61.5 |
70.4 |
74.4 |
92.5 |
81.5 |
70.4 |
80.2 |
| 50 |
mine |
67.1 |
87.5 |
74.9 |
64.2 |
72.9 |
73.3 |
92.0 |
80.3 |
69.1 |
79.2 |
| 60 |
official |
69.1 |
88.3 |
76.9 |
62.3 |
70.7 |
75.1 |
92.6 |
82.0 |
71.2 |
80.7 |
| 60 |
mine |
67.5 |
88.0 |
75.4 |
64.9 |
73.0 |
73.5 |
92.1 |
80.7 |
69.7 |
79.0 |
| 70 |
official |
69.5 |
88.7 |
77.5 |
62.6 |
71.6 |
75.4 |
92.9 |
82.4 |
71.3 |
81.2 |
| 70 |
mine |
67.6 |
87.9 |
75.5 |
64.7 |
73.4 |
73.7 |
92.3 |
81.0 |
69.6 |
79.5 |
Is this normal? The gap is widening, need help!
Additional information
No response
Prerequisite
Environment
OrderedDict([('sys.platform', 'linux'), ('Python', '3.9.23 (main, Jun 5 2025, 13:40:20) [GCC 11.2.0]'), ('CUDA available', True), ('MUSA available', False), ('numpy_random_seed', 2147483648), ('GPU 0', 'NVIDIA GeForce RTX 3090'), ('CUDA_HOME', '/usr/local/cuda'), ('NVCC', 'Cuda compilation tools, release 12.1, V12.1.105'), ('GCC', 'gcc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0'), ('PyTorch', '2.1.2+cu121'), ('PyTorch compiling details', 'PyTorch built with:\n - GCC 9.3\n - C++ Version: 201703\n - Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications\n - Intel(R) MKL-DNN v3.1.1 (Git Hash 64f6bcbcbab628e96f33a62c3e975f8535a7bde4)\n - OpenMP 201511 (a.k.a. OpenMP 4.5)\n - LAPACK is enabled (usually provided by MKL)\n - NNPACK is enabled\n - CPU capability usage: AVX2\n - CUDA Runtime 12.1\n - NVCC architecture flags: -gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90\n - CuDNN 8.9.2\n - Magma 2.6.1\n - Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=12.1, CUDNN_VERSION=8.9.2, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=old-style-cast -Wno-invalid-partial-specialization -Wno-unused-private-field -Wno-aligned-allocation-unavailable -Wno-missing-braces -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.1.2, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF, \n'), ('TorchVision', '0.16.2+cu121'), ('OpenCV', '4.9.0'), ('MMEngine', '0.10.7'), ('MMPose', '1.3.2+')])
mmcv 2.1.0
mmdet 3.3.0
mmengine 0.10.7
mmpose 1.3.2 /root/autodl-tmp/mmpose-main
mmpretrain 1.2.0
Reproduces the problem - code sample
I reproduced ViTPose-S on a single RTX 3090 without modifying any code. https://github.com/open-mmlab/mmpose/blob/main/configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_ViTPose-small_8xb64-210e_coco-256x192.py
Reproduces the problem - command or script
python tools/train.py configs/body_2d_keypoint/topdown_heatmap/coco/td-hm_ViTPose-small_8xb64-210e_coco-256x192.pyReproduces the problem - error message
Table 1: Training Stage (snapshot at the 50-th iter of each epoch)
Table 2: Validation Stage (end-of-epoch COCO metrics)
Is this normal? The gap is widening, need help!
Additional information
No response