Skip to content

A few tests failed/had errors in nvidia pytorch 20.12 container with A100 GPU #1472

Closed
@IsaacYangSLA

Description

@IsaacYangSLA

Describe the bug
Log in detail at the end of this report

To Reproduce
Launch nvcr.io/nvidia/pytorch:20.12-py3 in A100 host.
nvidia-smi showed

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.08 Driver Version: 455.08 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:65:00.0 Off | 0 |
| N/A 33C P0 37W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+

Then pip install -r requirements-dev.txt
BUILD_MONAI=1 ./runtests.sh --coverage

Expected behavior
All tests pass.

MONAI rev id is 91cb8cd

Additional context

ERROR: test_training (tests.test_integration_determinism.TestDeterminism)

AssertionError:
Not equal to tolerance rtol=0.0001, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 5.6044949e-05
Max relative difference: 0.00010458
x: array(0.535983)
y: array(0.535927)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 409, in _wrapper
raise RuntimeError(res.traceback) from res
RuntimeError: Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 348, in run_process
output = func(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 434, in _call_original_func
return f(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_determinism.py", line 84, in test_training
np.testing.assert_allclose(loss, 0.535927, rtol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=0.0001, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 5.6044949e-05
Max relative difference: 0.00010458
x: array(0.535983)
y: array(0.535927)

======================================================================
ERROR: test_timing (tests.test_integration_segmentation_3d.IntegrationSegmentation3D)

ValueError: no matched results for integration_segmentation_3d, output_sums. [0.14197655735151385, 0.15187345208036218, 0.15151798637064173, 0.13981437171922162, 0.18797424777831578, 0.16945214449897825, 0.14668689379487942, 0.16789344992671038, 0.15692868772420235, 0.17890303083214196, 0.16228913079728044, 0.16761449125061936, 0.14524114928247697, 0.11657973897988946, 0.16152112759094686, 0.20063607303425035, 0.17549273136006815, 0.10394395512027575, 0.19296821173905265, 0.20234056500046443, 0.19596162310697743, 0.20778924939444293, 0.16200035873640076, 0.13186463537569404, 0.14872206861575096, 0.1424331264902646, 0.23080494038241672, 0.16095933686883251, 0.14811113251292585, 0.10420227534253262, 0.11886731056829931, 0.13106810760639204, 0.11427714341377837, 0.15350518705361246, 0.1628292835845796, 0.19326941589027985, 0.22188361089428374, 0.1803359227170212, 0.18965555393236588, 0.08550083029685855].

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 409, in _wrapper
raise RuntimeError(res.traceback) from res
RuntimeError: Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 348, in run_process
output = func(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 434, in _call_original_func
return f(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_segmentation_3d.py", line 288, in test_timing
self.train_and_infer(idx=3)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_segmentation_3d.py", line 274, in train_and_infer
self.assertTrue(test_integration_value(TASK, key="output_sums", data=results[8:], rtol=1e-2))
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/testing_data/integration_answers.py", line 327, in test_integration_value
raise ValueError(f"no matched results for {test_name}, {key}. {data}.")
ValueError: no matched results for integration_segmentation_3d, output_sums. [0.14197655735151385, 0.15187345208036218, 0.15151798637064173, 0.13981437171922162, 0.18797424777831578, 0.16945214449897825, 0.14668689379487942, 0.16789344992671038, 0.15692868772420235, 0.17890303083214196, 0.16228913079728044, 0.16761449125061936, 0.14524114928247697, 0.11657973897988946, 0.16152112759094686, 0.20063607303425035, 0.17549273136006815, 0.10394395512027575, 0.19296821173905265, 0.20234056500046443, 0.19596162310697743, 0.20778924939444293, 0.16200035873640076, 0.13186463537569404, 0.14872206861575096, 0.1424331264902646, 0.23080494038241672, 0.16095933686883251, 0.14811113251292585, 0.10420227534253262, 0.11886731056829931, 0.13106810760639204, 0.11427714341377837, 0.15350518705361246, 0.1628292835845796, 0.19326941589027985, 0.22188361089428374, 0.1803359227170212, 0.18965555393236588, 0.08550083029685855].

======================================================================
ERROR: test_training (tests.test_integration_segmentation_3d.IntegrationSegmentation3D)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_segmentation_3d.py", line 280, in test_training
results = self.train_and_infer(i)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_segmentation_3d.py", line 274, in train_and_infer
self.assertTrue(test_integration_value(TASK, key="output_sums", data=results[8:], rtol=1e-2))
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/testing_data/integration_answers.py", line 327, in test_integration_value
raise ValueError(f"no matched results for {test_name}, {key}. {data}.")
ValueError: no matched results for integration_segmentation_3d, output_sums. [0.14197655735151385, 0.15187345208036218, 0.15151798637064173, 0.13981437171922162, 0.18797424777831578, 0.16945214449897825, 0.14668689379487942, 0.16789344992671038, 0.15692868772420235, 0.17890303083214196, 0.16228913079728044, 0.16761449125061936, 0.14524114928247697, 0.11657973897988946, 0.16152112759094686, 0.20063607303425035, 0.17549273136006815, 0.10394395512027575, 0.19296821173905265, 0.20234056500046443, 0.19596162310697743, 0.20778924939444293, 0.16200035873640076, 0.13186463537569404, 0.14872206861575096, 0.1424331264902646, 0.23080494038241672, 0.16095933686883251, 0.14811113251292585, 0.10420227534253262, 0.11886731056829931, 0.13106810760639204, 0.11427714341377837, 0.15350518705361246, 0.1628292835845796, 0.19326941589027985, 0.22188361089428374, 0.1803359227170212, 0.18965555393236588, 0.08550083029685855].

======================================================================
ERROR: test_training (tests.test_integration_sliding_window.TestIntegrationSlidingWindow)

AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 4.
Max relative difference: 0.00011897
x: array(33617.)
y: array(33621)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 409, in _wrapper
raise RuntimeError(res.traceback) from res
RuntimeError: Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 348, in run_process
output = func(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/utils.py", line 434, in _call_original_func
return f(*args, **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_integration_sliding_window.py", line 86, in test_training
np.testing.assert_allclose(np.sum(output_image), 33621)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0

Mismatched elements: 1 / 1 (100%)
Max absolute difference: 4.
Max relative difference: 0.00011897
x: array(33617.)
y: array(33621)

======================================================================
FAIL: test_affine_transform_2d (tests.test_affine_transform.TestAffineTransform)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 208, in test_affine_transform_2d
np.testing.assert_allclose(out, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 12 / 12 (100%)
Max absolute difference: 0.00219655
Max relative difference: 3070.99998842
x: array([[[[2.197266e-03, 4.995321e-01, 9.994913e-01, 1.499634e+00],
[3.866943e+00, 1.365621e+00, 1.865580e+00, 2.365723e+00],
[7.732300e+00, 3.037472e+00, 2.731669e+00, 3.231812e+00]]]],
dtype=float32)
y: array([[[[7.152557e-07, 4.999999e-01, 1.000000e+00, 1.500000e+00],
[3.866026e+00, 1.366025e+00, 1.866025e+00, 2.366025e+00],
[7.732052e+00, 3.035899e+00, 2.732051e+00, 3.232051e+00]]]])

======================================================================
FAIL: test_affine_transform_3d (tests.test_affine_transform.TestAffineTransform)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 258, in test_affine_transform_3d
np.testing.assert_allclose(out, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 44 / 48 (91.7%)
Max absolute difference: 0.00202166
Max relative difference: 31533.83072917
x: array([[[[[1.892090e-03, 5.017701e-01],
[2.367615e+00, 1.367493e+00],
[4.733337e+00, 2.402832e+00],...
y: array([[[[[6.000000e-08, 5.000001e-01],
[2.366025e+00, 1.366025e+00],
[4.732051e+00, 2.401924e+00],...

======================================================================
FAIL: test_to_norm_affine_0 (tests.test_affine_transform.TestToNormAffine)

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
return func(*(a + p.args), **p.kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 98, in test_to_norm_affine
np.testing.assert_allclose(new_affine, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 2 / 9 (22.2%)
Max absolute difference: 0.00032559
Max relative difference: 0.00097667
x: array([[[ 1.333008, 0. , 0.333008],
[ 0. , 0.399902, -0.600098],
[ 0. , 0. , 1. ]]], dtype=float32)
y: array([[[ 1.333333, 0. , 0.333333],
[ 0. , 0.4 , -0.6 ],
[ 0. , 0. , 1. ]]])

======================================================================
FAIL: test_to_norm_affine_1 (tests.test_affine_transform.TestToNormAffine)

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
return func(*(a + p.args), **p.kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 98, in test_to_norm_affine
np.testing.assert_allclose(new_affine, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 2 / 9 (22.2%)
Max absolute difference: 0.00024414
Max relative difference: 0.00048828
x: array([[[ 1.25 , 0. , 0.25 ],
[ 0. , 0.499878, -0.500244],
[ 0. , 0. , 1. ]]], dtype=float32)
y: array([[[ 1.25, 0. , 0.25],
[ 0. , 0.5 , -0.5 ],
[ 0. , 0. , 1. ]]])

======================================================================
FAIL: test_to_norm_affine_2 (tests.test_affine_transform.TestToNormAffine)

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
return func(*(a + p.args), **p.kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 98, in test_to_norm_affine
np.testing.assert_allclose(new_affine, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 2 / 16 (12.5%)
Max absolute difference: 0.00032559
Max relative difference: 0.00097667
x: array([[[ 2. , 0. , 0. , 1. ],
[ 0. , 1.333008, 0. , 0.333008],
[ 0. , 0. , 0.399902, -0.600098],
[ 0. , 0. , 0. , 1. ]]], dtype=float32)
y: array([[[ 2. , 0. , 0. , 1. ],
[ 0. , 1.333333, 0. , 0.333333],
[ 0. , 0. , 0.4 , -0.6 ],
[ 0. , 0. , 0. , 1. ]]])

======================================================================
FAIL: test_to_norm_affine_3 (tests.test_affine_transform.TestToNormAffine)

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
return func(*(a + p.args), **p.kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_affine_transform.py", line 98, in test_to_norm_affine
np.testing.assert_allclose(new_affine, expected, atol=1e-4)
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 1527, in assert_allclose
assert_array_compare(compare, actual, desired, err_msg=str(err_msg),
File "/opt/conda/lib/python3.8/site-packages/numpy/testing/_private/utils.py", line 840, in assert_array_compare
raise AssertionError(msg)
AssertionError:
Not equal to tolerance rtol=1e-07, atol=0.0001

Mismatched elements: 2 / 16 (12.5%)
Max absolute difference: 0.00024414
Max relative difference: 0.00048828
x: array([[[ 1.5 , 0. , 0. , 0.5 ],
[ 0. , 1.25 , 0. , 0.25 ],
[ 0. , 0. , 0.499878, -0.500244],
[ 0. , 0. , 0. , 1. ]]], dtype=float32)
y: array([[[ 1.5 , 0. , 0. , 0.5 ],
[ 0. , 1.25, 0. , 0.25],
[ 0. , 0. , 0.5 , -0.5 ],
[ 0. , 0. , 0. , 1. ]]])

======================================================================
FAIL: test_content (tests.test_ensemble_evaluator.TestEnsembleEvaluator)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_ensemble_evaluator.py", line 60, in test_content
val_engine.run()
File "/home/jenkins/agent/workspace/MONAI-nightly/monai/engines/evaluator.py", line 101, in run
super().run()
File "/home/jenkins/agent/workspace/MONAI-nightly/monai/engines/workflow.py", line 166, in run
super().run(data=self.data_loader, max_epochs=self.state.max_epochs)
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 691, in run
return self._internal_run()
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 762, in _internal_run
self._handle_exception(e)
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
raise e
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 730, in _internal_run
time_taken = self._run_once_on_dataset()
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 828, in _run_once_on_dataset
self._handle_exception(e)
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 467, in _handle_exception
raise e
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 812, in _run_once_on_dataset
self._fire_event(Events.ITERATION_COMPLETED)
File "/opt/conda/lib/python3.8/site-packages/ignite/engine/engine.py", line 423, in _fire_event
func(*first, *(event_args + others), **kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_ensemble_evaluator.py", line 58, in run_post_transform
torch.testing.assert_allclose(engine.state.output[f"pred{i}"], expected_value)
File "/opt/conda/lib/python3.8/site-packages/torch/testing/init.py", line 215, in assert_allclose
raise AssertionError("expected tensor shape {0} doesn't match with actual tensor "
AssertionError: expected tensor shape torch.Size([]) doesn't match with actual tensor shape torch.Size([1, 1])!

======================================================================
FAIL: test_one_save_one_load (tests.test_handler_checkpoint_loader.TestHandlerCheckpointLoader)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_handler_checkpoint_loader.py", line 43, in test_one_save_one_load
torch.testing.assert_allclose(net2.state_dict()["weight"], 0.1)
File "/opt/conda/lib/python3.8/site-packages/torch/testing/init.py", line 215, in assert_allclose
raise AssertionError("expected tensor shape {0} doesn't match with actual tensor "
AssertionError: expected tensor shape torch.Size([]) doesn't match with actual tensor shape torch.Size([1])!

======================================================================
FAIL: test_save_single_device_load_multi_devices (tests.test_handler_checkpoint_loader.TestHandlerCheckpointLoader)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_handler_checkpoint_loader.py", line 86, in test_save_single_device_load_multi_devices
torch.testing.assert_allclose(net2.state_dict()["module.weight"].cpu(), 0.1)
File "/opt/conda/lib/python3.8/site-packages/torch/testing/init.py", line 215, in assert_allclose
raise AssertionError("expected tensor shape {0} doesn't match with actual tensor "
AssertionError: expected tensor shape torch.Size([]) doesn't match with actual tensor shape torch.Size([1])!

======================================================================
FAIL: test_two_save_one_load (tests.test_handler_checkpoint_loader.TestHandlerCheckpointLoader)

Traceback (most recent call last):
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_handler_checkpoint_loader.py", line 65, in test_two_save_one_load
torch.testing.assert_allclose(net2.state_dict()["weight"], 0.1)
File "/opt/conda/lib/python3.8/site-packages/torch/testing/init.py", line 215, in assert_allclose
raise AssertionError("expected tensor shape {0} doesn't match with actual tensor "
AssertionError: expected tensor shape torch.Size([]) doesn't match with actual tensor shape torch.Size([1])!

======================================================================
FAIL: test_value_cuda_0 (tests.test_lltm.TestLLTM)

Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/parameterized/parameterized.py", line 533, in standalone_func
return func(*(a + p.args), **p.kwargs)
File "/home/jenkins/agent/workspace/MONAI-nightly/tests/test_lltm.py", line 53, in test_value_cuda
torch.testing.assert_allclose(new_h, expected_h.to(device), rtol=0.0001, atol=1e-04)
File "/opt/conda/lib/python3.8/site-packages/torch/testing/init.py", line 232, in assert_allclose
raise AssertionError(msg)
AssertionError: With rtol=0.0001 and atol=0.0001, found 2 element(s) (out of 8) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.00022590160369873047 (0.616874098777771 vs. 0.6171000003814697), which occurred at index (2, 1).


Ran 1957 tests in 2009.114s

FAILED (failures=11, errors=4, skipped=4)

Metadata

Metadata

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions