Description
Description
Created from #5388 (comment).
All CUDA CI jobs and several Linux jobs are failing with the following.
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =
client.restart()
calls in that test are resulting in the following:
raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s")
E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s
traceback (click me)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart
return self.sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync
return sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync
raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f
result = yield future
/root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run
value = future.result()
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart
await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc
return await send_recv(comm=comm, op=key, **kwargs)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv
raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm
result = await result
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper
return await func(*args, **kwargs)
It looks like those jobs are getting dask
and distributed
2022.7.1
dask-2022.7.1 | pyhd8ed1ab_0 5 KB conda-forge
dask-core-2022.7.1 | pyhd8ed1ab_0 840 KB conda-forge
dbus-1.13.6 | h5008d03_3 604 KB conda-forge
distributed-2022.7.1 | pyhd8ed1ab_0 735 KB conda-forge
which hit conda-forge
3 days ago.
Reproducible example
Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true
I don't believe the failure is related to anything specific on the PR that that failed build came from.
Additional Comments
Note that this should not be a concern for jobs using Python < 3.8
, as dask
/ distributed
have dropped support for those Python versions.
Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0
(build link):
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- dask==2022.7.0 -> python[version='>=3.8']
- distributed==2022.7.0 -> python[version='>=3.8']