Skip to content

[ci] [dask] CI jobs failing with Dask 2022.7.1 #5390

Open
@jameslamb

Description

@jameslamb

Description

Created from #5388 (comment).

All CUDA CI jobs and several Linux jobs are failing with the following.

FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[binary-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[multiclass-classification]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[regression]
FAILED ../tests/python_package_test/test_dask.py::test_machines_should_be_used_if_provided[ranking]
= 4 failed, 700 passed, 10 skipped, 2 xfailed, 395 warnings in 655.51s (0:10:55) =

client.restart() calls in that test are resulting in the following:

raise TimeoutError(f"{len(bad_nannies)}/{len(nannies)} nanny worker(s) did not shut down within {timeout}s")
E asyncio.exceptions.TimeoutError: 1/2 nanny worker(s) did not shut down within 120s

traceback (click me)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3360: in restart
    return self.sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:338: in sync
    return sync(
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:405: in sync
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:378: in f
    result = yield future
/root/miniforge/envs/test-env/lib/python3.9/site-packages/tornado/gen.py:762: in run
    value = future.result()
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/client.py:3329: in _restart
    await self.scheduler.restart(timeout=timeout, wait_for_workers=wait_for_workers)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:1153: in send_recv_from_rpc
    return await send_recv(comm=comm, op=key, **kwargs)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:943: in send_recv
    raise exc.with_traceback(tb)
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/core.py:769: in _handle_comm
    result = await result
/root/miniforge/envs/test-env/lib/python3.9/site-packages/distributed/utils.py:778: in wrapper
    return await func(*args, **kwargs)

It looks like those jobs are getting dask and distributed 2022.7.1

    dask-2022.7.1              |     pyhd8ed1ab_0           5 KB  conda-forge
    dask-core-2022.7.1         |     pyhd8ed1ab_0         840 KB  conda-forge
    dbus-1.13.6                |       h5008d03_3         604 KB  conda-forge
    distributed-2022.7.1       |     pyhd8ed1ab_0         735 KB  conda-forge

which hit conda-forge 3 days ago.

Screen Shot 2022-07-26 at 1 34 56 PM

Reproducible example

Here's an example: https://github.com/microsoft/LightGBM/runs/7522939980?check_suite_focus=true

I don't believe the failure is related to anything specific on the PR that that failed build came from.

Additional Comments

Note that this should not be a concern for jobs using Python < 3.8, as dask / distributed have dropped support for those Python versions.

Logs from an example build on #5388 where I tried to pin to exactly dask==2022.7.0 (build link):

UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:

Specifications:

  - dask==2022.7.0 -> python[version='>=3.8']
  - distributed==2022.7.0 -> python[version='>=3.8']

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions