-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Add a 5 min default timeout for deadlocks #16342
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
2287b53 to
2a5b779
Compare
crates/uv-fs/src/locked_file.rs
Outdated
| timeout: Duration, | ||
| ) -> Option<Output> { | ||
| let (sender, receiver) = std::sync::mpsc::channel(); | ||
| thread::spawn(move || { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should happen rarely and already involves waiting, so we can spawn a thread. I quickly looked into making it generally async but it didn't seem worth the churn.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, we only have a couple calls to the blocking / sync versions of the lock APIs. I'd be tempted to make them async.
I wonder if we should move this timeout handling to the API functions so we can run_with_timeout in the blocking / sync versions and just use an async timeout in the async versions? I'm wary of spawning a thread just for a timeout in the async case.
| /// Parsed value of `UV_LOCK_TIMEOUT`, with a default of 5 min. | ||
| static LOCK_TIMEOUT: LazyLock<Duration> = LazyLock::new(|| { | ||
| let default_timeout = Duration::from_secs(300); | ||
| let Some(lock_timeout) = env::var_os(EnvVars::UV_LOCK_TIMEOUT) else { | ||
| return default_timeout; | ||
| }; | ||
|
|
||
| if let Some(lock_timeout) = lock_timeout | ||
| .to_str() | ||
| .and_then(|lock_timeout| lock_timeout.parse::<u64>().ok()) | ||
| { | ||
| Duration::from_secs(lock_timeout) | ||
| } else { | ||
| warn!( | ||
| "Could not parse value of {} as integer: {:?}", | ||
| EnvVars::UV_LOCK_TIMEOUT, | ||
| lock_timeout | ||
| ); | ||
| default_timeout | ||
| } | ||
| }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It'd be nice to have this to our standard environment variable parsing in EnvironmentOptions instead, I don't want to keep adding ad-hoc parsing like this.
If you want to defer it to reduce churn, that's okay — but we should add it the tracking issue and make sure it's moved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've added it to #14720, do you want a separate tracking issue?
I had looked into parsing this centrally but the locks are called in a lot of locations including e.g. a LazyLock in a Default impl (
uv/crates/uv-auth/src/middleware.rs
Line 79 in 7f38a8a
| match TextCredentialStore::read(&path) { |
crates/uv-fs/src/locked_file.rs
Outdated
| #[derive(Debug, Error)] | ||
| pub enum LockedFileError { | ||
| #[error( | ||
| "Timeout ({}s) when waiting for lock on `{}` at `{}`, is another uv process running? Set `{}` to increase the timeout.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to say "You can set ... to increase the timeout" instead of "Set" which makes it sounds like you should do that as the solution.
crates/uv/tests/it/cache_clean.rs
Outdated
| // Write a test package that builds for a while | ||
| let child_pyproject_toml = context.temp_dir.child("pyproject.toml"); | ||
| child_pyproject_toml.write_str(indoc! {r#" | ||
| [project] | ||
| name = "child" | ||
| version = "0.1.0" | ||
| requires-python = ">=3.9" | ||
| [build-system] | ||
| requires = [] | ||
| backend-path = ["."] | ||
| build-backend = "build_backend" | ||
| "#})?; | ||
| // File to wait until the lock is acquired from starting the build. | ||
| let ready_file = context.temp_dir.child("ready_file.txt"); | ||
| let build_backend = context.temp_dir.child("build_backend.py"); | ||
| build_backend.write_str(&formatdoc! {r#" | ||
| import time | ||
| from pathlib import Path | ||
| Path(r"{}").touch() | ||
| # Make the test fail quickly if something goes wrong | ||
| time.sleep(10) | ||
| "#, | ||
| // Don't run tests in directories with double quotes, please. | ||
| ready_file.display(), | ||
| })?; | ||
|
|
||
| let mut child = context.pip_install().arg(".").spawn()?; | ||
|
|
||
| // Wait until we've acquired the lock in the first process. | ||
| while !ready_file.exists() { | ||
| std::thread::sleep(std::time::Duration::from_millis(1)); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is more complicated than it needs to be. We can just do
uv/crates/uv/tests/it/cache_clean.rs
Line 74 in 107d4e0
| let _cache = uv_cache::Cache::from_path(context.cache_dir.path()).with_exclusive_lock(); |
zanieb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#16342 (comment) is my main remaining caveat.
We should probably also add a note in https://docs.astral.sh/uv/concepts/cache since that's the main place this will be relevant.
|
On the timing, I guess I might expect something like 60s rather than 5m? 5m is nice and conservative though, we could reduce it later once we see that 5m doesn't break anything |
| ) -> anyhow::Result<Vec<PathBuf>> { | ||
| let cache = Cache::from_path(temp_dir.child("cache").to_path_buf()).init()?; | ||
| let cache = Cache::from_path(temp_dir.child("cache").to_path_buf()) | ||
| .init_no_wait()? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a bit more risky change because it assumes tests do not lock or spawn something in the background and then operate on Python versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alternative is making every integration test async
e15037e to
762fbf6
Compare
|
I rewrote it entirely async and removed the duplication between sync and async as well as shared and exclusive.
I can see some (e.g. Rust) build taking >60s, so I'd like to go with a higher timeout. |
ac8cb01 to
fce9233
Compare
fce9233 to
9f4664d
Compare
|
I've rebased on top of dropping fs2 |
3febf8f to
38699ba
Compare
38699ba to
3e2d46e
Compare
3e2d46e to
4d335c0
Compare
When a process is running and another calls `uv cache clean` or `uv cache prune` we currently deadlock - sometimes until the CI timeout (astral-sh/setup-uv#588). To avoid this, we add a default 5 min timeout waiting for a lock. 5 min balances allowing in-progress builds to finish, especially with larger native dependencies, while also giving timely errors for deadlocks on (remote) systems. Handle not found errors better Windows fix Fix windows Update docs Review Simplify test case Clippy Use async and write docs Update snapshot
4d335c0 to
a36e589
Compare
This MR contains the following updates: | Package | Update | Change | |---|---|---| | [astral-sh/uv](https://github.com/astral-sh/uv) | patch | `0.9.13` -> `0.9.17` | MR created with the help of [el-capitano/tools/renovate-bot](https://gitlab.com/el-capitano/tools/renovate-bot). **Proposed changes to behavior should be submitted there as MRs.** --- ### Release Notes <details> <summary>astral-sh/uv (astral-sh/uv)</summary> ### [`v0.9.17`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0917) [Compare Source](astral-sh/uv@0.9.16...0.9.17) Released on 2025-12-09. ##### Enhancements - Add `torch-tensorrt` and `torchao` to the PyTorch list ([#​17053](astral-sh/uv#17053)) - Add hint for misplaced `--verbose` in `uv tool run` ([#​17020](astral-sh/uv#17020)) - Add support for relative durations in `exclude-newer` (a.k.a., dependency cooldowns) ([#​16814](astral-sh/uv#16814)) - Add support for relocatable nushell activation script ([#​17036](astral-sh/uv#17036)) ##### Bug fixes - Respect dropped (but explicit) indexes in dependency groups ([#​17012](astral-sh/uv#17012)) ##### Documentation - Improve `source-exclude` reference docs ([#​16832](astral-sh/uv#16832)) - Recommend `UV_NO_DEV` in Docker installs ([#​17030](astral-sh/uv#17030)) - Update `UV_VERSION` in docs for GitLab CI/CD ([#​17040](astral-sh/uv#17040)) ### [`v0.9.16`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0916) [Compare Source](astral-sh/uv@0.9.15...0.9.16) Released on 2025-12-06. ##### Python - Add CPython 3.14.2 - Add CPython 3.13.11 ##### Enhancements - Add a 5m default timeout to acquiring file locks to fail faster on deadlock ([#​16342](astral-sh/uv#16342)) - Add a stub `debug` subcommand to `uv pip` announcing its intentional absence ([#​16966](astral-sh/uv#16966)) - Add bounds in `uv add --script` ([#​16954](astral-sh/uv#16954)) - Add brew specific message for `uv self update` ([#​16838](astral-sh/uv#16838)) - Error when built wheel is for the wrong platform ([#​16074](astral-sh/uv#16074)) - Filter wheels from PEP 751 files based on `--no-binary` et al in `uv pip compile` ([#​16956](astral-sh/uv#16956)) - Support `--target` and `--prefix` in `uv pip list`, `uv pip freeze`, and `uv pip show` ([#​16955](astral-sh/uv#16955)) - Tweak language for build backend validation errors ([#​16720](astral-sh/uv#16720)) - Use explicit credentials cache instead of global static ([#​16768](astral-sh/uv#16768)) - Enable SIMD in HTML parsing ([#​17010](astral-sh/uv#17010)) ##### Preview features - Fix missing preview warning in `uv workspace metadata` ([#​16988](astral-sh/uv#16988)) - Add a `uv auth helper --protocol bazel` command ([#​16886](astral-sh/uv#16886)) ##### Bug fixes - Fix Pyston wheel compatibility tags ([#​16972](astral-sh/uv#16972)) - Allow redundant entries in `tool.uv.build-backend.module-name` but emit warnings ([#​16928](astral-sh/uv#16928)) - Fix infinite loop in non-attribute re-treats during HTML parsing ([#​17010](astral-sh/uv#17010)) ##### Documentation - Clarify `--project` flag help text to indicate project discovery ([#​16965](astral-sh/uv#16965)) - Regenerate the crates.io READMEs on release ([#​16992](astral-sh/uv#16992)) - Update Docker integration guide to prefer `COPY` over `ADD` for simple cases ([#​16883](astral-sh/uv#16883)) - Update PyTorch documentation to include information about supporting CUDA 13.0.x ([#​16957](astral-sh/uv#16957)) - Update the versioning policy ([#​16710](astral-sh/uv#16710)) - Upgrade PyTorch documentation to latest versions ([#​16970](astral-sh/uv#16970)) ### [`v0.9.15`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0915) [Compare Source](astral-sh/uv@0.9.14...0.9.15) Released on 2025-12-02. ##### Python - Add CPython 3.14.1 - Add CPython 3.13.10 ##### Enhancements - Add ROCm 6.4 to `--torch-backend=auto` ([#​16919](astral-sh/uv#16919)) - Add a Windows manifest to uv binaries ([#​16894](astral-sh/uv#16894)) - Add LFS toggle to Git sources ([#​16143](astral-sh/uv#16143)) - Cache source reads during resolution ([#​16888](astral-sh/uv#16888)) - Allow reading requirements from scripts without an extension ([#​16923](astral-sh/uv#16923)) - Allow reading requirements from scripts with HTTP(S) paths ([#​16891](astral-sh/uv#16891)) ##### Configuration - Add `UV_HIDE_BUILD_OUTPUT` to omit build logs ([#​16885](astral-sh/uv#16885)) ##### Bug fixes - Fix `uv-trampoline-builder` builds from crates.io by moving bundled executables ([#​16922](astral-sh/uv#16922)) - Respect `NO_COLOR` and always show the command as a header when paging `uv help` output ([#​16908](astral-sh/uv#16908)) - Use `0o666` permissions for flock files instead of `0o777` ([#​16845](astral-sh/uv#16845)) - Revert "Bump `astral-tl` to v0.7.10 ([#​16887](astral-sh/uv#16887))" to narrow down a regression causing hangs in metadata retrieval ([#​16938](astral-sh/uv#16938)) ##### Documentation - Link to the uv version in crates.io member READMEs ([#​16939](astral-sh/uv#16939)) ### [`v0.9.14`](https://github.com/astral-sh/uv/blob/HEAD/CHANGELOG.md#0914) [Compare Source](astral-sh/uv@0.9.13...0.9.14) Released on 2025-12-01. ##### Performance - Bump `astral-tl` to v0.7.10 to enable SIMD for HTML parsing ([#​16887](astral-sh/uv#16887)) ##### Bug fixes - Allow earlier post releases with exclusive ordering ([#​16881](astral-sh/uv#16881)) - Prefer updating existing `.zshenv` over creating a new one in `tool update-shell` ([#​16866](astral-sh/uv#16866)) - Respect `-e` flags in `uv add` ([#​16882](astral-sh/uv#16882)) ##### Enhancements - Attach subcommand to User-Agent string ([#​16837](astral-sh/uv#16837)) - Prefer `UV_WORKING_DIR` over `UV_WORKING_DIRECTORY` for consistency ([#​16884](astral-sh/uv#16884)) </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever MR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this MR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this MR, check this box --- This MR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0Mi4yNy4xIiwidXBkYXRlZEluVmVyIjoiNDIuNDAuMyIsInRhcmdldEJyYW5jaCI6Im1haW4iLCJsYWJlbHMiOlsiUmVub3ZhdGUgQm90Il19-->
When a process is running and another calls
uv cache cleanoruv cache prunewe currently deadlock - sometimes until the CI timeout (astral-sh/setup-uv#588). To avoid this, we add a default 5 min timeout waiting for a lock. 5 min balances allowing in-progress builds to finish, especially with larger native dependencies, while also giving timely errors for deadlocks on (remote) systems.Commit 1 is a refactoring.
This branch also fixes a problem with the logging where acquired and released resources currently mismatch: