Description
I went to start a new branch for the gba
crate and things were going well until LLVM demanded that I add a bunch of definitions for atomic intrinsics:
= note: arm-none-eabi-ld: D:\dev\gba\target\thumbv4t-none-eabi\debug\deps\libgba-e42d0bb0e1acbf2b.rlib(gba-e42d0bb0e1acbf2b.8ymripsqk83bpm.rcgu.o): in function `core::sync::atomic::atomic_store':
C:\Users\Daniel\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\core\src\sync/atomic.rs:2948: undefined reference to `__sync_lock_test_and_set_4'
arm-none-eabi-ld: D:\dev\gba\target\thumbv4t-none-eabi\debug\examples\hello-4b3a62c592de993e.5c84v82b0il7lif5.rcgu.o: in function `core::sync::atomic::atomic_store':
C:\Users\Daniel\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\core\src\sync/atomic.rs:2948: undefined reference to `__sync_lock_test_and_set_2'
arm-none-eabi-ld: D:\dev\gba\target\thumbv4t-none-eabi\debug\examples\hello-4b3a62c592de993e.5c84v82b0il7lif5.rcgu.o: in function `core::sync::atomic::atomic_load':
C:\Users\Daniel\.rustup\toolchains\nightly-x86_64-pc-windows-msvc\lib\rustlib\src\rust\library\core\src\sync/atomic.rs:2963: undefined reference to `__sync_val_compare_and_swap_2'
These are all part of LLVM's sync function set, and they are probably caused by the atomic accesses I'm using:
Now the GBA is using the ARMv4T architecture, which does not have atomic instructions at the CPU level, but the +forced-atomic
feature was recently added to many target definitions for older targets. My understanding of how this was supposed to work is that an aligned atomic load/store will just use the normal load/store instruction, and then the "advanced" atomic actions like fetch_add would call a function. However, it seems to be the cast that (using the thumbv4t-none-eabi
target) LLVM still wants to call functions for plain load/store.
Activity
thomcc commentedon Sep 3, 2022
Related to (possibly dupe of) #99668.
nikic commentedon Sep 3, 2022
Yeah, this shouldn't be happening. Based on the symbol names, the atomic load or store seems to get lowered to atomic CAS at some point.
nikic commentedon Sep 3, 2022
Indeed: https://llvm.godbolt.org/z/TvjP1on6f
thomcc commentedon Sep 3, 2022
There's no __sync equivalent to plain load/store, so I believe this is to force going through libcalls on targets where you'd need to take a global lock around all access to support RMWs (which I guess LLVM assumes we might want to support).
thomcc commentedon Sep 3, 2022
This is a lot like the current situation for compiler_fence -- on some embedded targets llvm seems to consider __sync to be a good fallback: https://github.com/llvm/llvm-project/blob/9905dae5e18cd55ee6bb8678c95fed940ded1ef9/llvm/lib/CodeGen/SelectionDAG/LegalizeDAG.cpp#L3937-L3939
nikic commentedon Sep 3, 2022
Ah, I see the problem. These are seq_cst atomic load/stores, but apparently thumbv4t has no dmb instruction, so it's not possible to lower atomic load/store with stronger than monotonic ordering natively. Thus we get libcalls even with the
+atomics-32
feature.This does look correct to me, and it also looks like this is not new LLVM 15 behavior either. So to clarify, did this code (using these atomic orderings) work previously?
I think this means that we do need to declare atomics as completely unsupported for thumbv4t targets.
Lokathor commentedon Sep 3, 2022
Ah, I should be more clear: This code is all new code, and I have never before even attempted to use
Atomic*
types on this target.However, I was told by both Ralf and m-ou-se that plain loads are stores are "effectively atomic".
So when the Forced Atomic stuff was put in I went to try it out, and it did not work as expected. What I expected, given that it's an ARMv4T, is that all atomic orderings will lower to plain
ldr
andstr
. There's no need fordmb
because there's only one CPU and it's not capable of executing operations out of order. The only thing that needs to respect the atomic orderings is LLVM itself when it's optimizing, other than that the CPU will only perform operations in exactly their order in the assembly.What I am really after is a way for the main program (just one thread) to communicate with the interrupt handler.
So, if you think that everything here is working as far as you expect, then I might have to tell Ralf and Mo-us-e that there's no "effective atomics" after all and they need to go back to the drawing board.
Lokathor commentedon Sep 3, 2022
On advice
thomcc
I tried other orderings:Lokathor commentedon Sep 3, 2022
compiler_fence
to prevent that should be fine here, except that hits #62256 and becomes a library call to__sync_synchronize
, which means that a 2 (load) or 3 (store) cycle operation gets 6 cycles of overhead added just to jump to an empty function and immediately return.So yeah, atomics are completely broken on this target at the moment.
In the mean time it can be worked around with inline_asm, but then i either have to make every access a macro call (to get good register allocation) or wrap it in a method that gets inlined but then every access will always be fighting for
r0
.taiki-e commentedon Sep 4, 2022
FWIW, (when you use core::sync::atomic's atomics) whether this works or not actually seems to depend on compiler optimizations. (In cases where
Atomic*::{load,store}
cannot be inlined enough, libcalls will still be generated: taiki-e/portable-atomic#36 (comment))Lokathor commentedon Sep 4, 2022
Ah yes, i hadn't considered. For this platform, I always use opt-level = 3 for all builds.