Ensure proper memory barriers around ossl_rcu_deref/ossl_rcu_assign_ptr

Since the addition of macos14 M1 runners in our CI jobs we've been
seeing periodic random failures in the test_threads CI job.
Specifically we've seen instances in which the shared pointer in the
test (which points to a monotonically incrementing uint64_t went
backwards.

From taking a look at the disassembled code in the failing case, we see
that __atomic_load_n when emitted in clang 15 looks like this
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldapr   x0, [x0]
10012048c: d65f03c0     ret

Notably, when compiling with gcc on the same system we get this output
instead:
0000000100120488 <_ossl_rcu_uptr_deref>:
100120488: f8bfc000     ldar   x0, [x0]
10012048c: d65f03c0     ret

Checking the arm docs for the difference between ldar and ldapr:
https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register-
https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR

It seems that the ldar instruction provides a global cpu fence, not
completing until all writes in a given cpus writeback queue have
completed

Conversely, the ldapr instruction attmpts to achieve performance
improvements by honoring the Local Ordering register available in the
system coprocessor, only flushing writes in the same address region as
other cpus on the system.

I believe that on M1 virtualized cpus the ldapr is not properly ordering
writes, leading to an out of order read, despite the needed fencing.
I've opened an issue with apple on this here:
https://developer.apple.com/forums/thread/749530

I believe that it is not safe to issue an ldapr instruction unless the
programmer knows that the Local order registers are properly configured
for use on the system.

So to fix it I'm proposing with this patch that we, in the event that:
1) __APPLE__ is defined
AND
2) __clang__ is defined
AND
3) __aarch64__ is defined

during the build, that we override the ATOMIC_LOAD_N macro in the rcu
code such that it uses a custom function with inline assembly to emit
the ldar instruction rather than the ldapr instruction.  The above
conditions should get us to where this is only used on more recent MAC
cpus, and only in the case where the affected clang compiler emits the
offending instruction.

I've run this patch 10 times in our CI and failed to reproduce the
issue, whereas previously I could trigger it within 5 runs routinely.

Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com>
Reviewed-by: Paul Dale <ppzgs1@gmail.com>
Reviewed-by: Tomas Mraz <tomas@openssl.org>
(Merged from https://github.com/openssl/openssl/pull/23974)
This commit is contained in:
Neil Horman 2024-03-26 09:59:14 -04:00 committed by Tomas Mraz
parent 65fe3e846f
commit f5b5a35c84
1 changed files with 26 additions and 1 deletions

View File

@ -46,7 +46,32 @@
# endif
# if defined(__GNUC__) && defined(__ATOMIC_ACQUIRE) && !defined(BROKEN_CLANG_ATOMICS)
# define ATOMIC_LOAD_N(p,o) __atomic_load_n(p, o)
# if defined(__APPLE__) && defined(__clang__) && defined(__aarch64__)
/*
* Apple M1 virtualized cpu seems to have some problem using the ldapr instruction
* (see https://github.com/openssl/openssl/pull/23974)
* When using the native apple clang compiler, this instruction is emitted for
* atomic loads, which is bad. So, if
* 1) We are building on a target that defines __APPLE__ AND
* 2) We are building on a target using clang (__clang__) AND
* 3) We are building for an M1 processor (__aarch64__)
* Then we shold not use __atomic_load_n and instead implement our own
* function to issue the ldar instruction instead, which procuces the proper
* sequencing guarantees
*/
static inline void *apple_atomic_load_n(void **p)
{
void *ret;
__asm volatile("ldar %0, [%1]" : "=r" (ret): "r" (p):);
return ret;
}
# define ATOMIC_LOAD_N(p, o) apple_atomic_load_n((void **)p)
# else
# define ATOMIC_LOAD_N(p,o) __atomic_load_n(p, o)
# endif
# define ATOMIC_STORE_N(p, v, o) __atomic_store_n(p, v, o)
# define ATOMIC_STORE(p, v, o) __atomic_store(p, v, o)
# define ATOMIC_EXCHANGE_N(p, v, o) __atomic_exchange_n(p, v, o)