Ensure proper memory barriers around ossl_rcu_deref/ossl_rcu_assign_ptr

Since the addition of macos14 M1 runners in our CI jobs we've been seeing periodic random failures in the test_threads CI job. Specifically we've seen instances in which the shared pointer in the test (which points to a monotonically incrementing uint64_t went backwards. From taking a look at the disassembled code in the failing case, we see that __atomic_load_n when emitted in clang 15 looks like this 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldapr x0, [x0] 10012048c: d65f03c0 ret Notably, when compiling with gcc on the same system we get this output instead: 0000000100120488 <_ossl_rcu_uptr_deref>: 100120488: f8bfc000 ldar x0, [x0] 10012048c: d65f03c0 ret Checking the arm docs for the difference between ldar and ldapr: https://developer.arm.com/documentation/ddi0602/2023-09/Base-Instructions/LDAPR--Load-Acquire-RCpc-Register- https://developer.arm.com/documentation/dui0802/b/A64-Data-Transfer-Instructions/LDAR It seems that the ldar instruction provides a global cpu fence, not completing until all writes in a given cpus writeback queue have completed Conversely, the ldapr instruction attmpts to achieve performance improvements by honoring the Local Ordering register available in the system coprocessor, only flushing writes in the same address region as other cpus on the system. I believe that on M1 virtualized cpus the ldapr is not properly ordering writes, leading to an out of order read, despite the needed fencing. I've opened an issue with apple on this here: https://developer.apple.com/forums/thread/749530 I believe that it is not safe to issue an ldapr instruction unless the programmer knows that the Local order registers are properly configured for use on the system. So to fix it I'm proposing with this patch that we, in the event that: 1) __APPLE__ is defined AND 2) __clang__ is defined AND 3) __aarch64__ is defined during the build, that we override the ATOMIC_LOAD_N macro in the rcu code such that it uses a custom function with inline assembly to emit the ldar instruction rather than the ldapr instruction. The above conditions should get us to where this is only used on more recent MAC cpus, and only in the case where the affected clang compiler emits the offending instruction. I've run this patch 10 times in our CI and failed to reproduce the issue, whereas previously I could trigger it within 5 runs routinely. Reviewed-by: Tom Cosgrove <tom.cosgrove@arm.com> Reviewed-by: Paul Dale <ppzgs1@gmail.com> Reviewed-by: Tomas Mraz <tomas@openssl.org> (Merged from https://github.com/openssl/openssl/pull/23974)
2024-03-26 09:59:14 -04:00 · 2024-03-26 09:59:14 -04:00 · f5b5a35c84
parent 65fe3e846f
commit f5b5a35c84
1 changed files with 26 additions and 1 deletions
--- a/crypto/threads_pthread.c
+++ b/crypto/threads_pthread.c
@ -46,7 +46,32 @@
 # endif

 # if defined(__GNUC__) && defined(__ATOMIC_ACQUIRE) && !defined(BROKEN_CLANG_ATOMICS)
-# define ATOMIC_LOAD_N(p,o) __atomic_load_n(p, o)
+#  if defined(__APPLE__) && defined(__clang__) && defined(__aarch64__) 
+/*
+ * Apple M1 virtualized cpu seems to have some problem using the ldapr instruction
+ * (see https://github.com/openssl/openssl/pull/23974)
+ * When using the native apple clang compiler, this instruction is emitted for
+ * atomic loads, which is bad.  So, if
+ * 1) We are building on a target that defines __APPLE__ AND
+ * 2) We are building on a target using clang (__clang__) AND
+ * 3) We are building for an M1 processor (__aarch64__)
+ * Then we shold not use __atomic_load_n and instead implement our own
+ * function to issue the ldar instruction instead, which procuces the proper
+ * sequencing guarantees
+ */
+static inline void *apple_atomic_load_n(void **p)
+{
+    void *ret;
+
+    __asm volatile("ldar %0, [%1]" : "=r" (ret): "r" (p):);
+
+    return ret;
+}
+
+#   define ATOMIC_LOAD_N(p, o) apple_atomic_load_n((void **)p)
+#  else
+#   define ATOMIC_LOAD_N(p,o) __atomic_load_n(p, o)
+#  endif
 # define ATOMIC_STORE_N(p, v, o) __atomic_store_n(p, v, o)
 # define ATOMIC_STORE(p, v, o) __atomic_store(p, v, o)
 # define ATOMIC_EXCHANGE_N(p, v, o) __atomic_exchange_n(p, v, o)