We had already briefly talked about atomics to implement a mutex. However, the use of atomics in the general case can be a bit tricky and fires some low level synchronisation issues that the programmer needs to be able to control. This is what the memory models are for.
Atomics were introduced in the C++11 standard, including C functions and types to use them from any of those languages. As mentioned, we briefly looked to them and to the generated assembly used to implement them in different platforms. In this case, we will look again to the generated code but, in order to explore the available memory models we will only check the intel platforms.
Quick Refresher
Before going into details, let's quickly refresh how to use atomic types and what is their impact on generated code. Let's take this simple example:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdatomic.h>
#include <pthread.h>
atomic_int cnt;
int cnt1;
atomic_int flag[2] = {1,1};
void *task (void *p) {
int *flag = (int*)p;
while (*flag) {
cnt++;
cnt1++;
}
}
int main () {
pthread_t tid[2];
if (pthread_create (&tid[0], NULL, task, &flag[0]) < 0) exit (EXIT_FAILURE);
if (pthread_create (&tid[1], NULL, task, &flag[1]) < 0) exit (EXIT_FAILURE);
sleep (5);
flag[0] = 0;
flag[1] = 0;
pthread_join (tid[0], NULL);
pthread_join (tid[1], NULL);
printf ("%d atomic %d non-atomic\n", cnt, cnt1);
}
The program just creates two thread that increase two shared counters. One atomic and the other no. If you compile and run the program you will see the values printed by the program won't match as expected. We already know that. Let's quickly look to the generated code for the thread function.
$ objdump -d atomic01 | grep -A31 "<task>:" 000000000000080a <task>: 80a: 55 push %rbp 80b: 48 89 e5 mov %rsp,%rbp 80e: 48 83 ec 30 sub $0x30,%rsp 812: 48 89 7d d8 mov %rdi,-0x28(%rbp) 816: 64 48 8b 04 25 28 00 mov %fs:0x28,%rax 81d: 00 00 81f: 48 89 45 f8 mov %rax,-0x8(%rbp) 823: 31 c0 xor %eax,%eax 825: 48 8b 45 d8 mov -0x28(%rbp),%rax 829: 48 89 45 f0 mov %rax,-0x10(%rbp) 82d: eb 24 jmp 853 <task+0x49> 82f: c7 45 e8 01 00 00 00 movl $0x1,-0x18(%rbp) 836: 8b 45 e8 mov -0x18(%rbp),%eax 839: f0 0f c1 05 df 07 20 lock xadd %eax,0x2007df(%rip) # 201020 <cnt> <======= 840: 00 841: 89 45 ec mov %eax,-0x14(%rbp) 844: 8b 05 d2 07 20 00 mov 0x2007d2(%rip),%eax # 20101c <cnt1> 84a: 83 c0 01 add $0x1,%eax 84d: 89 05 c9 07 20 00 mov %eax,0x2007c9(%rip) # 20101c <cnt1> 853: 48 8b 45 f0 mov -0x10(%rbp),%rax 857: 8b 00 mov (%rax),%eax 859: 85 c0 test %eax,%eax 85b: 75 d2 jne 82f <task+0x25> 85d: 90 nop 85e: 48 8b 55 f8 mov -0x8(%rbp),%rdx 862: 64 48 33 14 25 28 00 xor %fs:0x28,%rdx 869: 00 00 86b: 74 05 je 872 <task+0x68> 86d: e8 2e fe ff ff callq 6a0 <__stack_chk_fail@plt> 872: c9 leaveq 873: c3 retq
In the code above we can see clearly the lock
modifier used to atomically increase (xadd
) the atomic shared counter, but not the increase of the non-atomic one.
The previous code is equivalent to this:
void *task (void *p) {
int *flag = (int*)p;
while (*flag) {
atomic_fetch_add_explicit(&cnt, 1,__ATOMIC_SEQ_CST);
cnt1++;
}
}
Despite of the fact that the compiler generates simpler code to increase the atomic using this function (or alternatively __atomic_fetch_add
or __atomic_add_fetch
) we get a lock
ed add
instruction generated the same than before.
The last version of the function shows a third parameter in the atomic modification function. __ATOMIC_SEQ_CST
stands for ATOMIC Sequentially Consistent. This is the default memory model used by the compiler when no memory model is specified as happened with our initial example, when we just assign a value to an atomic variable.
The memory model will tell the compiler which code to generate in order to establish certain constraints in the way atomic values are read and write in memory. In the rest of the post I will use the examples from the GCC wiki to ilustrate what happens with each memory model. That page uses C++ syntax. You can check this GCC page for the C equivalent that I'm going to use.
The problem
Before diving into the code, we should explain what is the issue that these memory models are intended to address.
Imagine that we have two thread running the following code (taken from the page mentioned above):
THREAD1 THREAD2
y = 1 if (x.load() == 2)
x.store (2) assert (y == 1)
Looking to this code many programmer will expect that the assert
on thread2 should always be correct, however what may happen is that the compiler could optimise the code as the code of both threads is, in principle unrelated and could be optimised independently. In that case, the compiler may decide that it is better to first store x
and then set y
. If in between thread2 starts executing, the assert could fail.
Note that, from the point of view of the compiler, changing the order of the assignments in thread 1 doesn't make any difference, as we are not doing anything with them. So, specifying a memory model when accessing atomic variables will allow us to control what we let the compiler (and also the processor) optimise and what not, at an inter-thread level.
In addition to the compiler optimisations, many processors also mess up with the instructions orders to improve the performance. Specifically memory accesses can be reordered as accessing memory more expensive than accessing processor internal memory.
Summing up, atomics are somehow like a volatile
modifier but on steroids.
So, the key concept here is the so-called Happens-before. In the code above, it is clear for us that the assignment of y
happens before the store in x. The memory model will allow us to specify if this is really what we want and therefore force the compiler and also the processor to execute the code as it is written at the expense of a performance penalty.
NOTE: Happens-before in this case means that whenever we store 2 in x
, assignment of 1 to y
happened before. In this situation, when if
in thread2 is true, it means that the assert
will always pass because we the y
assignment happenned before...
Sequentially Consistent and Relaxed
The two basic memory models are named Sequentially Consistent which let's say is the more constraint model in the sense that all the code is forced to be executed as written, and Relaxed that basically let's the compiler and the processor do whatever they do. Sequentially Consistent model roughly disable all compiler and processor optimisations for that piece of code while Relaxed just keeps all of them in place.
Let's take our previous theoretical example and implement it so we can check what happen with each one of these models:
#include <unistd.h>
#include <stdatomic.h>
#include <assert.h>
#include <pthread.h>
atomic_int x = 0;
int y = 0;
void *task1 (void *p) {
y = 1;
__atomic_store_n (&x, 2, MEMORY_MODEL);
}
void *task2 (void *p) {
if (__atomic_load_n (&x, MEMORY_MODEL) == 2)
assert (y == 1);
}
int main () {
pthread_t tid[2];
printf ("x = %d | y = %d\n", x , y);
if (pthread_create (&tid[0], NULL, task1, NULL) < 0) exit (EXIT_FAILURE);
if (pthread_create (&tid[1], NULL, task2, NULL) < 0) exit (EXIT_FAILURE);
pthread_join (tid[0], NULL);
pthread_join (tid[1], NULL);
printf ("x = %d | y = %d\n", x ,y);
}
Let's see what we get for the SEQ_CST
memory model:
$ gcc -DMEMORY_MODEL=__ATOMIC_SEQ_CST -o atomic03-param atomic03-param.c -lpthread $ objdump -d atomic03-param | grep -A31 "<task1>:" 000000000000081a <task1>: 81a: 55 push %rbp 81b: 48 89 e5 mov %rsp,%rbp 81e: 48 89 7d f8 mov %rdi,-0x8(%rbp) 822: c7 05 ec 07 20 00 01 movl $0x1,0x2007ec(%rip) # 201018 <y> 829: 00 00 00 82c: b8 02 00 00 00 mov $0x2,%eax 831: 89 05 dd 07 20 00 mov %eax,0x2007dd(%rip) # 201014 <x> 837: 0f ae f0 mfence 83a: 90 nop 83b: 5d pop %rbp 83c: c3 retq 000000000000083d <task2>: 83d: 55 push %rbp 83e: 48 89 e5 mov %rsp,%rbp 841: 48 83 ec 10 sub $0x10,%rsp 845: 48 89 7d f8 mov %rdi,-0x8(%rbp) 849: 8b 05 c5 07 20 00 mov 0x2007c5(%rip),%eax # 201014 <x> 84f: 83 f8 02 cmp $0x2,%eax 852: 75 2a jne 87e <task2+0x41> 854: 8b 05 be 07 20 00 mov 0x2007be(%rip),%eax # 201018 <y> 85a: 83 f8 01 cmp $0x1,%eax 85d: 74 1f je 87e <task2+0x41> 85f: 48 8d 0d c8 01 00 00 lea 0x1c8(%rip),%rcx # a2e <__PRETTY_FUNCTION__.3952> 866: ba 15 00 00 00 mov $0x15,%edx 86b: 48 8d 35 92 01 00 00 lea 0x192(%rip),%rsi # a04 <_IO_stdin_used+0x4> 872: 48 8d 3d 9c 01 00 00 lea 0x19c(%rip),%rdi # a15 <_IO_stdin_used+0x15> 879: e8 52 fe ff ff callq 6d0 <__assert_fail@plt> 87e: 90 nop 87f: c9 leaveq 880: c3 retq
We can see that task2
is left untouched.... No special synchronisation code have been issued, however, in the task1
function we do not see a lock
modifier but we can spot two mfence
instructions. Now it is time to state a couple of things to understand why task1
was compiled like that:
- The first thing is that load and store instructions on aligned memory on intel processors are already atomic. In the previous example we had an
add
instruction that doesn't fit in this category and therefore needs to be modified withlock
. - We see two
mfence
instructions because, as mentioned earlier Sequentially Consistent is default memory model used for atomic data.
So the remaining question is what that mfence
instruction does. from the intel software development manual
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.
What this means in our example is that the stores in y
and x
are ensured to be performed in that order and such update will be visible globally... that's it to all other threads. This is the way to ensure that the assert
on task2
is always true (whenever x
value is 2).
Let's see what happens when we update x
using RELAXED
memory model. And the generated code will look like this:
$ gcc -DMEMORY_MODEL=__ATOMIC_RELAXED -o atomic03-param atomic03-param.c -lpthread $ objdump -d atomic03-param | grep -A10 "<task1>:" 000000000000081a <task1>: 81a: 55 push %rbp 81b: 48 89 e5 mov %rsp,%rbp 81e: 48 89 7d f8 mov %rdi,-0x8(%rbp) 822: c7 05 ec 07 20 00 01 movl $0x1,0x2007ec(%rip) # 201018 <y> 829: 00 00 00 82c: b8 02 00 00 00 mov $0x2,%eax 831: 89 05 dd 07 20 00 mov %eax,0x2007dd(%rip) # 201014 <x> 837: 90 nop 838: 5d pop %rbp 839: c3 retq
I have removed task2
as it is not relevant for our discussion.
As we can see, when using the __ATOMIC_RELAXED
memory model, no special consideration is taken by the compiler with regards of the atomic values. The code will be indeed faster (no mfence
forcing reordering of operations), but then there is no assumption in the sequencing on store/load operations. It is up to the programmer to ensure such synchronisation... if needed.
So, now, when task2
is executed there is nothing in the system imposing that the store on y
happens before the store on x
and therefore, the assert may fail.
Note that, in both cases, when task1
reaches the retq
instruction, y
will be set to 1 and x
to 2.... Independently of which one was done first. If task2
gets executed in between the function, and the assigned was executed in a different order by the processor.... then the result of task2
will be different. The mfence
instructions ensures that, after each assignment the values are globally visible and that such assignments are done in the intended order.
Acquire/Release/Consume
In addition to the strict sequencing of SEQ_CST
and the non sequencing at all of RELAXED
there are three additional models that can be used. According to the gcc documentation these is what they do:
__ATOMIC_CONSUME
This is currently implemented using the stronger __ATOMIC_ACQUIRE memory order
because of a deficiency in C++11’s semantics for memory_order_consume.
__ATOMIC_ACQUIRE
Creates an inter-thread happens-before constraint from the release (or stronger)
semantic store to this acquire load. Can prevent hoisting of code to before
the operation.
__ATOMIC_RELEASE
Creates an inter-thread happens-before constraint to acquire (or stronger)
semantic loads that read from this release store. Can prevent sinking of
code to after the operation.
Well, the first thing we see is that __ATOMIC_CONSUME
is just the same that __ATOMIC_ACQUIRE
. The __ATOMIC_RELEASE
ensures that any operation after the indicated store
while be serialised while the __ATOMIC_ACQUIRE
will ensure that any operation before the indicated load
while be serialised. These two works together like a lock for a given atomic, but the synchronisation happens at the processor level.
For testing this, I used the following program:
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdatomic.h>
#include <assert.h>
#include <pthread.h>
atomic_int x = 0;
atomic_int y = 0;
void *task1 (void *p) {
__atomic_store_n (&y, 20, __ATOMIC_RELEASE);
}
void *task2 ( void *p) {
__atomic_store_n (&x, 10, __ATOMIC_RELEASE);
}
void *task3 (void *p) {
assert (__atomic_load_n (&y, __ATOMIC_ACQUIRE) == 20 &&
__atomic_load_n (&x, __ATOMIC_ACQUIRE) == 0);
}
void *task4 (void *p) {
assert (__atomic_load_n (&y, __ATOMIC_ACQUIRE) == 0 &&
__atomic_load_n (&x, ATOMIC_ACQUIRE) == 10);
}
int main () {
pthread_t tid[4];
printf ("x = %d | y = %d | c = %d\n", x ,y ,b);
if (pthread_create (&tid[0], NULL, task1, NULL) < 0) exit (EXIT_FAILURE);
if (pthread_create (&tid[1], NULL, task2, NULL) < 0) exit (EXIT_FAILURE);
if (pthread_create (&tid[2], NULL, task3, NULL) < 0) exit (EXIT_FAILURE);
if (pthread_create (&tid[3], NULL, task4, NULL) < 0) exit (EXIT_FAILURE);
pthread_join (tid[0], NULL);
pthread_join (tid[1], NULL);
pthread_join (tid[2], NULL);
pthread_join (tid[3], NULL);
printf ("x = %d | y = %d | c = %d\n", x ,y ,b);
}
Anyway, to complete the explanation of these memory models, in the example above, any of the asserts (the one on task3
or the one in task4
) can pass. There is no order imposed on the variables and any of them can be updated first but it is not forced that one has to be set before the other. Note that if the SEQ_CST memory model where used, that won't happen. In that case, one of the assignments will happen before the other, but such order will be determined at run-time. In other words, with SEQ_CST
model, one of the asserts will pass and the other won't, while with the ACQUIRE/RELEASE
model, both may pass.
In theory this model will allow the compiler to optimise the code further, at the same time that the load and store of certain variables are ensured to be visible globally.
If you compile it, you will just get the same code that for non atomic values. The ACQUIRE/RELEASE is intended to allow the compiler to apply more optimisation but I'm afraid that for such a simple program, there is not much optimisations to perform, so we cannot really see any difference.
Actually, the cppreference page describing this memory models, explains that for strongly-ordered systems (that includes x86) no special instruction is issued and is just the compiler who is instructed to avoid certain optimisations (like moving around load and stores of the affected atomics). However, ARM or PowerPC are weakly-ordered systems.
So, What happens with x86?
At this point things have become pretty confusing. Intel processors can run instructions out of order but they are strongly-ordered systens for wich memory access are guarantied to be performed in the indicated order.... So why do we need the mfence
instructions on Intel?
OUT-OF-ORDER Processors
It may be difficult to understand what follows, as all the code we see looks in the right order. The point is that the processor is allowed to re-order the instructions and despite of how the program looks in memory, the instructions will be executed in a different order, but producing the exact same result... likely improving performance. This works fine in most of the cases but when the have inter-thread interactions such a re-order may make the program behave differently as we discussed in our previous example.For more details take a look to the Out-of-Order Execution page in Wikipedia. Do not forget to donate and to ask your company/organization to do
Apparently, the x86 architecture is indeed strongly-ordered and in most cases, for simple aligned stores and loads everything just works. Per design, all stores have a RELEASE semantic and all loads have an ACQUIRE semantics and that is why no special code is generated.
All this said, I start to believe that the mfence
instruction is not really needed and it is just included in the Sequential Consistent model just in case. However, after further digging into the issue I found the following.
In the Intel 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 3A:System Programming Guide, Part 1 we found some details.
Section 11.10 introduces the so called STORE BUFFER, this buffer is used to temporary store writes to memory (store operations) delaying the actual writes to memory and therefore improving the processor performance, allowing it to continue executing instructions without waiting for memory writes and also to actually be more efficient when actually writing to memory.
This section also stated that all memory writes are performed in order (even when may be delayed) and indicates in which circumstances these store buffer is drained to memory. We can see the mention to mfence
instruction in that list. This section refers us to section 8.2 where write ordering is discussed and further details of how this store buffer is operated.
Section 8.2 Memory Ordering stars mentioning than the i386 processor is a strong-ordered system where all access happens as they are suppose to, but that Pentium 4, Xeon and P6 families of processor allow deviations of this behaviour to improve performance. In particular, the section mentions reads (loads) going ahead of buffered writes.
I won't go into the details of this sections. It is pretty long and dense, but for the curious reader you will find two sections describing the models for Pentium and i486 and another for P6 and more recent processors (which is way more complex). After that several examples to illustrate the memory ordering principles described in the first two sections are discussed. You can refer to the post Who ordered memory fences on an x86? for a more comprehensive reading in some of the cases described in the Intel manual.
Anything else I should know about x86 processors?
Well, just a couple of things. You can find the details in the same section 8.2, but I'll include here a very brief summary... this time for my own convenience.
The first thing is that fast-string operations stores may appear to be executed out of order. This is basically the use of rep stos
instructions on your code. For this, the store of the data may not be in order and additional synchronisation means have to be provided.
The second is that section 8.2.5 tell us how to strengthening or weakening the memory model. It mention 4 mechanisms:
- I/O instructions, locking instructions (like
xchg
),lock
prefixed instructions and serialising instructions (likecpuid
oriret
) force stronger ordering L/S/MFENCE
instructions force memory ordering and serialisation allowing some control on the instructions affected (L stands for Load and S stands for Store).- The memory type range register (MTRRs) available on Pentium 4, Xeon and P6 allows to modify the memory ordering for areas of physical memory
- Finally the PAT Page Attibute Table can do the same for a page or group of pages. This mechanism is available on Pentium 4, Xeon and Pentium III.
The two last are kindof intended to deal with memory mapped devices where instead of using I/O instructions (case 1, in
and out
), device control is done writing to specific memory areas. In those cases, we want the writes to be performed in order.... Usually you have to do two writes in a row, one on a control register to chose the operation and another in a data register to write the value. Those registers will be mapped in different memory locations for a memory-mapped hardware (usually consecutive tho).
So, the relevant part of all this is summarised in the last paragraph of the section that basically that in order to write portable code (also for future processors) and as new processors (Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors) does not implement any more the strong memory-ordering model (except when using MTTRs in the so-called strong uncached (UC) memory type), programmer should consider the processor-ordering model or weaker memory-ordering model.
Summing up, x86 is overall a strong-ordered system with certain exceptions. ACQUIRE/RELEASE semantics seems to be ensured by the processor, but in order to fully ensure sequencing special instructions are provided and used for the Sequential Consistent C/C++ memory model. Even when in some cases may not be necessary, the system manuals tell us to use them in order to ensure compatibility with future processors.So, What happens with ARM or PowerPC?
Let's take a quick look just for the satisfy our curiosity to what happens with a weakly-ordered system.
$ arm-linux-gnueabi-gcc -o atomic05.arm atomic05.c -lpthread $ arm-linux-gnueabi-objdump -d atomic05.arm | grep -A44 "<task2>:" 00010630 <task2>: 10630: e92d4810 push {r4, fp, lr} 10634: e28db008 add fp, sp, #8 10638: e24dd00c sub sp, sp, #12 1063c: e50b0010 str r0, [fp, #-16] 10640: e59f4018 ldr r4, [pc, #24] ; 10660 <task2+0x30> 10644: eb00042b bl 116f8 <__sync_synchronize> ; <======== 10648: e3a0300a mov r3, #10 1064c: e5843000 str r3, [r4] 10650: e1a00000 nop ; (mov r0, r0) 10654: e1a00003 mov r0, r3 10658: e24bd008 sub sp, fp, #8 1065c: e8bd8810 pop {r4, fp, pc} 10660: 00022044 .word 0x00022044 00010664 <task3>: 10664: e92d4810 push {r4, fp, lr} 10668: e28db008 add fp, sp, #8 1066c: e24dd00c sub sp, sp, #12 10670: e50b0010 str r0, [fp, #-16] 10674: e59f3044 ldr r3, [pc, #68] ; 106c0 <task3+0x5c> 10678: e5934000 ldr r4, [r3] 1067c: eb00041d bl 116f8 <__sync_synchronize> ; <========= 10680: e3540014 cmp r4, #20 10684: 1a000004 bne 1069c <task3+0x38> 10688: e59f3034 ldr r3, [pc, #52] ; 106c4 <task3+0x60> 1068c: e5934000 ldr r4, [r3] 10690: eb000418 bl 116f8 <__sync_synchronize> ; <=========== 10694: e3540000 cmp r4, #0 10698: 0a000004 beq 106b0 <task3+0x4c> 1069c: e59f3024 ldr r3, [pc, #36] ; 106c8 <task3+0x64> 106a0: e3a0201e mov r2, #30 106a4: e59f1020 ldr r1, [pc, #32] ; 106cc <task3+0x68> 106a8: e59f0020 ldr r0, [pc, #32] ; 106d0 <task3+0x6c> 106ac: ebffff93 bl 10500 <__assert_fail@plt> 106b0: e1a00000 nop ; (mov r0, r0) 106b4: e1a00003 mov r0, r3 106b8: e24bd008 sub sp, fp, #8 106bc: e8bd8810 pop {r4, fp, pc} 106c0: 00022048 .word 0x00022048 106c4: 00022044 .word 0x00022044 106c8: 000119ac .word 0x000119ac 106cc: 000118ec .word 0x000118ec 106d0: 000118f8 .word 0x000118f8
We can see the calls to sync_synchronise
... That ends up calling kuser_cmpxchg
function as described in part 2 of this series.
For the PowerPC we can also see how special instructions are needed to ensure the RELEASE/ACQUIRE semantics:
$ powerpc-linux-gnu-gcc -o atomic05.ppc atomic05.c -lpthread $ powerpc-linux-gnu-objdump -d atomic05.ppc | grep -A50 "<task2>:" 10000664 <task2>: 10000664: 94 21 ff e0 stwu r1,-32(r1) 10000668: 93 e1 00 1c stw r31,28(r1) 1000066c: 7c 3f 0b 78 mr r31,r1 10000670: 90 7f 00 0c stw r3,12(r31) 10000674: 3d 20 10 02 lis r9,4098 10000678: 39 40 00 0a li r10,10 1000067c: 7c 20 04 ac lwsync # <======== 10000680: 91 49 00 2c stw r10,44(r9) 10000684: 60 00 00 00 nop 10000688: 7d 23 4b 78 mr r3,r9 1000068c: 39 7f 00 20 addi r11,r31,32 10000690: 83 eb ff fc lwz r31,-4(r11) 10000694: 7d 61 5b 78 mr r1,r11 10000698: 4e 80 00 20 blr 1000069c <task3>: 1000069c: 94 21 ff e0 stwu r1,-32(r1) 100006a0: 7c 08 02 a6 mflr r0 100006a4: 90 01 00 24 stw r0,36(r1) 100006a8: 93 e1 00 1c stw r31,28(r1) 100006ac: 7c 3f 0b 78 mr r31,r1 100006b0: 90 7f 00 0c stw r3,12(r31) 100006b4: 3d 20 10 02 lis r9,4098 100006b8: 81 29 00 30 lwz r9,48(r9) 100006bc: 7f 89 48 00 cmpw cr7,r9,r9 100006c0: 40 9e 00 04 bne cr7,100006c4 <task3+0x28> 100006c4: 4c 00 01 2c isync # <========== 100006c8: 2b 89 00 14 cmplwi cr7,r9,20 100006cc: 40 9e 00 20 bne cr7,100006ec <task3+0x50> 100006d0: 3d 20 10 02 lis r9,4098 100006d4: 81 29 00 2c lwz r9,44(r9) 100006d8: 7f 89 48 00 cmpw cr7,r9,r9 100006dc: 40 9e 00 04 bne cr7,100006e0 <task3+0x44> 100006e0: 4c 00 01 2c isync # <=============== 100006e4: 2f 89 00 00 cmpwi cr7,r9,0 100006e8: 41 9e 00 24 beq cr7,1000070c <task3+0x70> 100006ec: 3d 20 10 00 lis r9,4096 100006f0: 38 c9 0d 74 addi r6,r9,3444 100006f4: 38 a0 00 1e li r5,30 100006f8: 3d 20 10 00 lis r9,4096 100006fc: 38 89 0c b4 addi r4,r9,3252 10000700: 3d 20 10 00 lis r9,4096 10000704: 38 69 0c c0 addi r3,r9,3264 10000708: 48 00 04 f9 bl 10000c00 <__assert_fail@plt> 1000070c: 60 00 00 00 nop 10000710: 7d 23 4b 78 mr r3,r9 10000714: 39 7f 00 20 addi r11,r31,32 10000718: 80 0b 00 04 lwz r0,4(r11) 1000071c: 7c 08 03 a6 mtlr r0 10000720: 83 eb ff fc lwz r31,-4(r11)
If you do a quick search on google, you will quickly find that lwsync
(lightweight sync) is actually used for RELEASE while isync
is used for ACQUIRE. There is also a sync/hwsync
(heavyweight sync) instruction that imposes a full memory barrier and we can see them when using the SEQ_CST
memory model.
Conclusions
We have gone quickly through the different memory models supported by gcc
atomics that allows developer get better control on what code get executed when. We have also seen that is not than easy to find proper and simple examples. For the simple ones we have seen in this paper, either we get the compiler to issue an mfence
instruction or not, and there is not much room for the optimiser to modify a function that just assigns a couple of values.
Memory models and atomics become relevant when implementing higher level synchronisation mechanisms as we find out when trying to implement a mutex. In normal SW, we will likely use a mutex
, semaphore
, condition
or barrier
than implementing our own mechanism using atomics with inline functions. Any way it is good to understand what happens under the hood and that these problems exists.... because many times, a slightly different version of this issue may happen at a higher level and it would be easier to identify when it sounds a little bit familiar.
■