Concurrency. Atomics and Memory Models
PROGRAMMING
Concurrency. Atomics and Memory Models
2021-03-24
By
David "DeMO" Martínez Oliveira

We had already briefly talked about atomics to implement a mutex. However, the use of atomics in the general case can be a bit tricky and fires some low level synchronisation issues that the programmer needs to be able to control. This is what the memory models are for.

Atomics were introduced in the C++11 standard, including C functions and types to use them from any of those languages. As mentioned, we briefly looked to them and to the generated assembly used to implement them in different platforms. In this case, we will look again to the generated code but, in order to explore the available memory models we will only check the intel platforms.

Quick Refresher

Before going into details, let's quickly refresh how to use atomic types and what is their impact on generated code. Let's take this simple example:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdatomic.h>
#include <pthread.h>

atomic_int cnt;
int        cnt1;
atomic_int flag[2] = {1,1};

void *task (void *p) {
  int *flag = (int*)p;
  while (*flag) {
    cnt++;
    cnt1++;
  }
}

int main () {
  pthread_t tid[2];

  if (pthread_create (&tid[0], NULL, task, &flag[0]) < 0) exit (EXIT_FAILURE);
  if (pthread_create (&tid[1], NULL, task, &flag[1]) < 0) exit (EXIT_FAILURE);
  sleep (5);
  flag[0] = 0;
  flag[1] = 0;
  pthread_join (tid[0], NULL);
  pthread_join (tid[1], NULL);
  printf ("%d atomic %d non-atomic\n", cnt, cnt1);
  
}

The program just creates two thread that increase two shared counters. One atomic and the other no. If you compile and run the program you will see the values printed by the program won't match as expected. We already know that. Let's quickly look to the generated code for the thread function.

$ objdump -d atomic01 | grep -A31 "<task>:"
000000000000080a <task>:
 80a:   55                      push   %rbp
 80b:   48 89 e5                mov    %rsp,%rbp
 80e:   48 83 ec 30             sub    $0x30,%rsp
 812:   48 89 7d d8             mov    %rdi,-0x28(%rbp)
 816:   64 48 8b 04 25 28 00    mov    %fs:0x28,%rax
 81d:   00 00
 81f:   48 89 45 f8             mov    %rax,-0x8(%rbp)
 823:   31 c0                   xor    %eax,%eax
 825:   48 8b 45 d8             mov    -0x28(%rbp),%rax
 829:   48 89 45 f0             mov    %rax,-0x10(%rbp)
 82d:   eb 24                   jmp    853 <task+0x49>
 82f:   c7 45 e8 01 00 00 00    movl   $0x1,-0x18(%rbp)
 836:   8b 45 e8                mov    -0x18(%rbp),%eax
 839:   f0 0f c1 05 df 07 20    lock xadd %eax,0x2007df(%rip)        # 201020 <cnt> <=======
 840:   00
 841:   89 45 ec                mov    %eax,-0x14(%rbp)
 844:   8b 05 d2 07 20 00       mov    0x2007d2(%rip),%eax        # 20101c <cnt1>
 84a:   83 c0 01                add    $0x1,%eax
 84d:   89 05 c9 07 20 00       mov    %eax,0x2007c9(%rip)        # 20101c <cnt1>
 853:   48 8b 45 f0             mov    -0x10(%rbp),%rax
 857:   8b 00                   mov    (%rax),%eax
 859:   85 c0                   test   %eax,%eax
 85b:   75 d2                   jne    82f <task+0x25>
 85d:   90                      nop
 85e:   48 8b 55 f8             mov    -0x8(%rbp),%rdx
 862:   64 48 33 14 25 28 00    xor    %fs:0x28,%rdx
 869:   00 00
 86b:   74 05                   je     872 <task+0x68>
 86d:   e8 2e fe ff ff          callq  6a0 <__stack_chk_fail@plt>
 872:   c9                      leaveq
 873:   c3                      retq

In the code above we can see clearly the lock modifier used to atomically increase (xadd) the atomic shared counter, but not the increase of the non-atomic one.

The previous code is equivalent to this:

void *task (void *p) {
  int *flag = (int*)p;
  while (*flag) {
    atomic_fetch_add_explicit(&cnt, 1,__ATOMIC_SEQ_CST);
    cnt1++;
  }
}

Despite of the fact that the compiler generates simpler code to increase the atomic using this function (or alternatively __atomic_fetch_add or __atomic_add_fetch) we get a locked add instruction generated the same than before.

The last version of the function shows a third parameter in the atomic modification function. __ATOMIC_SEQ_CST stands for ATOMIC Sequentially Consistent. This is the default memory model used by the compiler when no memory model is specified as happened with our initial example, when we just assign a value to an atomic variable.

The memory model will tell the compiler which code to generate in order to establish certain constraints in the way atomic values are read and write in memory. In the rest of the post I will use the examples from the GCC wiki to ilustrate what happens with each memory model. That page uses C++ syntax. You can check this GCC page for the C equivalent that I'm going to use.

The problem

Before diving into the code, we should explain what is the issue that these memory models are intended to address.

Imagine that we have two thread running the following code (taken from the page mentioned above):

THREAD1              THREAD2
y = 1                if (x.load() == 2)
x.store (2)              assert (y == 1)

Looking to this code many programmer will expect that the assert on thread2 should always be correct, however what may happen is that the compiler could optimise the code as the code of both threads is, in principle unrelated and could be optimised independently. In that case, the compiler may decide that it is better to first store x and then set y. If in between thread2 starts executing, the assert could fail.

Note that, from the point of view of the compiler, changing the order of the assignments in thread 1 doesn't make any difference, as we are not doing anything with them. So, specifying a memory model when accessing atomic variables will allow us to control what we let the compiler (and also the processor) optimise and what not, at an inter-thread level.

In addition to the compiler optimisations, many processors also mess up with the instructions orders to improve the performance. Specifically memory accesses can be reordered as accessing memory more expensive than accessing processor internal memory.

Summing up, atomics are somehow like a volatile modifier but on steroids.

So, the key concept here is the so-called Happens-before. In the code above, it is clear for us that the assignment of y happens before the store in x. The memory model will allow us to specify if this is really what we want and therefore force the compiler and also the processor to execute the code as it is written at the expense of a performance penalty.

NOTE: Happens-before in this case means that whenever we store 2 in x, assignment of 1 to y happened before. In this situation, when if in thread2 is true, it means that the assert will always pass because we the y assignment happenned before...

Sequentially Consistent and Relaxed

The two basic memory models are named Sequentially Consistent which let's say is the more constraint model in the sense that all the code is forced to be executed as written, and Relaxed that basically let's the compiler and the processor do whatever they do. Sequentially Consistent model roughly disable all compiler and processor optimisations for that piece of code while Relaxed just keeps all of them in place.

Let's take our previous theoretical example and implement it so we can check what happen with each one of these models:

#include <unistd.h>
#include <stdatomic.h>
#include <assert.h>
#include <pthread.h>

atomic_int x = 0;
int y = 0;

void *task1 (void *p) {
  y = 1;
  __atomic_store_n (&x, 2, MEMORY_MODEL);
}

void *task2 (void *p) {
  if (__atomic_load_n (&x, MEMORY_MODEL) == 2)
    assert (y == 1);
}

int main () {
  pthread_t tid[2];

  printf ("x = %d | y = %d\n", x , y);
  if (pthread_create (&tid[0], NULL, task1, NULL) < 0) exit (EXIT_FAILURE);
  if (pthread_create (&tid[1], NULL, task2, NULL) < 0) exit (EXIT_FAILURE);

  pthread_join (tid[0], NULL);
  pthread_join (tid[1], NULL);
  printf ("x = %d | y = %d\n", x ,y);
 
}

Let's see what we get for the SEQ_CST memory model:

$ gcc -DMEMORY_MODEL=__ATOMIC_SEQ_CST -o atomic03-param atomic03-param.c -lpthread
$ objdump -d atomic03-param | grep -A31 "<task1>:"
000000000000081a <task1>:
 81a:   55                      push   %rbp
 81b:   48 89 e5                mov    %rsp,%rbp
 81e:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
 822:   c7 05 ec 07 20 00 01    movl   $0x1,0x2007ec(%rip)        # 201018 <y>
 829:   00 00 00
 82c:   b8 02 00 00 00          mov    $0x2,%eax
 831:   89 05 dd 07 20 00       mov    %eax,0x2007dd(%rip)        # 201014 <x>
 837:   0f ae f0                mfence
 83a:   90                      nop
 83b:   5d                      pop    %rbp
 83c:   c3                      retq

000000000000083d <task2>:
 83d:   55                      push   %rbp
 83e:   48 89 e5                mov    %rsp,%rbp
 841:   48 83 ec 10             sub    $0x10,%rsp
 845:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
 849:   8b 05 c5 07 20 00       mov    0x2007c5(%rip),%eax        # 201014 <x>
 84f:   83 f8 02                cmp    $0x2,%eax
 852:   75 2a                   jne    87e <task2+0x41>
 854:   8b 05 be 07 20 00       mov    0x2007be(%rip),%eax        # 201018 <y>
 85a:   83 f8 01                cmp    $0x1,%eax
 85d:   74 1f                   je     87e <task2+0x41>
 85f:   48 8d 0d c8 01 00 00    lea    0x1c8(%rip),%rcx        # a2e <__PRETTY_FUNCTION__.3952>
 866:   ba 15 00 00 00          mov    $0x15,%edx
 86b:   48 8d 35 92 01 00 00    lea    0x192(%rip),%rsi        # a04 <_IO_stdin_used+0x4>
 872:   48 8d 3d 9c 01 00 00    lea    0x19c(%rip),%rdi        # a15 <_IO_stdin_used+0x15>
 879:   e8 52 fe ff ff          callq  6d0 <__assert_fail@plt>
 87e:   90                      nop
 87f:   c9                      leaveq
 880:   c3                      retq

We can see that task2 is left untouched.... No special synchronisation code have been issued, however, in the task1 function we do not see a lock modifier but we can spot two mfence instructions. Now it is time to state a couple of things to understand why task1 was compiled like that:

  • The first thing is that load and store instructions on aligned memory on intel processors are already atomic. In the previous example we had an add instruction that doesn't fit in this category and therefore needs to be modified with lock.
  • We see two mfence instructions because, as mentioned earlier Sequentially Consistent is default memory model used for atomic data.

So the remaining question is what that mfence instruction does. from the intel software development manual

Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.

What this means in our example is that the stores in y and x are ensured to be performed in that order and such update will be visible globally... that's it to all other threads. This is the way to ensure that the assert on task2 is always true (whenever x value is 2).

Let's see what happens when we update x using RELAXED memory model. And the generated code will look like this:

$ gcc -DMEMORY_MODEL=__ATOMIC_RELAXED -o atomic03-param atomic03-param.c -lpthread
$ objdump -d atomic03-param | grep -A10 "<task1>:"
000000000000081a <task1>:
 81a:   55                      push   %rbp
 81b:   48 89 e5                mov    %rsp,%rbp
 81e:   48 89 7d f8             mov    %rdi,-0x8(%rbp)
 822:   c7 05 ec 07 20 00 01    movl   $0x1,0x2007ec(%rip)        # 201018 <y>
 829:   00 00 00
 82c:   b8 02 00 00 00          mov    $0x2,%eax
 831:   89 05 dd 07 20 00       mov    %eax,0x2007dd(%rip)        # 201014 <x>
 837:   90                      nop
 838:   5d                      pop    %rbp
 839:   c3                      retq

I have removed task2 as it is not relevant for our discussion.

As we can see, when using the __ATOMIC_RELAXED memory model, no special consideration is taken by the compiler with regards of the atomic values. The code will be indeed faster (no mfence forcing reordering of operations), but then there is no assumption in the sequencing on store/load operations. It is up to the programmer to ensure such synchronisation... if needed.

So, now, when task2 is executed there is nothing in the system imposing that the store on y happens before the store on x and therefore, the assert may fail.

Note that, in both cases, when task1 reaches the retq instruction, y will be set to 1 and x to 2.... Independently of which one was done first. If task2 gets executed in between the function, and the assigned was executed in a different order by the processor.... then the result of task2 will be different. The mfence instructions ensures that, after each assignment the values are globally visible and that such assignments are done in the intended order.

Acquire/Release/Consume

In addition to the strict sequencing of SEQ_CST and the non sequencing at all of RELAXED there are three additional models that can be used. According to the gcc documentation these is what they do:

__ATOMIC_CONSUME

    This is currently implemented using the stronger __ATOMIC_ACQUIRE memory order 
    because of a deficiency in C++11’s semantics for memory_order_consume. 
    
__ATOMIC_ACQUIRE

    Creates an inter-thread happens-before constraint from the release (or stronger) 
    semantic store to this acquire load. Can prevent hoisting of code to before 
    the operation. 
    
__ATOMIC_RELEASE

    Creates an inter-thread happens-before constraint to acquire (or stronger) 
    semantic loads that read from this release store. Can prevent sinking of 
    code to after the operation.

Well, the first thing we see is that __ATOMIC_CONSUME is just the same that __ATOMIC_ACQUIRE. The __ATOMIC_RELEASE ensures that any operation after the indicated store while be serialised while the __ATOMIC_ACQUIRE will ensure that any operation before the indicated load while be serialised. These two works together like a lock for a given atomic, but the synchronisation happens at the processor level.

For testing this, I used the following program:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <stdatomic.h>
#include <assert.h>
#include <pthread.h>

atomic_int x = 0;
atomic_int y = 0;

void *task1 (void *p) {
  __atomic_store_n (&y, 20, __ATOMIC_RELEASE);
}

void *task2 ( void *p) {
    __atomic_store_n (&x, 10, __ATOMIC_RELEASE);
}

void *task3 (void *p) {
  assert (__atomic_load_n (&y, __ATOMIC_ACQUIRE) == 20 &&
      __atomic_load_n (&x, __ATOMIC_ACQUIRE) == 0);
}

void *task4 (void *p) {
  assert (__atomic_load_n (&y, __ATOMIC_ACQUIRE) == 0 &&
      __atomic_load_n (&x, ATOMIC_ACQUIRE) == 10);  
}

int main () {
  pthread_t tid[4];

  printf ("x = %d | y = %d | c = %d\n", x ,y ,b);
  if (pthread_create (&tid[0], NULL, task1, NULL) < 0) exit (EXIT_FAILURE);
  if (pthread_create (&tid[1], NULL, task2, NULL) < 0) exit (EXIT_FAILURE);
  if (pthread_create (&tid[2], NULL, task3, NULL) < 0) exit (EXIT_FAILURE);
  if (pthread_create (&tid[3], NULL, task4, NULL) < 0) exit (EXIT_FAILURE);

  pthread_join (tid[0], NULL);
  pthread_join (tid[1], NULL);
  pthread_join (tid[2], NULL);
  pthread_join (tid[3], NULL);
  printf ("x = %d | y = %d | c = %d\n", x ,y ,b);
}

Anyway, to complete the explanation of these memory models, in the example above, any of the asserts (the one on task3 or the one in task4) can pass. There is no order imposed on the variables and any of them can be updated first but it is not forced that one has to be set before the other. Note that if the SEQ_CST memory model where used, that won't happen. In that case, one of the assignments will happen before the other, but such order will be determined at run-time. In other words, with SEQ_CST model, one of the asserts will pass and the other won't, while with the ACQUIRE/RELEASE model, both may pass.

In theory this model will allow the compiler to optimise the code further, at the same time that the load and store of certain variables are ensured to be visible globally.

If you compile it, you will just get the same code that for non atomic values. The ACQUIRE/RELEASE is intended to allow the compiler to apply more optimisation but I'm afraid that for such a simple program, there is not much optimisations to perform, so we cannot really see any difference.

Actually, the cppreference page describing this memory models, explains that for strongly-ordered systems (that includes x86) no special instruction is issued and is just the compiler who is instructed to avoid certain optimisations (like moving around load and stores of the affected atomics). However, ARM or PowerPC are weakly-ordered systems.

So, What happens with x86?

At this point things have become pretty confusing. Intel processors can run instructions out of order but they are strongly-ordered systens for wich memory access are guarantied to be performed in the indicated order.... So why do we need the mfence instructions on Intel?

OUT-OF-ORDER Processors

It may be difficult to understand what follows, as all the code we see looks in the right order. The point is that the processor is allowed to re-order the instructions and despite of how the program looks in memory, the instructions will be executed in a different order, but producing the exact same result... likely improving performance. This works fine in most of the cases but when the have inter-thread interactions such a re-order may make the program behave differently as we discussed in our previous example.

For more details take a look to the Out-of-Order Execution page in Wikipedia. Do not forget to donate and to ask your company/organization to do

Apparently, the x86 architecture is indeed strongly-ordered and in most cases, for simple aligned stores and loads everything just works. Per design, all stores have a RELEASE semantic and all loads have an ACQUIRE semantics and that is why no special code is generated.

All this said, I start to believe that the mfence instruction is not really needed and it is just included in the Sequential Consistent model just in case. However, after further digging into the issue I found the following.

In the Intel 64 and IA-32 ArchitecturesSoftware Developer’s ManualVolume 3A:System Programming Guide, Part 1 we found some details.

Section 11.10 introduces the so called STORE BUFFER, this buffer is used to temporary store writes to memory (store operations) delaying the actual writes to memory and therefore improving the processor performance, allowing it to continue executing instructions without waiting for memory writes and also to actually be more efficient when actually writing to memory.

This section also stated that all memory writes are performed in order (even when may be delayed) and indicates in which circumstances these store buffer is drained to memory. We can see the mention to mfence instruction in that list. This section refers us to section 8.2 where write ordering is discussed and further details of how this store buffer is operated.

Section 8.2 Memory Ordering stars mentioning than the i386 processor is a strong-ordered system where all access happens as they are suppose to, but that Pentium 4, Xeon and P6 families of processor allow deviations of this behaviour to improve performance. In particular, the section mentions reads (loads) going ahead of buffered writes.

I won't go into the details of this sections. It is pretty long and dense, but for the curious reader you will find two sections describing the models for Pentium and i486 and another for P6 and more recent processors (which is way more complex). After that several examples to illustrate the memory ordering principles described in the first two sections are discussed. You can refer to the post Who ordered memory fences on an x86? for a more comprehensive reading in some of the cases described in the Intel manual.

Anything else I should know about x86 processors?

Well, just a couple of things. You can find the details in the same section 8.2, but I'll include here a very brief summary... this time for my own convenience.

The first thing is that fast-string operations stores may appear to be executed out of order. This is basically the use of rep stos instructions on your code. For this, the store of the data may not be in order and additional synchronisation means have to be provided.

The second is that section 8.2.5 tell us how to strengthening or weakening the memory model. It mention 4 mechanisms:

  • I/O instructions, locking instructions (like xchg), lock prefixed instructions and serialising instructions (like cpuid or iret) force stronger ordering
  • L/S/MFENCE instructions force memory ordering and serialisation allowing some control on the instructions affected (L stands for Load and S stands for Store).
  • The memory type range register (MTRRs) available on Pentium 4, Xeon and P6 allows to modify the memory ordering for areas of physical memory
  • Finally the PAT Page Attibute Table can do the same for a page or group of pages. This mechanism is available on Pentium 4, Xeon and Pentium III.

The two last are kindof intended to deal with memory mapped devices where instead of using I/O instructions (case 1, in and out), device control is done writing to specific memory areas. In those cases, we want the writes to be performed in order.... Usually you have to do two writes in a row, one on a control register to chose the operation and another in a data register to write the value. Those registers will be mapped in different memory locations for a memory-mapped hardware (usually consecutive tho).

So, the relevant part of all this is summarised in the last paragraph of the section that basically that in order to write portable code (also for future processors) and as new processors (Intel Core 2 Duo, Intel Atom, Intel Core Duo, Pentium 4, Intel Xeon, and P6 family processors) does not implement any more the strong memory-ordering model (except when using MTTRs in the so-called strong uncached (UC) memory type), programmer should consider the processor-ordering model or weaker memory-ordering model.

Summing up, x86 is overall a strong-ordered system with certain exceptions. ACQUIRE/RELEASE semantics seems to be ensured by the processor, but in order to fully ensure sequencing special instructions are provided and used for the Sequential Consistent C/C++ memory model. Even when in some cases may not be necessary, the system manuals tell us to use them in order to ensure compatibility with future processors.

So, What happens with ARM or PowerPC?

Let's take a quick look just for the satisfy our curiosity to what happens with a weakly-ordered system.

$ arm-linux-gnueabi-gcc -o atomic05.arm atomic05.c -lpthread
$ arm-linux-gnueabi-objdump  -d atomic05.arm | grep -A44 "<task2>:"
00010630 <task2>:
   10630:       e92d4810        push    {r4, fp, lr}
   10634:       e28db008        add     fp, sp, #8
   10638:       e24dd00c        sub     sp, sp, #12
   1063c:       e50b0010        str     r0, [fp, #-16]
   10640:       e59f4018        ldr     r4, [pc, #24]   ; 10660 <task2+0x30>
   10644:       eb00042b        bl      116f8 <__sync_synchronize>              ; <========
   10648:       e3a0300a        mov     r3, #10
   1064c:       e5843000        str     r3, [r4]
   10650:       e1a00000        nop                     ; (mov r0, r0)
   10654:       e1a00003        mov     r0, r3
   10658:       e24bd008        sub     sp, fp, #8
   1065c:       e8bd8810        pop     {r4, fp, pc}
   10660:       00022044        .word   0x00022044

00010664 <task3>:
   10664:       e92d4810        push    {r4, fp, lr}
   10668:       e28db008        add     fp, sp, #8
   1066c:       e24dd00c        sub     sp, sp, #12
   10670:       e50b0010        str     r0, [fp, #-16]
   10674:       e59f3044        ldr     r3, [pc, #68]   ; 106c0 <task3+0x5c>
   10678:       e5934000        ldr     r4, [r3]
   1067c:       eb00041d        bl      116f8 <__sync_synchronize>            ; <=========
   10680:       e3540014        cmp     r4, #20
   10684:       1a000004        bne     1069c <task3+0x38>
   10688:       e59f3034        ldr     r3, [pc, #52]   ; 106c4 <task3+0x60>
   1068c:       e5934000        ldr     r4, [r3]
   10690:       eb000418        bl      116f8 <__sync_synchronize>            ; <===========
   10694:       e3540000        cmp     r4, #0
   10698:       0a000004        beq     106b0 <task3+0x4c>
   1069c:       e59f3024        ldr     r3, [pc, #36]   ; 106c8 <task3+0x64>
   106a0:       e3a0201e        mov     r2, #30
   106a4:       e59f1020        ldr     r1, [pc, #32]   ; 106cc <task3+0x68>
   106a8:       e59f0020        ldr     r0, [pc, #32]   ; 106d0 <task3+0x6c>
   106ac:       ebffff93        bl      10500 <__assert_fail@plt>
   106b0:       e1a00000        nop                     ; (mov r0, r0)
   106b4:       e1a00003        mov     r0, r3
   106b8:       e24bd008        sub     sp, fp, #8
   106bc:       e8bd8810        pop     {r4, fp, pc}
   106c0:       00022048        .word   0x00022048
   106c4:       00022044        .word   0x00022044
   106c8:       000119ac        .word   0x000119ac
   106cc:       000118ec        .word   0x000118ec
   106d0:       000118f8        .word   0x000118f8

We can see the calls to sync_synchronise... That ends up calling kuser_cmpxchg function as described in part 2 of this series.

For the PowerPC we can also see how special instructions are needed to ensure the RELEASE/ACQUIRE semantics:

$ powerpc-linux-gnu-gcc -o atomic05.ppc atomic05.c -lpthread
$ powerpc-linux-gnu-objdump -d atomic05.ppc | grep -A50 "<task2>:"
10000664 <task2>:
10000664:       94 21 ff e0     stwu    r1,-32(r1)
10000668:       93 e1 00 1c     stw     r31,28(r1)
1000066c:       7c 3f 0b 78     mr      r31,r1
10000670:       90 7f 00 0c     stw     r3,12(r31)
10000674:       3d 20 10 02     lis     r9,4098
10000678:       39 40 00 0a     li      r10,10
1000067c:       7c 20 04 ac     lwsync               # <========
10000680:       91 49 00 2c     stw     r10,44(r9)
10000684:       60 00 00 00     nop
10000688:       7d 23 4b 78     mr      r3,r9
1000068c:       39 7f 00 20     addi    r11,r31,32
10000690:       83 eb ff fc     lwz     r31,-4(r11)
10000694:       7d 61 5b 78     mr      r1,r11
10000698:       4e 80 00 20     blr

1000069c <task3>:
1000069c:       94 21 ff e0     stwu    r1,-32(r1)
100006a0:       7c 08 02 a6     mflr    r0
100006a4:       90 01 00 24     stw     r0,36(r1)
100006a8:       93 e1 00 1c     stw     r31,28(r1)
100006ac:       7c 3f 0b 78     mr      r31,r1
100006b0:       90 7f 00 0c     stw     r3,12(r31)
100006b4:       3d 20 10 02     lis     r9,4098
100006b8:       81 29 00 30     lwz     r9,48(r9)
100006bc:       7f 89 48 00     cmpw    cr7,r9,r9
100006c0:       40 9e 00 04     bne     cr7,100006c4 <task3+0x28>
100006c4:       4c 00 01 2c     isync                     # <==========
100006c8:       2b 89 00 14     cmplwi  cr7,r9,20
100006cc:       40 9e 00 20     bne     cr7,100006ec <task3+0x50>
100006d0:       3d 20 10 02     lis     r9,4098
100006d4:       81 29 00 2c     lwz     r9,44(r9)
100006d8:       7f 89 48 00     cmpw    cr7,r9,r9
100006dc:       40 9e 00 04     bne     cr7,100006e0 <task3+0x44>
100006e0:       4c 00 01 2c     isync                     # <===============
100006e4:       2f 89 00 00     cmpwi   cr7,r9,0
100006e8:       41 9e 00 24     beq     cr7,1000070c <task3+0x70>
100006ec:       3d 20 10 00     lis     r9,4096
100006f0:       38 c9 0d 74     addi    r6,r9,3444
100006f4:       38 a0 00 1e     li      r5,30
100006f8:       3d 20 10 00     lis     r9,4096
100006fc:       38 89 0c b4     addi    r4,r9,3252
10000700:       3d 20 10 00     lis     r9,4096
10000704:       38 69 0c c0     addi    r3,r9,3264
10000708:       48 00 04 f9     bl      10000c00 <__assert_fail@plt>
1000070c:       60 00 00 00     nop
10000710:       7d 23 4b 78     mr      r3,r9
10000714:       39 7f 00 20     addi    r11,r31,32
10000718:       80 0b 00 04     lwz     r0,4(r11)
1000071c:       7c 08 03 a6     mtlr    r0
10000720:       83 eb ff fc     lwz     r31,-4(r11)

If you do a quick search on google, you will quickly find that lwsync (lightweight sync) is actually used for RELEASE while isync is used for ACQUIRE. There is also a sync/hwsync (heavyweight sync) instruction that imposes a full memory barrier and we can see them when using the SEQ_CST memory model.

Conclusions

We have gone quickly through the different memory models supported by gcc atomics that allows developer get better control on what code get executed when. We have also seen that is not than easy to find proper and simple examples. For the simple ones we have seen in this paper, either we get the compiler to issue an mfence instruction or not, and there is not much room for the optimiser to modify a function that just assigns a couple of values.

Memory models and atomics become relevant when implementing higher level synchronisation mechanisms as we find out when trying to implement a mutex. In normal SW, we will likely use a mutex, semaphore, condition or barrier than implementing our own mechanism using atomics with inline functions. Any way it is good to understand what happens under the hood and that these problems exists.... because many times, a slightly different version of this issue may happen at a higher level and it would be easier to identify when it sounds a little bit familiar.

Header Image Credits: Dan Meyers

 
Tu publicidad aquí :)