(proclaim λ) -

I've been reading about multi-threaded programming out of curiosity over lock-free algorithms.

The first thing I've implemented for that was actually something completely different, namely a short example of memory reordering ported over from a C++ program.

In Common Lisp this relies on SBCL because of the availability of memory barriers and the Linux-specific API to set the thread affinity to a single CPU core (sb-cpu-affinity). Of course, given the appropriate compability libraries this could easily be done on other implementations too.

(in-package #:cl-user)

(defun main (&key single-cpu-p barrierp)
  (let ((begin-semaphore-1 (sb-thread:make-semaphore))
        (begin-semaphore-2 (sb-thread:make-semaphore))
        (end-semaphore (sb-thread:make-semaphore))
        (iterations 0)
        (detected 0)
        (x 0)
        (y 0)
        (r1 0)
        (r2 0)
        done)
    (declare (fixnum x y r1 r2))
    (declare (boolean done))
    (declare (optimize (speed 3) (safety 0) (space 0) (debug 0) (compilation-speed 0)))
    (labels ((single-cpu ()
               (when single-cpu-p
                 sb-cpu-affinity::
                 (with-cpu-affinity-mask (mask :save T)
                   (clear-cpu-affinity-mask mask)
                   (setf (cpu-affinity-p 0 mask) T))))
             (thread-1 ()
               (single-cpu)
               (loop
                 (when done
                   (return))
                 (sb-thread:wait-on-semaphore begin-semaphore-1)
                 (setf x 1)
                 (when barrierp
                   (sb-thread:barrier (:memory)))
                 (setf r1 y)
                 (sb-thread:signal-semaphore end-semaphore)))
             (thread-2 ()
               (single-cpu)
               (loop
                 (when done
                   (return))
                 (sb-thread:wait-on-semaphore begin-semaphore-2)
                 (setf y 1)
                 (when barrierp
                   (sb-thread:barrier (:memory)))
                 (setf r2 x)
                 (sb-thread:signal-semaphore end-semaphore))))
      (let ((thread-1 (sb-thread:make-thread #'thread-1))
            (thread-2 (sb-thread:make-thread #'thread-2)))
        (unwind-protect
             (loop
               (setf x 0 y 0)
               (sb-thread:signal-semaphore begin-semaphore-1)
               (sb-thread:signal-semaphore begin-semaphore-2)
               (sb-thread:wait-on-semaphore end-semaphore)
               (sb-thread:wait-on-semaphore end-semaphore)
               (when (and (eql r1 0) (eql r2 0))
                 (incf detected)
                 (format T "~D reorders detected after ~D iterations, every ~D on average~%" detected iterations (floor iterations detected)))
               (incf iterations))
          (setf done T)
          (sb-thread:signal-semaphore begin-semaphore-1)
          (sb-thread:signal-semaphore begin-semaphore-2)
          (sb-thread:join-thread thread-1)
          (sb-thread:join-thread thread-2))))))

On the JVM, respectively Kotlin, the code looks again remarkably similar. I haven't yet looked at thread affinity, but what's interesting here are two aspects related to the JVM memory model. Using volatile is, in contrast to C, viable, as it generates the necessary memory barriers.

import java.util.concurrent.Semaphore
import kotlin.concurrent.thread

class HelloWorld {
    var beginSempahore1 = Semaphore(0)
    var beginSempahore2 = Semaphore(0)
    var endSemaphore = Semaphore(0)

    var detected = 0
    var iterations = 0

    @Volatile var x = 0
    @Volatile var y = 0
    var r1 = 0
    var r2 = 0

    fun run() {
        thread {
            while (true) {
                beginSempahore1.acquire()
                x = 1
                r1 = y
                endSemaphore.release()
            }
        }

        thread {
            while (true) {
                beginSempahore2.acquire()
                y = 1
                r2 = x
                endSemaphore.release()
            }
        }

        while (true) {
            iterations += 1
            x = 0
            y = 0
            beginSempahore1.release()
            beginSempahore2.release()
            endSemaphore.acquire()
            endSemaphore.acquire()
            if (r1 == 0 && r2 == 0) {
                detected++
                println("$detected reorders detected after $iterations iterations")
            }
        }
    }
}

fun main(args: Array<String>) {
    HelloWorld().run()
}

Also, when using a JVM plugin, the generated assembly code can be dumped during compilation, which allows us to confirm whether those instructions actually have been generated; for the first loop:

  0x00007f9ad4f503c9: mov    (%rcx),%r11d       ;*getfield this$0
                                                ; - HelloWorld$run$1::invoke@11 (line 21)

  0x00007f9ad4f503cc: test   %r11d,%r11d
  0x00007f9ad4f503cf: je     0x00007f9ad4f507a9
  0x00007f9ad4f503d5: movl   $0x1,0x14(%r12,%r11,8)
  0x00007f9ad4f503de: lock addl $0x0,(%rsp)     ;*putfield x
                                                ; - HelloWorld::setX@2 (line 12)
                                                ; - HelloWorld$run$1::invoke@15 (line 21)

  0x00007f9ad4f503e3: mov    (%rcx),%r10d       ;*getfield this$0
                                                ; - HelloWorld$run$1::invoke@19 (line 22)

  0x00007f9ad4f503e6: test   %r10d,%r10d
  0x00007f9ad4f503e9: je     0x00007f9ad4f507cd  ;*invokevirtual getY
                                                ; - HelloWorld$run$1::invoke@26 (line 22)

  0x00007f9ad4f503ef: mov    0x18(%r12,%r10,8),%r11d  ;*getfield y
                                                ; - HelloWorld::getY@1 (line 13)
                                                ; - HelloWorld$run$1::invoke@26 (line 22)

  0x00007f9ad4f503f4: mov    %r11d,0x1c(%r12,%r10,8)  ;*putfield r1
                                                ; - HelloWorld::setR1@2 (line 14)
                                                ; - HelloWorld$run$1::invoke@29 (line 22)

And for the second loop:

  0x00007f9ad4f47269: mov    (%rcx),%r11d       ;*getfield this$0
                                                ; - HelloWorld$run$2::invoke@11 (line 30)

  0x00007f9ad4f4726c: test   %r11d,%r11d
  0x00007f9ad4f4726f: je     0x00007f9ad4f47629
  0x00007f9ad4f47275: movl   $0x1,0x18(%r12,%r11,8)
  0x00007f9ad4f4727e: lock addl $0x0,(%rsp)     ;*putfield y
                                                ; - HelloWorld::setY@2 (line 13)
                                                ; - HelloWorld$run$2::invoke@15 (line 30)

  0x00007f9ad4f47283: mov    (%rcx),%r10d       ;*getfield this$0
                                                ; - HelloWorld$run$2::invoke@19 (line 31)

  0x00007f9ad4f47286: test   %r10d,%r10d
  0x00007f9ad4f47289: je     0x00007f9ad4f4764d  ;*invokevirtual getX
                                                ; - HelloWorld$run$2::invoke@26 (line 31)

  0x00007f9ad4f4728f: mov    0x14(%r12,%r10,8),%r11d  ;*getfield x
                                                ; - HelloWorld::getX@1 (line 12)
                                                ; - HelloWorld$run$2::invoke@26 (line 31)

  0x00007f9ad4f47294: mov    %r11d,0x20(%r12,%r10,8)  ;*putfield r2
                                                ; - HelloWorld::setR2@2 (line 15)
                                                ; - HelloWorld$run$2::invoke@29 (line 31)

And the main loop:

  0x00007fd383813838: lock addl $0x0,(%rsp)     ;*putfield x
                                                ; - HelloWorld::run@58 (line 38)

  0x00007fd38381383d: mov    $0x0,%edx
  0x00007fd383813842: mov    %edx,0x18(%rsi)
  0x00007fd383813845: lock addl $0x0,(%rsp)     ;*putfield y
                                                ; - HelloWorld::run@63 (line 39)

  0x00007fd38381384a: mov    0x24(%rsi),%edi

Note that in contrast to the original and the Common Lisp version there's one additional memory barrier here that's unnecessary.

Memory reordering in Common Lisp and Kotlin