Relocation and Thread Local Storage

The following excellent articles in the past two days taught me a lot about how the loader does relocation for ELF executables and how thread local storage is handled:

Why I suddenly started learning the ELF executable loading and thread local storage? It was triggered by a specific problem: when porting a userspace threading implementation (Shenengo), I need to come up with the aarch64 equivalent of the inline assembly asm volatile ("addl $1, %%fs:preempt_cnt@tpoff" : : : "memory", "cc");.

What the inline assembly does is to increment the 32-bit value stored in memory. More specifically, it is a thread local variable called preempt_cnt. The add instruction encodes the offset of preempt_cnt from the thread local storage base address (@tpoff stands for thread pointer offset), and generates the full address by adding the offset with FS segment register. In X86, FS along with GS register were introduced to manage different segment, while in multi-core architecture they are repurposed for Thread Local Storage (TLS). When the kernel switchs threads, it sets up the FS register (Windows and macOS uses GS) with the address (thread pointer, TP) pointing to the part of memory allocated for thread context. User code can read FS to index thread local variables, but cannot augment the value.

The preempt_cnt variable is defined in a .c file as the following and is archived as part of a static library.

There are two inline functions manipulating preempt_cnt in the header file:

And here comes a simple multi-thread program to demonstrate the usage:

If we compile it on a X86_64 machine, we can see from the following assembly snippet how the embedded add instruction is encoded: the offset (0xfffffffffffffffc means -4) takes up 4 bytes (32 bits).

This is because the preempt_cnt uses the relocation type R_X86_64_TPOFF32 (try readelf -r tls.o | grep preempt), that means 32 bits hard-coded offset from thread pointer is sufficent for the instruction to generate the real virtual memory address during execution (because thread pointer is dynamic) so the loader does not need to modify the code or populate a Global Offset Table (GOT) for this instruction. Sorry for pouring many terminologies here, but it is strongly recommended to gain the basic understanding of relocation from the excellent trilogy (1, 2, 3) written by Eli Bendersky. I do not even want to try to give my explanation, so here is the extremly simplified explanation about relocation from chao-tic:

Relocation is required because the compiler doesn’t know where the variables defined in shared objects are located at runtime, so a Global Offset Table (GOT) is set aside by the compiler and only gets filled in the appropriate location values at runtime by the dynamic linker.

First challenge is that aarch64 architecture does not have the FS segment register like X86. Instead it provides a system register TPIDR_EL0, which as the name and the document suggested, is supposed to store the thread id, then the id can be used to look up some table to find the thread related information. In practice, kernel designers (e.g., Fuchsia) simplify the process by directly storing the address to Thread Control Block (TCB, which ususally sits at the beginning or the end of TLS). It worths mentioning that layout of TLS in aarch64 architecture follows the ELF standard variant I, while X86 follows variant II for historical reasons. Figure 1 and Figure 2 in ELF Handling For Thread-Local Storage are very intuitive and helpful to understanding the two variants of layout. Unfortunately, the system register TPIDR_EL0 cannot be indexed by load instructions directly like FS in X86, so we have to use mrs instruction to read the thread pointer value from TPIDR_EL0 to a general purpose register first.

The second challenge is how to encode the offset of a thread local variable in the embedded assembly. The same syntax does not apply to aarch64 assembly, because aarch64 instructions are fixed-length (32 bits default, or 16 bits in Thumb mode). When it is impossible to put a long offset up to 64 bits in the instruction, aarch64 architecture relies on GOT mostly (e.g., c statement preempt_cnt++; would be compiled to use GOT by default). Speaking of embedded assembly support, a list of operators and their corresponding aarch64 relocation mode is available in System V ABI for the Arm 64-bit Architecture (AArch64) and ELF for the Arm 64-bit Architecture (AArch64).

So the first solution I figured out is using GOT (line 34 to 63 in preempt.h). This solution finds the address of the GOT entry for preempt_cnt via _GLOBAL_OFFSET_TABLE_ and :gottprel_lo12:, and takes one extra load at GOT entry to fetch the thread pointer offset for preempt_cnt.

As we have learned from the X86_64 assembly, the thread pointer offset is known to the static linker (static TLS has all information in the executable and ready to load upon launch), and the offset value should be relatively small (checked it is 0x14). Therefore, I tried to use the :tprel_lo12: operator to directly encode the thread pointer offset into a ldr instruction (supports a 12-bit immediate field). However, I kept getting the linker error saying:

relocation truncated to fit: R_AARCH64_TLSLE_LDST64_TPREL_LO12 against symbol `preempt_cnt' defined in .tbss section in preempt.o

I thought that the error occurred because I picked the wrong operator, so I tried many other operators, as well as any linking flags I found might be related, such as -fno-pie, -fno-pic, -mtls-size=12, -mcmodel=tiny. Once I thought the -mtls-size=12 flag worked, but it turned out not. However, in the trial-and-error, I observed that if I had only one thread local variable, the linking passed, or if I add the -static flag, it was also fine.

So I downloaded the binutils source code, trying to find whether linker made mistakes in arrange thread local variables. In bfd/elfxx-aarch64.c, about line 300, there is one statement gives me the clue: if (old_addend & ((1 << howto->rightshift) - 1)) return bfd_reloc_overflow;. I finally realized the error message says “truncated”, could actually mean the lower significant bits are non-zero but truncated to fit into the 12-bit immediate field. Then I checked the preempt_cnt data type is 4-byte long, so the second thread local variable is not 8-byte aligned while my ldr instruction was in 64-bit mode, dropping Bit 2. After I change to use the 32-bit ldr, the error is gone.

Finally, we can verify the second solution (so called “local exec”) saves instructions (one is memory access) by comparing their disassembly as the follows.

Credits