Relocation and Thread Local Storage
10 Mar 2022 Linux ELF loader linking PIC thread_local_storage assembly compilerThe following excellent articles in the past two days taught me a lot about how the loader does relocation for ELF executables and how thread local storage is handled:
- Load-time relocation of shared libraries
- Position Independent Code (PIC) in shared libraries
- Position Independent Code (PIC) in shared libraries on x64
- A Deep dive into (implicit) Thread Local Storage
- All about thread-local storage
Why I suddenly started learning the ELF executable loading and thread local storage?
It was triggered by a specific problem:
when porting a userspace threading implementation (Shenengo),
I need to come up with the aarch64 equivalent of the inline assembly
asm volatile ("addl $1, %%fs:preempt_cnt@tpoff" : : : "memory", "cc");
.
What the inline assembly does is to increment the 32-bit value stored in memory.
More specifically, it is a thread local variable called preempt_cnt
.
The add
instruction encodes the offset of preempt_cnt
from the thread local storage base address (@tpoff
stands for thread pointer offset),
and generates the full address by adding the offset with FS
segment register.
In X86, FS
along with GS
register were introduced to manage different segment,
while in multi-core architecture they are repurposed for Thread Local Storage (TLS).
When the kernel switchs threads, it sets up the FS
register (Windows and macOS uses GS
) with the address (thread pointer, TP) pointing to the part of memory allocated for thread context.
User code can read FS
to index thread local variables, but cannot augment the value.
The preempt_cnt
variable is defined in a .c
file as the following and is archived as part of a static library.
There are two inline functions manipulating preempt_cnt
in the header file:
And here comes a simple multi-thread program to demonstrate the usage:
If we compile it on a X86_64 machine,
we can see from the following assembly snippet how the embedded add
instruction is encoded:
the offset (0xfffffffffffffffc
means -4) takes up 4 bytes (32 bits).
This is because the preempt_cnt
uses the relocation type R_X86_64_TPOFF32
(try readelf -r tls.o | grep preempt
),
that means 32 bits hard-coded offset from thread pointer is sufficent for the instruction to generate the real virtual memory address during execution
(because thread pointer is dynamic) so the loader does not need to modify the code or populate a Global Offset Table (GOT) for this instruction.
Sorry for pouring many terminologies here,
but it is strongly recommended to gain the basic understanding of relocation from the excellent trilogy
(1,
2,
3)
written by Eli Bendersky.
I do not even want to try to give my explanation,
so here is the extremly simplified explanation about relocation from chao-tic:
Relocation is required because the compiler doesn’t know where the variables defined in shared objects are located at runtime, so a Global Offset Table (GOT) is set aside by the compiler and only gets filled in the appropriate location values at runtime by the dynamic linker.
First challenge is that aarch64 architecture does not have the FS
segment register like X86.
Instead it provides a system register TPIDR_EL0
,
which as the name and the document suggested,
is supposed to store the thread id,
then the id can be used to look up some table to find the thread related information.
In practice, kernel designers (e.g., Fuchsia) simplify the process by directly storing the address to Thread Control Block
(TCB, which ususally sits at the beginning or the end of TLS).
It worths mentioning that layout of TLS in aarch64 architecture follows the ELF standard variant I,
while X86 follows variant II for historical reasons.
Figure 1 and Figure 2 in ELF Handling For Thread-Local Storage are very intuitive and helpful to understanding the two variants of layout.
Unfortunately,
the system register TPIDR_EL0
cannot be indexed by load instructions directly like FS
in X86,
so we have to use mrs
instruction to read the thread pointer value from TPIDR_EL0
to a general purpose register first.
The second challenge is how to encode the offset of a thread local variable in the embedded assembly.
The same syntax does not apply to aarch64 assembly,
because aarch64 instructions are fixed-length
(32 bits default, or 16 bits in Thumb mode).
When it is impossible to put a long offset up to 64 bits in the instruction,
aarch64 architecture relies on GOT mostly
(e.g., c statement preempt_cnt++;
would be compiled to use GOT by default).
Speaking of embedded assembly support,
a list of operators and their corresponding aarch64 relocation mode is available in System V ABI for the Arm 64-bit Architecture (AArch64) and ELF for the Arm 64-bit Architecture (AArch64).
So the first solution I figured out is using GOT (line 34 to 63 in preempt.h
).
This solution finds the address of the GOT entry for preempt_cnt
via _GLOBAL_OFFSET_TABLE_
and :gottprel_lo12:
,
and takes one extra load at GOT entry to fetch the thread pointer offset for preempt_cnt
.
As we have learned from the X86_64 assembly,
the thread pointer offset is known to the static linker
(static TLS has all information in the executable and ready to load upon launch),
and the offset value should be relatively small (checked it is 0x14
).
Therefore, I tried to use the :tprel_lo12:
operator to directly encode the thread pointer offset into a ldr
instruction (supports a 12-bit immediate field).
However, I kept getting the linker error saying:
relocation truncated to fit: R_AARCH64_TLSLE_LDST64_TPREL_LO12 against symbol `preempt_cnt' defined in .tbss section in preempt.o
I thought that the error occurred because I picked the wrong operator,
so I tried many other operators,
as well as any linking flags I found might be related,
such as -fno-pie
, -fno-pic
, -mtls-size=12
, -mcmodel=tiny
.
Once I thought the -mtls-size=12
flag worked, but it turned out not.
However, in the trial-and-error,
I observed that if I had only one thread local variable,
the linking passed,
or if I add the -static
flag, it was also fine.
So I downloaded the binutils
source code,
trying to find whether linker made mistakes in arrange thread local variables.
In bfd/elfxx-aarch64.c
, about line 300, there is one statement gives me the clue:
if (old_addend & ((1 << howto->rightshift) - 1)) return bfd_reloc_overflow;
.
I finally realized the error message says “truncated”,
could actually mean the lower significant bits are non-zero but truncated to fit into the 12-bit immediate field.
Then I checked the preempt_cnt
data type is 4-byte long,
so the second thread local variable is not 8-byte aligned
while my ldr
instruction was in 64-bit mode, dropping Bit 2.
After I change to use the 32-bit ldr
, the error is gone.
Finally, we can verify the second solution (so called “local exec”) saves instructions (one is memory access) by comparing their disassembly as the follows.
Credits
- Load-time relocation of shared libraries
- Position Independent Code (PIC) in shared libraries
- Position Independent Code (PIC) in shared libraries on x64
- A Deep dive into (implicit) Thread Local Storage
- All about thread-local storage
- ELF Handling For Thread-Local Storage
- System V ABI for the Arm 64-bit Architecture (AArch64)
- ELF for the Arm 64-bit Architecture (AArch64).