Golang Internals, Part 5: the Runtime Bootstrap Process

by Siarhei MatsiukevichApril 2, 2015

This blog post explores the specifics of resizable stacks and internal TLS implementation.

Golang Internals Go Runtime and Bootstrapping

The bootstrapping process is the key to understanding how the Go runtime works. Learning it is essential, if you want to move forward with Go. So, the fifth installment in our Golang Internals series is dedicated to the Go runtime and, specifically, the Go bootstrap process. This time you will learn about:

Go bootstrapping
resizable stacks implementation
internal TLS implementation

Note that this post contains a lot of assembler code, and you will need at least some basic knowledge of it to proceed (here is a quick guide to the Go assembler). Let’s get going!

Table of Contents

Finding an entry point

First, we need to find which function is executed immediately after we start a Go program. To do this, we will write a simple Go app.

package main

func main() {
	print(123)
}

Then we need to compile and link it.

go tool 6g test.go
go tool 6l test.6

This will create an executable file called 6.out in your current directory. The next step involves the objdump tool, which is specific to Linux. Windows and Mac users can find analogs or skip this step altogether. Now, run the following command.

objdump -f 6.out

You should get the output, which will contain the start address.

6.out:     file format elf64-x86-64
architecture: i386:x86-64, flags 0x00000112:
EXEC_P, HAS_SYMS, D_PAGED
start address 0x000000000042f160

Next, we need to disassemble the executable and find which function is located at this address.

objdump -d 6.out > disassemble.txt

Then we need to open the disassemble.txt file and search for 42f160. The output that we got is shown below.

000000000042f160 <_rt0_amd64_linux>:
  42f160:	48 8d 74 24 08       		lea    0x8(%rsp),%rsi
  42f165:	48 8b 3c 24          		mov    (%rsp),%rdi
  42f169:	48 8d 05 10 00 00 00 	lea    0x10(%rip),%rax        # 42f180 <main>
  42f170:	ff e0               		 	jmpq   *%rax

Nice, we have found it! The entry point for my OS and architecture is a function called _rt0_amd64_linux.

The starting sequence

Now, we need to find this function in the Go runtime sources. It is located in the rt0_linux_amd64.s file. If you look inside the Go runtime package, you can find many file names with postfixes related to OS and architecture names. When a runtime package is built, only the files that correspond to the current OS and architecture are selected. The rest are skipped. Let’s take a closer look at rt0_linux_amd64.s.

TEXT _rt0_amd64_linux(SB),NOSPLIT,$-8
	LEAQ	8(SP), SI // argv
	MOVQ	0(SP), DI // argc
	MOVQ	$main(SB), AX
	JMP	AX

TEXT main(SB),NOSPLIT,$-8
	MOVQ	$runtime·rt0_go(SB), AX
	JMP	AX

The _rt0_amd64_linux function is very simple. It calls the main function and saves arguments (argc and argv) in registers (DI and SI). The arguments are located in the stack and can be accessed via the SP (stack pointer) register. The main function is also very simple. It calls the runtime.rt0_go function, which is longer and more complicated, so we will break it into small parts and describe each separately. The first section goes like this.

	MOVQ	DI, AX		// argc
	MOVQ	SI, BX		// argv
	SUBQ	$(4*8+7), SP		// 2args 2auto
	ANDQ	$~15, SP
	MOVQ	AX, 16(SP)
	MOVQ	BX, 24(SP)

Here, we put some previously saved command line argument values inside the AX and BX decrease stack pointers. We also add space for two more four-byte variables and adjust it to be 16-bit aligned. Finally, we move the arguments back to the stack.

	// create istack out of the given (operating system) stack.
	// _cgo_init may update stackguard.
	MOVQ	$runtime·g0(SB), DI
	LEAQ	(-64*1024+104)(SP), BX
	MOVQ	BX, g_stackguard0(DI)
	MOVQ	BX, g_stackguard1(DI)
	MOVQ	BX, (g_stack+stack_lo)(DI)
	MOVQ	SP, (g_stack+stack_hi)(DI)

The second part is a bit more tricky. First, we load the address of the global runtime.g0 variable into the DI register. This variable is defined in the proc1.go file and belongs to the runtime,g type. Variables of this type are created for each goroutine in the system. As you can guess, runtime.g0 describes a root goroutine. Then, we initialize the fields that describe the stack of the root goroutine. The meaning of stack.lo and stack.hi should be clear. These are pointers to the beginning and the end of the stack for the current goroutine, but what are the stackguard0 and stackguard1 fields? To understand this, we need to set aside the investigation of the runtime.rt0_go function and take a closer look at stack growth in Go.

Resizable stack implementation in Go

The Go language uses resizable stacks. Each goroutine starts with a small stack, and its size changes each time a certain threshold is reached. Obviously, there is a way to check whether we have reached this threshold or not. In fact, the check is performed at the beginning of each function. To see how it works, let’s compile our sample program one more time with the -S flag (this will show the generated assembler code). The beginning of the main function looks like this.

"".main t=1 size=48 value=0 args=0x0 locals=0x8
	0x0000 00000 (test.go:3)	TEXT	"".main+0(SB),$8-0
	0x0000 00000 (test.go:3)	MOVQ	(TLS),CX
	0x0009 00009 (test.go:3)	CMPQ	SP,16(CX)
	0x000d 00013 (test.go:3)	JHI	,22
	0x000f 00015 (test.go:3)	CALL	,runtime.morestack_noctxt(SB)
	0x0014 00020 (test.go:3)	JMP	,0
	0x0016 00022 (test.go:3)	SUBQ	$8,SP

First, we load a value from thread local storage (TLS) to the CX register (we have already explained what TLS is in the previous article). This value always contains a pointer to the runtime.g structure that corresponds to the current goroutine. Then we compare the stack pointer to the value located at an offset of 16 bytes in the runtime.g structure. We can easily calculate that this corresponds to the stackguard0 field.

So, this is how we check if we have reached the stack threshold. If we haven’t reached it yet, the check fails. In this case, we call the runtime.morestack_noctxt function repeatedly until enough memory has been allocated for the stack. The stackguard1 field works very similarly to stackguard0, but it is used inside the C stack growth prologue instead of Go. The inner workings of runtime.morestack_noctxt is also a very interesting topic, but we will discuss it later. For now, let’s return to the bootstrap process.

Further investigation of Go bootstrapping

We will proceed with the starting sequence by looking at the next portion of code inside the runtime.rt0_go function.

	// find out information about the processor we're on
	MOVQ	$0, AX
	CPUID
	CMPQ	AX, $0
	JE	nocpuinfo

	// Figure out how to serialize RDTSC.
	// On Intel processors LFENCE is enough. AMD requires MFENCE.
	// Don't know about the rest, so let's do MFENCE.
	CMPL	BX, $0x756E6547  // "Genu"
	JNE	notintel
	CMPL	DX, $0x49656E69  // "ineI"
	JNE	notintel
	CMPL	CX, $0x6C65746E  // "ntel"
	JNE	notintel
	MOVB	$1, runtime·lfenceBeforeRdtsc(SB)
notintel:

	MOVQ	$1, AX
	CPUID
	MOVL	CX, runtime·cpuid_ecx(SB)
	MOVL	DX, runtime·cpuid_edx(SB)
nocpuinfo:

This part is not crucial for understanding major Go concepts, so we will look through it briefly. Here, we are trying to figure out which processor we are using. If it is Intel, we set the runtime·lfenceBeforeRdtsc variable. The runtime·cputicks method is the only place where this variable is used. This method utilizes a different assembler instruction to get cpu ticks depending on the value of runtime·lfenceBeforeRdtsc. Finally, we call the CPUID assembler instruction, execute it, and save the result in the runtime·cpuid_ecx and runtime·cpuid_edx variables. These are used in the alg.go file to select a proper hashing algorithm that is natively supported by your computer’s architecture.

Ok, let’s move on and examine another portion of code.

	// if there is an _cgo_init, call it.
	MOVQ	_cgo_init(SB), AX
	TESTQ	AX, AX
	JZ	needtls
	// g0 already in DI
	MOVQ	DI, CX	// Win64 uses CX for first parameter
	MOVQ	$setg_gcc<>(SB), SI
	CALL	AX

	// update stackguard after _cgo_init
	MOVQ	$runtime·g0(SB), CX
	MOVQ	(g_stack+stack_lo)(CX), AX
	ADDQ	$const__StackGuard, AX
	MOVQ	AX, g_stackguard0(CX)
	MOVQ	AX, g_stackguard1(CX)

	CMPL	runtime·iswindows(SB), $0
	JEQ ok

This fragment is only executed when cgo is enabled.

The next code fragment is responsible for setting up TLS.

needtls:
	// skip TLS setup on Plan 9
	CMPL	runtime·isplan9(SB), $1
	JEQ ok
	// skip TLS setup on Solaris
	CMPL	runtime·issolaris(SB), $1
	JEQ ok

	LEAQ	runtime·tls0(SB), DI
	CALL	runtime·settls(SB)

	// store through it, to make sure it works
	get_tls(BX)
	MOVQ	$0x123, g(BX)
	MOVQ	runtime·tls0(SB), AX
	CMPQ	AX, $0x123
	JEQ 2(PC)
	MOVL	AX, 0	// abort

We have already mentioned TLS before. Now, it is time to understand how it is implemented.

Internal TLS implementation

If you look at the previous code fragment carefully, you can easily understand what the only lines that do actual work are.

LEAQ	runtime·tls0(SB), DI
	CALL	runtime·settls(SB)

All the other stuff is used to skip TLS setup, when it is not supported on your OS, and check that TLS works correctly. The two lines above store the address of the runtime·tls0 variable in the DI register and call the runtime·settls function. The code of this function is shown below.

// set tls base to DI
TEXT runtime·settls(SB),NOSPLIT,$32
	ADDQ	$8, DI	// ELF wants to use -8(FS)

	MOVQ	DI, SI
	MOVQ	$0x1002, DI	// ARCH_SET_FS
	MOVQ	$158, AX	// arch_prctl
	SYSCALL
	CMPQ	AX, $0xfffffffffffff001
	JLS	2(PC)
	MOVL	$0xf1, 0xf1  // crash
	RET

From the comments, we can understand that this function makes the arch_prctl system call and passes ARCH_SET_FS as an argument. We can also see that this system call sets a base for the FS segment register. In our case, we set TLS to point to the runtime·tls0 variable.

Do you remember the instruction that we saw at the beginning of the assembler code for the main function?

	0x0000 00000 (test.go:3)	MOVQ	(TLS),CX

We have previously explained that it loads the address of the runtime.g structure instance into the CX register. This structure describes the current goroutine and is stored in a thread local storage. Now, we can find out and understand how this instruction is translated into a machine assembler. If you open the previously created disassembly.txt file and look for the main.main function, the first instruction inside it should look like this.

400c00:       64 48 8b 0c 25 f0 ff    mov    %fs:0xfffffffffffffff0,%rcx

The colon in this instruction (%fs:0xfffffffffffffff0) stands for segmentation addressing (you can read more on it in this tutorial).

Returning to the starting sequence

Finally, let’s look at the last two parts of the runtime.rt0_go function.

ok:
	// set the per-goroutine and per-mach "registers"
	get_tls(BX)
	LEAQ	runtime·g0(SB), CX
	MOVQ	CX, g(BX)
	LEAQ	runtime·m0(SB), AX

	// save m->g0 = g0
	MOVQ	CX, m_g0(AX)
	// save m0 to g0->m
	MOVQ	AX, g_m(CX)

Here, we load the TLS address into the BX register and save the address of the runtime·g0 variable in TLS. We also initialize the runtime.m0 variable. If runtime.g0 stands for root goroutine, then runtime.m0 corresponds to the root operating system thread used to run this goroutine. We may take a closer look at runtime.g0 and runtime.m0 structures in the upcoming blog posts.

The final part of the starting sequence initializes arguments and calls different functions, but this is a topic for a separate discussion. So, we have learned the inner mechanisms of the bootstrap process and found out how stacks are implemented. To move forward, we needs to analyze the last part of the starting sequence, which will be the subject of our next blog post.