The Linux MM System: Initialization


There are three essential stages in the MM initialization process:


Turning On Paging (i386)

The kernel code is loaded at physical address 0x100000 (1MB), which is then remapped to PAGE_OFFSET+0x100000 when paging is turned on. This is done using compiled-in page tables (in arch/i386/kernel/head.S) that map physical range 0-8MB to itself and to PAGE_OFFSET...PAGE_OFFSET+8MB. Then we jump to start_kernel() in init/main.c, which is located at PAGE_OFFSET+some_address. This is a bit tricky: it is critical that the code that turns on paging in head.S do so in such a way that the address space it is executing out of remains valid; hence the 0-4MB identity mapping. start_kernel() is not called until paging is turned on, and assumes it is running at PAGE_OFFSET+whatever. Thus the page tables in head.S must also map the addresses used by the kernel code for the jump to start_kernel() to succeed; hence the PAGE_OFFSET mapping.

There is some magical code right after paging is enabled in head.S:

/*
 * Enable paging
 */
3:
	movl $swapper_pg_dir-__PAGE_OFFSET,%eax
	movl %eax,%cr3		/* set the page table pointer.. */
	movl %cr0,%eax
	orl $0x80000000,%eax
	movl %eax,%cr0		/* ..and set paging (PG) bit */
	jmp 1f			/* flush the prefetch-queue */
1:
	movl $1f,%eax
	jmp *%eax		/* make sure eip is relocated */
1:
The code between the two 1: labels loads the address of the second label 1: into EAX and jumps there. At this point the instruction pointer EIP is pointing to physical location 1MB+something. The labels are all in kernel virtual space (PAGE_OFFSET+something), so this code effectively relocates the instruction pointer from physical to virtual space.

The start_kernel() function initializes all kernel data and then starts the "init" kernel thread. One of the first things that happens in start_kernel() is a call to setup_arch(), an architecture-specific setup function which handles low-level initialization details. For x86 platforms, that function lives in arch/i386/kernel/setup.c.

The first memory-related thing setup_arch() does is compute the number of low-memory and high-memory pages available; the highest page numbers in each memory type get stored in the global variables highstart_pfn and highend_pfn, respectively. High memory is memory not directly mappable into kernel VM; this is discussed further below.

Next, setup_arch() calles init_bootmem() to initialize the boot-time memory allocator. The bootmem allocator is used only during boot, to allocate pages for permanent kernel data. We will not be too much concerned with it henceforth. The important thing to remember is that the bootmem allocator provides pages for kernel initialization, and those pages are permanently reserved for kernel purposes, almost as if they were loaded with the kernel image; they do not participate in any MM activity after boot.


Turning On Paging (x86_64)

The kernel code is loaded at physical address 0x100000 (1MB), which is then remapped to __START_KERNEL_map+0x100000 when paging is turned on. This is done using compiled-in page tables (in arch/x86_64/kernel/head.S) that map physical range 0-8MB to itself and to PAGE_OFFSET...PAGE_OFFSET+8MB.

$ cat  /proc/kallsyms | more
ffffffff80100f18 T _stext
ffffffff80100f18 T stext
ffffffff80101000 T init_level4_pgt
ffffffff80102000 T level3_ident_pgt
ffffffff80103000 T level3_kernel_pgt
ffffffff80104000 T level2_ident_pgt
ffffffff801040a0 T temp_boot_pmds
ffffffff80105000 T level2_kernel_pgt
ffffffff80106000 T empty_zero_page
ffffffff80107000 T empty_bad_page
ffffffff80108000 T empty_bad_pte_table
ffffffff80109000 T empty_bad_pmd_table
ffffffff8010a000 T level3_physmem_pgt
ffffffff8010b000 T wakeup_level4_pgt
ffffffff8010c000 T boot_level4_pgt
ffffffff8010d000 t run_init_process

/* include/asm-x86_64/page.h#L79 */
 79 #define __PHYSICAL_START        ((unsigned long)CONFIG_PHYSICAL_START)
/* arch/x86_64/defconfig?a=x86_64#L123 */
123 CONFIG_PHYSICAL_START=0x100000
/* include/asm-x86_64/page.h */
81  #define __START_KERNEL_map      0xffffffff80000000UL
82  #define __PAGE_OFFSET           0xffff810000000000UL
Kernel Symbols

Then we jump to start_kernel() in init/main.c, which is located at PAGE_OFFSET+some_address. This is a bit tricky: it is critical that the code that turns on paging in head.S do so in such a way that the address space it is executing out of remains valid; hence the 0-4MB identity mapping. start_kernel() is not called until paging is turned on, and assumes it is running at PAGE_OFFSET+whatever. Thus the page tables in head.S must also map the addresses used by the kernel code for the jump to start_kernel() to succeed; hence the PAGE_OFFSET mapping.

/* arch/x86_64/kernel/head.S?a=x86_64#L198 */
198         .quad   x86_64_start_kernel
/* arch/x86_64/kernel/head64.c?a=x86_64#L79 */
 79 void __init x86_64_start_kernel(char * real_mode_data)
/*     At line 117, calling  */
117    start_kernel();
/* init/main.c?a=x86_64#L445 */
445 asmlinkage void __init start_kernel(void)

There is some magical code right after paging is enabled in head.S:

/* arch/x86_64/kernel/head.S?a=x86_64
 * Enable paging
 */
 88     btsl    $31, %eax     /* Enable paging and in turn activate Long Mode */
 89     btsl    $0, %eax      /* Enable protected mode */
 90     /* Make changes effective */
 91     movl    %eax, %cr0
 92     /*
 93      * At this point we're in long mode but in 32bit compatibility mode
 94      * with EFER.LME = 1, CS.L = 0, CS.D = 1 (and in turn
 95      * EFER.LMA = 1). Now we want to jump in 64bit mode, to do that we 
 96      * use the new gdt/idt that has __KERNEL_CS with CS.L = 1.
 97      */
 98     ljmp    $__KERNEL_CS, $(startup_64 - __START_KERNEL_map)
 99 
100     .code64
101     .org 0x100      
102     .globl startup_64
103 startup_64:
The code between the two 1: labels loads the address of the second label 1: into EAX and jumps there. At this point the instruction pointer EIP is pointing to physical location 1MB+something. The labels are all in kernel virtual space (PAGE_OFFSET+something), so this code effectively relocates the instruction pointer from physical to virtual space.

The start_kernel() function initializes all kernel data and then starts the "init" kernel thread. One of the first things that happens in start_kernel() is a call to setup_arch(), an architecture-specific setup function which handles low-level initialization details. For x86_64 platforms, that function lives in arch/x86_64/kernel/setup.c.

The first memory-related thing setup_arch() does is setup_memory_region compute the number of low-memory and high-memory pages available; the highest page numbers in each memory type get stored in the global variables highstart_pfn and highend_pfn, respectively. High memory is memory not directly mappable into kernel VM; this is discussed further below.

 /* include/asm-x86_64/e820.h */
 25 #define HIGH_MEMORY     (1024*1024)
 26 
 27 #define LOWMEMSIZE()    (0x9f000)
 /* arch/x86_64/kernel/init_task.c */
 17 struct mm_struct init_mm = INIT_MM(init_mm);
 /* arch/x86_64/kernel/setup.c */
 136 struct resource data_resource = {
 137         .name = "Kernel data",
 138         .start = 0,
 139         .end = 0,
 140         .flags = IORESOURCE_RAM,
 141 };
 142 struct resource code_resource = {
 143         .name = "Kernel code",
 144         .start = 0,
 145         .end = 0,
 146         .flags = IORESOURCE_RAM,
 147 };

After setup_arch(), setup_per_cpu_areas() is called and inside setup_per_cpu_areas(), alloc_bootmem() is invoked to initialize the boot-time memory. The bootmem allocator is used only during boot, to allocate pages for permanent kernel data. We will not be too much concerned with it henceforth. The important thing to remember is that the bootmem allocator provides pages for kernel initialization, and those pages are permanently reserved for kernel purposes, almost as if they were loaded with the kernel image; they do not participate in any MM activity after boot.


Initializing the Kernel Page Tables

Thereafter, setup_arch() calls paging_init() defined in arch/i386/mm/init.c On x86_64 architecture, there are two versions of (x86_64, NUMA)paging_init() and (x86_64, non-NUMA)paging_init(). This function does several things. First, it calls pagetable_init() to map the entire physical memory, or as much of it as will fit between PAGE_OFFSET and 4GB, starting at PAGE_OFFSET.

In pagetable_init(), we actually build the kernel page tables in swapper_pg_dir that map the entire physical memory range to PAGE_OFFSET. This is simply a matter of doing the arithmetic and stuffing the correct values into the page directory and page tables. This mapping is created in swapper_pg_dir, the kernel page directory; this is also the page directory used to initiate paging. (Virtual addresses up to the next 4MB boundary past the end of memory are actually mapped here when using 4MB page tables, but "that's OK as we won't use that memory anyway"). If there is physical memory left unmapped here - that is, memory with physical address greater than 4GB-PAGE_OFFSET - that memory is unusable unless the CONFIG_HIGHMEM option is set.

Near the end of pagetable_init() we call fixrange_init() to reserve pagetables (but not populate them) for compile-time-fixed virtual-memory mappings. These tables map virtual addresses that are hard-coded into the kernel, but which are not part of the loaded kernel data. The fixmap tables are mapped to physical pages allocated at run time, using the set_fixmap() call.

After initializing the fixmaps, if CONFIG_HIGHMEM is set, we also allocate some pagetables for the kmap() allocator. kmap() allows the kernel to map any page of physical memory into the kernel virtual address space for temporary use. It's used, for example, to provide mappings on an as-needed basis for physical pages that aren't directly mappable during pagetable_init().

The fixmap and kmap pagetables occupy a portion of the top of kernel virtual space - addresses which therefore cannot be used to permanently map physical pages in the PAGE_OFFSET mapping. For this reason, 128MB at the top of kernel VM is reserved (the vmalloc allocator also uses addresses in this range). Any physical pages that would otherwise be mapped into the PAGE_OFFSET mapping in the 4GB-128MB range are instead (if CONFIG_HIGHMEM is specified) included in the high memory zone, accessible to the kernel only via kmap(). If CONFIG_HIGMEM is not true, those pages are completely unusable. This becomes an issue only on machines with a large amount of RAM (900-odd MB or more). For example, if PAGE_OFFSET==3GB and the machine has 2GB of RAM, only the first physical 1GB-128MB can be mapped between PAGE_OFFSET and the beginning of the fixmap/kmap address range. The remaining pages are still usable - in fact for user-process mappings they act the same as direct-mapped pages - but the kernel cannot access them directly.

Back in paging_init(), we possibly initialize the kmap() system further by calling kmap_init(), which simply caches the first kmap pagetable [in the TLB?]. Then, we initialize the zone allocator by computing the zone sizes and calling free_area_init() to build the mem_map and initialize the freelists. All freelists are initialized empty and all pages are marked "reserved" (not accessible to the VM system); this situation is rectified later.

When paging_init() completes, we have in physical memory [note - this is not quite right for 2.4]:

0x00000000: 0-page
0x00100000: kernel-text
0x????????: kernel_data
0x????????=_end: whole-mem pagetables
0x????????: fixmap pagetables
0x????????: zone data (mem_map, zone_structs, freelists &c)
0x????????=start_mem: free pages
This chunk of memory is mapped by swapper_pg_dir and the whole-mem-pagetables to address PAGE_OFFSET.


Further VM Subsytem Initialization Tasks

Here we are back in start_kernel(). After paging_init() completes, we do some additional setup of other kernel subsystems, some of which allocate additional kernel memory using the bootmem allocator. Important among these, from the MM point of view, is kmem_cache_init(), which initializes the slab allocator data.

Shortly after kmem_cache_init() is called, we call mem_init(). This function completes the freelist initialization begun in free_area_init() by clearing the PG_RESERVED bit in the zone data for free physical pages; clearing the PG_DMA bit for pages that can't be used for DMA; and freeing all usable pages into their respective zones. That last step, done in free_all_bootmem_core() in bootmem.c, is interesting: it builds the buddy bitmaps and freelists describing all existing non-reserved pages by simply freeing them and letting free_pages_ok() do the right thing. Once mem_init() is called, the bootmem allocator is no longer usable, since all its pages have been freed into the zone allocator's world.

Segmentation

Segments are just used to carve up the linear address space into arbitrary chunks. The linear space is what's managed by the VM subsystem. The x86 architecture supports segmentation in hardware: you can specify addresses as offsets into a particular segment, where a segment is defined as a range of linear (virtual) addresses with particular characteristics such as protection. In fact, you must use the segmentation mechanism on x86 machines; so we set up four segments:

Thus, we effectively allow access to the entire virtual address space using any of the available segment selectors.

Questions:

Thanks to Andrea Russo for clearing up this Intel segmentation business.


up

Linux MM Outline

next

Physical Memory


Questions and comments to Joe Knapka

The LXR links in this page were produced by lxrreplace.tcl, which is available for free.

Credits