Documentation/x86/intel_mpx.txt

   1 1. Intel(R) MPX Overview
   2 ========================
   3
   4 Intel(R) Memory Protection Extensions (Intel(R) MPX) is a new capability
   5 introduced into Intel Architecture. Intel MPX provides hardware features
   6 that can be used in conjunction with compiler changes to check memory
   7 references, for those references whose compile-time normal intentions are
   8 usurped at runtime due to buffer overflow or underflow.
   9
  10 For more information, please refer to Intel(R) Architecture Instruction
  11 Set Extensions Programming Reference, Chapter 9: Intel(R) Memory Protection
  12 Extensions.
  13
  14 Note: Currently no hardware with MPX ISA is available but it is always
  15 possible to use SDE (Intel(R) Software Development Emulator) instead, which
  16 can be downloaded from
  17 http://software.intel.com/en-us/articles/intel-software-development-emulator
  18
  19
  20 2. How to get the advantage of MPX
  21 ==================================
  22
  23 For MPX to work, changes are required in the kernel, binutils and compiler.
  24 No source changes are required for applications, just a recompile.
  25
  26 There are a lot of moving parts of this to all work right. The following
  27 is how we expect the compiler, application and kernel to work together.
  28
  29 1) Application developer compiles with -fmpx. The compiler will add the
  30    instrumentation as well as some setup code called early after the app
  31    starts. New instruction prefixes are noops for old CPUs.
  32 2) That setup code allocates (virtual) space for the "bounds directory",
  33    points the "bndcfgu" register to the directory and notifies the kernel
  34    (via the new prctl(PR_MPX_ENABLE_MANAGEMENT)) that the app will be using
  35    MPX.
  36 3) The kernel detects that the CPU has MPX, allows the new prctl() to
  37    succeed, and notes the location of the bounds directory. Userspace is
  38    expected to keep the bounds directory at that locationWe note it
  39    instead of reading it each time because the 'xsave' operation needed
  40    to access the bounds directory register is an expensive operation.
  41 4) If the application needs to spill bounds out of the 4 registers, it
  42    issues a bndstx instruction. Since the bounds directory is empty at
  43    this point, a bounds fault (#BR) is raised, the kernel allocates a
  44    bounds table (in the user address space) and makes the relevant entry
  45    in the bounds directory point to the new table.
  46 5) If the application violates the bounds specified in the bounds registers,
  47    a separate kind of #BR is raised which will deliver a signal with
  48    information about the violation in the 'struct siginfo'.
  49 6) Whenever memory is freed, we know that it can no longer contain valid
  50    pointers, and we attempt to free the associated space in the bounds
  51    tables. If an entire table becomes unused, we will attempt to free
  52    the table and remove the entry in the directory.
  53
  54 To summarize, there are essentially three things interacting here:
  55
  56 GCC with -fmpx:
  57  * enables annotation of code with MPX instructions and prefixes
  58  * inserts code early in the application to call in to the "gcc runtime"
  59 GCC MPX Runtime:
  60  * Checks for hardware MPX support in cpuid leaf
  61  * allocates virtual space for the bounds directory (malloc() essentially)
  62  * points the hardware BNDCFGU register at the directory
  63  * calls a new prctl(PR_MPX_ENABLE_MANAGEMENT) to notify the kernel to
  64    start managing the bounds directories
  65 Kernel MPX Code:
  66  * Checks for hardware MPX support in cpuid leaf
  67  * Handles #BR exceptions and sends SIGSEGV to the app when it violates
  68    bounds, like during a buffer overflow.
  69  * When bounds are spilled in to an unallocated bounds table, the kernel
  70    notices in the #BR exception, allocates the virtual space, then
  71    updates the bounds directory to point to the new table. It keeps
  72    special track of the memory with a VM_MPX flag.
  73  * Frees unused bounds tables at the time that the memory they described
  74    is unmapped.
  75
  76
  77 3. How does MPX kernel code work
  78 ================================
  79
  80 Handling #BR faults caused by MPX
  81 ---------------------------------
  82
  83 When MPX is enabled, there are 2 new situations that can generate
  84 #BR faults.
  85   * new bounds tables (BT) need to be allocated to save bounds.
  86   * bounds violation caused by MPX instructions.
  87
  88 We hook #BR handler to handle these two new situations.
  89
  90 On-demand kernel allocation of bounds tables
  91 --------------------------------------------
  92
  93 MPX only has 4 hardware registers for storing bounds information. If
  94 MPX-enabled code needs more than these 4 registers, it needs to spill
  95 them somewhere. It has two special instructions for this which allow
  96 the bounds to be moved between the bounds registers and some new "bounds
  97 tables".
  98
  99 #BR exceptions are a new class of exceptions just for MPX. They are
 100 similar conceptually to a page fault and will be raised by the MPX
 101 hardware during both bounds violations or when the tables are not
 102 present. The kernel handles those #BR exceptions for not-present tables
 103 by carving the space out of the normal processes address space and then
 104 pointing the bounds-directory over to it.
 105
 106 The tables need to be accessed and controlled by userspace because
 107 the instructions for moving bounds in and out of them are extremely
 108 frequent. They potentially happen every time a register points to
 109 memory. Any direct kernel involvement (like a syscall) to access the
 110 tables would obviously destroy performance.
 111
 112 Why not do this in userspace? MPX does not strictly require anything in
 113 the kernel. It can theoretically be done completely from userspace. Here
 114 are a few ways this could be done. We don't think any of them are practical
 115 in the real-world, but here they are.
 116
 117 Q: Can virtual space simply be reserved for the bounds tables so that we
 118    never have to allocate them?
 119 A: MPX-enabled application will possibly create a lot of bounds tables in
 120    process address space to save bounds information. These tables can take
 121    up huge swaths of memory (as much as 80% of the memory on the system)
 122    even if we clean them up aggressively. In the worst-case scenario, the
 123    tables can be 4x the size of the data structure being tracked. IOW, a
 124    1-page structure can require 4 bounds-table pages. An X-GB virtual
 125    area needs 4*X GB of virtual space, plus 2GB for the bounds directory.
 126    If we were to preallocate them for the 128TB of user virtual address
 127    space, we would need to reserve 512TB+2GB, which is larger than the
 128    entire virtual address space today. This means they can not be reserved
 129    ahead of time. Also, a single process's pre-popualated bounds directory
 130    consumes 2GB of virtual *AND* physical memory. IOW, it's completely
 131    infeasible to prepopulate bounds directories.
 132
 133 Q: Can we preallocate bounds table space at the same time memory is
 134    allocated which might contain pointers that might eventually need
 135    bounds tables?
 136 A: This would work if we could hook the site of each and every memory
 137    allocation syscall. This can be done for small, constrained applications.
 138    But, it isn't practical at a larger scale since a given app has no
 139    way of controlling how all the parts of the app might allocate memory
 140    (think libraries). The kernel is really the only place to intercept
 141    these calls.
 142
 143 Q: Could a bounds fault be handed to userspace and the tables allocated
 144    there in a signal handler intead of in the kernel?
 145 A: mmap() is not on the list of safe async handler functions and even
 146    if mmap() would work it still requires locking or nasty tricks to
 147    keep track of the allocation state there.
 148
 149 Having ruled out all of the userspace-only approaches for managing
 150 bounds tables that we could think of, we create them on demand in
 151 the kernel.
 152
 153 Decoding MPX instructions
 154 -------------------------
 155
 156 If a #BR is generated due to a bounds violation caused by MPX.
 157 We need to decode MPX instructions to get violation address and
 158 set this address into extended struct siginfo.
 159
 160 The _sigfault feild of struct siginfo is extended as follow:
 161
 162 87              /* SIGILL, SIGFPE, SIGSEGV, SIGBUS */
 163 88              struct {
 164 89                      void __user *_addr; /* faulting insn/memory ref. */
 165 90 #ifdef __ARCH_SI_TRAPNO
 166 91                      int _trapno;    /* TRAP # which caused the signal */
 167 92 #endif
 168 93                      short _addr_lsb; /* LSB of the reported address */
 169 94                      struct {
 170 95                              void __user *_lower;
 171 96                              void __user *_upper;
 172 97                      } _addr_bnd;
 173 98              } _sigfault;
 174
 175 The '_addr' field refers to violation address, and new '_addr_and'
 176 field refers to the upper/lower bounds when a #BR is caused.
 177
 178 Glibc will be also updated to support this new siginfo. So user
 179 can get violation address and bounds when bounds violations occur.
 180
 181 Cleanup unused bounds tables
 182 ----------------------------
 183
 184 When a BNDSTX instruction attempts to save bounds to a bounds directory
 185 entry marked as invalid, a #BR is generated. This is an indication that
 186 no bounds table exists for this entry. In this case the fault handler
 187 will allocate a new bounds table on demand.
 188
 189 Since the kernel allocated those tables on-demand without userspace
 190 knowledge, it is also responsible for freeing them when the associated
 191 mappings go away.
 192
 193 Here, the solution for this issue is to hook do_munmap() to check
 194 whether one process is MPX enabled. If yes, those bounds tables covered
 195 in the virtual address region which is being unmapped will be freed also.
 196
 197 Adding new prctl commands
 198 -------------------------
 199
 200 Two new prctl commands are added to enable and disable MPX bounds tables
 201 management in kernel.
 202
 203 155     #define PR_MPX_ENABLE_MANAGEMENT        43
 204 156     #define PR_MPX_DISABLE_MANAGEMENT       44
 205
 206 Runtime library in userspace is responsible for allocation of bounds
 207 directory. So kernel have to use XSAVE instruction to get the base
 208 of bounds directory from BNDCFG register.
 209
 210 But XSAVE is expected to be very expensive. In order to do performance
 211 optimization, we have to get the base of bounds directory and save it
 212 into struct mm_struct to be used in future during PR_MPX_ENABLE_MANAGEMENT
 213 command execution.
 214
 215
 216 4. Special rules
 217 ================
 218
 219 1) If userspace is requesting help from the kernel to do the management
 220 of bounds tables, it may not create or modify entries in the bounds directory.
 221
 222 Certainly users can allocate bounds tables and forcibly point the bounds
 223 directory at them through XSAVE instruction, and then set valid bit
 224 of bounds entry to have this entry valid.  But, the kernel will decline
 225 to assist in managing these tables.
 226
 227 2) Userspace may not take multiple bounds directory entries and point
 228 them at the same bounds table.
 229
 230 This is allowed architecturally.  See more information "Intel(R) Architecture
 231 Instruction Set Extensions Programming Reference" (9.3.4).
 232
 233 However, if users did this, the kernel might be fooled in to unmaping an
 234 in-use bounds table since it does not recognize sharing.