Documentation/security/self-protection.txt

   1 # Kernel Self-Protection
   2
   3 Kernel self-protection is the design and implementation of systems and
   4 structures within the Linux kernel to protect against security flaws in
   5 the kernel itself. This covers a wide range of issues, including removing
   6 entire classes of bugs, blocking security flaw exploitation methods,
   7 and actively detecting attack attempts. Not all topics are explored in
   8 this document, but it should serve as a reasonable starting point and
   9 answer any frequently asked questions. (Patches welcome, of course!)
  10
  11 In the worst-case scenario, we assume an unprivileged local attacker
  12 has arbitrary read and write access to the kernel's memory. In many
  13 cases, bugs being exploited will not provide this level of access,
  14 but with systems in place that defend against the worst case we'll
  15 cover the more limited cases as well. A higher bar, and one that should
  16 still be kept in mind, is protecting the kernel against a _privileged_
  17 local attacker, since the root user has access to a vastly increased
  18 attack surface. (Especially when they have the ability to load arbitrary
  19 kernel modules.)
  20
  21 The goals for successful self-protection systems would be that they
  22 are effective, on by default, require no opt-in by developers, have no
  23 performance impact, do not impede kernel debugging, and have tests. It
  24 is uncommon that all these goals can be met, but it is worth explicitly
  25 mentioning them, since these aspects need to be explored, dealt with,
  26 and/or accepted.
  27
  28
  29 ## Attack Surface Reduction
  30
  31 The most fundamental defense against security exploits is to reduce the
  32 areas of the kernel that can be used to redirect execution. This ranges
  33 from limiting the exposed APIs available to userspace, making in-kernel
  34 APIs hard to use incorrectly, minimizing the areas of writable kernel
  35 memory, etc.
  36
  37 ### Strict kernel memory permissions
  38
  39 When all of kernel memory is writable, it becomes trivial for attacks
  40 to redirect execution flow. To reduce the availability of these targets
  41 the kernel needs to protect its memory with a tight set of permissions.
  42
  43 #### Executable code and read-only data must not be writable
  44
  45 Any areas of the kernel with executable memory must not be writable.
  46 While this obviously includes the kernel text itself, we must consider
  47 all additional places too: kernel modules, JIT memory, etc. (There are
  48 temporary exceptions to this rule to support things like instruction
  49 alternatives, breakpoints, kprobes, etc. If these must exist in a
  50 kernel, they are implemented in a way where the memory is temporarily
  51 made writable during the update, and then returned to the original
  52 permissions.)
  53
  54 In support of this are (the poorly named) CONFIG_DEBUG_RODATA and
  55 CONFIG_DEBUG_SET_MODULE_RONX, which seek to make sure that code is not
  56 writable, data is not executable, and read-only data is neither writable
  57 nor executable.
  58
  59 #### Function pointers and sensitive variables must not be writable
  60
  61 Vast areas of kernel memory contain function pointers that are looked
  62 up by the kernel and used to continue execution (e.g. descriptor/vector
  63 tables, file/network/etc operation structures, etc). The number of these
  64 variables must be reduced to an absolute minimum.
  65
  66 Many such variables can be made read-only by setting them "const"
  67 so that they live in the .rodata section instead of the .data section
  68 of the kernel, gaining the protection of the kernel's strict memory
  69 permissions as described above.
  70
  71 For variables that are initialized once at __init time, these can
  72 be marked with the (new and under development) __ro_after_init
  73 attribute.
  74
  75 What remains are variables that are updated rarely (e.g. GDT). These
  76 will need another infrastructure (similar to the temporary exceptions
  77 made to kernel code mentioned above) that allow them to spend the rest
  78 of their lifetime read-only. (For example, when being updated, only the
  79 CPU thread performing the update would be given uninterruptible write
  80 access to the memory.)
  81
  82 #### Segregation of kernel memory from userspace memory
  83
  84 The kernel must never execute userspace memory. The kernel must also never
  85 access userspace memory without explicit expectation to do so. These
  86 rules can be enforced either by support of hardware-based restrictions
  87 (x86's SMEP/SMAP, ARM's PXN/PAN) or via emulation (ARM's Memory Domains).
  88 By blocking userspace memory in this way, execution and data parsing
  89 cannot be passed to trivially-controlled userspace memory, forcing
  90 attacks to operate entirely in kernel memory.
  91
  92 ### Reduced access to syscalls
  93
  94 One trivial way to eliminate many syscalls for 64-bit systems is building
  95 without CONFIG_COMPAT. However, this is rarely a feasible scenario.
  96
  97 The "seccomp" system provides an opt-in feature made available to
  98 userspace, which provides a way to reduce the number of kernel entry
  99 points available to a running process. This limits the breadth of kernel
 100 code that can be reached, possibly reducing the availability of a given
 101 bug to an attack.
 102
 103 An area of improvement would be creating viable ways to keep access to
 104 things like compat, user namespaces, BPF creation, and perf limited only
 105 to trusted processes. This would keep the scope of kernel entry points
 106 restricted to the more regular set of normally available to unprivileged
 107 userspace.
 108
 109 ### Restricting access to kernel modules
 110
 111 The kernel should never allow an unprivileged user the ability to
 112 load specific kernel modules, since that would provide a facility to
 113 unexpectedly extend the available attack surface. (The on-demand loading
 114 of modules via their predefined subsystems, e.g. MODULE_ALIAS_*, is
 115 considered "expected" here, though additional consideration should be
 116 given even to these.) For example, loading a filesystem module via an
 117 unprivileged socket API is nonsense: only the root or physically local
 118 user should trigger filesystem module loading. (And even this can be up
 119 for debate in some scenarios.)
 120
 121 To protect against even privileged users, systems may need to either
 122 disable module loading entirely (e.g. monolithic kernel builds or
 123 modules_disabled sysctl), or provide signed modules (e.g.
 124 CONFIG_MODULE_SIG_FORCE, or dm-crypt with LoadPin), to keep from having
 125 root load arbitrary kernel code via the module loader interface.
 126
 127
 128 ## Memory integrity
 129
 130 There are many memory structures in the kernel that are regularly abused
 131 to gain execution control during an attack, By far the most commonly
 132 understood is that of the stack buffer overflow in which the return
 133 address stored on the stack is overwritten. Many other examples of this
 134 kind of attack exist, and protections exist to defend against them.
 135
 136 ### Stack buffer overflow
 137
 138 The classic stack buffer overflow involves writing past the expected end
 139 of a variable stored on the stack, ultimately writing a controlled value
 140 to the stack frame's stored return address. The most widely used defense
 141 is the presence of a stack canary between the stack variables and the
 142 return address (CONFIG_CC_STACKPROTECTOR), which is verified just before
 143 the function returns. Other defenses include things like shadow stacks.
 144
 145 ### Stack depth overflow
 146
 147 A less well understood attack is using a bug that triggers the
 148 kernel to consume stack memory with deep function calls or large stack
 149 allocations. With this attack it is possible to write beyond the end of
 150 the kernel's preallocated stack space and into sensitive structures. Two
 151 important changes need to be made for better protections: moving the
 152 sensitive thread_info structure elsewhere, and adding a faulting memory
 153 hole at the bottom of the stack to catch these overflows.
 154
 155 ### Heap memory integrity
 156
 157 The structures used to track heap free lists can be sanity-checked during
 158 allocation and freeing to make sure they aren't being used to manipulate
 159 other memory areas.
 160
 161 ### Counter integrity
 162
 163 Many places in the kernel use atomic counters to track object references
 164 or perform similar lifetime management. When these counters can be made
 165 to wrap (over or under) this traditionally exposes a use-after-free
 166 flaw. By trapping atomic wrapping, this class of bug vanishes.
 167
 168 ### Size calculation overflow detection
 169
 170 Similar to counter overflow, integer overflows (usually size calculations)
 171 need to be detected at runtime to kill this class of bug, which
 172 traditionally leads to being able to write past the end of kernel buffers.
 173
 174
 175 ## Statistical defenses
 176
 177 While many protections can be considered deterministic (e.g. read-only
 178 memory cannot be written to), some protections provide only statistical
 179 defense, in that an attack must gather enough information about a
 180 running system to overcome the defense. While not perfect, these do
 181 provide meaningful defenses.
 182
 183 ### Canaries, blinding, and other secrets
 184
 185 It should be noted that things like the stack canary discussed earlier
 186 are technically statistical defenses, since they rely on a (leakable)
 187 secret value.
 188
 189 Blinding literal values for things like JITs, where the executable
 190 contents may be partially under the control of userspace, need a similar
 191 secret value.
 192
 193 It is critical that the secret values used must be separate (e.g.
 194 different canary per stack) and high entropy (e.g. is the RNG actually
 195 working?) in order to maximize their success.
 196
 197 ### Kernel Address Space Layout Randomization (KASLR)
 198
 199 Since the location of kernel memory is almost always instrumental in
 200 mounting a successful attack, making the location non-deterministic
 201 raises the difficulty of an exploit. (Note that this in turn makes
 202 the value of leaks higher, since they may be used to discover desired
 203 memory locations.)
 204
 205 #### Text and module base
 206
 207 By relocating the physical and virtual base address of the kernel at
 208 boot-time (CONFIG_RANDOMIZE_BASE), attacks needing kernel code will be
 209 frustrated. Additionally, offsetting the module loading base address
 210 means that even systems that load the same set of modules in the same
 211 order every boot will not share a common base address with the rest of
 212 the kernel text.
 213
 214 #### Stack base
 215
 216 If the base address of the kernel stack is not the same between processes,
 217 or even not the same between syscalls, targets on or beyond the stack
 218 become more difficult to locate.
 219
 220 #### Dynamic memory base
 221
 222 Much of the kernel's dynamic memory (e.g. kmalloc, vmalloc, etc) ends up
 223 being relatively deterministic in layout due to the order of early-boot
 224 initializations. If the base address of these areas is not the same
 225 between boots, targeting them is frustrated, requiring a leak specific
 226 to the region.
 227
 228
 229 ## Preventing Leaks
 230
 231 Since the locations of sensitive structures are the primary target for
 232 attacks, it is important to defend against leaks of both kernel memory
 233 addresses and kernel memory contents (since they may contain kernel
 234 addresses or other sensitive things like canary values).
 235
 236 ### Unique identifiers
 237
 238 Kernel memory addresses must never be used as identifiers exposed to
 239 userspace. Instead, use an atomic counter, an idr, or similar unique
 240 identifier.
 241
 242 ### Memory initialization
 243
 244 Memory copied to userspace must always be fully initialized. If not
 245 explicitly memset(), this will require changes to the compiler to make
 246 sure structure holes are cleared.
 247
 248 ### Memory poisoning
 249
 250 When releasing memory, it is best to poison the contents (clear stack on
 251 syscall return, wipe heap memory on a free), to avoid reuse attacks that
 252 rely on the old contents of memory. This frustrates many uninitialized
 253 variable attacks, stack info leaks, heap info leaks, and use-after-free
 254 attacks.
 255
 256 ### Destination tracking
 257
 258 To help kill classes of bugs that result in kernel addresses being
 259 written to userspace, the destination of writes needs to be tracked. If
 260 the buffer is destined for userspace (e.g. seq_file backed /proc files),
 261 it should automatically censor sensitive values.