1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
|
.\" Copyright (c) 2021, IBM Corporation.
.\" Written by Mike Rapoport <rppt@linux.ibm.com>
.\"
.\" Based on memfd_create(2) man page
.\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com>
.\" and Copyright (C) 2014 David Herrmann <dh.herrmann@gmail.com>
.\"
.\" SPDX-License-Identifier: GPL-2.0-or-later
.\"
.TH memfd_secret 2 2023-03-30 "Linux man-pages 6.05.01"
.SH NAME
memfd_secret \- create an anonymous RAM-based file
to access secret memory regions
.SH LIBRARY
Standard C library
.RI ( libc ", " \-lc )
.SH SYNOPSIS
.nf
.PP
.BR "#include <sys/syscall.h>" " /* Definition of " SYS_* " constants */"
.B #include <unistd.h>
.PP
.BI "int syscall(SYS_memfd_secret, unsigned int " flags );
.fi
.PP
.IR Note :
glibc provides no wrapper for
.BR memfd_secret (),
necessitating the use of
.BR syscall (2).
.SH DESCRIPTION
.BR memfd_secret ()
creates an anonymous RAM-based file and returns a file descriptor
that refers to it.
The file provides a way to create and access memory regions
with stronger protection than usual RAM-based files and
anonymous memory mappings.
Once all open references to the file are closed,
it is automatically released.
The initial size of the file is set to 0.
Following the call, the file size should be set using
.BR ftruncate (2).
.PP
The memory areas backing the file created with
.BR memfd_secret (2)
are visible only to the processes that have access to the file descriptor.
The memory region is removed from the kernel page tables
and only the page tables of the processes holding the file descriptor
map the corresponding physical memory.
(Thus, the pages in the region can't be accessed by the kernel itself,
so that, for example, pointers to the region can't be passed to
system calls.)
.PP
The following values may be bitwise ORed in
.I flags
to control the behavior of
.BR memfd_secret ():
.TP
.B FD_CLOEXEC
Set the close-on-exec flag on the new file descriptor,
which causes the region to be removed from the process on
.BR execve (2).
See the description of the
.B O_CLOEXEC
flag in
.BR open (2)
.PP
As its return value,
.BR memfd_secret ()
returns a new file descriptor that refers to an anonymous file.
This file descriptor is opened for both reading and writing
.RB ( O_RDWR )
and
.B O_LARGEFILE
is set for the file descriptor.
.PP
With respect to
.BR fork (2)
and
.BR execve (2),
the usual semantics apply for the file descriptor created by
.BR memfd_secret ().
A copy of the file descriptor is inherited by the child produced by
.BR fork (2)
and refers to the same file.
The file descriptor is preserved across
.BR execve (2),
unless the close-on-exec flag has been set.
.PP
The memory region is locked into memory in the same way as with
.BR mlock (2),
so that it will never be written into swap,
and hibernation is inhibited for as long as any
.BR memfd_secret ()
descriptions exist.
However the implementation of
.BR memfd_secret ()
will not try to populate the whole range during the
.BR mmap (2)
call that attaches the region into the process's address space;
instead, the pages are only actually allocated
as they are faulted in.
The amount of memory allowed for memory mappings
of the file descriptor obeys the same rules as
.BR mlock (2)
and cannot exceed
.BR RLIMIT_MEMLOCK .
.SH RETURN VALUE
On success,
.BR memfd_secret ()
returns a new file descriptor.
On error, \-1 is returned and
.I errno
is set to indicate the error.
.SH ERRORS
.TP
.B EINVAL
.I flags
included unknown bits.
.TP
.B EMFILE
The per-process limit on the number of open file descriptors has been reached.
.TP
.B EMFILE
The system-wide limit on the total number of open files has been reached.
.TP
.B ENOMEM
There was insufficient memory to create a new anonymous file.
.TP
.B ENOSYS
.BR memfd_secret ()
is not implemented on this architecture,
or has not been enabled on the kernel command-line with
.BR secretmem_enable =1.
.SH STANDARDS
Linux.
.SH HISTORY
Linux 5.14.
.SH NOTES
The
.BR memfd_secret ()
system call is designed to allow a user-space process
to create a range of memory that is inaccessible to anybody else -
kernel included.
There is no 100% guarantee that kernel won't be able to access
memory ranges backed by
.BR memfd_secret ()
in any circumstances, but nevertheless,
it is much harder to exfiltrate data from these regions.
.PP
.BR memfd_secret ()
provides the following protections:
.IP \[bu] 3
Enhanced protection
(in conjunction with all the other in-kernel attack prevention systems)
against ROP attacks.
Absence of any in-kernel primitive for accessing memory backed by
.BR memfd_secret ()
means that one-gadget ROP attack
can't work to perform data exfiltration.
The attacker would need to find enough ROP gadgets
to reconstruct the missing page table entries,
which significantly increases difficulty of the attack,
especially when other protections like the kernel stack size limit
and address space layout randomization are in place.
.IP \[bu]
Prevent cross-process user-space memory exposures.
Once a region for a
.BR memfd_secret ()
memory mapping is allocated,
the user can't accidentally pass it into the kernel
to be transmitted somewhere.
The memory pages in this region cannot be accessed via the direct map
and they are disallowed in get_user_pages.
.IP \[bu]
Harden against exploited kernel flaws.
In order to access memory areas backed by
.BR memfd_secret (),
a kernel-side attack would need to
either walk the page tables and create new ones,
or spawn a new privileged user-space process to perform
secrets exfiltration using
.BR ptrace (2).
.PP
The way
.BR memfd_secret ()
allocates and locks the memory may impact overall system performance,
therefore the system call is disabled by default and only available
if the system administrator turned it on using
"secretmem.enable=y" kernel parameter.
.PP
To prevent potential data leaks of memory regions backed by
.BR memfd_secret ()
from a hybernation image,
hybernation is prevented when there are active
.BR memfd_secret ()
users.
.SH SEE ALSO
.BR fcntl (2),
.BR ftruncate (2),
.BR mlock (2),
.BR memfd_create (2),
.BR mmap (2),
.BR setrlimit (2)
|