aboutsummaryrefslogtreecommitdiff
path: root/docs/tools/virtiofsd.rst
blob: e457b13d56f0a529b9f3b16ce4afb78eaf6eda1d (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
QEMU virtio-fs shared file system daemon
========================================

Synopsis
--------

**virtiofsd** [*OPTIONS*]

Description
-----------

Share a host directory tree with a guest through a virtio-fs device.  This
program is a vhost-user backend that implements the virtio-fs device.  Each
virtio-fs device instance requires its own virtiofsd process.

This program is designed to work with QEMU's ``--device vhost-user-fs-pci``
but should work with any virtual machine monitor (VMM) that supports
vhost-user.  See the Examples section below.

This program must be run as the root user.  The program drops privileges where
possible during startup although it must be able to create and access files
with any uid/gid:

* The ability to invoke syscalls is limited using seccomp(2).
* Linux capabilities(7) are dropped.

In "namespace" sandbox mode the program switches into a new file system
namespace and invokes pivot_root(2) to make the shared directory tree its root.
A new pid and net namespace is also created to isolate the process.

In "chroot" sandbox mode the program invokes chroot(2) to make the shared
directory tree its root. This mode is intended for container environments where
the container runtime has already set up the namespaces and the program does
not have permission to create namespaces itself.

Both sandbox modes prevent "file system escapes" due to symlinks and other file
system objects that might lead to files outside the shared directory.

Options
-------

.. program:: virtiofsd

.. option:: -h, --help

  Print help.

.. option:: -V, --version

  Print version.

.. option:: -d

  Enable debug output.

.. option:: --syslog

  Print log messages to syslog instead of stderr.

.. option:: -o OPTION

  * debug -
    Enable debug output.

  * flock|no_flock -
    Enable/disable flock.  The default is ``no_flock``.

  * modcaps=CAPLIST
    Modify the list of capabilities allowed; CAPLIST is a colon separated
    list of capabilities, each preceded by either + or -, e.g.
    ''+sys_admin:-chown''.

  * log_level=LEVEL -
    Print only log messages matching LEVEL or more severe.  LEVEL is one of
    ``err``, ``warn``, ``info``, or ``debug``.  The default is ``info``.

  * posix_lock|no_posix_lock -
    Enable/disable remote POSIX locks.  The default is ``no_posix_lock``.

  * readdirplus|no_readdirplus -
    Enable/disable readdirplus.  The default is ``readdirplus``.

  * sandbox=namespace|chroot -
    Sandbox mode:
    - namespace: Create mount, pid, and net namespaces and pivot_root(2) into
    the shared directory.
    - chroot: chroot(2) into shared directory (use in containers).
    The default is "namespace".

  * source=PATH -
    Share host directory tree located at PATH.  This option is required.

  * timeout=TIMEOUT -
    I/O timeout in seconds.  The default depends on cache= option.

  * writeback|no_writeback -
    Enable/disable writeback cache. The cache allows the FUSE client to buffer
    and merge write requests.  The default is ``no_writeback``.

  * xattr|no_xattr -
    Enable/disable extended attributes (xattr) on files and directories.  The
    default is ``no_xattr``.

  * posix_acl|no_posix_acl -
    Enable/disable posix acl support.  Posix ACLs are disabled by default.

  * security_label|no_security_label -
    Enable/disable security label support. Security labels are disabled by
    default. This will allow client to send a MAC label of file during
    file creation. Typically this is expected to be SELinux security
    label. Server will try to set that label on newly created file
    atomically wherever possible.

  * killpriv_v2|no_killpriv_v2 -
    Enable/disable ``FUSE_HANDLE_KILLPRIV_V2`` support. KILLPRIV_V2 is enabled
    by default as long as the client supports it. Enabling this option helps
    with performance in write path.

.. option:: --socket-path=PATH

  Listen on vhost-user UNIX domain socket at PATH.

.. option:: --socket-group=GROUP

  Set the vhost-user UNIX domain socket gid to GROUP.

.. option:: --fd=FDNUM

  Accept connections from vhost-user UNIX domain socket file descriptor FDNUM.
  The file descriptor must already be listening for connections.

.. option:: --thread-pool-size=NUM

  Restrict the number of worker threads per request queue to NUM.  The default
  is 64.

.. option:: --cache=none|auto|always

  Select the desired trade-off between coherency and performance.  ``none``
  forbids the FUSE client from caching to achieve best coherency at the cost of
  performance.  ``auto`` acts similar to NFS with a 1 second metadata cache
  timeout.  ``always`` sets a long cache lifetime at the expense of coherency.
  The default is ``auto``.

Extended attribute (xattr) mapping
----------------------------------

By default the name of xattr's used by the client are passed through to the server
file system.  This can be a problem where either those xattr names are used
by something on the server (e.g. selinux client/server confusion) or if the
``virtiofsd`` is running in a container with restricted privileges where it
cannot access some attributes.

Mapping syntax
~~~~~~~~~~~~~~

A mapping of xattr names can be made using -o xattrmap=mapping where the ``mapping``
string consists of a series of rules.

The first matching rule terminates the mapping.
The set of rules must include a terminating rule to match any remaining attributes
at the end.

Each rule consists of a number of fields separated with a separator that is the
first non-white space character in the rule.  This separator must then be used
for the whole rule.
White space may be added before and after each rule.

Using ':' as the separator a rule is of the form:

``:type:scope:key:prepend:``

**scope** is:

- 'client' - match 'key' against a xattr name from the client for
             setxattr/getxattr/removexattr
- 'server' - match 'prepend' against a xattr name from the server
             for listxattr
- 'all' - can be used to make a single rule where both the server
          and client matches are triggered.

**type** is one of:

- 'prefix' - is designed to prepend and strip a prefix;  the modified
  attributes then being passed on to the client/server.

- 'ok' - Causes the rule set to be terminated when a match is found
  while allowing matching xattr's through unchanged.
  It is intended both as a way of explicitly terminating
  the list of rules, and to allow some xattr's to skip following rules.

- 'bad' - If a client tries to use a name matching 'key' it's
  denied using EPERM; when the server passes an attribute
  name matching 'prepend' it's hidden.  In many ways it's use is very like
  'ok' as either an explicit terminator or for special handling of certain
  patterns.

- 'unsupported' - If a client tries to use a name matching 'key' it's
  denied using ENOTSUP; when the server passes an attribute
  name matching 'prepend' it's hidden.  In many ways it's use is very like
  'ok' as either an explicit terminator or for special handling of certain
  patterns.

**key** is a string tested as a prefix on an attribute name originating
on the client.  It maybe empty in which case a 'client' rule
will always match on client names.

**prepend** is a string tested as a prefix on an attribute name originating
on the server, and used as a new prefix.  It may be empty
in which case a 'server' rule will always match on all names from
the server.

e.g.:

  ``:prefix:client:trusted.:user.virtiofs.:``

  will match 'trusted.' attributes in client calls and prefix them before
  passing them to the server.

  ``:prefix:server::user.virtiofs.:``

  will strip 'user.virtiofs.' from all server replies.

  ``:prefix:all:trusted.:user.virtiofs.:``

  combines the previous two cases into a single rule.

  ``:ok:client:user.::``

  will allow get/set xattr for 'user.' xattr's and ignore
  following rules.

  ``:ok:server::security.:``

  will pass 'securty.' xattr's in listxattr from the server
  and ignore following rules.

  ``:ok:all:::``

  will terminate the rule search passing any remaining attributes
  in both directions.

  ``:bad:server::security.:``

  would hide 'security.' xattr's in listxattr from the server.

A simpler 'map' type provides a shorter syntax for the common case:

``:map:key:prepend:``

The 'map' type adds a number of separate rules to add **prepend** as a prefix
to the matched **key** (or all attributes if **key** is empty).
There may be at most one 'map' rule and it must be the last rule in the set.

Note: When the 'security.capability' xattr is remapped, the daemon has to do
extra work to remove it during many operations, which the host kernel normally
does itself.

Security considerations
~~~~~~~~~~~~~~~~~~~~~~~

Operating systems typically partition the xattr namespace using
well defined name prefixes. Each partition may have different
access controls applied. For example, on Linux there are multiple
partitions

 * ``system.*`` - access varies depending on attribute & filesystem
 * ``security.*`` - only processes with CAP_SYS_ADMIN
 * ``trusted.*`` - only processes with CAP_SYS_ADMIN
 * ``user.*`` - any process granted by file permissions / ownership

While other OS such as FreeBSD have different name prefixes
and access control rules.

When remapping attributes on the host, it is important to
ensure that the remapping does not allow a guest user to
evade the guest access control rules.

Consider if ``trusted.*`` from the guest was remapped to
``user.virtiofs.trusted*`` in the host. An unprivileged
user in a Linux guest has the ability to write to xattrs
under ``user.*``. Thus the user can evade the access
control restriction on ``trusted.*`` by instead writing
to ``user.virtiofs.trusted.*``.

As noted above, the partitions used and access controls
applied, will vary across guest OS, so it is not wise to
try to predict what the guest OS will use.

The simplest way to avoid an insecure configuration is
to remap all xattrs at once, to a given fixed prefix.
This is shown in example (1) below.

If selectively mapping only a subset of xattr prefixes,
then rules must be added to explicitly block direct
access to the target of the remapping. This is shown
in example (2) below.

Mapping examples
~~~~~~~~~~~~~~~~

1) Prefix all attributes with 'user.virtiofs.'

::

 -o xattrmap=":prefix:all::user.virtiofs.::bad:all:::"


This uses two rules, using : as the field separator;
the first rule prefixes and strips 'user.virtiofs.',
the second rule hides any non-prefixed attributes that
the host set.

This is equivalent to the 'map' rule:

::

 -o xattrmap=":map::user.virtiofs.:"

2) Prefix 'trusted.' attributes, allow others through

::

   "/prefix/all/trusted./user.virtiofs./
    /bad/server//trusted./
    /bad/client/user.virtiofs.//
    /ok/all///"


Here there are four rules, using / as the field
separator, and also demonstrating that new lines can
be included between rules.
The first rule is the prefixing of 'trusted.' and
stripping of 'user.virtiofs.'.
The second rule hides unprefixed 'trusted.' attributes
on the host.
The third rule stops a guest from explicitly setting
the 'user.virtiofs.' path directly to prevent access
control bypass on the target of the earlier prefix
remapping.
Finally, the fourth rule lets all remaining attributes
through.

This is equivalent to the 'map' rule:

::

 -o xattrmap="/map/trusted./user.virtiofs./"

3) Hide 'security.' attributes, and allow everything else

::

    "/bad/all/security./security./
     /ok/all///'

The first rule combines what could be separate client and server
rules into a single 'all' rule, matching 'security.' in either
client arguments or lists returned from the host.  This stops
the client seeing any 'security.' attributes on the server and
stops it setting any.

SELinux support
---------------
One can enable support for SELinux by running virtiofsd with option
"-o security_label". But this will try to save guest's security context
in xattr security.selinux on host and it might fail if host's SELinux
policy does not permit virtiofsd to do this operation.

Hence, it is preferred to remap guest's "security.selinux" xattr to say
"trusted.virtiofs.security.selinux" on host.

"-o xattrmap=:map:security.selinux:trusted.virtiofs.:"

This will make sure that guest and host's SELinux xattrs on same file
remain separate and not interfere with each other. And will allow both
host and guest to implement their own separate SELinux policies.

Setting trusted xattr on host requires CAP_SYS_ADMIN. So one will need
add this capability to daemon.

"-o modcaps=+sys_admin"

Giving CAP_SYS_ADMIN increases the risk on system. Now virtiofsd is more
powerful and if gets compromised, it can do lot of damage to host system.
So keep this trade-off in my mind while making a decision.

Examples
--------

Export ``/var/lib/fs/vm001/`` on vhost-user UNIX domain socket
``/var/run/vm001-vhost-fs.sock``:

.. parsed-literal::

  host# virtiofsd --socket-path=/var/run/vm001-vhost-fs.sock -o source=/var/lib/fs/vm001
  host# |qemu_system| \\
        -chardev socket,id=char0,path=/var/run/vm001-vhost-fs.sock \\
        -device vhost-user-fs-pci,chardev=char0,tag=myfs \\
        -object memory-backend-memfd,id=mem,size=4G,share=on \\
        -numa node,memdev=mem \\
        ...
  guest# mount -t virtiofs myfs /mnt