From 9b1a5bc365e379b4b13849adacfde3427f55ca38 Mon Sep 17 00:00:00 2001 From: =?utf8?q?Zbigniew=20J=C4=99drzejewski-Szmek?= Date: Tue, 15 Oct 2024 18:53:00 +0200 Subject: [PATCH] man/systemd-nspawn: emphasise that user namespaces are strongly recommended --- man/systemd-nspawn.xml | 65 +++++++++++++++++++++++------------------- 1 file changed, 35 insertions(+), 30 deletions(-) diff --git a/man/systemd-nspawn.xml b/man/systemd-nspawn.xml index cd7d349b95..4feedd8644 100644 --- a/man/systemd-nspawn.xml +++ b/man/systemd-nspawn.xml @@ -46,8 +46,8 @@ systemd-nspawn may be used to run a command or OS in a light-weight namespace container. In many ways it is similar to chroot1, but more powerful - since it fully virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems and - the host and domain name. + since it virtualizes the file system hierarchy, as well as the process tree, the various IPC subsystems, and + the host and domain names. systemd-nspawn may be invoked on any directory tree containing an operating system tree, using the command line option. By using the option an OS @@ -59,11 +59,14 @@ project='man-pages'>chroot1 systemd-nspawn may be used to boot full Linux-based operating systems in a container. - systemd-nspawn limits access to various kernel interfaces in the container to read-only, - such as /sys/, /proc/sys/ or /sys/fs/selinux/. The - host's network interfaces and the system clock may not be changed from within the container. Device nodes may not - be created. The host system cannot be rebooted and kernel modules may not be loaded from within the - container. + systemd-nspawn limits access to various kernel interfaces in the container to + read-only, such as /sys/, /proc/sys/, or + /sys/fs/selinux/. The host's network interfaces and the system clock may not be + changed from within the container. Device nodes may not be created. The host system cannot be rebooted + and kernel modules may not be loaded from within the container. This sandbox can easily be + circumvented from within the container if user namespaces are not used. This means that + untrusted code must always be run in a user namespace, see the discussion of the + option below. Use a tool like dnf8, Note that systemd-nspawn will mount file systems private to the container to - /dev/, /run/ and similar. These will not be visible outside of the - container, and their contents will be lost when the container exits. + /dev/, /run/, and similar. These will not be visible outside of + the container, and their contents will be lost when the container exits. Note that running two systemd-nspawn containers from the same directory tree will not make processes in them see each other. The PID namespace separation of the two containers is complete and the containers @@ -810,17 +813,6 @@ range. In this mode, the number of UIDs/GIDs assigned to the container is 65536, and the owner UID/GID of the root directory must be a multiple of 65536. - If the parameter is no, user namespacing is turned off. This is - the default. - - - If the parameter is identity, user namespacing is employed with - an identity mapping for the first 65536 UIDs/GIDs. This is mostly equivalent to - . While it does not provide UID/GID isolation, since all - host and container UIDs/GIDs are chosen identically it does provide process capability isolation, - and hence is often a good choice if proper user namespacing with distinct UID maps is not - appropriate. - The special value pick turns on user namespacing. In this case the UID/GID range is automatically chosen. As first step, the file owner UID/GID of the root directory of the container's directory tree is read, and it is checked that no other container is @@ -837,22 +829,35 @@ for it, and thus in the (possibly expensive) file ownership adjustment operation. However, subsequent invocations of the container will be cheap (unless of course the picked UID/GID range is assigned to a different use by then). + + If the parameter is no, user namespacing is turned off. This is + the default when systemd-nspawn is invoked directly. (Note that the + systemd-nspawn@.service unit enables private users.) This option is not + secure and must not be used to run untrusted code. + + If the parameter is identity, user namespacing is employed with + an identity mapping for the first 65536 UIDs/GIDs. This is mostly equivalent to + . While it does not provide UID/GID isolation, since all + host and container UIDs/GIDs are chosen identically it does provide process capability isolation, + but may be useful if proper user namespacing with distinct UID maps is not possible. This option is + not secure and must not be used to run untrusted code. - It is recommended to assign at least 65536 UIDs/GIDs to each container, so that the usable UID/GID range in the - container covers 16 bit. For best security, do not assign overlapping UID/GID ranges to multiple containers. It is - hence a good idea to use the upper 16 bit of the host 32-bit UIDs/GIDs as container identifier, while the lower 16 - bit encode the container UID/GID used. This is in fact the behavior enforced by the - option. + It is recommended to assign at least 65536 UIDs/GIDs to each container, so that the usable + UID/GID range in the container covers 16 bits. For best security, do not assign overlapping UID/GID + ranges to multiple containers. It is hence a good idea to use the upper 16 bit of the host 32-bit + UIDs/GIDs as container identifier, while the lower 16 bits encode the container UID/GID used. This is + in fact the behavior enforced by the option. - When user namespaces are used, the GID range assigned to each container is always chosen identical to the - UID range. + When user namespaces are used, the GID range assigned to each container is always chosen + identical to the UID range. - In most cases, using is the recommended option as it enhances - container security massively and operates fully automatically in most cases. + In most cases, using is the recommended option as user + namespacing is required for security, and this option massively enhances container security while + operating fully automatically in most cases. Note that the picked UID/GID range is not written to /etc/passwd or - /etc/group. In fact, the allocation of the range is not stored persistently anywhere, + /etc/group. In fact, the allocation of the range is not stored persistently, except in the file ownership of the files and directories of the container. Note that when user namespacing is used file ownership on disk reflects this, and all of the container's -- 2.25.1