diff options
author | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-10 20:49:52 +0000 |
---|---|---|
committer | Daniel Baumann <daniel.baumann@progress-linux.org> | 2024-04-10 20:49:52 +0000 |
commit | 55944e5e40b1be2afc4855d8d2baf4b73d1876b5 (patch) | |
tree | 33f869f55a1b149e9b7c2b7e201867ca5dd52992 /docs | |
parent | Initial commit. (diff) | |
download | systemd-55944e5e40b1be2afc4855d8d2baf4b73d1876b5.tar.xz systemd-55944e5e40b1be2afc4855d8d2baf4b73d1876b5.zip |
Adding upstream version 255.4.upstream/255.4
Signed-off-by: Daniel Baumann <daniel.baumann@progress-linux.org>
Diffstat (limited to 'docs')
100 files changed, 15889 insertions, 0 deletions
diff --git a/docs/.gitattributes b/docs/.gitattributes new file mode 100644 index 0000000..dbe845e --- /dev/null +++ b/docs/.gitattributes @@ -0,0 +1,2 @@ +*.png binary generated +*.woff binary diff --git a/docs/.gitignore b/docs/.gitignore new file mode 100644 index 0000000..ab2d677 --- /dev/null +++ b/docs/.gitignore @@ -0,0 +1,2 @@ +/_site/ +/.jekyll-cache/ diff --git a/docs/API_FILE_SYSTEMS.md b/docs/API_FILE_SYSTEMS.md new file mode 100644 index 0000000..84a1900 --- /dev/null +++ b/docs/API_FILE_SYSTEMS.md @@ -0,0 +1,52 @@ +--- +title: API File Systems +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# API File Systems + +_So you are seeing all kinds of weird file systems in the output of mount(8) that are not listed in `/etc/fstab`, and you wonder what those are, how you can get rid of them, or at least change their mount options._ + +The Linux kernel provides a number of different ways for userspace to communicate with it. For many facilities there are system calls, others are hidden behind Netlink interfaces, and even others are exposed via virtual file systems such as `/proc` or `/sys`. These file systems are programming interfaces, they are not actually backed by real, persistent storage. They simply use the file system interface of the kernel as interface to various unrelated mechanisms. Similarly, there are file systems that userspace uses for its own API purposes, to store shared memory segments, shared temporary files or sockets. In this article we want to discuss all these kind of _API file systems_. More specifically, here's a list of these file systems typical Linux systems currently have: + +* `/sys` for exposing kernel devices, drivers and other kernel information to userspace +* `/proc` for exposing kernel settings, processes and other kernel information to userspace +* `/dev` for exposing kernel device nodes to userspace +* `/run` as location for userspace sockets and files +* `/tmp` as location for volatile, temporary userspace file system objects (X) +* `/sys/fs/cgroup` (and file systems below that) for exposing the kernel control group hierarchy +* `/sys/kernel/security`, `/sys/kernel/debug` (X), `/sys/kernel/config` (X) for exposing special purpose kernel objects to userspace +* `/sys/fs/selinux` for exposing SELinux security data to userspace +* `/dev/shm` as location for userspace shared memory objects +* `/dev/pts` for exposing kernel pseudo TTY device nodes to userspace +* `/proc/sys/fs/binfmt_misc` for registering additional binary formats in the kernel (X) +* `/dev/mqueue` for exposing mqueue IPC objects to userspace (X) +* `/dev/hugepages` as a userspace API for allocating "huge" memory pages (X) +* `/sys/fs/fuse/connections` for exposing kernel FUSE connections to userspace (X) +* `/sys/firmware/efi/efivars` for exposing firmware variables to userspace + +All these _API file systems_ are mounted during very early boot-up of systemd and are generally not listed in `/etc/fstab`. Depending on the used kernel configuration some of these API file systems might not be available and others might exist instead. As these interfaces are important for kernel-to-userspace and userspace-to-userspace communication they are mounted automatically and without configuration or interference by the user. Disabling or changing their parameters might hence result in applications breaking as they can no longer access the interfaces they need. + +Even though the default settings of these file systems should normally be suitable for most setups, in some cases it might make sense to change the mount options, or possibly even disable some of these file systems. + +Even though normally none of these API file systems are listed in `/etc/fstab` they may be added there. If so, any options specified therein will be applied to that specific API file system. Hence: to alter the mount options or other parameters of these file systems, simply add them to `/etc/fstab` with the appropriate settings and you are done. Using this technique it is possible to change the source, type of a file system in addition to simply changing mount options. That is useful to turn `/tmp` to a true file system backed by a physical disk. + +It is possible to disable the automatic mounting of some (but not all) of these file systems, if that is required. These are marked with (X) in the list above. You may disable them simply by masking them: + +```sh +systemctl mask dev-hugepages.mount +``` + +This has the effect that the huge memory page API FS is not mounted by default, starting with the next boot. See [Three Levels of Off](http://0pointer.de/blog/projects/three-levels-of-off.html) for more information on masking. + +The systemd service [systemd-remount-fs.service](http://www.freedesktop.org/software/systemd/man/systemd-remount-fs.service.html) is responsible for applying mount parameters from `/etc/fstab` to the actual mounts. + +## Why are you telling me all this? I just want to get rid of the tmpfs backed /tmp! + +You have three options: + +1. Disable any mounting on `/tmp` so that it resides on the same physical file system as the root directory. For that, execute `systemctl mask tmp.mount` +2. Mount a different, physical file system to `/tmp`. For that, simply create an entry for it in `/etc/fstab` as you would do for any other file system. +3. Keep `/tmp` but increase/decrease the size of it. For that, also just create an entry for it in `/etc/fstab` as you would do for any other `tmpfs` file system, and use the right `size=` option. diff --git a/docs/ARCHITECTURE.md b/docs/ARCHITECTURE.md new file mode 100644 index 0000000..1478ea0 --- /dev/null +++ b/docs/ARCHITECTURE.md @@ -0,0 +1,225 @@ +--- +title: systemd Repository Architecture +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The systemd Repository Architecture + +## Code Map + +This document provides a high-level overview of the various components of the +systemd repository. + +## Source Code + +Directories in `src/` provide the implementation of all daemons, libraries and +command-line tools shipped by the project. There are many, and more are +constantly added, so we will not enumerate them all here — the directory +names are self-explanatory. + +### Shared Code + +The code that is shared between components is split into a few directories, +each with a different purpose: + +- `src/basic/` and `src/fundamental/` — those directories contain code + primitives that are used by all other code. `src/fundamental/` is stricter, + because it used for EFI and user-space code, while `src/basic/` is only used + for user-space code. The code in `src/fundamental/` cannot depend on any + other code in the tree, and `src/basic/` can depend only on itself and + `src/fundamental/`. For user-space, a static library is built from this code + and linked statically in various places. + +- `src/libsystemd/` implements the `libsystemd.so` shared library (also + available as static `libsystemd.a`). This code may use anything in + `src/basic/` or `src/fundamental/`. + +- `src/shared/` provides various utilities and code shared between other + components that is exposed as the `libsystemd-shared-<nnn>.so` shared library. + +The other subdirectories implement individual components. They may depend only +on `src/fundamental/` + `src/basic/`, or also on `src/libsystemd/`, or also on +`src/shared/`. + +You might wonder what kind of code belongs where. In general, the rule is that +code should be linked as few times as possible, ideally only once. Thus code that +is used by "higher-level" components (e.g. our binaries which are linked to +`libsystemd-shared-<nnn>.so`), would go to a subdirectory specific to that +component if it is only used there. If the code is to be shared between +components, it'd go to `src/shared/`. Shared code that is used by multiple +components that do not link to `libsystemd-shared-<nnn>.so` may live either in +`src/libsystemd/`, `src/basic/`, or `src/fundamental/`. Any code that is used +only for EFI goes under `src/boot/efi/`, and `src/fundamental/` if is shared +with non-EFI compoenents. + +To summarize: + +`src/fundamental/` +- may be used by all code in the tree +- may not use any code outside of `src/fundamental/` + +`src/basic/` +- may be used by all code in the tree +- may not use any code outside of `src/fundamental/` and `src/basic/` + +`src/libsystemd/` +- may be used by all code in the tree that links to `libsystem.so` +- may not use any code outside of `src/fundamental/`, `src/basic/`, and + `src/libsystemd/` + +`src/shared/` +- may be used by all code in the tree, except for code in `src/basic/`, + `src/libsystemd/`, `src/nss-*`, `src/login/pam_systemd.*`, and files under + `src/journal/` that end up in `libjournal-client.a` convenience library. +- may not use any code outside of `src/fundamental/`, `src/basic/`, + `src/libsystemd/`, `src/shared/` + +### PID 1 + +Code located in `src/core/` implements the main logic of the systemd system (and user) +service manager. + +BPF helpers written in C and used by PID 1 can be found under `src/core/bpf/`. + +#### Implementing Unit Settings + +The system and session manager supports a large number of unit settings. These can generally +be configured in three ways: + +1. Via textual, INI-style configuration files called *unit* *files* +2. Via D-Bus messages to the manager +3. Via the `systemd-run` and `systemctl set-property` commands + +From a user's perspective, the third is a wrapper for the second. To implement a new unit +setting, it is necessary to support all three input methods: + +1. *unit* *files* are parsed in `src/core/load-fragment.c`, with many simple and fixed-type +unit settings being parsed by common helpers, with the definition in the generator file +`src/core/load-fragment-gperf.gperf.in` +2. D-Bus messages are defined and parsed in `src/core/dbus-*.c` +3. `systemd-run` and `systemctl set-property` do client-side parsing and translation into +D-Bus messages in `src/shared/bus-unit-util.c` + +So that they are exercised by the fuzzing CI, new unit settings should also be listed in the +text files under `test/fuzz/fuzz-unit-file/`. + +### systemd-udev + +Sources for the udev daemon and command-line tool (single binary) can be found under +`src/udev/`. + +### Unit Tests + +Source files found under `src/test/` implement unit-level testing, mostly for +modules found in `src/basic/` and `src/shared/`, but not exclusively. Each test +file is compiled in a standalone binary that can be run to exercise the +corresponding module. While most of the tests can be run by any user, some +require privileges, and will attempt to clearly log about what they need +(mostly in the form of effective capabilities). These tests are self-contained, +and generally safe to run on the host without side effects. + +Ideally, every module in `src/basic/` and `src/shared/` should have a +corresponding unit test under `src/test/`, exercising every helper function. + +### Fuzzing + +Fuzzers are a type of unit tests that execute code on an externally-supplied +input sample. Fuzzers are called `fuzz-*`. Fuzzers for `src/basic/` and +`src/shared` live under `src/fuzz/`, and those for other parts of the codebase +should be located next to the code they test. + +Files under `test/fuzz/` contain input data for fuzzers, one subdirectory for +each fuzzer. Some of the files are "seed corpora", i.e. files that contain +lists of settings and input values intended to generate initial coverage, and +other files are samples saved by the fuzzing engines when they find an issue. + +When adding new input samples under `test/fuzz/*/`, please use some +short-but-meaningful names. Names of meson tests include the input file name +and output looks awkward if they are too long. + +Fuzzers are invoked primarily in three ways: firstly, each fuzzer is compiled +as a normal executable and executed for each of the input samples under +`test/fuzz/` as part of the test suite. Secondly, fuzzers may be instrumented +with sanitizers and invoked as part of the test suite (if `-Dfuzz-tests=true` +is configured). Thirdly, fuzzers are executed through fuzzing engines that try +to find new "interesting" inputs through coverage feedback and massive +parallelization; see the links for oss-fuzz in [Code quality](CODE_QUALITY). +For testing and debugging, fuzzers can be executed as any other program, +including under `valgrind` or `gdb`. + +## Integration Tests + +Sources in `test/TEST-*` implement system-level testing for executables, +libraries and daemons that are shipped by the project. They require privileges +to run, and are not safe to execute directly on a host. By default they will +build an image and run the test under it via `qemu` or `systemd-nspawn`. + +Most of those tests should be able to run via `systemd-nspawn`, which is +orders-of-magnitude faster than `qemu`, but some tests require privileged +operations like using `dm-crypt` or `loopdev`. They are clearly marked if that +is the case. + +See `test/README.testsuite` for more specific details. + +## hwdb + +Rules built in the static hardware database shipped by the project can be found +under `hwdb.d/`. Some of these files are updated automatically, some are filled +by contributors. + +## Documentation + +### systemd.io + +Markdown files found under `docs/` are automatically published on the +[systemd.io](https://systemd.io) website using Github Pages. A minimal unit test +to ensure the formatting doesn't have errors is included in the +`meson test -C build/ github-pages` run as part of the CI. + +### Man pages + +Manpages for binaries and libraries, and the DBUS interfaces, can be found under +`man/` and should ideally be kept in sync with changes to the corresponding +binaries and libraries. + +### Translations + +Translations files for binaries and daemons, provided by volunteers, can be found +under `po/` in the usual format. They are kept up to date by contributors and by +automated tools. + +## System Configuration files and presets + +Presets (or templates from which they are generated) for various daemons and tools +can be found under various directories such as `factory/`, `modprobe.d/`, `network/`, +`presets/`, `rules.d/`, `shell-completion/`, `sysctl.d/`, `sysusers.d/`, `tmpfiles.d/`. + +## Utilities for Developers + +`tools/`, `coccinelle/`, `.github/`, `.semaphore/`, `.mkosi/` host various +utilities and scripts that are used by maintainers and developers. They are not +shipped or installed. + +# Service Manager Overview + +The Service Manager takes configuration in the form of unit files, credentials, +kernel command line options and D-Bus commands, and based on those manages the +system and spawns other processes. It runs in system mode as PID1, and in user +mode with one instance per user session. + +When starting a unit requires forking a new process, configuration for the new +process will be serialized and passed over to the new process, created via a +posix_spawn() call. This is done in order to avoid excessive processing after +a fork() but before an exec(), which is against glibc's best practices and can +also result in a copy-on-write trap. The new process will start as the +`systemd-executor` binary, which will deserialize the configuration and apply +all the options (sandboxing, namespacing, cgroup, etc.) before exec'ing the +configured executable. + +``` + ┌──────┐posix_spawn() ┌───────────┐execve() ┌────────┐ + │ PID1 ├─────────────►│sd-executor├────────►│program │ + └──────┘ (memfd) └───────────┘ └────────┘ +``` diff --git a/docs/AUTOMATIC_BOOT_ASSESSMENT.md b/docs/AUTOMATIC_BOOT_ASSESSMENT.md new file mode 100644 index 0000000..2fbf86e --- /dev/null +++ b/docs/AUTOMATIC_BOOT_ASSESSMENT.md @@ -0,0 +1,219 @@ +--- +title: Automatic Boot Assessment +category: Booting +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Automatic Boot Assessment + +systemd provides support for automatically reverting back to the previous +version of the OS or kernel in case the system consistently fails to boot. The +[Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification/#boot-counting) +describes how to annotate boot loader entries with a counter that specifies how +many attempts should be made to boot it. This document describes how systemd +implements this scheme. + +The many different components involved in the implementation may be used +independently and in combination with other software to, for example, support +other boot loaders or take actions outside of the boot loader. + +Here's a brief overview of the complete set of components: + +* The + [`kernel-install(8)`](https://www.freedesktop.org/software/systemd/man/kernel-install.html) + script can optionally create boot loader entries that carry an initial boot + counter (the initial counter is configurable in `/etc/kernel/tries`). + +* The + [`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) + boot loader optionally maintains a per-boot-loader-entry counter described by + the [Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification/#boot-counting) + that is decreased by one on each attempt to boot the entry, prioritizing + entries that have non-zero counters over those which already reached a + counter of zero when choosing the entry to boot. + +* The `boot-complete.target` target unit (see + [`systemd.special(7)`](https://www.freedesktop.org/software/systemd/man/systemd.special.html)) + serves as a generic extension point both for units that are necessary to + consider a boot successful (e.g. `systemd-boot-check-no-failures.service` + described below), and units that want to act only if the boot is + successful (e.g. `systemd-bless-boot.service` described below). + +* The + [`systemd-boot-check-no-failures.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-boot-check-no-failures.service.html) + service is a simple service health check tool. When enabled it becomes an + indirect dependency of `systemd-bless-boot.service` (by means of + `boot-complete.target`, see below), ensuring that the boot will not be + considered successful if there are any failed services. + +* The + [`systemd-bless-boot.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-bless-boot.service.html) + service automatically marks a boot loader entry, for which boot counting as + mentioned above is enabled, as "good" when a boot has been determined to be + successful, thus turning off boot counting for it. + +* The + [`systemd-bless-boot-generator(8)`](https://www.freedesktop.org/software/systemd/man/systemd-bless-boot-generator.html) + generator automatically pulls in `systemd-bless-boot.service` when use of + `systemd-boot` with boot counting enabled is detected. + +## Details + +As described in the +[Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification/#boot-counting), +the boot counting data is stored in the file name of the boot loader entries as +a plus (`+`), followed by a number, optionally followed by `-` and another +number, right before the file name suffix (`.conf` or `.efi`). + +The first number is the "tries left" counter encoding how many attempts to boot +this entry shall still be made. The second number is the "tries done" counter, +encoding how many failed attempts to boot it have already been made. Each time +a boot loader entry marked this way is booted the first counter is decremented, +and the second one incremented. (If the second counter is missing, then it is +assumed to be equivalent to zero.) If the boot attempt completed successfully +the entry's counters are removed from the name (entry state "good"), thus +turning off boot counting for the future. + +## Walkthrough + +Here's an example walkthrough of how this all fits together. + +1. The user runs `echo 3 >/etc/kernel/tries` to enable boot counting. + +2. A new kernel is installed. `kernel-install` is used to generate a new boot + loader entry file for it. Let's say the version string for the new kernel is + `4.14.11-300.fc27.x86_64`, a new boot loader entry + `/boot/loader/entries/4.14.11-300.fc27.x86_64+3.conf` is hence created. + +3. The system is booted for the first time after the new kernel has been + installed. The boot loader now sees the `+3` counter in the entry file + name. It hence renames the file to `4.14.11-300.fc27.x86_64+2-1.conf` + indicating that at this point one attempt has started. + After the rename completed, the entry is booted as usual. + +4. Let's say this attempt to boot fails. On the following boot the boot loader + will hence see the `+2-1` tag in the name, and will hence rename the entry file to + `4.14.11-300.fc27.x86_64+1-2.conf`, and boot it. + +5. Let's say the boot fails again. On the subsequent boot the loader will hence + see the `+1-2` tag, and rename the file to + `4.14.11-300.fc27.x86_64+0-3.conf` and boot it. + +6. If this boot also fails, on the next boot the boot loader will see the tag + `+0-3`, i.e. the counter reached zero. At this point the entry will be + considered "bad", and ordered after all non-bad entries. The next newest + boot entry is now tried, i.e. the system automatically reverted to an + earlier version. + +The above describes the walkthrough when the selected boot entry continuously +fails. Let's have a look at an alternative ending to this walkthrough. In this +scenario the first 4 steps are the same as above: + +1. *as above* + +2. *as above* + +3. *as above* + +4. *as above* + +5. Let's say the second boot succeeds. The kernel initializes properly, systemd + is started and invokes all generators. + +6. One of the generators started is `systemd-bless-boot-generator` which + detects that boot counting is used. It hence pulls + `systemd-bless-boot.service` into the initial transaction. + +7. `systemd-bless-boot.service` is ordered after and `Requires=` the generic + `boot-complete.target` unit. This unit is hence also pulled into the initial + transaction. + +8. The `boot-complete.target` unit is ordered after and pulls in various units + that are required to succeed for the boot process to be considered + successful. One such unit is `systemd-boot-check-no-failures.service`. + +9. The graphical desktop environment installed on the machine starts a + service called `graphical-session-good.service`, which is also ordered before + `boot-complete.target`, that registers a D-Bus endpoint. + +10. `systemd-boot-check-no-failures.service` is run after all its own + dependencies completed, and assesses that the boot completed + successfully. It hence exits cleanly. + +11. `graphical-session-good.service` waits for a user to log in. In the user + desktop environment, one minute after the user has logged in and started the + first program, a user service is invoked which makes a D-Bus call to + `graphical-session-good.service`. Upon receiving that call, + `graphical-session-good.service` exits cleanly. + +12. This allows `boot-complete.target` to be reached. This signifies to the + system that this boot attempt shall be considered successful. + +13. Which in turn permits `systemd-bless-boot.service` to run. It now + determines which boot loader entry file was used to boot the system, and + renames it dropping the counter tag. Thus + `4.14.11-300.fc27.x86_64+1-2.conf` is renamed to + `4.14.11-300.fc27.x86_64.conf`. From this moment boot counting is turned + off for this entry. + +14. On the following boot (and all subsequent boots after that) the entry is + now seen with boot counting turned off, no further renaming takes place. + +## How to adapt this scheme to other setups + +Of the stack described above many components may be replaced or augmented. Here +are a couple of recommendations. + +1. To support alternative boot loaders in place of `systemd-boot` two scenarios + are recommended: + + a. Boot loaders already implementing the Boot Loader Specification can + simply implement the same rename logic, and thus integrate fully with + the rest of the stack. + + b. Boot loaders that want to implement boot counting and store the counters + elsewhere can provide their own replacements for + `systemd-bless-boot.service` and `systemd-bless-boot-generator`, but should + continue to use `boot-complete.target` and thus support any services + ordered before that. + +2. To support additional components that shall succeed before the boot is + considered successful, simply place them in units (if they aren't already) + and order them before the generic `boot-complete.target` target unit, + combined with `Requires=` dependencies from the target, so that the target + cannot be reached when any of the units fail. You may add any number of + units like this, and only if they all succeed the boot entry is marked as + good. Note that the target unit shall pull in these boot checking units, not + the other way around. + + Depending on the setup, it may be most convenient to pull in such units + through normal enablement symlinks, or during early boot using a + [`generator`](https://www.freedesktop.org/software/systemd/man/systemd.generator.html), + or even during later boot. In the last case, care must be taken to ensure + that the start job is created before `boot-complete.target` has been + reached. + +3. To support additional components that shall only run on boot success, simply + wrap them in a unit and order them after `boot-complete.target`, pulling it + in. + + Such unit would be typically wanted (or required) by one of the + [`bootup`](https://www.freedesktop.org/software/systemd/man/bootup.html) targets, + for example, `multi-user.target`. To avoid potential loops due to conflicting + [default dependencies](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#Default%20Dependencies) + ordering, it is recommended to also add an explicit dependency (e.g. + `After=multi-user.target`) to the unit. This overrides the implicit ordering + and allows `boot-complete.target` to start after the given bootup target. + +## FAQ + +1. *I have a service which — when it fails — should immediately cause a + reboot. How does that fit in with the above?* — That's orthogonal to + the above, please use `FailureAction=` in the unit file for this. + +2. *Under some condition I want to mark the current boot loader entry as bad + right-away, so that it never is tried again, how do I do that?* — You may + invoke `/usr/lib/systemd/systemd-bless-boot bad` at any time to mark the + current boot loader entry as "bad" right-away so that it isn't tried again + on later boots. diff --git a/docs/AUTOPKGTEST.md b/docs/AUTOPKGTEST.md new file mode 100644 index 0000000..393b74e --- /dev/null +++ b/docs/AUTOPKGTEST.md @@ -0,0 +1,92 @@ +--- +title: Autopkgtest - Defining tests for Debian packages +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Test description + +Full system integration/acceptance testing is done through [autopkgtests](https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/doc/README.package-tests.rst). These test the actual installed binary distribution packages. They are run in QEMU or containers and thus can do intrusive and destructive things such as installing arbitrary packages, modifying arbitrary files in the system (including grub boot parameters), rebooting, or loading kernel modules. + +The tests for systemd are defined in the [Debian package's debian/tests](https://salsa.debian.org/systemd-team/systemd/tree/master/debian/tests) directory. For validating a pull request, the Debian package is built using the unpatched code from that PR (via the [checkout-upstream](https://salsa.debian.org/systemd-team/systemd/blob/master/debian/extra/checkout-upstream) script), and the tests run against these built packages. Note that some tests which check Debian specific behaviour are skipped in "test upstream" mode. + +# Infrastructure + +systemd's GitHub project has webhooks that trigger autopkgtests on Ubuntu 18.04 LTS on three architectures: + +* i386: 32 bit x86, little endian, QEMU (OpenStack cloud instance) +* amd64: 64 bit x86, little endian, QEMU (OpenStack cloud instance) +* arm64: 64 bit ARM, little endian, QEMU (OpenStack cloud instance) +* s390x: 64 bit IBM z/Series, big endian, LXC (this architecture is not yet available in Canonical's OpenStack and thus skips some tests) + +Please see the [Ubuntu CI infrastructure](https://wiki.ubuntu.com/ProposedMigration/AutopkgtestInfrastructure) documentation for details about how this works. + +# Manually retrying/triggering tests on the infrastructure + +The current tests are fairly solid by now, but rarely they fail on infrastructure/network issues or race conditions. If you encounter these, please notify @iainlane in the GitHub PR for debugging/fixing those -- transient infrastructure issues are supposed to be detected automatically, and tests auto-retry on those; and flaky tests should of course be fixed properly. But sometimes it is useful to trigger tests on a different Ubuntu release too, for example to test a PR on a newer kernel or against current build/binary dependencies (cgroup changes, util-linux, gcc, etc.). + +This can be done using the generic [retry-github-test](https://git.launchpad.net/autopkgtest-cloud/tree/charms/focal/autopkgtest-cloud-worker/autopkgtest-cloud/tools/retry-github-test) script from [Ubuntu's autopkgtest infrastructure](https://git.launchpad.net/autopkgtest-cloud): you need the parameterized URL from the [configured webhooks](https://github.com/systemd/systemd/settings/hooks) and the shared secret (Ubuntu's CI needs to restrict access to avoid DoSing and misuse). + +You can use Martin Pitt's [retry-gh-systemd-test](https://piware.de/gitweb/?p=bin.git;a=blob;f=retry-gh-systemd-test) shell wrapper around retry-github-test for that. You need to adjust the path where you put retry-github-test and the file with the shared secret, then you can call it like this: + +```sh +$ retry-gh-systemd-test <#PR> <architecture> [release] +``` + +where `release` defaults to `bionic` (aka Ubuntu 18.04 LTS). For example: + +```sh +$ retry-gh-systemd-test 1234 amd64 +$ retry-gh-systemd-test 2345 s390x cosmic +``` + +Please make sure to not trigger unknown [releases](https://launchpad.net/ubuntu/+series) or architectures as they will cause a pending test on the PR which never gets finished. + +# Test the code from the PR locally + +As soon as a test on the infrastructure finishes, the "Details" link in the PR "checks" section will point to the `log.gz` log. You can download the individual test log, built .debs, and other artifacts that tests leave behind (some dump a complete journal or the udev database on failure) by replacing `/log.gz` with `/artifacts.tar.gz` in that URL. You can then unpack the tarball and use `sudo dpkg -iO binaries/*.deb` to install the debs from the PR into an Ubuntu VM of the same release/architecture for manually testing a PR. + +# Run autopkgtests locally + +Preparations: + +* Get autopkgtest: + ```sh + git clone https://salsa.debian.org/ci-team/autopkgtest.git + ``` + +* Install necessary dependencies; on Debian/Ubuntu you can simply run `sudo apt install autopkgtest` (instead of the above cloning), on Fedora do `yum install qemu-kvm dpkg-perl` + +* Build a test image based on Ubuntu cloud images for the desired release/arch: + ```sh + autopkgtest/tools/autopkgtest-buildvm-ubuntu-cloud -r bionic -a amd64 + ``` + + This will build `autopkgtest-bionic-amd64.img`. This is normally being used through the `autopkgtest` command (see below), but you can boot this normally in QEMU (using `-snapshot` is highly recommended) to interactively poke around; this provides a easy throw-away test environment. + + +The most basic mode of operation is to run the tests for the current distro packages: + +```sh +autopkgtest/runner/autopkgtest systemd -- qemu autopkgtest-bionic-amd64.img +``` + +But autopkgtest allows lots of [different modes](https://salsa.debian.org/ci-team/autopkgtest/-/blob/master/doc/README.running-tests.rst) and [options](http://manpages.ubuntu.com/autopkgtest), like running a shell on failure (`-s`), running a single test only (`--test-name`), running the tests from a local checkout of the Debian source tree (possibly with modifications to the test) instead of from the distribution source, or running QEMU with more than one CPU (check the [autopkgtest-virt-qemu manpage](http://manpages.ubuntu.com/autopkgtest-virt-qemu). + +A common use case is to check out the Debian packaging git for getting/modifying the tests locally: + +```sh +git clone https://salsa.debian.org/systemd-team/systemd.git /tmp/systemd-debian/ +``` + +and running these against the binaries from a PR (see above), running only the `logind` test, getting a shell on failure, showing the boot output, and running with 2 CPUs: + +```sh +autopkgtest/runner/autopkgtest --test-name logind /tmp/binaries/*.deb /tmp/systemd-debian/ -s -- \ + qemu --show-boot --cpus 2 /srv/vm/autopkgtest-bionic-amd64.img +``` + +# Contact + +For troubles with the infrastructure, please notify [iainlane](https://github.com/iainlane) in the affected PR. diff --git a/docs/BACKPORTS.md b/docs/BACKPORTS.md new file mode 100644 index 0000000..6fbb57d --- /dev/null +++ b/docs/BACKPORTS.md @@ -0,0 +1,25 @@ +--- +title: Backports +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Backports + +The upstream systemd git repo at [https://github.com/systemd/systemd](https://github.com/systemd/systemd) only contains the main systemd branch that progresses at a quick pace, continuously bringing both bugfixes and new features. Distributions usually prefer basing their releases on stabilized versions branched off from this, that receive the bugfixes but not the features. + +## Stable Branch Repository + +Stable branches are available from [https://github.com/systemd/systemd-stable](https://github.com/systemd/systemd-stable). + +Stable branches are started for certain releases of systemd and named after them, e.g. v208-stable. Stable branches are typically managed by distribution maintainers on an as needed basis. For example v208 has been chosen for stable as several distributions are shipping this version and the official/upstream cycle of v208-v209 was a long one due to kdbus work. If you are using a particular version and find yourself backporting several patches, you may consider pushing a stable branch here for that version so others can benefit. Please contact us if you are interested. + +The following types of commits are cherry-picked onto those branches: + +* bugfixes +* documentation updates, when relevant to this version +* hardware database additions, especially the keymap updates +* small non-conflicting features deemed safe to add in a stable release + +Please try to ensure that anything backported to the stable repository is done with the `git cherry-pick -x` option such that text stating the original SHA1 is added into the commit message. This makes it easier to check where the code came from (as sometimes it is necessary to add small fixes as new code due to the upstream refactors that are deemed too invasive to backport as a stable patch. diff --git a/docs/BLOCK_DEVICE_LOCKING.md b/docs/BLOCK_DEVICE_LOCKING.md new file mode 100644 index 0000000..a6e3374 --- /dev/null +++ b/docs/BLOCK_DEVICE_LOCKING.md @@ -0,0 +1,243 @@ +--- +title: Locking Block Device Access +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Locking Block Device Access + +*TL;DR: Use BSD file locks +[(`flock(2)`)](https://man7.org/linux/man-pages/man2/flock.2.html) on block +device nodes to synchronize access for partitioning and file system formatting +tools.* + +`systemd-udevd` probes all block devices showing up for file system superblock +and partition table information (utilizing `libblkid`). If another program +concurrently modifies a superblock or partition table this probing might be +affected, which is bad in itself, but also might in turn result in undesired +effects in programs subscribing to `udev` events. + +Applications manipulating a block device can temporarily stop `systemd-udevd` +from processing rules on it — and thus bar it from probing the device — by +taking a BSD file lock on the block device node. Specifically, whenever +`systemd-udevd` starts processing a block device it takes a `LOCK_SH|LOCK_NB` +lock using [`flock(2)`](https://man7.org/linux/man-pages/man2/flock.2.html) on +the main block device (i.e. never on any partition block device, but on the +device the partition belongs to). If this lock cannot be taken (i.e. `flock()` +returns `EAGAIN`), it refrains from processing the device. If it manages to take +the lock it is kept for the entire time the device is processed. + +Note that `systemd-udevd` also watches all block device nodes it manages for +`inotify()` `IN_CLOSE_WRITE` events: whenever such an event is seen, this is +used as trigger to re-run the rule-set for the device. + +These two concepts allow tools such as disk partitioners or file system +formatting tools to safely and easily take exclusive ownership of a block +device while operating: before starting work on the block device, they should +take an `LOCK_EX` lock on it. This has two effects: first of all, in case +`systemd-udevd` is still processing the device the tool will wait for it to +finish. Second, after the lock is taken, it can be sure that `systemd-udevd` +will refrain from processing the block device, and thus all other client +applications subscribed to it won't get device notifications from potentially +half-written data either. After the operation is complete the +partitioner/formatter can simply close the device node. This has two effects: +it implicitly releases the lock, so that `systemd-udevd` can process events on +the device node again. Secondly, it results an `IN_CLOSE_WRITE` event, which +causes `systemd-udevd` to immediately re-process the device — seeing all +changes the tool made — and notify subscribed clients about it. + +Ideally, `systemd-udevd` would explicitly watch block devices for `LOCK_EX` +locks being released. Such monitoring is not supported on Linux however, which +is why it watches for `IN_CLOSE_WRITE` instead, i.e. for `close()` calls to +writable file descriptors referring to the block device. In almost all cases, +the difference between these two events does not matter much, as any locks +taken are implicitly released by `close()`. However, it should be noted that if +an application unlocks a device after completing its work without closing it, +i.e. while keeping the file descriptor open for further, longer time, then +`systemd-udevd` will not notice this and not retrigger and thus reprobe the +device. + +Besides synchronizing block device access between `systemd-udevd` and such +tools this scheme may also be used to synchronize access between those tools +themselves. However, do note that `flock()` locks are advisory only. This means +if one tool honours this scheme and another tool does not, they will of course +not be synchronized properly, and might interfere with each other's work. + +Note that the file locks follow the usual access semantics of BSD locks: since +`systemd-udevd` never writes to such block devices it only takes a `LOCK_SH` +*shared* lock. A program intending to make changes to the block device should +take a `LOCK_EX` *exclusive* lock instead. For further details, see the +`flock(2)` man page. + +And please keep in mind: BSD file locks (`flock()`) and POSIX file locks +(`lockf()`, `F_SETLK`, …) are different concepts, and in their effect +orthogonal. The scheme discussed above uses the former and not the latter, +because these types of locks more closely match the required semantics. + +If multiple devices are to be locked at the same time (for example in order to +format a RAID file system), the devices should be locked in the order of the +the device nodes' major numbers (primary ordering key, ascending) and minor +numbers (secondary ordering key, ditto), in order to avoid ABBA locking issues +between subsystems. + +Note that the locks should only be taken while the device is repartitioned, +file systems formatted or `dd`'ed in, and similar cases that +apply/remove/change superblocks/partition information. It should not be held +during normal operation, i.e. while file systems on it are mounted for +application use. + +The [`udevadm +lock`](https://www.freedesktop.org/software/systemd/man/udevadm.html) command +is provided to lock block devices following this scheme from the command line, +for the use in scripts and similar. (Note though that it's typically preferable +to use native support for block device locking in tools where that's +available.) + +Summarizing: it is recommended to take `LOCK_EX` BSD file locks when +manipulating block devices in all tools that change file system block devices +(`mkfs`, `fsck`, …) or partition tables (`fdisk`, `parted`, …), right after +opening the node. + +# Example of Locking The Whole Disk + +The following is an example to leverage `libsystemd` infrastructure to get the whole disk of a block device and take a BSD lock on it. + +## Compile and Execute +**Note that this example requires `libsystemd` version 251 or newer.** + +Place the code in a source file, e.g. `take_BSD_lock.c` and run the following commands: +``` +$ gcc -o take_BSD_lock -lsystemd take_BSD_lock.c + +$ ./take_BSD_lock /dev/sda1 +Successfully took a BSD lock: /dev/sda + +$ flock -x /dev/sda ./take_BSD_lock /dev/sda1 +Failed to take a BSD lock on /dev/sda: Resource temporarily unavailable +``` + +## Code +```c +/* SPDX-License-Identifier: MIT-0 */ + +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <sys/file.h> +#include <systemd/sd-device.h> +#include <unistd.h> + +static inline void closep(int *fd) { + if (*fd >= 0) + close(*fd); +} + +/** + * lock_whole_disk_from_devname + * @devname: devname of a block device, e.g., /dev/sda or /dev/sda1 + * @open_flags: the flags to open the device, e.g., O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY + * @flock_operation: the operation to call flock, e.g., LOCK_EX|LOCK_NB + * + * given the devname of a block device, take a BSD lock of the whole disk + * + * Returns: negative errno value on error, or non-negative fd if the lock was taken successfully. + **/ +int lock_whole_disk_from_devname(const char *devname, int open_flags, int flock_operation) { + __attribute__((cleanup(sd_device_unrefp))) sd_device *dev = NULL; + sd_device *whole_dev; + const char *whole_disk_devname, *subsystem, *devtype; + int r; + + // create a sd_device instance from devname + r = sd_device_new_from_devname(&dev, devname); + if (r < 0) { + errno = -r; + fprintf(stderr, "Failed to create sd_device: %m\n"); + return r; + } + + // if the subsystem of dev is block, but its devtype is not disk, find its parent + r = sd_device_get_subsystem(dev, &subsystem); + if (r < 0) { + errno = -r; + fprintf(stderr, "Failed to get the subsystem: %m\n"); + return r; + } + if (strcmp(subsystem, "block") != 0) { + fprintf(stderr, "%s is not a block device, refusing.\n", devname); + return -EINVAL; + } + + r = sd_device_get_devtype(dev, &devtype); + if (r < 0) { + errno = -r; + fprintf(stderr, "Failed to get the devtype: %m\n"); + return r; + } + if (strcmp(devtype, "disk") == 0) + whole_dev = dev; + else { + r = sd_device_get_parent_with_subsystem_devtype(dev, "block", "disk", &whole_dev); + if (r < 0) { + errno = -r; + fprintf(stderr, "Failed to get the parent device: %m\n"); + return r; + } + } + + // open the whole disk device node + __attribute__((cleanup(closep))) int fd = sd_device_open(whole_dev, open_flags); + if (fd < 0) { + errno = -fd; + fprintf(stderr, "Failed to open the device: %m\n"); + return fd; + } + + // get the whole disk devname + r = sd_device_get_devname(whole_dev, &whole_disk_devname); + if (r < 0) { + errno = -r; + fprintf(stderr, "Failed to get the whole disk name: %m\n"); + return r; + } + + // take a BSD lock of the whole disk device node + if (flock(fd, flock_operation) < 0) { + r = -errno; + fprintf(stderr, "Failed to take a BSD lock on %s: %m\n", whole_disk_devname); + return r; + } + + printf("Successfully took a BSD lock: %s\n", whole_disk_devname); + + // take the fd to avoid automatic cleanup + int ret_fd = fd; + fd = -EBADF; + return ret_fd; +} + +int main(int argc, char **argv) { + if (argc != 2) { + fprintf(stderr, "Invalid number of parameters.\n"); + return EXIT_FAILURE; + } + + // try to take an exclusive and nonblocking BSD lock + __attribute__((cleanup(closep))) int fd = + lock_whole_disk_from_devname( + argv[1], + O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY, + LOCK_EX|LOCK_NB); + + if (fd < 0) + return EXIT_FAILURE; + + /** + * The device is now locked until the return below. + * Now you can safely manipulate the block device. + **/ + + return EXIT_SUCCESS; +} +``` diff --git a/docs/BOOT.md b/docs/BOOT.md new file mode 100644 index 0000000..574cc08 --- /dev/null +++ b/docs/BOOT.md @@ -0,0 +1,111 @@ +--- +title: systemd-boot UEFI Boot Manager +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# systemd-boot UEFI Boot Manager + +systemd-boot is a UEFI boot manager which executes configured EFI images. The default entry is selected by a configured pattern (glob) or an on-screen menu. + +systemd-boot operates on the EFI System Partition (ESP) only. Configuration file fragments, kernels, initrds, other EFI images need to reside on the ESP. Linux kernels need to be built with CONFIG\_EFI\_STUB to be able to be directly executed as an EFI image. + +systemd-boot reads simple and entirely generic boot loader configuration files; one file per boot loader entry to select from. All files need to reside on the ESP. + +Pressing the Space key (or most other keys actually work too) during bootup will show an on-screen menu with all configured loader entries to select from. Pressing Enter on the selected entry loads and starts the EFI image. + +If no timeout is configured, which is the default setting, and no key pressed during bootup, the default entry is executed right away. + +![systemd-boot menu](/assets/systemd-boot-menu.png) + +All configuration files are expected to be 7-bit ASCII or valid UTF8. The loader configuration file understands the following keywords: + +| Config | +|---------|------------------------------------------------------------| +| default | pattern to select the default entry in the list of entries | +| timeout | timeout in seconds for how long to show the menu | + + +The entry configuration files understand the following keywords: + +| Entry | +|--------|------------------------------------------------------------| +| title | text to show in the menu | +| version | version string to append to the title when the title is not unique | +| machine-id | machine identifier to append to the title when the title is not unique | +| efi | executable EFI image | +| options | options to pass to the EFI image / kernel command line | +| linux | linux kernel image (systemd-boot still requires the kernel to have an EFI stub) | +| initrd | initramfs image (systemd-boot just adds this as option initrd=) | + + +Examples: +``` +/boot/loader/loader.conf +timeout 3 +default 6a9857a393724b7a981ebb5b8495b9ea-* + +/boot/loader/entries/6a9857a393724b7a981ebb5b8495b9ea-3.8.0-2.fc19.x86_64.conf +title Fedora 19 (Rawhide) +version 3.8.0-2.fc19.x86_64 +machine-id 6a9857a393724b7a981ebb5b8495b9ea +linux /6a9857a393724b7a981ebb5b8495b9ea/3.8.0-2.fc19.x86_64/linux +initrd /6a9857a393724b7a981ebb5b8495b9ea/3.8.0-2.fc19.x86_64/initrd +options root=UUID=f8f83f73-df71-445c-87f7-31f70263b83b quiet + +/boot/loader/entries/custom-kernel.conf +title My kernel +efi /bzImage +options root=PARTUUID=084917b7-8be2-4e86-838d-f771a9902e08 + +/boot/loader/entries/custom-kernel-initrd.conf +title My kernel with initrd +linux /bzImage +initrd /initrd.img +options root=PARTUUID=084917b7-8be2-4e86-838d-f771a9902e08 quiet` +``` + + +While the menu is shown, the following keys are active: + +| Keys | +|--------|------------------------------------------------------------| +| Up/Down | Select menu entry | +| Enter | boot the selected entry | +| d | select the default entry to boot (stored in a non-volatile EFI variable) | +| t/T | adjust the timeout (stored in a non-volatile EFI variable) | +| e | edit the option line (kernel command line) for this bootup to pass to the EFI image | +| Q | quit | +| v | show the systemd-boot and UEFI version | +| P | print the current configuration to the console | +| h | show key mapping | + +Hotkeys to select a specific entry in the menu, or when pressed during bootup to boot the entry right-away: + + + +| Keys | +|--------|------------------------------------------------------------| +| l | Linux | +| w | Windows | +| a | OS X | +| s | EFI Shell | +| 1-9 | number of entry | + +Some EFI variables control the loader or exported the loaders state to the started operating system. The vendor UUID `4a67b082-0a4c-41cf-b6c7-440b29bb8c4f` and the variable names are supposed to be shared across all loaders implementations which follow this scheme of configuration: + +| EFI Variables | +|---------------|------------------------|-------------------------------| +| LoaderEntryDefault | entry identifier to select as default at bootup | non-volatile | +| LoaderConfigTimeout | timeout in seconds to show the menu | non-volatile | +| LoaderEntryOneShot | entry identifier to select at the next and only the next bootup | non-volatile | +| LoaderDeviceIdentifier | list of identifiers of the volume the loader was started from | volatile | +| LoaderDevicePartUUID | partition GPT UUID of the ESP systemd-boot was executed from | volatile | + + +Links: + +[https://github.com/systemd/systemd](https://github.com/systemd/systemd) + +[http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/](http://www.freedesktop.org/wiki/Specifications/BootLoaderSpec/) diff --git a/docs/BOOT_LOADER_INTERFACE.md b/docs/BOOT_LOADER_INTERFACE.md new file mode 100644 index 0000000..a1f6b59 --- /dev/null +++ b/docs/BOOT_LOADER_INTERFACE.md @@ -0,0 +1,154 @@ +--- +title: Boot Loader Interface +category: Booting +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The Boot Loader Interface + +systemd can interface with the boot loader to receive performance data and +other information, and pass control information. This is only supported on EFI +systems. Data is transferred between the boot loader and systemd in EFI +variables. All EFI variables use the vendor UUID +`4a67b082-0a4c-41cf-b6c7-440b29bb8c4f`. + +* The EFI Variable `LoaderTimeInitUSec` contains the timestamp in microseconds + when the loader was initialized. This value is the time spent in the firmware + for initialization, it is formatted as numeric, NUL-terminated, decimal + string, in UTF-16. + +* The EFI Variable `LoaderTimeExecUSec` contains the timestamp in microseconds + when the loader finished its work and is about to execute the kernel. The + time spent in the loader is the difference between `LoaderTimeExecUSec` and + `LoaderTimeInitUSec`. This value is formatted the same way as + `LoaderTimeInitUSec`. + +* The EFI variable `LoaderDevicePartUUID` contains the partition GUID of the + ESP the boot loader was run from formatted as NUL-terminated UTF16 string, in + normal GUID syntax. + +* The EFI variable `LoaderConfigTimeout` contains the boot menu timeout + currently in use. It may be modified both by the boot loader and by the + host. The value should be formatted as numeric, NUL-terminated, decimal + string, in UTF-16. The time is specified in seconds. In addition some + non-numeric string values are also accepted. A value of `menu-force` + will disable the timeout and show the menu indefinitely. If set to `0` or + `menu-hidden` the default entry is booted immediately without showing a menu. + Unless a value of `menu-disabled` is set, the boot loader should provide a + way to interrupt this by for example listening for key presses for a brief + moment before booting. + +* Similarly, the EFI variable `LoaderConfigTimeoutOneShot` contains a boot menu + timeout for a single following boot. It is set by the OS in order to request + display of the boot menu on the following boot. When set overrides + `LoaderConfigTimeout`. It is removed automatically after being read by the + boot loader, to ensure it only takes effect a single time. This value is + formatted the same way as `LoaderConfigTimeout`. If set to `0` the boot menu + timeout is turned off, and the menu is shown indefinitely. + +* The EFI variable `LoaderEntries` may contain a series of boot loader entry + identifiers, one after the other, each individually NUL terminated. This may + be used to let the OS know which boot menu entries were discovered by the + boot loader. A boot loader entry identifier should be a short, non-empty + alphanumeric string (possibly containing `-`, too). The list should be in the + order the entries are shown on screen during boot. See below regarding a + recommended vocabulary for boot loader entry identifiers. + +* The EFI variable `LoaderEntryDefault` contains the default boot loader entry + to use. It contains a NUL-terminated boot loader entry identifier. + +* Similarly, the EFI variable `LoaderEntryOneShot` contains the default boot + loader entry to use for a single following boot. It is set by the OS in order + to request booting into a specific menu entry on the following boot. When set + overrides `LoaderEntryDefault`. It is removed automatically after being read + by the boot loader, to ensure it only takes effect a single time. This value + is formatted the same way as `LoaderEntryDefault`. + +* The EFI variable `LoaderEntrySelected` contains the boot loader entry + identifier that was booted. It is set by the boot loader and read by + the OS in order to identify which entry has been used for the current boot. + +* The EFI variable `LoaderFeatures` contains a 64-bit unsigned integer with a + number of flags bits that are set by the boot loader and passed to the OS and + indicate the features the boot loader supports. Specifically, the following + bits are defined: + + * `1 << 0` → The boot loader honours `LoaderConfigTimeout` when set. + * `1 << 1` → The boot loader honours `LoaderConfigTimeoutOneShot` when set. + * `1 << 2` → The boot loader honours `LoaderEntryDefault` when set. + * `1 << 3` → The boot loader honours `LoaderEntryOneShot` when set. + * `1 << 4` → The boot loader supports boot counting as described in [Automatic Boot Assessment](AUTOMATIC_BOOT_ASSESSMENT). + * `1 << 5` → The boot loader supports looking for boot menu entries in the Extended Boot Loader Partition. + * `1 << 6` → The boot loader supports passing a random seed to the OS. + * `1 << 13` → The boot loader honours `menu-disabled` option when set. + +* The EFI variable `LoaderSystemToken` contains binary random data, + persistently set by the OS installer. Boot loaders that support passing + random seeds to the OS should use this data and combine it with the random + seed file read from the ESP. By combining this random data with the random + seed read off the disk before generating a seed to pass to the OS and a new + seed to store in the ESP the boot loader can protect itself from situations + where "golden" OS images that include a random seed are replicated and used + on multiple systems. Since the EFI variable storage is usually independent + (i.e. in physical NVRAM) of the ESP file system storage, and only the latter + is part of "golden" OS images, this ensures that different systems still come + up with different random seeds. Note that the `LoaderSystemToken` is + generally only written once, by the OS installer, and is usually not touched + after that. + +If `LoaderTimeInitUSec` and `LoaderTimeExecUSec` are set, `systemd-analyze` +will include them in its boot-time analysis. If `LoaderDevicePartUUID` is set, +systemd will mount the ESP that was used for the boot to `/boot`, but only if +that directory is empty, and only if no other file systems are mounted +there. The `systemctl reboot --boot-loader-entry=…` and `systemctl reboot +--boot-loader-menu=…` commands rely on the `LoaderFeatures` , +`LoaderConfigTimeoutOneShot`, `LoaderEntries`, `LoaderEntryOneShot` +variables. + +## Boot Loader Entry Identifiers + +While boot loader entries may be named relatively freely, it's highly +recommended to follow the following rules when picking identifiers for the +entries, so that programs (and users) can derive basic context and meaning from +the identifiers as passed in `LoaderEntries`, `LoaderEntryDefault`, +`LoaderEntryOneShot`, `LoaderEntrySelected`, and possibly show nicely localized +names for them in UIs. + +1. When boot loader entries are defined through the + [Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification/) + files, the identifier should be derived directly from the file name, + but with the `.conf` (Type #1 snippets) or `.efi` (Type #2 images) + suffix removed. + +2. Entries automatically discovered by the boot loader (as opposed to being + configured in configuration files) should generally have an identifier + prefixed with `auto-`. + +3. Boot menu entries referring to Microsoft Windows installations should either + use the identifier `windows` or use the `windows-` prefix for the + identifier. If a menu entry is automatically discovered, it should be + prefixed with `auto-`, see above (Example: this means an automatically + discovered Windows installation might have the identifier `auto-windows` or + `auto-windows-10` or so.). + +4. Similarly, boot menu entries referring to Apple macOS installations should + use the identifier `osx` or one that is prefixed with `osx-`. If such an + entry is automatically discovered by the boot loader use `auto-osx` as + identifier, or `auto-osx-` as prefix for the identifier, see above. + +5. If a boot menu entry encapsulates the EFI shell program, it should use the + identifier `efi-shell` (or when automatically discovered: `auto-efi-shell`, + see above). + +6. If a boot menu entry encapsulates a reboot into EFI firmware setup feature, + it should use the identifier `reboot-to-firmware-setup` (or + `auto-reboot-to-firmware-setup` in case it is automatically discovered). + +## Links + +[Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification)<br> +[Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification)<br> +[`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html)<br> +[`bootctl(1)`](https://www.freedesktop.org/software/systemd/man/bootctl.html)<br> +[`systemd-gpt-auto-generator(8)`](https://www.freedesktop.org/software/systemd/man/systemd-gpt-auto-generator.html) diff --git a/docs/BOOT_LOADER_SPECIFICATION.md b/docs/BOOT_LOADER_SPECIFICATION.md new file mode 100644 index 0000000..33066b2 --- /dev/null +++ b/docs/BOOT_LOADER_SPECIFICATION.md @@ -0,0 +1 @@ +[This content has moved to the UAPI group website](https://uapi-group.org/specifications/specs/boot_loader_specification/) diff --git a/docs/BUILDING_IMAGES.md b/docs/BUILDING_IMAGES.md new file mode 100644 index 0000000..b11afa3 --- /dev/null +++ b/docs/BUILDING_IMAGES.md @@ -0,0 +1,275 @@ +--- +title: Safely Building Images +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Building Images Safely + +In many scenarios OS installations are shipped as pre-built images, that +require no further installation process beyond simple `dd`-ing the image to +disk and booting it up. When building such "golden" OS images for +`systemd`-based OSes a few points should be taken into account. + +Most of the points described here are implemented by the +[`mkosi`](https://github.com/systemd/mkosi) OS image builder developed and +maintained by the systemd project. If you are using or working on another image +builder it's recommended to keep the following concepts and recommendations in +mind. + +## Resources to Reset + +Typically the same OS image shall be deployable in multiple instances, and each +instance should automatically acquire its own identifying credentials on first +boot. For that it's essential to: + +1. Remove the + [`/etc/machine-id`](https://www.freedesktop.org/software/systemd/man/machine-id.html) + file or write the string `uninitialized\n` into it. This file is supposed to + carry a 128-bit identifier unique to the system. Only when it is reset it + will be auto-generated on first boot and thus be truly unique. If this file + is not reset, and carries a valid ID every instance of the system will come + up with the same ID and that will likely lead to problems sooner or later, + as many network-visible identifiers are commonly derived from the machine + ID, for example, IPv6 addresses or transient MAC addresses. + +2. Remove the `/var/lib/systemd/random-seed` file (see + [`systemd-random-seed(8)`](https://www.freedesktop.org/software/systemd/man/systemd-random-seed.service.html)), + which is used to seed the kernel's random pool on boot. If this file is + shipped pre-initialized, every instance will seed its random pool with the + same random data that is included in the image, and thus possibly generate + random data that is more similar to other instances booted off the same + image than advisable. + +3. Remove the `/loader/random-seed` file (see + [`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html)) + from the UEFI System Partition (ESP), in case the `systemd-boot` boot loader + is used in the image. + +4. It might also make sense to remove + [`/etc/hostname`](https://www.freedesktop.org/software/systemd/man/hostname.html) + and + [`/etc/machine-info`](https://www.freedesktop.org/software/systemd/man/machine-info.html) + which carry additional identifying information about the OS image. + +5. Remove `/var/lib/systemd/credential.secret` which is used for protecting + service credentials, see + [`systemd.exec(5)`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Credentials) + and + [`systemd-creds(1)`](https://www.freedesktop.org/software/systemd/man/systemd-creds.html) + for details. Note that by removing this file access to previously encrypted + credentials from this image is lost. The file is automatically generated if + a new credential is encrypted and the file does not exist yet. + +## Boot Menu Entry Identifiers + +The +[`kernel-install(8)`](https://www.freedesktop.org/software/systemd/man/kernel-install.html) +logic used to generate +[Boot Loader Specification Type #1](https://uapi-group.org/specifications/specs/boot_loader_specification/#type-1-boot-loader-specification-entries) +entries by default uses the machine ID as stored in `/etc/machine-id` for +naming boot menu entries and the directories in the ESP to place kernel images +in. This is done in order to allow multiple installations of the same OS on the +same system without conflicts. However, this is problematic if the machine ID +shall be generated automatically on first boot: if the ID is not known before +the first boot it cannot be used to name the most basic resources required for +the boot process to complete. + +Thus, for images that shall acquire their identity on first boot only, it is +required to use a different identifier for naming boot menu entries. To allow +this the `kernel-install` logic knows the generalized *entry* *token* concept, +which can be a freely chosen string to use for identifying the boot menu +resources of the OS. If not configured explicitly it defaults to the machine +ID. The file `/etc/kernel/entry-token` may be used to configure this string +explicitly. Thus, golden image builders should write a suitable identifier into +this file, for example, the `IMAGE_ID=` or `ID=` field from +[`/etc/os-release`](https://www.freedesktop.org/software/systemd/man/os-release.html) +(also see below). It is recommended to do this before the `kernel-install` +functionality is invoked (i.e. before the package manager is used to install +packages into the OS tree being prepared), so that the selected string is +automatically used for all entries to be generated. + +## Booting with Empty `/var/` and/or Empty Root File System + +`systemd` is designed to be able to come up safely and robustly if the `/var/` +file system or even the entire root file system (with exception of `/usr/`, +i.e. the vendor OS resources) is empty (i.e. "unpopulated"). With this in mind +it's relatively easy to build images that only ship a `/usr/` tree, and +otherwise carry no other data, populating the rest of the directory hierarchy +on first boot as needed. + +Specifically, the following mechanisms are in place: + +1. The `switch-root` logic in systemd, that is used to switch from the initrd + phase to the host will create the basic OS hierarchy skeleton if missing. It + will create a couple of directories strictly necessary to boot up + successfully, plus essential symlinks (such as those necessary for the + dynamic loader `ld.so` to function). + +2. PID 1 will initialize `/etc/machine-id` automatically if not initialized yet + (see above). + +3. The + [`nss-systemd(8)`](https://www.freedesktop.org/software/systemd/man/nss-systemd.html) + glibc NSS module ensures the `root` and `nobody` users and groups remain + resolvable, even without `/etc/passwd` and `/etc/group` around. + +4. The + [`systemd-sysusers(8)`](https://www.freedesktop.org/software/systemd/man/systemd-sysusers.service.html) + component will automatically populate `/etc/passwd` and `/etc/group` on + first boot with further necessary system users. + +5. The + [`systemd-tmpfiles(8)`](https://www.freedesktop.org/software/systemd/man/systemd-tmpfiles-setup.service.html) + component ensures that various files and directories below `/etc/`, `/var/` + and other places are created automatically at boot if missing. Unlike the + directories/symlinks created by the `switch-root` logic above this logic is + extensible by packages, and can adjust access modes, file ownership and + more. Among others this will also link `/etc/os-release` → + `/usr/lib/os-release`, ensuring that the OS release information is + unconditionally accessible through `/etc/os-release`. + +6. The + [`nss-myhostname(8)`](https://www.freedesktop.org/software/systemd/man/nss-myhostname.html) + glibc NSS module will ensure the local host name as well as `localhost` + remains resolvable, even without `/etc/hosts` around. + +With these mechanisms the hierarchies below `/var/` and `/etc/` can be safely +and robustly populated on first boot, so that the OS can safely boot up. Note +that some auxiliary package are not prepared to operate correctly if their +configuration data in `/etc/` or their state directories in `/var/` are +missing. This can typically be addressed via `systemd-tmpfiles` lines that +ensure the missing files and directories are created if missing. In particular, +configuration files that are necessary for operation can be automatically +copied or symlinked from the `/usr/share/factory/etc/` tree via the `C` or `L` +line types. That said, we recommend that all packages safely fall back to +internal defaults if their configuration is missing, making such additional +steps unnecessary. + +Note that while `systemd` itself explicitly supports booting up with entirely +unpopulated images (`/usr/` being the only required directory to be populated) +distributions might not be there yet: depending on your distribution further, +manual work might be required to make this scenario work. + +## Adapting OS Images to Storage + +Typically, if an image is `dd`-ed onto a target disk it will be minimal: +i.e. only consist of necessary vendor data, and lack "payload" data, that shall +be individual to the system, and dependent on host parameters. On first boot, +the OS should take possession of the backing storage as necessary, dynamically +using available space. Specifically: + +1. Additional partitions should be created, that make no sense to ship + pre-built in the image. For example, `/tmp/` or `/home/` partitions, or even + `/var/` or the root file system (see above). + +2. Additional partitions should be created that shall function as A/B + secondaries for partitions shipped in the original image. In other words: if + the `/usr/` file system shall be updated in an A/B fashion it typically + makes sense to ship the original A file system in the deployed image, but + create the B partition on first boot. + +3. Partitions covering only a part of the disk should be grown to the full + extent of the disk. + +4. File systems in uninitialized partitions should be formatted with a file + system of choice. + +5. File systems covering only a part of a partition should be grown to the full + extent of the partition. + +6. Partitions should be encrypted with cryptographic keys generated locally on + the machine the system is first booted on, ensuring these keys remain local + and are not shared with any other instance of the OS image. + +Or any combination of the above: i.e. first create a partition, then encrypt +it, then format it. + +`systemd` provides multiple tools to implement the above logic: + +1. The + [`systemd-repart(8)`](https://www.freedesktop.org/software/systemd/man/systemd-repart.service.html) + component may manipulate GPT partition tables automatically on boot, growing + partitions or adding in partitions taking the backing storage size into + account. It can also encrypt partitions automatically it creates (even bind + to TPM2, automatically) and populate partitions from various sources. It + does this all in a robust fashion so that aborted invocations will not leave + incompletely set up partitions around. + +2. The + [`systemd-growfs@(8).service`](https://www.freedesktop.org/software/systemd/man/systemd-growfs.html) + tool can automatically grow a file system to the partition it is contained + in. The `x-systemd.growfs` mount option in `/etc/fstab` is sufficient to + enable this logic for specific mounts. Alternatively appropriately set up + partitions can set GPT partition flag 59 to request this behaviour, see the + [Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification) + for details. If the file system is already grown it executes no operation. + +3. Similar, the `systemd-makefs@.service` and `systemd-makeswap@.service` + services can format file systems and swap spaces before first use, if they + carry no file system signature yet. The `x-systemd.makefs` mount option in + `/etc/fstab` may be used to request this functionality. + +## Provisioning Image Settings + +While a lot of work has gone into ensuring `systemd` systems can safely boot +with unpopulated `/etc/` trees, it sometimes is desirable to set a couple of +basic settings *after* `dd`-ing the image to disk, but *before* first boot. For +this the tool +[`systemd-firstboot(1)`](https://www.freedesktop.org/software/systemd/man/systemd-firstboot.html) +can be useful, with its `--image=` switch. It may be used to set very basic +settings, such as the root password or hostname on an OS disk image or +installed block device. + +## Distinguishing First Boot + +For various purposes it's useful to be able to distinguish the first boot-up of +the system from later boot-ups (for example, to set up TPM hardware +specifically, or register a system somewhere). `systemd` provides mechanisms to +implement that. Specifically, the `ConditionFirstBoot=` and `AssertFirstBoot=` +settings may be used to conditionalize units to only run on first boot. See +[`systemd.unit(5)`](https://www.freedesktop.org/software/systemd/man/systemd.unit.html#ConditionFirstBoot=) +for details. + +A special target unit `first-boot-complete.target` may be used as milestone to +safely handle first boots where the system is powered off too early: if the +first boot process is aborted before this target is reached, the following boot +process will be considered a first boot, too. Once the target is reached, +subsequent boots will not be considered first boots anymore, even if the boot +process is aborted immediately after. Thus, services that must complete fully +before a system shall be considered fully past the first boot should be ordered +before this target unit. + +Whether a system will come up in first boot state or not is derived from the +initialization status of `/etc/machine-id`: if the file already carries a valid +ID the system is already past the first boot. If it is not initialized yet it +is still considered in the first boot state. For details see +[`machine-id(5)`](https://www.freedesktop.org/software/systemd/man/machine-id.html). + +## Image Metadata + +Typically, when operating with golden disk images it is useful to be able to +identify them and their version. For this the two fields `IMAGE_ID=` and +`IMAGE_VERSION=` have been defined in +[`os-release(5)`](https://www.freedesktop.org/software/systemd/man/os-release.html). These +fields may be accessed from unit files and similar via the `%M` and `%A` +specifiers. + +Depending on how the images are put together it might make sense to leave the +OS distribution's `os-release` file as is in `/usr/lib/os-release` but to +replace the usual `/etc/os-release` symlink with a regular file that extends +the distribution's file with one augmented with these two additional +fields. + +## Links + +[`machine-id(5)`](https://www.freedesktop.org/software/systemd/man/machine-id.html)<br> +[`systemd-random-seed(8)`](https://www.freedesktop.org/software/systemd/man/systemd-random-seed.service.html)<br> +[`os-release(5)`](https://www.freedesktop.org/software/systemd/man/os-release.html)<br> +[Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification)<br> +[Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification)<br> +[`mkosi`](https://github.com/systemd/mkosi)<br> +[`systemd-boot(7)`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html)<br> +[`systemd-repart(8)`](https://www.freedesktop.org/software/systemd/man/systemd-repart.service.html)<br> +[`systemd-growfs@(8).service`](https://www.freedesktop.org/software/systemd/man/systemd-growfs.html)<br> diff --git a/docs/CATALOG.md b/docs/CATALOG.md new file mode 100644 index 0000000..bcbf5b9 --- /dev/null +++ b/docs/CATALOG.md @@ -0,0 +1,67 @@ +--- +title: Journal Message Catalogs +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Journal Message Catalogs + +Starting with 196 systemd includes a message catalog system which allows augmentation on display of journal log messages with short explanation texts, keyed off the MESSAGE\_ID= field of the entry. Many important log messages generated by systemd itself have message catalog entries. External packages can easily provide catalog data for their own messages. + +The message catalog has a number of purposes: + +* Provide the administrator, user or developer with further information about the issue at hand, beyond the actual message text +* Provide links to further documentation on the topic of the specific message +* Provide native language explanations for English language system messages +* Provide links for support forums, hotlines or other contacts + +## Format + +Message catalog source files are simple text files that follow an RFC822 inspired format. To get an understanding of the format [here's an example file](http://cgit.freedesktop.org/systemd/systemd/plain/catalog/systemd.catalog), which includes entries for many important messages systemd itself generates. On installation of a package that includes message catalogs all installed message catalog source files get compiled into a binary index, which is then used to look up catalog data. + +journalctl's `-x` command line parameter may be used to augment on display journal log messages with message catalog data when browsing. `journalctl --list-catalog` may be used to print a list of all known catalog entries. + +To register additional catalog entries, packages may drop (text) catalog files into /usr/lib/systemd/catalog/ with a suffix of .catalog. The files are not accessed directly when needed, but need to be built into a binary index file with `journalctl --update-catalog`. + +Here's an example how a single catalog entry looks like in the text source format. Multiple of these may be listed one after the other per catalog source file: + +``` +-- fc2e22bc6ee647b6b90729ab34a250b1 +Subject: Process @COREDUMP_PID@ (@COREDUMP_COMM@) dumped core +Defined-By: systemd +Support: http://lists.freedesktop.org/mailman/listinfo/systemd-devel +Documentation: man:core(5) +Documentation: http://www.freedesktop.org/wiki/Software/systemd/catalog/@MESSAGE_ID@ + +Process @COREDUMP_PID@ (@COREDUMP_COMM@) crashed and dumped core. + +This usually indicates a programming error in the crashing program and +should be reported to its vendor as a bug. +``` + + +The text format of the .catalog files is as follows: + +* Simple, UTF-8 text files, with usual line breaks at 76 chars. URLs and suchlike where line-breaks are undesirable may use longer lines. As catalog files need to be usable on text consoles it is essential that the 76 char line break rule is otherwise followed for human readable text. +* Lines starting with `#` are ignored, and may be used for comments. +* The files consist of a series of entries. For each message ID (in combination with a locale) only a single entry may be defined. Every entry consists of: + * A separator line beginning with `-- `, followed by a hexadecimal message ID formatted as lower case ASCII string. Optionally, the message ID may be suffixed by a space and a locale identifier, such as `de` or `fr\_FR`, if i10n is required. + * A series of entry headers, in RFC822-style but not supporting continuation lines. Some header fields may appear more than once per entry. The following header fields are currently known (but additional fields may be added later): + * Subject: A short, one-line human readable description of the message + * Defined-By: Who defined this message. Usually a package name or suchlike + * Support: A URI for getting further support. This can be a web URL or a telephone number in the tel:// namespace + * Documentation: URIs for further user, administrator or developer documentation on the log entry. URIs should be listed in order of relevance, the most relevant documentation first. + * An empty line + * The actual catalog entry payload, as human readable prose. Multiple paragraphs may be separated by empty lines. The prose should first describe the message and when it occurs, possibly followed by recommendations how to deal with the message and (if it is an error message) correct the problem at hand. This message text should be readable by users and administrators. Information for developers should be stored externally instead, and referenced via a Documentation= header field. +* When a catalog entry is printed on screen for a specific log entry simple variable replacements are applied. Journal field names enclosed in @ will be replaced by their values, if such a field is available in an entry. If such a field is not defined in an entry the enclosing @ will be dropped but the variable name is kept. See [systemd's own message catalog](http://cgit.freedesktop.org/systemd/systemd/plain/catalog/systemd.catalog) for a complete example for a catalog file. + +## Adding Message Catalog Support to Your Program + +Note that the message catalog is only available for messages generated with the MESSAGE\_ID= journal meta data field, as this is need to find the right entry for a message. For more information on the MESSAGE\_ID= journal entry field see [systemd.journal-fields(7)](http://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html). + +To add message catalog entries for log messages your application generates, please follow the following guidelines: + +* Use the [native Journal logging APIs](http://0pointer.de/blog/projects/journal-submit.html) to generate your messages, and define message IDs for all messages you want to add catalog entries for. You may use `journalctl --new-id128` to allocate new message IDs. +* Write a catalog entry file for your messages and ship them in your package and install them to `/usr/lib/systemd/catalog/` (if you package your software with RPM use `%_journalcatalogdir`) +* Ensure that after installation of your application's RPM/DEB "`journalctl --update-catalog`" is executed, in order to update the binary catalog index. (if you package your software with RPM use the `%journal_catalog_update` macro to achieve that.) diff --git a/docs/CGROUP_DELEGATION.md b/docs/CGROUP_DELEGATION.md new file mode 100644 index 0000000..4210a75 --- /dev/null +++ b/docs/CGROUP_DELEGATION.md @@ -0,0 +1,502 @@ +--- +title: Control Group APIs and Delegation +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Control Group APIs and Delegation + +*Intended audience: hackers working on userspace subsystems that require direct +cgroup access, such as container managers and similar.* + +So you are wondering about resource management with systemd, you know Linux +control groups (cgroups) a bit and are trying to integrate your software with +what systemd has to offer there. Here's a bit of documentation about the +concepts and interfaces involved with this. + +What's described here has been part of systemd and documented since v205 +times. However, it has been updated and improved substantially, even +though the concepts stayed mostly the same. This is an attempt to provide more +comprehensive up-to-date information about all this, particular in light of the +poor implementations of the components interfacing with systemd of current +container managers. + +Before you read on, please make sure you read the low-level kernel +documentation about the +[unified cgroup hierarchy](https://docs.kernel.org/admin-guide/cgroup-v2.html). +This document then adds in the higher-level view from systemd. + +This document augments the existing documentation we already have: + +* [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface) +* [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers) + +These wiki documents are not as up to date as they should be, currently, but +the basic concepts still fully apply. You should read them too, if you do something +with cgroups and systemd, in particular as they shine more light on the various +D-Bus APIs provided. (That said, sooner or later we should probably fold that +wiki documentation into this very document, too.) + +## Two Key Design Rules + +Much of the philosophy behind these concepts is based on a couple of basic +design ideas of cgroup v2 (which we however try to adapt as far as we can to +cgroup v1 too). Specifically two cgroup v2 rules are the most relevant: + +1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted +to have processes directly attached to a cgroup that also has child cgroups and +vice versa. A cgroup is either an inner node or a leaf node of the tree, and if +it's an inner node it may not contain processes directly, and if it's a leaf +node then it may not have child cgroups. (Note that there are some minor +exceptions to this rule, though. E.g. the root cgroup is special and allows +both processes and children — which is used in particular to maintain kernel +threads.) + +2. The **single-writer** rule: this means that each cgroup only has a single +writer, i.e. a single process managing it. It's OK if different cgroups have +different processes managing them. However, only a single process should own a +specific cgroup, and when it does that ownership is exclusive, and nothing else +should manipulate it at the same time. This rule ensures that various pieces of +software don't step on each other's toes constantly. + +These two rules have various effects. For example, one corollary of this is: if +your container manager creates and manages cgroups in the system's root cgroup +you violate rule #2, as the root cgroup is managed by systemd and hence off +limits to everybody else. + +Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as +soon as you add a process to a cgroup it is ensured the rule is not +violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even +though it's a good thing to follow it then too. Rule #2 is not enforced on +either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case +root can do anything, modulo SELinux and friends), but if you ignore it you'll +be in constant pain as various pieces of software will fight over cgroup +ownership. + +Note that cgroup v1 is currently the most deployed implementation, even though +it's semantically broken in many ways, and in many cases doesn't actually do +what people think it does. cgroup v2 is where things are going, and most new +kernel features in this area are only added to cgroup v2, and not cgroup v1 +anymore. For example, cgroup v2 provides proper cgroup-empty notifications, has +support for all kinds of per-cgroup BPF magic, supports secure delegation of +cgroup trees to less privileged processes and so on, which all are not +available on cgroup v1. + +## Three Different Tree Setups 🌳 + +systemd supports three different modes how cgroups are set up. Specifically: + +1. **Unified** — this is the simplest mode, and exposes a pure cgroup v2 +logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system +and all available controllers are exclusively exposed through it. + +2. **Legacy** — this is the traditional cgroup v1 mode. In this mode the +various controllers each get their own cgroup file system mounted to +`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup +hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`. + +3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set +up mostly like legacy, except that there's also an additional hierarchy +`/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in +this mode the unified hierarchy won't have controllers attached, the +controllers are all mounted as separate hierarchies as in legacy mode, +i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2 +functionality and not about resource management.) In this mode compatibility +with cgroup v1 is retained while some cgroup v2 features are available +too. This mode is a stopgap. Don't bother with this too much unless you have +too much free time. + +To say this clearly, legacy and hybrid modes have no future. If you develop +software today and don't focus on the unified mode, then you are writing +software for yesterday, not tomorrow. They are primarily supported for +compatibility reasons and will not receive new features. Sorry. + +Superficially, in legacy and hybrid modes it might appear that the parallel +cgroup hierarchies for each controller are orthogonal from each other. In +systemd they are not: the hierarchies of all controllers are always kept in +sync (at least mostly: sub-trees might be suppressed in certain hierarchies if +no controller usage is required for them). The fact that systemd keeps these +hierarchies in sync means that the legacy and hybrid hierarchies are +conceptually very close to the unified hierarchy. In particular this allows us +to talk of one specific cgroup and actually mean the same cgroup in all +available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/` +then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as +`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on. +Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence +thinking about them as orthogonal won't help you in the long run anyway. + +If you wonder how to detect which of these three modes is currently used, use +`statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its +`.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then +you are either in legacy or hybrid mode. To distinguish these two cases, run +`statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports +`CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not. +From a shell, you can check the `Type` in `stat -f /sys/fs/cgroup` and +`stat -f /sys/fs/cgroup/unified`. + +## systemd's Unit Types + +The low-level kernel cgroups feature is exposed in systemd in three different +"unit" types. Specifically: + +1. 💼 The `.service` unit type. This unit type is for units encapsulating + processes systemd itself starts. Units of these types have cgroups that are + the leaves of the cgroup tree the systemd instance manages (though possibly + they might contain a sub-tree of their own managed by something else, made + possible by the concept of delegation, see below). Service units are usually + instantiated based on a unit file on disk that describes the command line to + invoke and other properties of the service. However, service units may also + be declared and started programmatically at runtime through a D-Bus API + (which is called *transient* services). + +2. 👓 The `.scope` unit type. This is very similar to `.service`. The main + difference: the processes the units of this type encapsulate are forked off + by some unrelated manager process, and that manager asked systemd to expose + them as a unit. Unlike services, scopes can only be declared and started + programmatically, i.e. are always transient. That's because they encapsulate + processes forked off by something else, i.e. existing runtime objects, and + hence cannot really be defined fully in 'offline' concepts such as unit + files. + +3. 🔪 The `.slice` unit type. Units of this type do not directly contain any + processes. Units of this type are the inner nodes of part of the cgroup tree + the systemd instance manages. Much like services, slices can be defined + either on disk with unit files or programmatically as transient units. + +Slices expose the trunk and branches of a tree, and scopes and services are +attached to those branches as leaves. The idea is that scopes and services can +be moved around though, i.e. assigned to a different slice if needed. + +The naming of slice units directly maps to the cgroup tree path. This is not +the case for service and scope units however. A slice named `foo-bar-baz.slice` +maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service +`quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the +cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`. + +By default systemd sets up four slice units: + +1. `-.slice` is the root slice. i.e. the parent of everything else. On the host + system it maps directly to the top-level directory of cgroup v2. + +2. `system.slice` is where system services are by default placed, unless + configured otherwise. + +3. `user.slice` is where user sessions are placed. Each user gets a slice of + its own below that. + +4. `machines.slice` is where VMs and containers are supposed to be + placed. `systemd-nspawn` makes use of this by default, and you're very welcome + to place your containers and VMs there too if you hack on managers for those. + +Users may define any amount of additional slices they like though, the four +above are just the defaults. + +## Delegation + +Container managers and suchlike often want to control cgroups directly using +the raw kernel APIs. That's entirely fine and supported, as long as proper +*delegation* is followed. Delegation is a concept we inherited from cgroup v2, +but we expose it on cgroup v1 too. Delegation means that some parts of the +cgroup tree may be managed by different managers than others. As long as it is +clear which manager manages which part of the tree each one can do within its +sub-graph of the tree whatever it wants. + +Only sub-trees can be delegated (though whoever decides to request a sub-tree +can delegate sub-sub-trees further to somebody else if they like). Delegation +takes place at a specific cgroup: in systemd there's a `Delegate=` property you +can set for a service or scope unit. If you do, it's the cut-off point for +systemd's cgroup management: the unit itself is managed by systemd, i.e. all +its attributes are managed exclusively by systemd, however your program may +create/remove sub-cgroups inside it freely, and those then become exclusive +property of your program, systemd won't touch them — all attributes of *those* +sub-cgroups can be manipulated freely and exclusively by your program. + +By turning on the `Delegate=` property for a scope or service you get a few +guarantees: + +1. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't + change attributes of any cgroups below it, nor will it create or remove any + cgroups thereunder, nor migrate processes across the boundaries of that + sub-tree as it deems useful anymore. + +2. If your service makes use of the `User=` functionality, then the sub-tree + will be `chown()`ed to the indicated user so that it can correctly create + cgroups below it. Note however that systemd will do that only in the unified + hierarchy (in unified and hybrid mode) as well as on systemd's own private + hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy + controller hierarchies. Delegation to less privileged processes is not safe + in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate + access to it. + +3. Any BPF IP filter programs systemd installs will be installed with + `BPF_F_ALLOW_MULTI` so that your program can install additional ones. + +In unit files the `Delegate=` property is superficially exposed as +boolean. However, since v236 it optionally takes a list of controller names +instead. If so, delegation is requested for listed controllers +specifically. Note that this only encodes a request. Depending on various +parameters it might happen that your service actually will get fewer +controllers delegated (for example, because the controller is not available on +the current kernel or was turned off) or more. If no list is specified +(i.e. the property simply set to `yes`) then all available controllers are +delegated. + +Let's stress one thing: delegation is available on scope and service units +only. It's expressly not available on slice units. Why? Because slice units are +our *inner* nodes of the cgroup trees and we freely attach services and scopes +to them. If we'd allow delegation on slice units then this would mean that +both systemd and your own manager would create/delete cgroups below the slice +unit and that conflicts with the single-writer rule. + +So, if you want to do your own raw cgroups kernel level access, then allocate a +scope unit, or a service unit (or just use the service unit you already have +for your service code), and turn on delegation for it. + +The service manager sets the `user.delegate` extended attribute (readable via +`getxattr(2)` and related calls) to the character `1` on cgroup directories +where delegation is enabled (and removes it on those cgroups where it is +not). This may be used by service programs to determine whether a cgroup tree +was delegated to them. Note that this is only supported on kernels 5.6 and +newer in combination with systemd 251 and newer. + +(OK, here's one caveat: if you turn on delegation for a service, and that +service has `ExecStartPost=`, `ExecReload=`, `ExecStop=` or `ExecStopPost=` +set, then these commands will be executed within the `.control/` sub-cgroup of +your service's cgroup. This is necessary because by turning on delegation we +have to assume that the cgroup delegated to your service is now an *inner* +cgroup, which means that it may not directly contain any processes. Hence, if +your service has any of these four settings set, you must be prepared that a +`.control/` subcgroup might appear, managed by the service manager. This also +means that your service code should have moved itself further down the cgroup +tree by the time it notifies the service manager about start-up readiness, so +that the service's main cgroup is definitely an inner node by the time the +service manager might start `ExecStartPost=`. Starting with systemd 254 you may +also use `DelegateSubgroup=` to let the service manager put your initial +service process into a subgroup right away.) + +(Also note, if you intend to use "threaded" cgroups — as added in Linux 4.14 —, +then you should do that *two* levels down from the main service cgroup your +turned delegation on for. Why that? You need one level so that systemd can +properly create the `.control` subgroup, as described above. But that one +cannot be threaded, since that would mean `.control` has to be threaded too — +this is a requirement of threaded cgroups: either a cgroup and all its siblings +are threaded or none –, but systemd expects it to be a regular cgroup. Thus you +have to nest a second cgroup beneath it which then can be threaded.) + +## Three Scenarios + +Let's say you write a container manager, and you wonder what to do regarding +cgroups for it, as you want your manager to be able to run on systemd systems. + +You basically have three options: + +1. 😊 The *integration-is-good* option. For this, you register each container + you have either as a systemd service (i.e. let systemd invoke the executor + binary for you) or a systemd scope (i.e. your manager executes the binary + directly, but then tells systemd about it. In this mode the administrator + can use the usual systemd resource management and reporting commands + individually on those containers. By turning on `Delegate=` for these scopes + or services you make it possible to run cgroup-enabled programs in your + containers, for example a nested systemd instance. This option has two + sub-options: + + a. You transiently register the service or scope by directly contacting + systemd via D-Bus. In this case systemd will just manage the unit for you + and nothing else. + + b. Instead you register the service or scope through `systemd-machined` + (also via D-Bus). This mini-daemon is basically just a proxy for the same + operations as in a. The main benefit of this: this way you let the system + know that what you are registering is a container, and this opens up + certain additional integration points. For example, `journalctl -M` can + then be used to directly look into any container's journal logs (should + the container run systemd inside), or `systemctl -M` can be used to + directly invoke systemd operations inside the containers. Moreover tools + like "ps" can then show you to which container a process belongs (`ps -eo + pid,comm,machine`), and even gnome-system-monitor supports it. + +2. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree, + and you want to have to do as little as possible with systemd and no + interest in integration with the rest of the system, then this is a valid + option. For this all you have to do is turn on `Delegate=` for your main + manager daemon. Then figure out the cgroup systemd placed your daemon in: + you can now freely create sub-cgroups beneath it. Don't forget the + *no-processes-in-inner-nodes* rule however: you have to move your main + daemon process out of that cgroup (and into a sub-cgroup) before you can + start further processes in any of your sub-cgroups. + +3. 🙁 The *i-like-continents* option. In this option you'd leave your manager + daemon where it is, and would not turn on delegation on its unit. However, + as you start your first managed process (a container, for example) you would + register a new scope unit with systemd, and that scope unit would have + `Delegate=` turned on, and it would contain the PID of this process; all + your managed processes subsequently created should also be moved into this + scope. From systemd's PoV there'd be two units: your manager service and the + big scope that contains all your managed processes in one. + +BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus +API, kthxbye", then options #1 and #3 are not available, as they generally +involve talking to systemd from your program code, via D-Bus. You still have +option #2 in that case however, as you can simply set `Delegate=` in your +service's unit file and you are done and have your own sub-tree. In fact, #2 is +the one option that allows you to completely ignore systemd's existence: you +can entirely generically follow the single rule that you just use the cgroup +you are started in, and everything below it, whatever that might be. That said, +maybe if you dislike D-Bus and systemd that much, the better approach might be +to work on that, and widen your horizon a bit. You are welcome. + +## Controller Support + +systemd supports a number of controllers (but not all). Specifically, supported +are: + +* on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids` +* on cgroup v2: `cpu`, `io`, `memory`, `pids` + +It is our intention to natively support all cgroup v2 controllers as they are +added to the kernel. However, regarding cgroup v1: at this point we will not +add support for any other controllers anymore. This means systemd currently +does not and will never manage the following controllers on cgroup v1: +`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not? +Depending on the case, either their API semantics or implementations aren't +really usable, or it's very clear they have no future on cgroup v2, and we +won't add new code for stuff that clearly has no future. + +Effectively this means that all those mentioned cgroup v1 controllers are up +for grabs: systemd won't manage them, and hence won't delegate them to your +code (however, systemd will still mount their hierarchies, simply because it +mounts all controller hierarchies it finds available in the kernel). If you +decide to use them, then that's fine, but systemd won't help you with it (but +also not interfere with it). To be nice to other tenants it might be wise to +replicate the cgroup hierarchies of the other controllers in them too however, +but of course that's between you and those other tenants, and systemd won't +care. Replicating the cgroup hierarchies in those unsupported controllers would +mean replicating the full cgroup paths in them, and hence the prefixing +`.slice` components too, otherwise the hierarchies will start being orthogonal +after all, and that's not really desirable. One more thing: systemd will clean +up after you in the hierarchies it manages: if your daemon goes down, its +cgroups will be removed too. You basically get the guarantee that you start +with a pristine cgroup sub-tree for your service or scope whenever it is +started. This is not the case however in the hierarchies systemd doesn't +manage. This means that your programs should be ready to deal with left-over +cgroups in them — from previous runs, and be extra careful with them as they +might still carry settings that might not be valid anymore. + +Note a particular asymmetry here: if your systemd version doesn't support a +specific controller on cgroup v1 you can still make use of it for delegation, +by directly fiddling with its hierarchy and replicating the cgroup tree there +as necessary (as suggested above). However, on cgroup v2 this is different: +separately mounted hierarchies are not available, and delegation has always to +happen through systemd itself. This means: when you update your kernel and it +adds a new, so far unseen controller, and you want to use it for delegation, +then you also need to update systemd to a version that groks it. + +## systemd as Container Payload + +systemd can happily run as a container payload's PID 1. Note that systemd +unconditionally needs write access to the cgroup tree however, hence you need +to delegate a sub-tree to it. Note that there's nothing too special you have to +do beyond that: just invoke systemd as PID 1 inside the root of the delegated +cgroup sub-tree, and it will figure out the rest: it will determine the cgroup +it is running in and take possession of it. It won't interfere with any cgroup +outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence +optional (but of course wise). + +Note one particular asymmetry here though: systemd will try to take possession +of the root cgroup you pass to it *in* *full*, i.e. it will not only +create/remove child cgroups below it, it will also attempt to manage the +attributes of it. OTOH as mentioned above, when delegating a cgroup tree to +somebody else it only passes the rights to create/remove sub-cgroups, but will +insist on managing the delegated cgroup tree's top-level attributes. Or in +other words: systemd is *greedy* when accepting delegated cgroup trees and also +*greedy* when delegating them to others: it insists on managing attributes on +the specific cgroup in both cases. A container manager that is itself a payload +of a host systemd which wants to run a systemd as its own container payload +instead hence needs to insert an extra level in the hierarchy in between, so +that the systemd on the host and the one in the container won't fight for the +attributes. That said, you likely should do that anyway, due to the +no-processes-in-inner-cgroups rule, see below. + +When systemd runs as container payload it will make use of all hierarchies it +has write access to. For legacy mode you need to make at least +`/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For +hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully +unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself. + +## Some Dos + +1. ⚡ If you go for implementation option 1a or 1b (as in the list above), then + each of your containers will have its own systemd-managed unit and hence + cgroup with possibly further sub-cgroups below. Typically the first process + running in that unit will be some kind of executor program, which will in + turn fork off the payload processes of the container. In this case don't + forget that there are two levels of delegation involved: first, systemd + delegates a group sub-tree to your executor. And then your executor should + delegate a sub-tree further down to the container payload. Oh, and because + of the no-process-in-inner-nodes rule, your executor needs to migrate itself + to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence + want a two-pronged approach: below the cgroup you got started in, you want + one cgroup maybe called `supervisor/` where your manager runs in and then + for each container a sibling cgroup of that maybe called `payload-xyz/`. + +2. ⚡ Don't forget that the cgroups you create have to have names that are + suitable as UNIX file names, and that they live in the same namespace as the + various kernel attribute files. Hence, when you want to allow the user + arbitrary naming, you might need to escape some of the names (for example, + you really don't want to create a cgroup named `tasks`, just because the + user created a container by that name, because `tasks` after all is a magic + attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In + systemd we do escaping by prefixing names that might collide with a kernel + attribute name with an underscore. You might want to do the same, but this + is really up to you how you do it. Just do it, and be careful. + +## Some Don'ts + +1. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e + cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your + own cgroups below the root cgroup 🔥. That's owned by systemd, and you will + step on systemd's toes if you ignore that, and systemd will step on + yours. Get your own delegated sub-tree, you may create as many cgroups there + as you like. Seriously, if you create cgroups directly in the cgroup root, + then all you do is ask for trouble. + +2. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in + `-.slice`. It's not supported, and will generate an error. + +3. 🚫 Never *write* to any of the attributes of a cgroup systemd created for + you. It's systemd's private property. You are welcome to manipulate the + attributes of cgroups you created in your own delegated sub-tree, but the + cgroup tree of systemd itself is out of limits for you. It's fine to *read* + from any attribute you like however. That's totally OK and welcome. + +4. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a + container payload running systemd, then don't get the idea that you can bind + mount only a sub-tree of the host's cgroup tree into the container. Part of + the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every + process, and hence any path below `/sys/fs/cgroup/` needs to match what + `/proc/$PID/cgroup` of the payload processes reports. What you can do safely + however, is mount the upper parts of the cgroup tree read-only (or even + replace the middle bits with an intermediary `tmpfs` — but be careful not to + break the `statfs()` detection logic discussed above), as long as the path + to the delegated sub-tree remains accessible as-is. + +5. ⚡ Currently, the algorithm for mapping between slice/scope/service unit + naming and their cgroup paths is not considered public API of systemd, and + may change in future versions. This means: it's best to avoid implementing a + local logic of translating cgroup paths to slice/scope/service names in your + program, or vice versa — it's likely going to break sooner or later. Use the + appropriate D-Bus API calls for that instead, so that systemd translates + this for you. (Specifically: each Unit object has a `ControlGroup` property + to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be + used to get the unit for a cgroup.) + +6. ⚡ Think twice before delegating cgroup v1 controllers to less privileged + containers. It's not safe, you basically allow your containers to freeze the + system with that and worse. Delegation is a strongpoint of cgroup v2 though, + and there it's safe to treat delegation boundaries as privilege boundaries. + +And that's it for now. If you have further questions, refer to the systemd +mailing list. + +— Berlin, 2018-04-20 diff --git a/docs/CNAME b/docs/CNAME new file mode 100644 index 0000000..cdcf4d9 --- /dev/null +++ b/docs/CNAME @@ -0,0 +1 @@ +systemd.io
\ No newline at end of file diff --git a/docs/CODE_OF_CONDUCT.md b/docs/CODE_OF_CONDUCT.md new file mode 100644 index 0000000..8e5455d --- /dev/null +++ b/docs/CODE_OF_CONDUCT.md @@ -0,0 +1,21 @@ +--- +title: systemd Community Conduct Guidelines +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The systemd Community Conduct Guidelines + +This document provides community guidelines for a safe, respectful, productive, and collaborative place for any person who is willing to contribute to systemd. It applies to all “collaborative spaces”, which is defined as community communications channels (such as mailing lists, submitted patches, commit comments, etc.). + +- Participants will be tolerant of opposing views. +- Participants must ensure that their language and actions are free of personal attacks and disparaging personal remarks. +- When interpreting the words and actions of others, participants should always assume good intentions. +- Behaviour which can be reasonably considered harassment will not be tolerated. + +## Enforcement + +Instances of abusive, harassing, or otherwise unacceptable behavior may be reported by contacting the project team at systemd-conduct@googlegroups.com. This team currently consists of David Strauss <<systemd-conduct@davidstrauss.net>>, Ekaterina Gerasimova (Kat) <<Kittykat3756@gmail.com>>, and Zbigniew Jędrzejewski-Szmek <<zbyszek@in.waw.pl>>. In the unfortunate event that you wish to make a complaint against one of the members, you may instead contact any of the other members individually. + +All complaints will be reviewed and investigated and will result in a response that is deemed necessary and appropriate to the circumstances. The project team is obligated to maintain confidentiality with regard to the reporter of an incident. diff --git a/docs/CODE_QUALITY.md b/docs/CODE_QUALITY.md new file mode 100644 index 0000000..166b307 --- /dev/null +++ b/docs/CODE_QUALITY.md @@ -0,0 +1,85 @@ +--- +title: Code Quality Tools +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Code Quality Tools + +The systemd project has a number of code quality tools set up in the source +tree and on the github infrastructure. Here's an incomprehensive list of the +available functionality: + +1. Use `meson test -C build` to run the unit tests. Some tests are skipped if + no privileges are available, hence consider also running them with `sudo + meson test -C build`. A couple of unit tests are considered "unsafe" (as + they change system state); to run those too, build with `meson setup + -Dtests=unsafe`. Finally, some unit tests are considered to be very slow, + build them too with `meson setup -Dslow-tests=true`. (Note that there are a + couple of manual tests in addition to these unit tests.) (Also note: you can + change these flags for an already set up build tree, too, with "meson + configure -C build -D…".) + +2. Use `./test/run-integration-tests.sh` to run the full integration test + suite. This will build OS images with a number of integration tests and run + them using `systemd-nspawn` and `qemu`. Requires root. + +3. Use `./coccinelle/run-coccinelle.sh` to run all + [Coccinelle](http://coccinelle.lip6.fr/) semantic patch scripts we ship. The + output will show false positives, hence take it with a pinch of salt. + +4. Use `./tools/find-double-newline.sh recdiff` to find double newlines. Use + `./tools/find-double-newline.sh recpatch` to fix them. Take this with a grain + of salt, in particular as we generally leave foreign header files we include in + our tree unmodified, if possible. + +5. Similar use `./tools/find-tabs.sh recdiff` to find TABs, and + `./tools/find-tabs.sh recpatch` to fix them. (Again, grain of salt, foreign + headers should usually be left unmodified.) + +6. Use `ninja -C build check-api-docs` to compare the list of exported symbols + of `libsystemd.so` and `libudev.so` with the list of man pages. Symbols + lacking documentation are highlighted. + +7. Use `ninja -C build update-hwdb` and `ninja -C build update-hwdb-autosuspend` + to automatically download and import the PCI, USB, and OUI databases and the + autosuspend quirks into the hwdb. + +8. Use `ninja -C build update-man-rules` to update the meson rules for building + man pages automatically from the docbook XML files included in `man/`. + +9. There are multiple CI systems in use that run on every github pull request + submission or update. + +10. [Coverity](https://scan.coverity.com/) is analyzing systemd `main` branch + in regular intervals. The reports are available + [online](https://scan.coverity.com/projects/systemd). + +11. [OSS-Fuzz](https://github.com/google/oss-fuzz) is continuously fuzzing the + codebase. Reports are available + [online](https://oss-fuzz.com/testcases?project=systemd&open=yes). + It also builds + [coverage reports](https://oss-fuzz.com/coverage-report/job/libfuzzer_asan_systemd/latest) + daily. + +12. Our tree includes `.editorconfig`, `.dir-locals.el` and `.vimrc` files, to + ensure that editors follow the right indentiation styles automatically. + +13. When building systemd from a git checkout the build scripts will + automatically enable a git commit hook that ensures whitespace cleanliness. + +14. [CodeQL](https://codeql.github.com/) analyzes each PR and every commit + pushed to `main`. The list of active alerts can be found + [here](https://github.com/systemd/systemd/security/code-scanning). + +15. Each PR is automatically tested with [Address Sanitizer](https://clang.llvm.org/docs/AddressSanitizer.html) + and [Undefined Behavior Sanitizer](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). + See [Testing systemd using sanitizers](TESTING_WITH_SANITIZERS) + for more information. + +16. Fossies provides [source code misspelling reports](https://fossies.org/features.html#codespell). + The systemd report can be found [here](https://fossies.org/linux/misc/systemd/codespell.html). + +Access to Coverity and oss-fuzz reports is limited. Please reach out to the +maintainers if you need access. diff --git a/docs/CODING_STYLE.md b/docs/CODING_STYLE.md new file mode 100644 index 0000000..6d6e549 --- /dev/null +++ b/docs/CODING_STYLE.md @@ -0,0 +1,782 @@ +--- +title: Coding Style +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Coding Style + +## Formatting + +- 8ch indent, no tabs, except for files in `man/` which are 2ch indent, and + still no tabs, and shell scripts, which are 4ch indent, and no tabs either. + +- We prefer `/* comments */` over `// comments` in code you commit, + please. This way `// comments` are left for developers to use for local, + temporary commenting of code for debug purposes (i.e. uncommittable stuff), + making such comments easily discernible from explanatory, documenting code + comments (i.e. committable stuff). + +- Don't break code lines too eagerly. We do **not** force line breaks at 80ch, + all of today's screens should be much larger than that. But then again, don't + overdo it, ~109ch should be enough really. The `.editorconfig`, `.vimrc` and + `.dir-locals.el` files contained in the repository will set this limit up for + you automatically, if you let them (as well as a few other things). Please + note that emacs loads `.dir-locals.el` automatically, but vim needs to be + configured to load `.vimrc`, see that file for instructions. + +- If you break a function declaration over multiple lines, do it like this: + + ```c + void some_function( + int foo, + bool bar, + char baz) { + + int a, b, c; + ``` + + (i.e. use double indentation — 16 spaces — for the parameter list.) + +- Try to write this: + + ```c + void foo() { + } + ``` + + instead of this: + + ```c + void foo() + { + } + ``` + +- Single-line `if` blocks should not be enclosed in `{}`. Write this: + + ```c + if (foobar) + waldo(); + ``` + + instead of this: + + ```c + if (foobar) { + waldo(); + } + ``` + +- Do not write `foo ()`, write `foo()`. + +- `else` blocks should generally start on the same line as the closing `}`: + ```c + if (foobar) { + find(); + waldo(); + } else + dont_find_waldo(); + ``` + +- Please define flags types like this: + + ```c + typedef enum FoobarFlags { + FOOBAR_QUUX = 1 << 0, + FOOBAR_WALDO = 1 << 1, + FOOBAR_XOXO = 1 << 2, + … + } FoobarFlags; + ``` + + i.e. use an enum for it, if possible. Indicate bit values via `1 <<` + expressions, and align them vertically. Define both an enum and a type for + it. + +- If you define (non-flags) enums, follow this template: + + ```c + typedef enum FoobarMode { + FOOBAR_AAA, + FOOBAR_BBB, + FOOBAR_CCC, + … + _FOOBAR_MAX, + _FOOBAR_INVALID = -EINVAL, + } FoobarMode; + ``` + + i.e. define a `_MAX` enum for the largest defined enum value, plus one. Since + this is not a regular enum value, prefix it with `_`. Also, define a special + "invalid" enum value, and set it to `-EINVAL`. That way the enum type can + safely be used to propagate conversion errors. + +- If you define an enum in a public API, be extra careful, as the size of the + enum might change when new values are added, which would break ABI + compatibility. Since we typically want to allow adding new enum values to an + existing enum type with later API versions, please use the + `_SD_ENUM_FORCE_S64()` macro in the enum definition, which forces the size of + the enum to be signed 64-bit wide. + +- Empty lines to separate code blocks are a good thing, please add them + abundantly. However, please stick to one at a time, i.e. multiple empty lines + immediately following each other are not OK. Also, we try to keep function + calls and their immediate error handling together. Hence: + + ```c + /* → empty line here is good */ + r = some_function(…); + /* → empty line here would be bad */ + if (r < 0) + return log_error_errno(r, "Some function failed: %m"); + /* → empty line here is good */ + ``` + +- In shell scripts, do not use whitespace after the redirection operator + (`>some/file` instead of `> some/file`, `<<EOF` instead of `<< EOF`). + +## Code Organization and Semantics + +- For our codebase we intend to use ISO C11 *with* GNU extensions (aka + "gnu11"). Public APIs (i.e. those we expose via `libsystemd.so` + i.e. `systemd/sd-*.h`) should only use ISO C89 however (with a very limited + set of conservative and common extensions, such as fixed size integer types + from `<inttypes.h>`), so that we don't force consuming programs into C11 + mode. (This discrepancy in particular means one thing: internally we use C99 + `bool` booleans, externally C89-compatible `int` booleans which generally + have different size in memory and slightly different semantics, also see + below.) Both for internal and external code it's OK to use even newer + features and GCC extension than "gnu11", as long as there's reasonable + fallback #ifdeffery in place to ensure compatibility is retained with older + compilers. + +- Please name structures in `PascalCase` (with exceptions, such as public API + structs), variables and functions in `snake_case`. + +- Avoid static variables, except for caches and very few other cases. Think + about thread-safety! While most of our code is never used in threaded + environments, at least the library code should make sure it works correctly + in them. Instead of doing a lot of locking for that, we tend to prefer using + TLS to do per-thread caching (which only works for small, fixed-size cache + objects), or we disable caching for any thread that is not the main + thread. Use `is_main_thread()` to detect whether the calling thread is the + main thread. + +- Do not write functions that clobber call-by-reference variables on + failure. Use temporary variables for these cases and change the passed in + variables only on success. The rule is: never clobber return parameters on + failure, always initialize return parameters on success. + +- Typically, function parameters fit into three categories: input parameters, + mutable objects, and call-by-reference return parameters. Input parameters + should always carry suitable "const" declarators if they are pointers, to + indicate they are input-only and not changed by the function. Return + parameters are best prefixed with "ret_", to clarify they are return + parameters. (Conversely, please do not prefix parameters that aren't + output-only with "ret_", in particular not mutable parameters that are both + input as well as output). Example: + + ```c + static int foobar_frobnicate( + Foobar* object, /* the associated mutable object */ + const char *input, /* immutable input parameter */ + char **ret_frobnicated) { /* return parameter */ + … + return 0; + } + ``` + +- The order in which header files are included doesn't matter too + much. systemd-internal headers must not rely on an include order, so it is + safe to include them in any order possible. However, to not clutter global + includes, and to make sure internal definitions will not affect global + headers, please always include the headers of external components first + (these are all headers enclosed in <>), followed by our own exported headers + (usually everything that's prefixed by `sd-`), and then followed by internal + headers. Furthermore, in all three groups, order all includes alphabetically + so duplicate includes can easily be detected. + +- Please avoid using global variables as much as you can. And if you do use + them make sure they are static at least, instead of exported. Especially in + library-like code it is important to avoid global variables. Why are global + variables bad? They usually hinder generic reusability of code (since they + break in threaded programs, and usually would require locking there), and as + the code using them has side-effects make programs non-transparent. That + said, there are many cases where they explicitly make a lot of sense, and are + OK to use. For example, the log level and target in `log.c` is stored in a + global variable, and that's OK and probably expected by most. Also in many + cases we cache data in global variables. If you add more caches like this, + please be careful however, and think about threading. Only use static + variables if you are sure that thread-safety doesn't matter in your + case. Alternatively, consider using TLS, which is pretty easy to use with + gcc's `thread_local` concept. It's also OK to store data that is inherently + global in global variables, for example, data parsed from command lines, see + below. + +- Our focus is on the GNU libc (glibc), not any other libcs. If other libcs are + incompatible with glibc it's on them. However, if there are equivalent POSIX + and Linux/GNU-specific APIs, we generally prefer the POSIX APIs. If there + aren't, we are happy to use GNU or Linux APIs, and expect non-GNU + implementations of libc to catch up with glibc. + +## Using C Constructs + +- Allocate local variables where it makes sense: at the top of the block, or at + the point where they can be initialized. Avoid huge variable declaration + lists at the top of the function. + + As an exception, `int r` is typically used for a local state variable, but + should almost always be declared as the last variable at the top of the + function. + + ```c + { + uint64_t a; + int r; + + r = frobnicate(&a); + if (r < 0) + … + + uint64_t b = a + 1, c; + + r = foobarify(a, b, &c); + if (r < 0) + … + + const char *pretty = prettify(a, b, c); + … + } + ``` + +- Do not mix multiple variable definitions with function invocations or + complicated expressions: + + ```c + { + uint64_t x = 7; + int a; + + a = foobar(); + } + ``` + + instead of: + + ```c + { + int a = foobar(); + uint64_t x = 7; + } + ``` + +- Use `goto` for cleaning up, and only use it for that. I.e. you may only jump + to the end of a function, and little else. Never jump backwards! + +- To minimize strict aliasing violations, we prefer unions over casting. + +- Instead of using `memzero()`/`memset()` to initialize structs allocated on + the stack, please try to use c99 structure initializers. It's short, prettier + and actually even faster at execution. Hence: + + ```c + struct foobar t = { + .foo = 7, + .bar = "bazz", + }; + ``` + + instead of: + + ```c + struct foobar t; + zero(t); + t.foo = 7; + t.bar = "bazz"; + ``` + +- To implement an endless loop, use `for (;;)` rather than `while (1)`. The + latter is a bit ugly anyway, since you probably really meant `while + (true)`. To avoid the discussion what the right always-true expression for an + infinite while loop is, our recommendation is to simply write it without any + such expression by using `for (;;)`. + +- To determine the length of a constant string `"foo"`, don't bother with + `sizeof("foo")-1`, please use `strlen()` instead (both gcc and clang optimize + the call away for fixed strings). The only exception is when declaring an + array. In that case use `STRLEN()`, which evaluates to a static constant and + doesn't force the compiler to create a VLA. + +- Please use C's downgrade-to-bool feature only for expressions that are + actually booleans (or "boolean-like"), and not for variables that are really + numeric. Specifically, if you have an `int b` and it's only used in a boolean + sense, by all means check its state with `if (b) …` — but if `b` can actually + have more than two semantic values, and you want to compare for non-zero, + then please write that explicitly with `if (b != 0) …`. This helps readability + as the value range and semantical behaviour is directly clear from the + condition check. As a special addition: when dealing with pointers which you + want to check for non-NULL-ness, you may also use downgrade-to-bool feature. + +- Please do not use yoda comparisons, i.e. please prefer the more readable `if + (a == 7)` over the less readable `if (7 == a)`. + +## Destructors + +- The destructors always deregister the object from the next bigger object, not + the other way around. + +- For robustness reasons, destructors should be able to destruct + half-initialized objects, too. + +- When you define a destructor or `unref()` call for an object, please accept a + `NULL` object and simply treat this as NOP. This is similar to how libc + `free()` works, which accepts `NULL` pointers and becomes a NOP for them. By + following this scheme a lot of `if` checks can be removed before invoking + your destructor, which makes the code substantially more readable and robust. + +- Related to this: when you define a destructor or `unref()` call for an + object, please make it return the same type it takes and always return `NULL` + from it. This allows writing code like this: + + ```c + p = foobar_unref(p); + ``` + + which will always work regardless if `p` is initialized or not, and + guarantees that `p` is `NULL` afterwards, all in just one line. + +## Common Function Naming + +- Name destructor functions that destroy an object in full freeing all its + memory and associated resources (and thus invalidating the pointer to it) + `xyz_free()`. Example: `strv_free()`. + +- Name destructor functions that destroy only the referenced content of an + object but leave the object itself allocated `xyz_done()`. If it resets all + fields so that the object can be reused later call it `xyz_clear()`. + +- Functions that decrease the reference counter of an object by one should be + called `xyz_unref()`. Example: `json_variant_unref()`. Functions that + increase the reference counter by one should be called `xyz_ref()`. Example: + `json_variant_ref()` + +## Error Handling + +- Error codes are returned as negative `Exxx`. e.g. `return -EINVAL`. There are + some exceptions: for constructors, it is OK to return `NULL` on OOM. For + lookup functions, `NULL` is fine too for "not found". + + Be strict with this. When you write a function that can fail due to more than + one cause, it *really* should have an `int` as the return value for the error + code. + +- libc system calls typically return -1 on error (with the error code in + `errno`), and >= 0 on success. Use the RET_NERRNO() helper if you are looking + for a simple way to convert this libc style error returning into systemd + style error returning. e.g. + + ```c + … + r = RET_NERRNO(unlink(t)); + … + ``` + + or + + ```c + … + r = RET_NERRNO(open("/some/file", O_RDONLY|O_CLOEXEC)); + … + ``` + +- Do not bother with error checking whether writing to stdout/stderr worked. + +- Do not log errors from "library" code, only do so from "main program" + code. (With one exception: it is OK to log with DEBUG level from any code, + with the exception of maybe inner loops). + +- In public API calls, you **must** validate all your input arguments for + programming error with `assert_return()` and return a sensible return + code. In all other calls, it is recommended to check for programming errors + with a more brutal `assert()`. We are more forgiving to public users than for + ourselves! Note that `assert()` and `assert_return()` really only should be + used for detecting programming errors, not for runtime errors. `assert()` and + `assert_return()` by usage of `_likely_()` inform the compiler that it should + not expect these checks to fail, and they inform fellow programmers about the + expected validity and range of parameters. + +- When you invoke certain calls like `unlink()`, or `mkdir_p()` and you know it + is safe to ignore the error it might return (because a later call would + detect the failure anyway, or because the error is in an error path and you + thus couldn't do anything about it anyway), then make this clear by casting + the invocation explicitly to `(void)`. Code checks like Coverity understand + that, and will not complain about ignored error codes. Hence, please use + this: + + ```c + (void) unlink("/foo/bar/baz"); + ``` + + instead of just this: + + ```c + unlink("/foo/bar/baz"); + ``` + + When returning from a `void` function, you may also want to shorten the error + path boilerplate by returning a function invocation cast to `(void)` like so: + + ```c + if (condition_not_met) + return (void) log_tests_skipped("Cannot run ..."); + ``` + + Don't cast function calls to `(void)` that return no error + conditions. Specifically, the various `xyz_unref()` calls that return a + `NULL` object shouldn't be cast to `(void)`, since not using the return value + does not hide any errors. + +- When returning a return code from `main()`, please preferably use + `EXIT_FAILURE` and `EXIT_SUCCESS` as defined by libc. + +## Logging + +- For every function you add, think about whether it is a "logging" function or + a "non-logging" function. "Logging" functions do (non-debug) logging on their + own, "non-logging" functions never log on their own (except at debug level) + and expect their callers to log. All functions in "library" code, i.e. in + `src/shared/` and suchlike must be "non-logging". Every time a "logging" + function calls a "non-logging" function, it should log about the resulting + errors. If a "logging" function calls another "logging" function, then it + should not generate log messages, so that log messages are not generated + twice for the same errors. (Note that debug level logging — at syslog level + `LOG_DEBUG` — is not considered logging in this context, debug logging is + generally always fine and welcome.) + +- If possible, do a combined log & return operation: + + ```c + r = operation(...); + if (r < 0) + return log_(error|warning|notice|...)_errno(r, "Failed to ...: %m"); + ``` + + If the error value is "synthetic", i.e. it was not received from + the called function, use `SYNTHETIC_ERRNO` wrapper to tell the logging + system to not log the errno value, but still return it: + + ```c + n = read(..., s, sizeof s); + if (n != sizeof s) + return log_error_errno(SYNTHETIC_ERRNO(EIO), "Failed to read ..."); + ``` + +## Memory Allocation + +- Always check OOM. There is no excuse. In program code, you can use + `log_oom()` for then printing a short message, but not in "library" code. + +- Avoid fixed-size string buffers, unless you really know the maximum size and + that maximum size is small. It is often nicer to use dynamic memory, + `alloca_safe()` or VLAs. If you do allocate fixed-size strings on the stack, + then it is probably only OK if you either use a maximum size such as + `LINE_MAX`, or count in detail the maximum size a string can + have. (`DECIMAL_STR_MAX` and `DECIMAL_STR_WIDTH` macros are your friends for + this!) + + Or in other words, if you use `char buf[256]` then you are likely doing + something wrong! + +- Make use of `_cleanup_free_` and friends. It makes your code much nicer to + read (and shorter)! + +- Do not use `alloca()`, `strdupa()` or `strndupa()` directly. Use + `alloca_safe()`, `strdupa_safe()` or `strndupa_safe()` instead. (The + difference is that the latter include an assertion that the specified size is + below a safety threshold, so that the program rather aborts than runs into + possible stack overruns.) + +- Use `alloca_safe()`, but never forget that it is not OK to invoke + `alloca_safe()` within a loop or within function call + parameters. `alloca_safe()` memory is released at the end of a function, and + not at the end of a `{}` block. Thus, if you invoke it in a loop, you keep + increasing the stack pointer without ever releasing memory again. (VLAs have + better behavior in this case, so consider using them as an alternative.) + Regarding not using `alloca_safe()` within function parameters, see the BUGS + section of the `alloca(3)` man page. + +- If you want to concatenate two or more strings, consider using `strjoina()` + or `strjoin()` rather than `asprintf()`, as the latter is a lot slower. This + matters particularly in inner loops (but note that `strjoina()` cannot be + used there). + +## Runtime Behaviour + +- Avoid leaving long-running child processes around, i.e. `fork()`s that are + not followed quickly by an `execv()` in the child. Resource management is + unclear in this case, and memory CoW will result in unexpected penalties in + the parent much, much later on. + +- Don't block execution for arbitrary amounts of time using `usleep()` or a + similar call, unless you really know what you do. Just "giving something some + time", or so is a lazy excuse. Always wait for the proper event, instead of + doing time-based poll loops. + +- Whenever installing a signal handler, make sure to set `SA_RESTART` for it, + so that interrupted system calls are automatically restarted, and we minimize + hassles with handling `EINTR` (in particular as `EINTR` handling is pretty + broken on Linux). + +- When applying C-style unescaping as well as specifier expansion on the same + string, always apply the C-style unescaping first, followed by the specifier + expansion. When doing the reverse, make sure to escape `%` in specifier-style + first (i.e. `%` → `%%`), and then do C-style escaping where necessary. + +- Be exceptionally careful when formatting and parsing floating point + numbers. Their syntax is locale dependent (i.e. `5.000` in en_US is generally + understood as 5, while in de_DE as 5000.). + +- Make sure to enforce limits on every user controllable resource. If the user + can allocate resources in your code, your code must enforce some form of + limits after which it will refuse operation. It's fine if it is hard-coded + (at least initially), but it needs to be there. This is particularly + important for objects that unprivileged users may allocate, but also matters + for everything else any user may allocate. + +## Types + +- Think about the types you use. If a value cannot sensibly be negative, do not + use `int`, but use `unsigned`. We prefer `unsigned` form to `unsigned int`. + +- Use `char` only for actual characters. Use `uint8_t` or `int8_t` when you + actually mean a byte-sized signed or unsigned integers. When referring to a + generic byte, we generally prefer the unsigned variant `uint8_t`. Do not use + types based on `short`. They *never* make sense. Use `int`, `long`, `long + long`, all in unsigned and signed fashion, and the fixed-size types + `uint8_t`, `uint16_t`, `uint32_t`, `uint64_t`, `int8_t`, `int16_t`, `int32_t` + and so on, as well as `size_t`, but nothing else. Do not use kernel types + like `u32` and so on, leave that to the kernel. + +- Stay uniform. For example, always use `usec_t` for time values. Do not mix + `usec` and `msec`, and `usec` and whatnot. + +- Never use the `off_t` type, and particularly avoid it in public APIs. It's + really weirdly defined, as it usually is 64-bit and we don't support it any + other way, but it could in theory also be 32-bit. Which one it is depends on + a compiler switch chosen by the compiled program, which hence corrupts APIs + using it unless they can also follow the program's choice. Moreover, in + systemd we should parse values the same way on all architectures and cannot + expose `off_t` values over D-Bus. To avoid any confusion regarding conversion + and ABIs, always use simply `uint64_t` directly. + +- Unless you allocate an array, `double` is always a better choice than + `float`. Processors speak `double` natively anyway, so there is no speed + benefit, and on calls like `printf()` `float`s get promoted to `double`s + anyway, so there is no point. + +- Use the bool type for booleans, not integers. One exception: in public + headers (i.e those in `src/systemd/sd-*.h`) use integers after all, as `bool` + is C99 and in our public APIs we try to stick to C89 (with a few extensions; + also see above). + +## Deadlocks + +- Do not issue NSS requests (that includes user name and hostname lookups) + from PID 1 as this might trigger deadlocks when those lookups involve + synchronously talking to services that we would need to start up. + +- Do not synchronously talk to any other service from PID 1, due to risk of + deadlocks. + +## File Descriptors + +- When you allocate a file descriptor, it should be made `O_CLOEXEC` right from + the beginning, as none of our files should leak to forked binaries by + default. Hence, whenever you open a file, `O_CLOEXEC` must be specified, + right from the beginning. This also applies to sockets. Effectively, this + means that all invocations to: + + - `open()` must get `O_CLOEXEC` passed, + - `socket()` and `socketpair()` must get `SOCK_CLOEXEC` passed, + - `recvmsg()` must get `MSG_CMSG_CLOEXEC` set, + - `F_DUPFD_CLOEXEC` should be used instead of `F_DUPFD`, and so on, + - invocations of `fopen()` should take `e`. + +- It's a good idea to use `O_NONBLOCK` when opening 'foreign' regular files, + i.e. file system objects that are supposed to be regular files whose paths + were specified by the user and hence might actually refer to other types of + file system objects. This is a good idea so that we don't end up blocking on + 'strange' file nodes, for example, if the user pointed us to a FIFO or device + node which may block when opening. Moreover even for actual regular files + `O_NONBLOCK` has a benefit: it bypasses any mandatory lock that might be in + effect on the regular file. If in doubt consider turning off `O_NONBLOCK` + again after opening. + +- These days we generally prefer `openat()`-style file APIs, i.e. APIs that + accept a combination of file descriptor and path string, and where the path + (if not absolute) is considered relative to the specified file + descriptor. When implementing library calls in similar style, please make + sure to imply `AT_EMPTY_PATH` if an empty or `NULL` path argument is + specified (and convert that latter to an empty string). This differs from the + underlying kernel semantics, where `AT_EMPTY_PATH` must always be specified + explicitly, and `NULL` is not accepted as path. + +## Command Line + +- If you parse a command line, and want to store the parsed parameters in + global variables, please consider prefixing their names with `arg_`. We have + been following this naming rule in most of our tools, and we should continue + to do so, as it makes it easy to identify command line parameter variables, + and makes it clear why it is OK that they are global variables. + +- Command line option parsing: + - Do not print full `help()` on error, be specific about the error. + - Do not print messages to stdout on error. + - Do not POSIX_ME_HARDER unless necessary, i.e. avoid `+` in option string. + +## Exporting Symbols + +- Variables and functions **must** be static, unless they have a prototype, and + are supposed to be exported. + +- Public API calls (i.e. functions exported by our shared libraries) + must be marked `_public_` and need to be prefixed with `sd_`. No + other functions should be prefixed like that. + +- When exposing public C APIs, be careful what function parameters you make + `const`. For example, a parameter taking a context object should probably not + be `const`, even if you are writing an otherwise read-only accessor function + for it. The reason is that making it `const` fixates the contract that your + call won't alter the object ever, as part of the API. However, that's often + quite a promise, given that this even prohibits object-internal caching or + lazy initialization of object variables. Moreover, it's usually not too + useful for client applications. Hence, please be careful and avoid `const` on + object parameters, unless you are very sure `const` is appropriate. + +## Referencing Concepts + +- When referring to a configuration file option in the documentation and such, + please always suffix it with `=`, to indicate that it is a configuration file + setting. + +- When referring to a command line option in the documentation and such, please + always prefix with `--` or `-` (as appropriate), to indicate that it is a + command line option. + +- When referring to a file system path that is a directory, please always + suffix it with `/`, to indicate that it is a directory, not a regular file + (or other file system object). + +## Functions to Avoid + +- Use `memzero()` or even better `zero()` instead of `memset(..., 0, ...)` + +- Please use `streq()` and `strneq()` instead of `strcmp()`, `strncmp()` where + applicable (i.e. wherever you just care about equality/inequality, not about + the sorting order). + +- Never use `strtol()`, `atoi()` and similar calls. Use `safe_atoli()`, + `safe_atou32()` and suchlike instead. They are much nicer to use in most + cases and correctly check for parsing errors. + +- `htonl()`/`ntohl()` and `htons()`/`ntohs()` are weird. Please use `htobe32()` + and `htobe16()` instead, it's much more descriptive, and actually says what + really is happening, after all `htonl()` and `htons()` don't operate on + `long`s and `short`s as their name would suggest, but on `uint32_t` and + `uint16_t`. Also, "network byte order" is just a weird name for "big endian", + hence we might want to call it "big endian" right-away. + +- Use `typesafe_inet_ntop()`, `typesafe_inet_ntop4()`, and + `typesafe_inet_ntop6()` instead of `inet_ntop()`. But better yet, use the + `IN_ADDR_TO_STRING()`, `IN4_ADDR_TO_STRING()`, and `IN6_ADDR_TO_STRING()` + macros which allocate an anonymous buffer internally. + +- Please never use `dup()`. Use `fcntl(fd, F_DUPFD_CLOEXEC, 3)` instead. For + two reasons: first, you want `O_CLOEXEC` set on the new `fd` (see + above). Second, `dup()` will happily duplicate your `fd` as 0, 1, 2, + i.e. stdin, stdout, stderr, should those `fd`s be closed. Given the special + semantics of those `fd`s, it's probably a good idea to avoid + them. `F_DUPFD_CLOEXEC` with `3` as parameter avoids them. + +- Don't use `fgets()`, it's too hard to properly handle errors such as overly + long lines. Use `read_line()` instead, which is our own function that handles + this much more nicely. + +- Don't invoke `exit()`, ever. It is not replacement for proper error + handling. Please escalate errors up your call chain, and use normal `return` + to exit from the main function of a process. If you `fork()`ed off a child + process, please use `_exit()` instead of `exit()`, so that the exit handlers + are not run. + +- Do not use `basename()` or `dirname()`. The semantics in corner cases are + full of pitfalls, and the fact that there are two quite different versions of + `basename()` (one POSIX and one GNU, of which the latter is much more useful) + doesn't make it better either. Use path_extract_filename() and + path_extract_directory() instead. + +- Never use `FILENAME_MAX`. Use `PATH_MAX` instead (for checking maximum size + of paths) and `NAME_MAX` (for checking maximum size of filenames). + `FILENAME_MAX` is not POSIX, and is a confusingly named alias for `PATH_MAX` + on Linux. Note that `NAME_MAX` does not include space for a trailing `NUL`, + but `PATH_MAX` does. UNIX FTW! + +## Committing to git + +- Commit message subject lines should be prefixed with an appropriate component + name of some kind. For example, "journal: ", "nspawn: " and so on. + +- Do not use "Signed-Off-By:" in your commit messages. That's a kernel thing we + don't do in the systemd project. + +## Commenting + +- The best place for code comments and explanations is in the code itself. Only + the second best is in git commit messages. The worst place is in the GitHub + PR cover letter. Hence, whenever you type a commit message consider for a + moment if what you are typing there wouldn't be a better fit for an in-code + comment. And if you type the cover letter of a PR, think hard if this + wouldn't be better as a commit message or even code comment. Comments are + supposed to be useful for somebody who reviews the code, and hence hiding + comments in git commits or PR cover letters makes reviews unnecessarily + hard. Moreover, while we rely heavily on GitHub's project management + infrastructure we'd like to keep everything that can reasonably be kept in + the git repository itself in the git repository, so that we can theoretically + move things elsewhere with the least effort possible. + +- It's OK to reference GitHub PRs, GitHub issues and git commits from code + comments. Cross-referencing code, issues, and documentation is a good thing. + +- Reasonable use of non-ASCII Unicode UTF-8 characters in code comments is + welcome. If your code comment contains an emoji or two this will certainly + brighten the day of the occasional reviewer of your code. Really! 😊 + +## Threading + +- We generally avoid using threads, to the level this is possible. In + particular in the service manager/PID 1 threads are not OK to use. This is + because you cannot mix memory allocation in threads with use of glibc's + `clone()` call, or manual `clone()`/`clone3()` system call wrappers. Only + glibc's own `fork()` call will properly synchronize the memory allocation + locks around the process clone operation. This means that if a process is + cloned via `clone()`/`clone3()` and another thread currently has the + `malloc()` lock taken, it will be cloned in locked state to the child, and + thus can never be acquired in the child, leading to deadlocks. Hence, when + using `clone()`/`clone3()` there are only two ways out: never use threads in the + parent, or never do memory allocation in the child. For our uses we need + `clone()`/`clone3()` and hence decided to avoid threads. Of course, sometimes the + concurrency threads allow is beneficial, however we suggest forking off + worker *processes* rather than worker *threads* for this purpose, ideally + even with an `execve()` to remove the CoW trap situation `fork()` easily + triggers. + +- A corollary of the above is: never use `clone()` where a `fork()` would do + too. Also consider using `posix_spawn()` which combines `clone()` + + `execve()` into one and has nice properties since it avoids becoming a CoW + trap by using `CLONE_VFORK` and `CLONE_VM` together. + +- While we avoid forking off threads on our own, writing thread-safe code is a + good idea where it might end up running inside of libsystemd.so or + similar. Hence, use TLS (i.e. `thread_local`) where appropriate, and maybe + the occasional `pthread_once()`. diff --git a/docs/CONTAINER_INTERFACE.md b/docs/CONTAINER_INTERFACE.md new file mode 100644 index 0000000..7fa8558 --- /dev/null +++ b/docs/CONTAINER_INTERFACE.md @@ -0,0 +1,397 @@ +--- +title: Container Interface +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The Container Interface + +Also consult [Writing Virtual Machine or Container +Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers). + +systemd has a number of interfaces for interacting with container managers, +when systemd is used inside of an OS container. If you work on a container +manager, please consider supporting the following interfaces. + +## Execution Environment + +1. If the container manager wants to control the hostname for a container + running systemd it may just set it before invoking systemd, and systemd will + leave it unmodified when there is no hostname configured in `/etc/hostname` + (that file overrides whatever is pre-initialized by the container manager). + +2. Make sure to pre-mount `/proc/`, `/sys/`, and `/sys/fs/selinux/` before + invoking systemd, and mount `/sys/`, `/sys/fs/selinux/` and `/proc/sys/` + read-only (the latter via e.g. a read-only bind mount on itself) in order + to prevent the container from altering the host kernel's configuration + settings. (As a special exception, if your container has network namespaces + enabled, feel free to make `/proc/sys/net/` writable. If it also has user, ipc, + uts and pid namespaces enabled, the entire `/proc/sys` can be left writable). + systemd and various other subsystems (such as the SELinux userspace) have + been modified to behave accordingly when these file systems are read-only. + (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a subset of its + sub-trees from the real `sysfs` to hide `/sys/firmware/`, `/sys/kernel/` and + so on. If you do that, make sure to mark `/sys/` read-only, as that + condition is what systemd looks for, and is what is considered to be the API + in this context.) + +3. Pre-mount `/dev/` as (container private) `tmpfs` for the container and bind + mount some suitable TTY to `/dev/console`. If this is a pty, make sure to + not close the controlling pty during systemd's lifetime. PID 1 will close + ttys, to avoid being killed by SAK. It only opens ttys for the time it + actually needs to print something. Also, make sure to create device nodes + for `/dev/null`, `/dev/zero`, `/dev/full`, `/dev/random`, `/dev/urandom`, + `/dev/tty`, `/dev/ptmx` in `/dev/`. It is not necessary to create `/dev/fd` + or `/dev/stdout`, as systemd will do that on its own. Make sure to set up a + `BPF_PROG_TYPE_CGROUP_DEVICE` BPF program — on cgroupv2 — or the `devices` + cgroup controller — on cgroupv1 — so that no other devices but these may be + created in the container. Note that many systemd services use + `PrivateDevices=`, which means that systemd will set up a private `/dev/` + for them for which it needs to be able to create these device nodes. + Dropping `CAP_MKNOD` for containers is hence generally not advisable, but + see below. + +4. `systemd-udevd` is not available in containers (and refuses to start), and + hence device dependencies are unavailable. The `systemd-udevd` unit files + will check for `/sys/` being read-only, as an indication whether device + management can work. Therefore make sure to mount `/sys/` read-only in the + container (see above). Various clients of `systemd-udevd` also check the + read-only state of `/sys/`, including PID 1 itself and `systemd-networkd`. + +5. If systemd detects it is run in a container it will spawn a single shell on + `/dev/console`, and not care about VTs or multiple gettys on VTs. (But see + `$container_ttys` below.) + +6. Either pre-mount all cgroup hierarchies in full into the container, or leave + that to systemd which will do so if they are missing. Note that it is + explicitly *not* OK to just mount a sub-hierarchy into the container as that + is incompatible with `/proc/$PID/cgroup` (which lists full paths). Also the + root-level cgroup directories tend to be quite different from inner + directories, and that distinction matters. It is OK however, to mount the + "upper" parts read-only of the hierarchies, and only allow write-access to + the cgroup sub-tree the container runs in. It's also a good idea to mount + all controller hierarchies with exception of `name=systemd` fully read-only + (this only applies to cgroupv1, of course), to protect the controllers from + alteration from inside the containers. Or to turn this around: only the + cgroup sub-tree of the container itself (on cgroupv2 in the unified + hierarchy, and on cgroupv1 in the `name=systemd` hierarchy) may be writable + to the container. + +7. Create the control group root of your container by either running your + container as a service (in case you have one container manager instance per + container instance) or creating one scope unit for each container instance + via systemd's transient unit API (in case you have one container manager + that manages all instances. Either way, make sure to set `Delegate=yes` in + it. This ensures that the unit you created will be part of all cgroup + controllers (or at least the ones systemd understands). The latter may also + be done via `systemd-machined`'s `CreateMachine()` API. Make sure to use the + cgroup path systemd put your process in for all operations of the container. + Do not add new cgroup directories to the top of the tree. This will not only + confuse systemd and the admin, but also prevent your implementation from + being "stackable". + +## Environment Variables + +1. To allow systemd (and other programs) to identify that it is executed within + a container, please set the `$container` environment variable for PID 1 in + the container to a short lowercase string identifying your + implementation. With this in place the `ConditionVirtualization=` setting in + unit files will work properly. Example: `container=lxc-libvirt` + +2. systemd has special support for allowing container managers to initialize + the UUID for `/etc/machine-id` to some manager supplied value. This is only + enabled if `/etc/machine-id` is empty (i.e. not yet set) at boot time of the + container. The container manager should set `$container_uuid` as environment + variable for the container's PID 1 to the container UUID. (This is similar + to the effect of `qemu`'s `-uuid` switch). Note that you should pass only a + UUID here that is actually unique (i.e. only one running container should + have a specific UUID), and gets changed when a container gets duplicated. + Also note that systemd will try to persistently store the UUID in + `/etc/machine-id` (if writable) when this option is used, hence you should + always pass the same UUID here. Keeping the externally used UUID for a + container and the internal one in sync is hopefully useful to minimize + surprise for the administrator. + +3. systemd can automatically spawn login gettys on additional ptys. A container + manager can set the `$container_ttys` environment variable for the + container's PID 1 to tell it on which ptys to spawn gettys. The variable + should take a space separated list of pty names, without the leading `/dev/` + prefix, but with the `pts/` prefix included. Note that despite the + variable's name you may only specify ptys, and not other types of ttys. Also + you need to specify the pty itself, a symlink will not suffice. This is + implemented in + [systemd-getty-generator(8)](https://www.freedesktop.org/software/systemd/man/systemd-getty-generator.html). + Note that this variable should not include the pty that `/dev/console` maps + to if it maps to one (see below). Example: if the container receives + `container_ttys=pts/7 pts/8 pts/14` it will spawn three additional login + gettys on ptys 7, 8, and 14. + +4. To allow applications to detect the OS version and other metadata of the host + running the container manager, if this is considered desirable, please parse + the host's `/etc/os-release` and set a `$container_host_<key>=<VALUE>` + environment variable for the ID fields described by the [os-release + interface](https://www.freedesktop.org/software/systemd/man/os-release.html), eg: + `$container_host_id=debian` + `$container_host_build_id=2020-06-15` + `$container_host_variant_id=server` + `$container_host_version_id=10` + +5. systemd supports passing immutable binary data blobs with limited size and + restricted access to services via the `ImportCredential=`, `LoadCredential=` + and `SetCredential=` settings. The same protocol may be used to pass credentials + from the container manager to systemd itself. The credential data should be + placed in some location (ideally a read-only and non-swappable file system, + like 'ramfs'), and the absolute path to this directory exported in the + `$CREDENTIALS_DIRECTORY` environment variable. If the container managers + does this, the credentials passed to the service manager can be propagated + to services via `LoadCredential=` or `ImportCredential=` (see ...). The + container manager can choose any path, but `/run/host/credentials` is + recommended. + +## Advanced Integration + +1. Consider syncing `/etc/localtime` from the host file system into the + container. Make it a relative symlink to the containers's zoneinfo dir, as + usual. Tools rely on being able to determine the timezone setting from the + symlink value, and making it relative looks nice even if people list the + container's `/etc/` from the host. + +2. Make the container journal available in the host, by automatically + symlinking the container journal directory into the host journal directory. + More precisely, link `/var/log/journal/<container-machine-id>` of the + container into the same dir of the host. Administrators can then + automatically browse all container journals (correctly interleaved) by + issuing `journalctl -m`. The container machine ID can be determined from + `/etc/machine-id` in the container. + +3. If the container manager wants to cleanly shutdown the container, it might + be a good idea to send `SIGRTMIN+3` to its init process. systemd will then + do a clean shutdown. Note however, that since only systemd understands + `SIGRTMIN+3` like this, this might confuse other init systems. + +4. To support [Socket Activated + Containers](https://0pointer.de/blog/projects/socket-activated-containers.html) + the container manager should be capable of being run as a systemd + service. It will then receive the sockets starting with FD 3, the number of + passed FDs in `$LISTEN_FDS` and its PID as `$LISTEN_PID`. It should take + these and pass them on to the container's init process, also setting + $LISTEN_FDS and `$LISTEN_PID` (basically, it can just leave the FDs and + `$LISTEN_FDS` untouched, but it needs to adjust `$LISTEN_PID` to the + container init process). That's all that's necessary to make socket + activation work. The protocol to hand sockets from systemd to services is + hence the same as from the container manager to the container systemd. For + further details see the explanations of + [sd_listen_fds(1)](https://0pointer.de/public/systemd-man/sd_listen_fds.html) + and the [blog story for service + developers](https://0pointer.de/blog/projects/socket-activation.html). + +5. Container managers should stay away from the cgroup hierarchy outside of the + unit they created for their container. That's private property of systemd, + and no other code should modify it. + +6. systemd running inside the container can report when boot-up is complete + using the usual `sd_notify()` protocol that is also used when a service + wants to tell the service manager about readiness. A container manager can + set the `$NOTIFY_SOCKET` environment variable to a suitable socket path to + make use of this functionality. (Also see information about + `/run/host/notify` below.) + +## Networking + +1. Inside of a container, if a `veth` link is named `host0`, `systemd-networkd` + running inside of the container will by default run DHCPv4, DHCPv6, and + IPv4LL clients on it. It is thus recommended that container managers that + add a `veth` link to a container name it `host0`, to get an automatically + configured network, with no manual setup. + +2. Outside of a container, if a `veth` link is prefixed "ve-", `systemd-networkd` + will by default run DHCPv4 and DHCPv6 servers on it, as well as IPv4LL. It + is thus recommended that container managers that add a `veth` link to a + container name the external side `ve-` + the container name. + +3. It is recommended to configure stable MAC addresses for container `veth` + devices, for example, hashed out of the container names. That way it is more + likely that DHCP and IPv4LL will acquire stable addresses. + +## The `/run/host/` Hierarchy + +Container managers may place certain resources the manager wants to provide to +the container payload below the `/run/host/` hierarchy. This hierarchy should +be mostly immutable (possibly some subdirs might be writable, but the top-level +hierarchy — and probably most subdirs should be read-only to the +container). Note that this hierarchy is used by various container managers, and +care should be taken to avoid naming conflicts. `systemd` (and in particular +`systemd-nspawn`) use the hierarchy for the following resources: + +1. The `/run/host/incoming/` directory mount point is configured for `MS_SLAVE` + mount propagation with the host, and is used as intermediary location for + mounts to establish in the container, for the implementation of `machinectl + bind`. Container payload should usually not directly interact with this + directory: it's used by code outside the container to insert mounts inside + it only, and is mostly an internal vehicle to achieve this. Other container + managers that want to implement similar functionality might consider using + the same directory. + +2. The `/run/host/inaccessible/` directory may be set up by the container + manager to include six file nodes: `reg`, `dir`, `fifo`, `sock`, `chr`, + `blk`. These nodes correspond with the six types of file nodes Linux knows + (with the exceptions of symlinks). Each node should be of the specific type + and have an all zero access mode, i.e. be inaccessible. The two device node + types should have major and minor of zero (which are unallocated devices on + Linux). These nodes are used as mount source for implementing the + `InaccessiblePath=` setting of unit files, i.e. file nodes to mask this way + are overmounted with these "inaccessible" inodes, guaranteeing that the file + node type does not change this way but the nodes still become + inaccessible. Note that systemd when run as PID 1 in the container payload + will create these nodes on its own if not passed in by the container + manager. However, in that case it likely lacks the privileges to create the + character and block devices nodes (there are fallbacks for this case). + +3. The `/run/host/notify` path is a good choice to place the `sd_notify()` + socket in, that may be used for the container's PID 1 to report to the + container manager when boot-up is complete. The path used for this doesn't + matter much as it is communicated via the `$NOTIFY_SOCKET` environment + variable, following the usual protocol for this, however it's suitable, and + recommended place for this socket in case ready notification is desired. + +4. The `/run/host/os-release` file contains the `/etc/os-release` file of the + host, i.e. may be used by the container payload to gather limited + information about the host environment, on top of what `uname -a` reports. + +5. The `/run/host/container-manager` file may be used to pass the same + information as the `$container` environment variable (see above), i.e. a + short string identifying the container manager implementation. This file + should be newline terminated. Passing this information via this file has the + benefit that payload code can easily access it, even when running + unprivileged without access to the container PID 1's environment block. + +6. The `/run/host/container-uuid` file may be used to pass the same information + as the `$container_uuid` environment variable (see above). This file should + be newline terminated. + +7. The `/run/host/credentials/` directory is a good place to pass credentials + into the container, using the `$CREDENTIALS_DIRECTORY` protocol, see above. + +## What You Shouldn't Do + +1. Do not drop `CAP_MKNOD` from the container. `PrivateDevices=` is a commonly + used service setting that provides a service with its own, private, minimal + version of `/dev/`. To set this up systemd in the container needs this + capability. If you take away the capability, then all services that set this + flag will cease to work. Use `BPF_PROG_TYPE_CGROUP_DEVICE` BPF programs — on + cgroupv2 — or the `devices` controller — on cgroupv1 — to restrict what + device nodes the container can create instead of taking away the capability + wholesale. (Also see the section about fully unprivileged containers below.) + +2. Do not drop `CAP_SYS_ADMIN` from the container. A number of the most + commonly used file system namespacing related settings, such as + `PrivateDevices=`, `ProtectHome=`, `ProtectSystem=`, `MountFlags=`, + `PrivateTmp=`, `ReadWriteDirectories=`, `ReadOnlyDirectories=`, + `InaccessibleDirectories=`, and `MountFlags=` need to be able to open new + mount namespaces and the mount certain file systems into them. You break all + services that make use of these options if you drop the capability. Also + note that logind mounts `XDG_RUNTIME_DIR` as `tmpfs` for all logged in users + and that won't work either if you take away the capability. (Also see + section about fully unprivileged containers below.) + +3. Do not cross-link `/dev/kmsg` with `/dev/console`. They are different things, + you cannot link them to each other. + +4. Do not pretend that the real VTs are available in the container. The VT + subsystem consists of all the devices `/dev/tty[0-9]*`, `/dev/vcs*`, + `/dev/vcsa*` plus their `sysfs` counterparts. They speak specific `ioctl()`s + and understand specific escape sequences, that other ptys don't understand. + Hence, it is explicitly not OK to mount a pty to `/dev/tty1`, `/dev/tty2`, + `/dev/tty3`. This is explicitly not supported. + +5. Don't pretend that passing arbitrary devices to containers could really work + well. For example, do not pass device nodes for block devices to the + container. Device access (with the exception of network devices) is not + virtualized on Linux. Enumeration and probing of meta information from + `/sys/` and elsewhere is not possible to do correctly in a container. Simply + adding a specific device node to a container's `/dev/` is *not* *enough* to + do the job, as `systemd-udevd` and suchlike are not available at all, and no + devices will appear available or enumerable, inside the container. + +6. Don't mount only a sub-tree of the `cgroupfs` into the container. This will not + work as `/proc/$PID/cgroup` lists full paths and cannot be matched up with + the actual `cgroupfs` tree visible, then. (You may "prune" some branches + though, see above.) + +7. Do not make `/sys/` writable in the container. If you do, + `systemd-udevd.service` is started to manage your devices — inside the + container, but that will cause conflicts and errors given that the Linux + device model is not virtualized for containers on Linux and thus the + containers and the host would try to manage the same devices, fighting for + ownership. Multiple other subsystems of systemd similarly test for `/sys/` + being writable to decide whether to use `systemd-udevd` or assume that + device management is properly available on the instance. Among them + `systemd-networkd` and `systemd-logind`. The conditionalization on the + read-only state of `/sys/` enables a nice automatism: as soon as `/sys/` and + the Linux device model are changed to be virtualized properly the container + payload can make use of that, simply by marking `/sys/` writable. (Note that + as special exception, the devices in `/sys/class/net/` are virtualized + already, if network namespacing is used. Thus it is OK to mount the relevant + sub-directories of `/sys/` writable, but make sure to leave the root of + `/sys/` read-only.) + +8. Do not pass the `CAP_AUDIT_CONTROL`, `CAP_AUDIT_READ`, `CAP_AUDIT_WRITE` + capabilities to the container, in particular not to those making use of user + namespaces. The kernel's audit subsystem is still not virtualized for + containers, and passing these credentials is pointless hence, given the + actual attempt to make use of the audit subsystem will fail. Note that + systemd's audit support is partially conditioned on these capabilities, thus + by dropping them you ensure that you get an entirely clean boot, as systemd + will make no attempt to use it. If you pass the capabilities to the payload + systemd will assume that audit is available and works, and some components + will subsequently fail in various ways. Note that once the kernel learnt + native support for container-virtualized audit, adding the capability to the + container description will automatically make the container payload use it. + +## Fully Unprivileged Container Payload + +First things first, to make this clear: Linux containers are not a security +technology right now. There are more holes in the model than in swiss cheese. + +For example: if you do not use user namespacing, and share root and other users +between container and host, the `struct user` structures will be shared between +host and container, and hence `RLIMIT_NPROC` and so of the container users +affect the host and other containers, and vice versa. This is a major security +hole, and actually is a real-life problem: since Avahi sets `RLIMIT_NPROC` of +its user to 2 (to effectively disallow `fork()`ing) you cannot run more than +one Avahi instance on the entire system... + +People have been asking to be able to run systemd without `CAP_SYS_ADMIN` and +`CAP_SYS_MKNOD` in the container. This is now supported to some level in +systemd, but we recommend against it (see above). If `CAP_SYS_ADMIN` and +`CAP_SYS_MKNOD` are missing from the container systemd will now gracefully turn +off `PrivateTmp=`, `PrivateNetwork=`, `ProtectHome=`, `ProtectSystem=` and +others, because those capabilities are required to implement these options. The +services using these settings (which include many of systemd's own) will hence +run in a different, less secure environment when the capabilities are missing +than with them around. + +With user namespacing in place things get much better. With user namespaces the +`struct user` issue described above goes away, and containers can keep +`CAP_SYS_ADMIN` safely for the user namespace, as capabilities are virtualized +and having capabilities inside a container doesn't mean one also has them +outside. + +## Final Words + +If you write software that wants to detect whether it is run in a container, +please check `/proc/1/environ` and look for the `container=` environment +variable. Do not assume the environment variable is inherited down the process +tree. It generally is not. Hence check the environment block of PID 1, not your +own. Note though that this file is only accessible to root. systemd hence early +on also copies the value into `/run/systemd/container`, which is readable for +everybody. However, that's a systemd-specific interface and other init systems +are unlikely to do the same. + +Note that it is our intention to make systemd systems work flawlessly and +out-of-the-box in containers. In fact, we are interested to ensure that the same +OS image can be booted on a bare system, in a VM and in a container, and behave +correctly each time. If you notice that some component in systemd does not work +in a container as it should, even though the container manager implements +everything documented above, please contact us. diff --git a/docs/CONTRIBUTING.md b/docs/CONTRIBUTING.md new file mode 100644 index 0000000..f599972 --- /dev/null +++ b/docs/CONTRIBUTING.md @@ -0,0 +1,112 @@ +--- +title: Contributing +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Contributing + +We welcome contributions from everyone. However, please follow the following guidelines when posting a GitHub Pull Request or filing a GitHub Issue on the systemd project: + +## Filing Issues + +* We use [GitHub Issues](https://github.com/systemd/systemd/issues) **exclusively** for tracking **bugs** and **feature** **requests** (RFEs) of systemd. If you are looking for help, please try the forums of your distribution first, or [systemd-devel mailing list](https://lists.freedesktop.org/mailman/listinfo/systemd-devel) for general questions about systemd. +* We only track bugs in the **two** **most** **recently** **released** (non-rc) **versions** of systemd in the GitHub Issue tracker. If you are using an older version of systemd, please contact your distribution's bug tracker instead (see below). See [GitHub Release Page](https://github.com/systemd/systemd/releases) for the list of most recent releases. +* When filing a feature request issue (RFE), please always check first if the newest upstream version of systemd already implements the feature, and whether there's already an issue filed for your feature by someone else. +* When filing an issue, specify the **systemd** **version** you are experiencing the issue with. Also, indicate which **distribution** you are using. +* Please include an explanation how to reproduce the issue you are pointing out. + +Following these guidelines makes it easier for us to process your issue, and ensures we won't close your issue right-away for being misfiled. + +### Older downstream versions + +For older versions that are still supported by your distribution please use respective downstream tracker: + +* **Fedora** - [bugzilla](https://bugzilla.redhat.com/enter_bug.cgi?product=Fedora&component=systemd) +* **RHEL/CentOS stream** - [Jira](https://issues.redhat.com/secure/CreateIssueDetails!init.jspa?pid=12332745&issuetype=1&components=12380515&priority=10300) or [contribute to systemd-rhel @GitHub](https://github.com/redhat-plumbers#systemd) +* **Debian** - [bugs.debian.org](https://bugs.debian.org/cgi-bin/pkgreport.cgi?pkg=systemd) + +## Security vulnerability reports + +See [reporting of security vulnerabilities](SECURITY). + +## Posting Pull Requests + +* Make sure to post PRs only relative to a recent tip of the `main` branch. +* Follow our [Coding Style](CODING_STYLE) when contributing code. This is a requirement for all code we merge. +* Please make sure to test your change before submitting the PR. See the [Hacking guide](HACKING) for details on how to do this. +* Make sure to run the test suite locally, before posting your PR. We use a CI system, meaning we don't even look at your PR if the build and tests don't pass. +* If you need to update the code in an existing PR, force-push into the same branch, overriding old commits with new versions. +* After you have pushed a new version, add a comment explaining the latest changes. If you are a member of the systemd project on GitHub, remove the `reviewed/needs-rework`/`ci-fails/needs-rework`/`needs-rebase` labels. +* If you are copying existing code from another source (eg: a compat header), please make sure the license is compatible with `LGPL-2.1-or-later`. If the license is not `LGPL-2.1-or-later`, please add a note to [`LICENSES/README.md`](https://github.com/systemd/systemd/blob/main/LICENSES/README.md). +* If the pull request stalls without review, post a ping in a comment after some time has passed. We are always short on reviewer time, and pull requests which haven't seen any recent activity can be easily forgotten. +* Github will automatically add the `please-review` label when a pull request is opened or updated. If you need +more information after a review, you can comment `/please-review` on the pull request to have Github add the +`please-review` label to the pull request. + +## Reviewing Pull Requests + +* See [filtered list of pull requests](https://github.com/systemd/systemd/pulls?q=is%3Aopen+is%3Apr+-label%3A%22reviewed%2Fneeds-rework+%F0%9F%94%A8%22+-label%3Aneeds-rebase+-label%3Agood-to-merge%2Fwith-minor-suggestions+-label%3A%22good-to-merge%2Fwaiting-for-ci+%F0%9F%91%8D%22+-label%3Apostponed+-label%3A%22needs-reporter-feedback+%E2%9D%93%22+-label%3A%22dont-merge+%F0%9F%92%A3%22+-label%3A%22ci-fails%2Fneeds-rework+%F0%9F%94%A5%22+sort%3Aupdated-desc) for requests that are ready for review. +* After performing a review, set + + * `reviewed/needs-rework` if the pull request needs significant changes + * `ci-fails/needs-rework` if the automatic tests fail and the failure is relevant to the pull request + * `ci-failure-appears-unrelated` if the test failures seem irrelevant + * `needs-rebase` if the pull request needs a rebase because of conflicts + * `good-to-merge/waiting-for-ci` if the pull request should be merged without further review + * `good-to-merge/with-minor-suggestions` if the pull request should be merged after an update without going through another round of reviews + +Unfortunately only members of the `systemd` organization on github can change labels. +If your pull request is mislabeled, make a comment in the pull request and somebody will fix it. +Reviews from non-members are still welcome. + +## Final Words + +We'd like to apologize in advance if we are not able to process and reply to your issue or PR right-away. We have a lot of work to do, but we are trying our best! + +Thank you very much for your contributions! + +# Backward Compatibility And External Dependencies + +We strive to keep backward compatibility where possible and reasonable. The following are general guidelines, not hard +rules, and case-by-case exceptions might be applied at the discretion of the maintainers. The current set of build-time +and runtime dependencies are documented in the [README](https://github.com/systemd/systemd/blob/main/README). + +## New features + +It is fine for new features/functionality/tools/daemons to require bleeding edge external dependencies, provided there +are runtime and build-time graceful fallbacks (e.g.: a daemon will not be built, runtime functionality will be skipped with a clear log message). +In case a new feature is added to both `systemd` and one of its dependencies, we expect the corresponding feature code to +be merged upstream in the dependency before accepting our side of the implementation. +Making use of new kernel syscalls can be achieved through compat wrappers in our tree (see: `src/basic/missing_syscall_def.h`), +and does not need to wait for glibc support. + +## External Build/Runtime Dependencies + +It is often tempting to bump external dependencies' minimum versions to cut cruft, and in general it's an essential part +of the maintenance process. But as a general rule, existing dependencies should not be bumped without strong +reasons. When possible, we try to keep compatibility with the most recent LTS releases of each mainstream distribution +for optional components, and with all currently maintained (i.e.: not EOL) LTS releases for core components. When in +doubt, ask before committing time to work on contributions if it's not clear that cutting support would be obviously +acceptable. + +## Kernel Requirements + +Same principles as with other dependencies should be applied. It is fine to require newer kernel versions for additional +functionality or optional features, but very strong reasons should be required for breaking compatibility for existing +functionality, especially for core components. It is not uncommon, for example, for embedded systems to be stuck on older +kernel versions due to hardware requirements, so do not assume everybody is running with latest and greatest at all times. +In general, [currently maintained LTS branches](https://www.kernel.org/category/releases.html) should keep being supported +for existing functionality. + +## `libsystemd.so` + +`libsystemd.so` is a shared public library, so breaking ABI/API compatibility would create lot of work for everyone, and is not allowed. Instead, always add a new interface instead of modifying +the signature of an existing function. It is fine to mark an interface as deprecated to gently nudge users toward a newer one, +but support for the old one must be maintained. +Symbol versioning and the compiler's deprecated attribute should be used when managing the lifetime of a public interface. + +## `libudev.so` + +`libudev.so` is a shared public library, and is still maintained, but should not gain new symbols at this point. diff --git a/docs/CONTROL_GROUP_INTERFACE.md b/docs/CONTROL_GROUP_INTERFACE.md new file mode 100644 index 0000000..11dc6a3 --- /dev/null +++ b/docs/CONTROL_GROUP_INTERFACE.md @@ -0,0 +1,240 @@ +--- +title: New Control Group Interfaces +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The New Control Group Interfaces + +> _aka "I want to make use of kernel cgroups, how do I do this in the new world order?"_ + +Starting with version 205 systemd provides a number of interfaces that may be used to create and manage labelled groups of processes for the purpose of monitoring and controlling them and their resource usage. This is built on top of the Linux kernel Control Groups ("cgroups") facility. Previously, the kernel's cgroups API was exposed directly as shared application API, following the rules of the [Pax Control Groups](http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups/) document. However, the kernel cgroup interface has been reworked into an API that requires that each individual cgroup is managed by a single writer only. With this change the main cgroup tree becomes private property of that userspace component and is no longer a shared resource. On systemd systems PID 1 takes this role and hence needs to provide APIs for clients to take benefit of the control groups functionality of the kernel. Note that services running on systemd systems may manage their own subtrees of the cgroups tree, as long as they explicitly turn on delegation mode for them (see below). + +That means explicitly, that: + +1. The root control group may only be written to by systemd (PID 1). Services that create and manipulate control groups in the top level cgroup are in direct conflict with the kernel's requirement that each control group should have a single-writer only. +2. Services must set Delegate=yes for the units they intend to manage subcgroups of. If they create and manipulate cgroups outside of units that have Delegate=yes set, they violate the access contract for control groups. + +For a more high-level background story, please have a look at this [Linux Foundation News Story](http://www.linuxfoundation.org/news-media/blogs/browse/2013/08/all-about-linux-kernel-cgroup%E2%80%99s-redesign). + +### Why this all again? + +- Objects placed in the same level of the cgroup tree frequently need to propagate properties from one to each other. For example, when using the "cpu" controller for one object then all objects on the same level need to do the same, otherwise the entire cgroup of the first object will be scheduled against the individual processes of the others, thus giving the first object a drastic malus on scheduling if it uses many processes. +- Similar, some properties also require propagation up the tree. +- The tree needs to be refreshed/built in scheduled steps as devices show up/go away as controllers like "blkio" or "devices" refer to devices via major/minor device node indexes, which are not fixed but determined only as a device appears. +- The tree also needs refreshing/rebuilding as new services are installed/started/instantiated/stopped/uninstalled. +- Many of the cgroup attributes are too low-level as API. For example, the major/minor device interface in order to be useful requires a userspace component for translating stable device paths into major/minor at the right time. +- By unifying the cgroup logic under a single arbiter it is possible to write tools that can manage all objects the system contains, including services, virtual machines containers and whatever else applications register. +- By unifying the cgroup logic under a single arbiter a good default that encompasses all kinds of objects may be shipped, thus making manual configuration unnecessary to take benefit of basic resource control. + +systemd through its "unit" concept already implements a dependency network between objects where propagation can take place and contains a powerful execution queue. Also, a major part of the objects resources need to be controlled for are already systemd objects, most prominently the services systemd manages. + +### Why is this not managed by a component independent of systemd? + +Well, as mentioned above, a dependency network between objects, usable for propagation, combined with a powerful execution engine is basically what systemd _is_. Since cgroups management requires precisely this it is an obvious choice to simply implement this in systemd itself. + +Implementing a similar propagation/dependency network with execution scheduler outside of systemd in an independent "cgroup" daemon would basically mean reimplementing systemd a second time. Also, accessing such an external service from PID 1 for managing other services would result in cyclic dependencies between PID 1 which would need this functionality to manage the cgroup service which would only be available however after that service finished starting up. Such cyclic dependencies can certainly be worked around, but make such a design complex. + +### I don't use systemd, what does this mean for me? + +Nothing. This page is about systemd's cgroups APIs. If you don't use systemd then the kernel cgroup rework will probably affect you eventually, but a different component will be the single writer userspace daemon managing the cgroup tree, with different APIs. Note that the APIs described here expose a lot of systemd-specific concepts and hence are unlikely to be available outside of systemd systems. + +### I want to write cgroup code that should work on both systemd systems and others (such as Ubuntu), what should I do? + +On systemd systems use the systemd APIs as described below. At this time we are not aware of any component that would take the cgroup managing role on Upstart/sysvinit systems, so we cannot help you with this. Sorry. + +### What's the timeframe of this? Do I need to care now? + +In the short-term future writing directly to the control group tree from applications should still be OK, as long as the [Pax Control Groups](http://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups/) document is followed. In the medium-term future it will still be supported to alter/read individual attributes of cgroups directly, but no longer to create/delete cgroups without using the systemd API. In the longer-term future altering/reading attributes will also be unavailable to userspace applications, unless done via systemd's APIs (either D-Bus based IPC APIs or shared library APIs for _passive_ operations). + +It is recommended to use the new systemd APIs described below in any case. Note that the kernel cgroup interface is currently being reworked (available when the "sane_behaviour" kernel option is used). This will change the cgroupfs interface. By using systemd's APIs this change is abstracted away and invisible to applications. + +## systemd's Resource Control Concepts + +Systemd provides three unit types that are useful for the purpose of resource control: + +- [_Services_](http://www.freedesktop.org/software/systemd/man/systemd.service.html) encapsulate a number of processes that are started and stopped by systemd based on configuration. Services are named in the style of `quux.service`. +- [_Scopes_](http://www.freedesktop.org/software/systemd/man/systemd.scope.html) encapsulate a number of processes that are started and stopped by arbitrary processes via fork(), and then registered at runtime with PID1. Scopes are named in the style of `wuff.scope`. +- [_Slices_](http://www.freedesktop.org/software/systemd/man/systemd.slice.html) may be used to group a number of services and scopes together in a hierarchial tree. Slices do not contain processes themselves, but the services and slices contained in them do. Slices are named in the style of `foobar-waldo.slice`, where the path to the location of the slice in the tree is encoded in the name with "-" as separator for the path components (`foobar-waldo.slice` is hence a subslice of `foobar.slice`). There's one special slices defined, `-.slice`, which is the root slice of all slices (`foobar.slice` is hence subslice of `-.slice`). This is similar how in regular file paths, "/" denotes the root directory. + +Service, scope and slice units directly map to objects in the cgroup tree. When these units are activated they each map to directly (modulo some character escaping) to cgroup paths built from the unit names. For example, a service `quux.service` in a slice `foobar-waldo.slice` is found in the cgroup `foobar.slice/foobar-waldo.slice/quux.service/`. + +Services, scopes and slices may be created freely by the administrator or dynamically by programs. However by default the OS defines a number of built-in services that are necessary to start-up the system. Also, there are four slices defined by default: first of all the root slice `-.slice` (as mentioned above), but also `system.slice`, `machine.slice`, `user.slice`. By default all system services are placed in the first slice, all virtual machines and containers in the second, and user sessions in the third. However, this is just a default, and the administrator my freely define new slices and assign services and scopes to them. Also note that all login sessions automatically are placed in an individual scope unit, as are VM and container processes. Finally, all users logging in will also get an implicit slice of their own where all the session scopes are placed. + +Here's an example how the cgroup tree could look like (as generated with `systemd-cgls(1)`, see below): + +``` +├─user.slice +│ └─user-1000.slice +│ ├─session-18.scope +│ │ ├─703 login -- lennart +│ │ └─773 -bash +│ ├─session-1.scope +│ │ ├─ 518 gdm-session-worker [pam/gdm-autologin] +│ │ ├─ 540 gnome-session +│ │ ├─ 552 dbus-launch --sh-syntax --exit-with-session +│ │ ├─ 553 /bin/dbus-daemon --fork --print-pid 4 --print-address 6 --session +│ │ ├─ 589 /usr/libexec/gvfsd +│ │ ├─ 593 /usr/libexec//gvfsd-fuse -f /run/user/1000/gvfs +│ │ ├─ 598 /usr/libexec/gnome-settings-daemon +│ │ ├─ 617 /usr/bin/gnome-keyring-daemon --start --components=gpg +│ │ ├─ 630 /usr/bin/pulseaudio --start +│ │ ├─ 726 /usr/bin/gnome-shell +│ │ ├─ 728 syndaemon -i 1.0 -t -K -R +│ │ ├─ 736 /usr/libexec/gsd-printer +│ │ ├─ 742 /usr/libexec/dconf-service +│ │ ├─ 798 /usr/libexec/mission-control-5 +│ │ ├─ 802 /usr/libexec/goa-daemon +│ │ ├─ 823 /usr/libexec/gvfsd-metadata +│ │ ├─ 866 /usr/libexec/gvfs-udisks2-volume-monitor +│ │ ├─ 880 /usr/libexec/gvfs-gphoto2-volume-monitor +│ │ ├─ 886 /usr/libexec/gvfs-afc-volume-monitor +│ │ ├─ 891 /usr/libexec/gvfs-mtp-volume-monitor +│ │ ├─ 895 /usr/libexec/gvfs-goa-volume-monitor +│ │ ├─ 999 /usr/libexec/telepathy-logger +│ │ ├─ 1076 /usr/libexec/gnome-terminal-server +│ │ ├─ 1079 gnome-pty-helper +│ │ ├─ 1080 bash +│ │ ├─ 1134 ssh-agent +│ │ ├─ 1137 gedit l +│ │ ├─ 1160 gpg-agent --daemon --write-env-file +│ │ ├─ 1371 /usr/lib64/firefox/firefox +│ │ ├─ 1729 systemd-cgls +│ │ ├─ 1929 bash +│ │ ├─ 2057 emacs src/login/org.freedesktop.login1.policy.in +│ │ ├─ 2060 /usr/libexec/gconfd-2 +│ │ ├─29634 /usr/libexec/gvfsd-http --spawner :1.5 /org/gtk/gvfs/exec_spaw/0 +│ │ └─31416 bash +│ └─user@1000.service +│ ├─532 /usr/lib/systemd/systemd --user +│ └─541 (sd-pam) +└─system.slice + ├─1 /usr/lib/systemd/systemd --system --deserialize 22 + ├─sshd.service + │ └─29701 /usr/sbin/sshd -D + ├─udisks2.service + │ └─743 /usr/lib/udisks2/udisksd --no-debug + ├─colord.service + │ └─727 /usr/libexec/colord + ├─upower.service + │ └─633 /usr/libexec/upowerd + ├─wpa_supplicant.service + │ └─488 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid + ├─bluetooth.service + │ └─463 /usr/sbin/bluetoothd -n + ├─polkit.service + │ └─443 /usr/lib/polkit-1/polkitd --no-debug + ├─alsa-state.service + │ └─408 /usr/sbin/alsactl -s -n 19 -c -E ALSA_CONFIG_PATH=/etc/alsa/alsactl.conf --initfile=/lib/alsa/init/00main rdaemon + ├─systemd-udevd.service + │ └─253 /usr/lib/systemd/systemd-udevd + ├─systemd-journald.service + │ └─240 /usr/lib/systemd/systemd-journald + ├─rtkit-daemon.service + │ └─419 /usr/libexec/rtkit-daemon + ├─rpcbind.service + │ └─475 /sbin/rpcbind -w + ├─cups.service + │ └─731 /usr/sbin/cupsd -f + ├─avahi-daemon.service + │ ├─417 avahi-daemon: running [delta.local] + │ └─424 avahi-daemon: chroot helper + ├─dbus.service + │ ├─418 /bin/dbus-daemon --system --address=systemd: --nofork --nopidfile --systemd-activation + │ └─462 /usr/sbin/modem-manager + ├─accounts-daemon.service + │ └─416 /usr/libexec/accounts-daemon + ├─systemd-ask-password-wall.service + │ └─434 /usr/bin/systemd-tty-ask-password-agent --wall + ├─systemd-logind.service + │ └─415 /usr/lib/systemd/systemd-logind + ├─ntpd.service + │ └─429 /usr/sbin/ntpd -u ntp:ntp -g + ├─rngd.service + │ └─412 /sbin/rngd -f + ├─libvirtd.service + │ └─467 /usr/sbin/libvirtd + ├─irqbalance.service + │ └─411 /usr/sbin/irqbalance --foreground + ├─crond.service + │ └─421 /usr/sbin/crond -n + ├─NetworkManager.service + │ ├─ 410 /usr/sbin/NetworkManager --no-daemon + │ ├─1066 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-enp0s20u2.pid -lf /var/lib/NetworkManager/dhclient-35c8218b-9e45-4b1f-b79e-22334f687340-enp0s20u2.lease -cf /var/lib/NetworkManager/dhclient-enp0s20u2.conf enp0s20u2 + │ └─1070 /sbin/dhclient -d -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-enp0s26u1u4i2.pid -lf /var/lib/NetworkManager/dhclient-f404f1ca-ccfe-4957-aead-dec19c126dea-enp0s26u1u4i2.lease -cf /var/lib/NetworkManager/dhclient-enp0s26u1u4i2.conf enp0s26u1u4i2 + └─gdm.service + ├─420 /usr/sbin/gdm + ├─449 /usr/libexec/gdm-simple-slave --display-id /org/gnome/DisplayManager/Displays/_0 + └─476 /usr/bin/Xorg :0 -background none -verbose -auth /run/gdm/auth-for-gdm-pJjwsi/database -seat seat0 -nolisten tcp vt1 +``` + +As you can see, services and scopes contain process and are placed in slices, and slices do not contain processes of their own. Also note that the special "-.slice" is not shown as it is implicitly identified with the root of the entire tree. + +Resource limits may be set on services, scopes and slices the same way. All active service, scope and slice units may easily be viewed with the "systemctl" command. The hierarchy of services and scopes in the slice tree may be viewed with the "systemd-cgls" command. + +Service and slice units may be configured via unit files on disk, or alternatively be created dynamically at runtime via API calls to PID 1. Scope units may only be created at runtime via API calls to PID 1, but not from unit files on disk. Units that are created dynamically at runtime via API calls are called _transient_ units. Transient units exist only during runtime and are released automatically as soon as they finished/got deactivated or the system is rebooted. + +If a service/slice is configured via unit files on disk the resource controls may be configured with the settings documented in [systemd.resource-control(5)](http://www.freedesktop.org/software/systemd/man/systemd.resource-control.html). While the unit are started they may be reconfigured for services/slices/scopes (with changes applying instantly) with the a command line such as: + +``` +# systemctl set-property httpd.service CPUShares=500 MemoryLimit=500M +``` + +This will make these changes persistently, so that after the next reboot they are automatically applied right when the services are first started. By passing the `--runtime` switch the changes can alternatively be made in a volatile way so that they are lost on the next reboot. + +Note that the number of cgroup attributes currently exposed as unit properties is limited. This will be extended later on, as their kernel interfaces are cleaned up. For example cpuset or freezer are currently not exposed at all due to the broken inheritance semantics of the kernel logic. Also, migrating units to a different slice at runtime is not supported (i.e. altering the Slice= property for running units) as the kernel currently lacks atomic cgroup subtree moves. + +(Note that the resource control settings are actually also available on mount, swap and socket units. This is because they may also involve processes run for them. However, normally it should not be necessary to alter resource control settings on these unit types.) + +## The APIs + +Most relevant APIs are exposed via D-Bus, however some _passive_ interfaces are available as shared library, bypassing IPC so that they are much cheaper to call. + +### Creating and Starting + +To create and start a transient (scope, service or slice) unit in the cgroup tree use the `StartTransientUnit()` method on the `Manager` object exposed by systemd's PID 1 on the bus, see the [Bus API Documentation](http://www.freedesktop.org/wiki/Software/systemd/dbus/) for details. This call takes four arguments. The first argument is the full unit name you want this unit to be known under. This unit name is the handle to the unit, and is shown in the "systemctl" output and elsewhere. This name must be unique during runtime of the unit. You should generate a descriptive name for this that is useful for the administrator to make sense of it. The second parameter is the mode, and should usually be `replace` or `fail`. The third parameter contains an array of initial properties to set for the unit. It is an array of pairs of property names as string and values as variant. Note that this is an array and not a dictionary! This is that way in order to match the properties array of the `SetProperties()` call (see below). The fourth parameter is currently not used and should be passed as empty array. This call will first create the transient unit and then immediately queue a start job for it. This call returns an object path to a `Job` object for the start job of this unit. + +### Properties + +The properties array of `StartTransientUnit()` may take many of the settings that may also be configured in unit files. Not all parameters are currently accepted though, but we plan to cover more properties with future release. Currently you may set the `Description`, `Slice` and all dependency types of units, as well as `RemainAfterExit`, `ExecStart` for service units, `TimeoutStopUSec` and `PIDs` for scope units, and `CPUAccounting`, `CPUShares`, `BlockIOAccounting`, `BlockIOWeight`, `BlockIOReadBandwidth`, `BlockIOWriteBandwidth`, `BlockIODeviceWeight`, `MemoryAccounting`, `MemoryLimit`, `DevicePolicy`, `DeviceAllow` for services/scopes/slices. These fields map directly to their counterparts in unit files and as normal D-Bus object properties. The exception here is the `PIDs` field of scope units which is used for construction of the scope only and specifies the initial PIDs to add to the scope object. + +To alter resource control properties at runtime use the `SetUnitProperty()` call on the `Manager` object or `SetProperty()` on the individual Unit objects. This also takes an array of properties to set, in the same format as `StartTransientUnit()` takes. Note again that this is not a dictionary, and allows properties to be set multiple times with a single invocation. THis is useful for array properties: if a property is assigned the empty array it will be reset to the empty array itself, however if it is assigned a non-empty array then this array is appended to the previous array. This mimics behaviour of array settings in unit files. Note that most settings may only be set during creation of units with `StartTransientUnit()`, and may not be altered later on. The exception here are the resource control settings, more specifically `CPUAccounting`, `CPUShares`, `BlockIOAccounting`, `BlockIOWeight`, `BlockIOReadBandwidth`, `BlockIOWriteBandwidth`, `BlockIODeviceWeight`, `MemoryAccounting`, `MemoryLimit`, `DevicePolicy`, `DeviceAllow` for services/scopes/slices. Note that the standard D-Bus `org.freedesktop.DBus.Properties.Set()` call is currently not supported by any of the unit objects to set these properties, but might eventually (note however, that it is substantially less useful as it only allows setting a single property at a time, resulting in races). + +The [`systemctl set-property`](http://www.freedesktop.org/software/systemd/man/systemctl.html) command internally is little more than a wrapper around `SetUnitProperty()`. The [`systemd-run`](http://www.freedesktop.org/software/systemd/man/systemd-run.html) tool is a wrapper around `StartTransientUnit()`. It may be used to either run a process as a transient service in the background, where it is invoked from PID 1, or alternatively as a scope unit in the foreground, where it is run from the `systemd-run` process itself. + +### Enumeration + +To acquire a list of currently running units, use the `ListUnits()` call on the Manager bus object. To determine the scope/service unit and slice unit a process is running in use [`sd_pid_get_unit()`](http://www.freedesktop.org/software/systemd/man/sd_pid_get_unit.html) and `sd_pid_get_slice()`. These two calls are implemented in `libsystemd-login.so`. These call bypass the system bus (which they can because they are passive and do not require privileges) and are hence very efficient to invoke. + +### VM and Container Managers + +Use these APIs to register any kind of process workload with systemd to be placed in a resource controlled cgroup. Note however that for containers and virtual machines it is better to use the [`machined`](http://www.freedesktop.org/wiki/Software/systemd/machined/) interfaces since they provide integration with "ps" and similar tools beyond what mere cgroup registration provides. Also see [Writing VM and Container Managers](http://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers/) for details. + +### Reading Accounting Information + +Note that there's currently no systemd API to retrieve accounting information from cgroups. For now, if you need to retrieve this information use `/proc/$PID/cgroup` to determine the cgroup path for your process in the `cpuacct` controller (or whichever controller matters to you), and then read the attributes directly from the cgroup tree. + +If you want to collect the exit status and other runtime parameters of your transient scope or service unit after the processes in them ended set the `RemainAfterExited` boolean property when creating it. This will has the effect that the unit will stay around even after all processes in it died, in the `SubState="exited"` state. Simply watch for state changes until this state is reached, then read the status details from the various properties you need, and finally terminate the unit via `StopUnit()` on the `Manager` object or `Stop()` on the `Unit` object itself. + +### Becoming a Controller + +Optionally, it is possible for a program that registers a scope unit (the "scope manager") for one or more of its child processes to hook into the shutdown logic of the scope unit. Normally, if this is not done, and the scope needs to be shut down (regardless if during normal operation when the user invokes `systemctl stop` -- or something equivalent -- on the scope unit, or during system shutdown), then systemd will simply send SIGTERM to its processes. After a timeout this will be followed by SIGKILL unless the scope processes exited by then. If a scope manager program wants to be involved in the shutdown logic of its scopes it may set the `Controller` property of the scope unit when creating it via `StartTransientUnit()`. It should be set to the bus name (either unique name or well-known name) of the scope manager program. If this is done then instead of SIGTERM to the scope processes systemd will send the RequestStop() bus signal to the specified name. If the name is gone by then it will automatically fallback to SIGTERM, in order to make this robust. As before in either case this will be followed by SIGKILL to the scope unit processes after a timeout. + +Scope units implement a special `Abandon()` method call. This method call is useful for informing the system manager that the scope unit is no longer managed by any scope manager process. Among other things it is useful for manager daemons which terminate but want to leave the scopes they started running. When a scope is abandoned its state will be set to "abandoned" which is shown in the usual systemctl output, as information to the user. Also, if a controller has been set for the scope, it will be unset. Note that there is not strictly need to ever invoke the `Abandon()` method call, however it is recommended for cases like the ones explained above. + +### Delegation + +Service and scope units know a special `Delegate` boolean property. If set, then the processes inside the scope or service may control their own control group subtree (that means: create subcgroups directly via /sys/fs/cgroup). The effect of the property is that: + +1. All controllers systemd knows are enabled for that scope/service, if the scope/service runs privileged code. +2. Access to the cgroup directory of the scope/service is permitted, and files/and directories are updated to get write access for the user specified in `User=` if the scope/unit runs unprivileged. Note that in this case access to any controllers is not available. +3. systemd will refrain from moving processes across the "delegation" boundary. + +Generally, the `Delegate` property is only useful for services that need to manage their own cgroup subtrees, such as container managers. After creating a unit with this property set, they should use `/proc/$PID/cgroup` to figure out the cgroup subtree path they may manage (the one from the name=systemd hierarchy!). Managers should refrain from making any changes to the cgroup tree outside of the subtrees for units they created with the `Delegate` flag turned on. + +Note that scope units created by `machined`'s `CreateMachine()` call have this flag set. + +### Example + +Please see the [systemd-run sources](http://cgit.freedesktop.org/systemd/systemd/plain/src/run/run.c) for a relatively simple example how to create scope or service units transiently and pass properties to them. diff --git a/docs/CONVERTING_TO_HOMED.md b/docs/CONVERTING_TO_HOMED.md new file mode 100644 index 0000000..5416a22 --- /dev/null +++ b/docs/CONVERTING_TO_HOMED.md @@ -0,0 +1,136 @@ +--- +title: Converting Existing Users to systemd-homed +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Converting Existing Users to systemd-homed managed Users + +Traditionally on most Linux distributions, regular (human) users are managed +via entries in `/etc/passwd`, `/etc/shadow`, `/etc/group` and +`/etc/gshadow`. With the advent of +[`systemd-homed`](https://www.freedesktop.org/software/systemd/man/systemd-homed.service.html) +it might be desirable to convert an existing, traditional user account to a +`systemd-homed` managed one. Below is a brief guide how to do that. + +Before continuing, please read up on these basic concepts: + +* [Home Directories](HOME_DIRECTORY) +* [JSON User Records](USER_RECORD) +* [JSON Group Records](GROUP_RECORD) +* [User/Group Record Lookup API via Varlink](USER_GROUP_API) + +## Caveat + +This is a manual process, and possibly a bit fragile. Hence, do this at your +own risk, read up beforehand, and make a backup first. You know what's at +stake: your own home directory, i.e. all your personal data. + +## Step-By-Step + +Here's the step-by-step guide: + +0. Preparations: make sure you run a distribution that has `systemd-homed` + enabled and properly set up, including the necessary PAM and NSS + configuration updates. Make sure you have enough disk space in `/home/` for + a (temporary) second copy of your home directory. Make sure to backup your + home directory. Make sure to log out of your user account fully. Then log in + as root on the console. + +1. Rename your existing home directory to something safe. Let's say your user + ID is `foobar`. Then do: + + ``` + mv /home/foobar /home/foobar.saved + ``` + +2. Have a look at your existing user record, as stored in `/etc/passwd` and + related files. We want to use the same data for the new record, hence it's good + looking at the old data. Use commands such as: + + ``` + getent passwd foobar + getent shadow foobar + ``` + + This will tell you the `/etc/passwd` and `/etc/shadow` entries for your + user. For details about the fields, see the respective man pages + [passwd(5)](https://man7.org/linux/man-pages/man5/passwd.5.html) and + [shadow(5)](https://man7.org/linux/man-pages/man5/shadow.5.html). + + The fourth field in the `getent passwd foobar` output tells you the GID of + your user's main group. Depending on your distribution it's a group private + to the user, or a group shared by most local, regular users. Let's say the + GID reported is 1000, let's then query its details: + + ``` + getent group 1000 + ``` + + This will tell you the name of that group. If the name is the same as your + user name your distribution apparently provided you with a private group for + your user. If it doesn't match (and is something like `users`) it apparently + didn't. Note that `systemd-homed` will always manage a private group for + each user under the same name, hence if your distribution is one of the + latter kind, then there's a (minor) mismatch in structure when converting. + + Save the information reported by these three commands somewhere, for later + reference. + +3. Now edit your `/etc/passwd` file and remove your existing record + (i.e. delete a single line, the one of your user's account, leaving all + other lines unmodified). Similar for `/etc/shadow`, `/etc/group` (in case + you have a private group for your user) and `/etc/gshadow`. Most + distributions provide you with a tool for that, that adds safe + synchronization for these changes: `vipw`, `vipw -s`, `vigr` and `vigr -s`. + +4. At this point the old user account vanished, while the home directory still + exists safely under the `/home/foobar.saved` name. Let's now create a new + account with `systemd-homed`, using the same username and UID as before: + + ``` + homectl create foobar --uid=$UID --real-name=$GECOS + ``` + + In this command line, replace `$UID` by the UID you previously used, + i.e. the third field of the `getent passwd foobar` output above. Similar, + replace `$GECOS` by the GECOS field of your old account, i.e the fifth field + of the old output. If your distribution traditionally does not assign a + private group to regular user groups, then consider adding `--member-of=` + with the group name to get a modicum of compatibility with the status quo + ante: this way your new user account will still not have the old primary + group as new primary group, but will have it as auxiliary group. + + Consider reading through the + [homectl(1)](https://www.freedesktop.org/software/systemd/man/homectl.html) + manual page at this point, maybe there are a couple of other settings you + want to set for your new account. In particular, look at `--storage=` and + `--disk-size=`, in order to change how your home directory shall be stored + (the default `luks` storage is recommended). + +5. Your new user account exists now, but it has an empty home directory. Let's + now migrate your old home directory into it. For that let's mount the new + home directory temporarily and copy the data in. + + ``` + homectl with foobar -- rsync -aHANUXv --remove-source-files /home/foobar.saved/ . + ``` + + This mounts the home directory of the user, and then runs the specified + `rsync` command which copies the contents of the old home directory into the + new. The new home directory is the working directory of the invoked `rsync` + process. We are invoking this command as root, hence the `rsync` runs as + root too. When the `rsync` command completes the home directory is + automatically unmounted again. Since we used `--remove-source-files` all files + copied are removed from the old home directory as the copy progresses. After + the command completes the old home directory should be empty. Let's remove + it hence: + + ``` + rmdir /home/foobar.saved + ``` + +And that's it, we are done already. You can log out now and should be able to +log in under your user account as usual, but now with `systemd-homed` managing +your home directory. diff --git a/docs/COREDUMP.md b/docs/COREDUMP.md new file mode 100644 index 0000000..c64579e --- /dev/null +++ b/docs/COREDUMP.md @@ -0,0 +1,147 @@ +--- +title: systemd Coredump Handling +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# systemd Coredump Handling + +## Support in the Service Manager (PID 1) + +The systemd service manager natively provides coredump handling functionality, +as implemented by the Linux kernel. Specifically, PID 1 provides the following +functionality: + +1. During very early boot it will raise the + [`LIMIT_CORE`](https://man7.org/linux/man-pages/man2/getrlimit.2.html) + resource limit for itself to infinity (and thus implicitly also all its + children). This removes any limits on the size of generated coredumps, for + all invoked processes, from earliest boot on. (The Linux kernel sets the + limit to 0 by default.) + +2. At the same time it will turn off coredump handling in the kernel by writing + `|/bin/false` into `/proc/sys/kernel/core_pattern` (also known as the + "`kernel.core_pattern` sysctl"; see + [core(5)](https://man7.org/linux/man-pages/man5/core.5.html) for + details). This means that coredumps are not actually processed. (The Linux + kernel sets the pattern to `core` by default, so that coredumps are written + to the current working directory of the crashing process.) + +Net effect: after PID1 has started and performed this setup coredumps are +disabled, but by means of the the `kernel.core_pattern` sysctl rather than by +size limit. This is generally preferable, since the pattern can be updated +trivially at the right time to enable coredumping once the system is ready, +taking comprehensive effect on all userspace. (Or to say this differently: +disabling coredumps via the size limit is problematic, since it cannot easily +be undone without iterating through all already running processes once the +system is ready for coredump handling.) + +Processing of core dumps may be enabled at the appropriate time by updating the +`kernel.core_pattern` sysctl. Only coredumps that happen later will be +processed. + +During the final shutdown phase the `kernel.core_pattern` sysctl is updated +again to `|/bin/false`, disabling coredump support again, should it have been +enabled in the meantime. + +This means coredump handling is generally not available during earliest boot +and latest shutdown, reflecting the fact that storage is typically not +available in these environments, and many other facilities are missing too that +are required to collect and process a coredump successfully. + +## `systemd-coredump` Handler + +The systemd suite provides a coredump handler +[`systemd-coredump`](https://www.freedesktop.org/software/systemd/man/systemd-coredump.html) +which can be enabled at build-time. It is activated during boot via the +`/usr/lib/sysctl.d/50-coredump.conf` drop-in file for +`systemd-sysctl.service`. It registers the `systemd-coredump` tool as +`kernel.core_pattern` sysctl. + +`systemd-coredump` is implemented as socket activated service: when the kernel +invokes the userspace coredump handler, the received coredump file descriptor +is immediately handed off to the socket activated service +`systemd-coredump@.service` via the `systemd-coredump.socket` socket unit. This +means the coredump handler runs for a very short time only, and the potentially +*heavy* and security sensitive coredump processing work is done as part of the +specified service unit, and thus can take benefit of regular service resource +management and sandboxing. + +The `systemd-coredump` handler will extract a backtrace and [ELF packaging +metadata](https://systemd.io/ELF_PACKAGE_METADATA) from any coredumps it +receives and log both. The information about coredumps stored in the journal +can be enumerated and queried with the +[`coredumpctl`](https://www.freedesktop.org/software/systemd/man/coredumpctl.html) +tool, for example for directly invoking a debugger such as `gdb` on a collected +coredump. + +The handler writes coredump files to `/var/lib/systemd/coredump/`. Old files +are cleaned up periodically by +[`systemd-tmpfiles(8)`](https://www.freedesktop.org/software/systemd/man/systemd-tmpfiles.html). + +## User Experience + +With the above, any coredumps generated on the system are by default collected +and turned into logged events — except during very early boot and late +shutdown. Individual services, processes or users can opt-out of coredump +collection, by setting `LIMIT_CORE` to 0 (or alternatively invoke +[`PR_SET_DUMPABLE`](https://man7.org/linux/man-pages/man2/prctl.2.html)). The +resource limit can be set freely by daemons/processes/users to arbitrary +values, which the coredump handler will respect. The `coredumpctl` tool may be +used to further analyze/debug coredumps. + +## Alternative Coredump Handlers + +While we recommend usage of the `systemd-coredump` handler, it's fully +supported to use alternative coredump handlers instead. A similar +implementation pattern is recommended. Specifically: + +1. Use a `sysctl.d/` drop-in to register your handler with the kernel. Make + sure to include the `%c` specifier in the pattern (which reflects the + crashing process' `RLIMIT_CORE`) and act on it: limit the stored coredump + file to the specified limit. + +2. Do not do heavy processing directly in the coredump handler. Instead, + quickly pass off the kernel's coredump file descriptor to an + auxiliary service running as service under the service manager, so that it + can be done under supervision, sandboxing and resource management. + +Note that at any given time only a single handler can be enabled, i.e. the +`kernel.core_pattern` sysctl cannot reference multiple executables. + +## Packaging + +It might make sense to split `systemd-coredump` into a separate distribution +package. If doing so, make sure that `/usr/lib/sysctl.d/50-coredump.conf` and +the associated service and socket units are also added to the split off package. + +Note that in a scenario where `systemd-coredump` is split out and not +installed, coredumping is turned off during the entire runtime of the system — +unless an alternative handler is installed, or behaviour is manually reverted +to legacy style handling (see below). + +## Restoring Legacy Coredump Handling + +The default policy of the kernel to write coredumps into the current working +directory of the crashing process is considered highly problematic by many, +including by the systemd maintainers. Nonetheless, if users locally want to +return to this behaviour, two changes must be made (followed by a reboot): + +```console +$ mkdir -p /etc/sysctl.d +$ cat >/etc/sysctl.d/50-coredump.conf <<EOF +# Party like it's 1995! +kernel.core_pattern=core +EOF +``` + +and + +```console +$ mkdir -p /etc/systemd/system.conf.d +$ cat >/etc/systemd/system.conf.d/50-coredump.conf <<EOF +[Manager] +DefaultLimitCORE=0:infinity +EOF +``` diff --git a/docs/COREDUMP_PACKAGE_METADATA.md b/docs/COREDUMP_PACKAGE_METADATA.md new file mode 100644 index 0000000..d84e269 --- /dev/null +++ b/docs/COREDUMP_PACKAGE_METADATA.md @@ -0,0 +1,4 @@ +--- +layout: forward +target: /ELF_PACKAGE_METADATA +--- diff --git a/docs/CREDENTIALS.md b/docs/CREDENTIALS.md new file mode 100644 index 0000000..ed30eac --- /dev/null +++ b/docs/CREDENTIALS.md @@ -0,0 +1,504 @@ +--- +title: Credentials +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# System and Service Credentials + +The `systemd` service manager supports a "credential" concept for securely +acquiring and passing credential data to systems and services. The precise +nature of the credential data is up to applications, but the concept is +intended to provide systems and services with potentially security sensitive +cryptographic keys, certificates, passwords, identity information and similar +types of information. It may also be used as generic infrastructure for +parameterizing systems and services. + +Traditionally, data of this nature has often been provided to services via +environment variables (which is problematic because by default they are +inherited down the process tree, have size limitations, and issues with binary +data) or simple, unencrypted files on disk. `systemd`'s system and service +credentials are supposed to provide a better alternative for this +purpose. Specifically, the following features are provided: + +1. Service credentials are acquired at the moment of service activation, and + released on service deactivation. They are immutable during the service + runtime. + +2. Service credentials are accessible to service code as regular files, the + path to access them is derived from the environment variable + `$CREDENTIALS_DIRECTORY`. + +3. Access to credentials is restricted to the service's user. Unlike + environment variables the credential data is not propagated down the process + tree. Instead each time a credential is accessed an access check is enforced + by the kernel. If the service is using file system namespacing the loaded + credential data is invisible to all other services. + +4. Service credentials may be acquired from files on disk, specified as literal + strings in unit files, acquired from another service dynamically via an + `AF_UNIX` socket, or inherited from the system credentials the system itself + received. + +5. Credentials may optionally be encrypted and authenticated, either with a key + derived from a local TPM2 chip, or one stored in `/var/`, or both. This + encryption is supposed to *just* *work*, and requires no manual setup. (That + is besides first encrypting relevant credentials with one simple command, + see below.) + +6. Service credentials are placed in non-swappable memory. (If permissions + allow it, via `ramfs`.) + +7. Credentials may be acquired from a hosting VM hypervisor (SMBIOS OEM strings + or qemu `fw_cfg`), a hosting container manager, the kernel command line, + from the initrd, or from the UEFI environment via the EFI System Partition + (via `systemd-stub`). Such system credentials may then be propagated into + individual services as needed. + +8. Credentials are an effective way to pass parameters into services that run + with `RootImage=` or `RootDirectory=` and thus cannot read these resources + directly from the host directory tree. + Specifically, [Portable Services](PORTABLE_SERVICES) may be + parameterized this way securely and robustly. + +9. Credentials can be binary and relatively large (though currently an overall + size limit of 1M per service is enforced). + +## Configuring per-Service Credentials + +Within unit files, there are four settings to configure service credentials. + +1. `LoadCredential=` may be used to load a credential from disk, from an + `AF_UNIX` socket, or propagate them from a system credential. + +2. `ImportCredential=` may be used to load one or more (optionally encrypted) + credentials from disk or from the credential stores. + +3. `SetCredential=` may be used to set a credential to a literal string encoded + in the unit file. Because unit files are world-readable (both on disk and + via D-Bus), this should only be used for credentials that aren't sensitive, + e.g. public keys or certificates, but not private keys. + +4. `LoadCredentialEncrypted=` is similar to `LoadCredential=` but will load an + encrypted credential, and decrypt it before passing it to the service. For + details on credential encryption, see below. + +5. `SetCredentialEncrypted=` is similar to `SetCredential=` but expects an + encrypted credential to be specified literally. Unlike `SetCredential=` it + is thus safe to be used even for sensitive information, because even though + unit files are world readable, the ciphertext included in them cannot be + decoded unless access to TPM2/encryption key is available. + +Each credential configured with these options carries a short name (suitable +for inclusion in a filename) in the unit file, under which the invoked service +code can then retrieve it. Each name should only be specified once. + +For details about these four settings [see the man +page](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Credentials). + +It is a good idea to also enable mount namespacing for services that process +credentials configured this way. If so, the runtime credential directory of the +specific service is not visible to any other service. Use `PrivateMounts=` as +minimal option to enable such namespacing. Note that many other sandboxing +settings (e.g. `ProtectSystem=`, `ReadOnlyPaths=` and similar) imply +`PrivateMounts=`, hence oftentimes it's not necessary to set this option +explicitly. + +## Programming Interface from Service Code + +When a service is invoked with one or more credentials set it will have an +environment variable `$CREDENTIALS_DIRECTORY` set. It contains an absolute path +to a directory the credentials are placed in. In this directory for each +configured credential one file is placed. In addition to the +`$CREDENTIALS_DIRECTORY` environment variable passed to the service processes +the `%d` specifier in unit files resolves to the service's credential +directory. + +Example unit file: + +``` +… +[Service] +ExecStart=/usr/bin/myservice.sh +LoadCredential=foobar:/etc/myfoobarcredential.txt +Environment=FOOBARPATH=%d/foobar +… +``` + +Associated service shell script `/usr/bin/myservice.sh`: + +```sh +#!/bin/sh + +sha256sum $CREDENTIALS_DIRECTORY/foobar +sha256sum $FOOBARPATH + +``` + +A service defined like this will get the contents of the file +`/etc/myfoobarcredential.txt` passed as credential `foobar`, which is hence +accessible under `$CREDENTIALS_DIRECTORY/foobar`. Since we additionally pass +the path to it as environment variable `$FOOBARPATH` the credential is also +accessible as the path in that environment variable. When invoked, the service +will hence show the same SHA256 hash value of `/etc/myfoobarcredential.txt` +twice. + +In an ideal world, well-behaved service code would directly support credentials +passed this way, i.e. look for `$CREDENTIALS_DIRECTORY` and load the credential +data it needs from there. For daemons that do not support this but allow +passing credentials via a path supplied over the command line use +`${CREDENTIALS_DIRECTORY}` in the `ExecStart=` command line to reference the +credentials directory. For daemons that allow passing credentials via a path +supplied as environment variable, use the `%d` specifier in the `Environment=` +setting to build valid paths to specific credentials. + +Encrypted credentials are automatically decrypted/authenticated during service +activation, so that service code only receives plaintext credentials. + +## Programming Interface from Generator Code + +[Generators](https://www.freedesktop.org/software/systemd/man/systemd.generator.html) +may generate native unit files from external configuration or system +parameters, such as system credentials. Note that they run outside of service +context, and hence will not receive encrypted credentials in plaintext +form. Specifically, credentials passed into the system in encrypted form will +be placed as they are in a directory referenced by the +`$ENCRYPTED_CREDENTIALS_DIRECTORY` environment variable, and those passed in +plaintext form will be placed in `$CREDENTIALS_DIRECTORY`. Use a command such +as `systemd-creds --system cat …` to access both forms of credentials, and +decrypt them if needed (see +[systemd-creds(1)](https://www.freedesktop.org/software/systemd/man/systemd-creds.html) +for details. + +Note that generators typically run very early during boot (similar to initrd +code), earlier than the `/var/` file system is necessarily mounted (which is +where the system's credential encryption secret is located). Thus it's a good +idea to encrypt credentials with `systemd-creds encrypt --with-key=auto-initrd` +if they shall be consumed by a generator, to ensure they are locked to the TPM2 +only, not the credentials secret stored below `/var/`. + +For further details about encrypted credentials, see below. + +## Tools + +The +[`systemd-creds`](https://www.freedesktop.org/software/systemd/man/systemd-creds.html) +tool is provided to work with system and service credentials. It may be used to +access and enumerate system and service credentials, or to encrypt/decrypt credentials +(for details about the latter, see below). + +When invoked from service context, `systemd-creds` passed without further +parameters will list passed credentials. The `systemd-creds cat xyz` command +may be used to write the contents of credential `xyz` to standard output. If +these calls are combined with the `--system` switch credentials passed to the +system as a whole are shown, instead of those passed to the service the +command is invoked from. + +Example use: + +```sh +systemd-run -P --wait -p LoadCredential=abc:/etc/hosts systemd-creds cat abc +``` + +This will invoke a transient service with a credential `abc` sourced from the +system's `/etc/hosts` file. This credential is then written to standard output +via `systemd-creds cat`. + +## Encryption + +Credentials are supposed to be useful for carrying sensitive information, such +as cryptographic key material. For this kind of data (symmetric) encryption and +authentication are provided to make storage of the data at rest safer. The data +may be encrypted and authenticated with AES256-GCM. The encryption key can +either be one derived from the local TPM2 device, or one stored in +`/var/lib/systemd/credential.secret`, or a combination of both. If a TPM2 +device is available and `/var/` resides on a persistent storage, the default +behaviour is to use the combination of both for encryption, thus ensuring that +credentials protected this way can only be decrypted and validated on the +local hardware and OS installation. Encrypted credentials stored on disk thus +cannot be decrypted without access to the TPM2 chip and the aforementioned key +file `/var/lib/systemd/credential.secret`. Moreover, credentials cannot be +prepared on a machine other than the local one. + +Decryption generally takes place at the moment of service activation. This +means credentials passed to the system can be either encrypted or plaintext and +remain that way all the way while they are propagated to their consumers, until +the moment of service activation when they are decrypted and authenticated, so +that the service only sees plaintext credentials. + +The `systemd-creds` tool provides the commands `encrypt` and `decrypt` to +encrypt and decrypt/authenticate credentials. Example: + +```sh +systemd-creds encrypt --name=foobar plaintext.txt ciphertext.cred +shred -u plaintext.txt +systemd-run -P --wait -p LoadCredentialEncrypted=foobar:$(pwd)/ciphertext.cred systemd-creds cat foobar +``` + +This will first create an encrypted copy of the file `plaintext.txt` in the +encrypted credential file `ciphertext.cred`. It then securely removes the +source file. It then runs a transient service, that reads the encrypted file +and passes it as decrypted credential `foobar` to the invoked service binary +(which here is the `systemd-creds` tool, which just writes the data +it received to standard output). + +Instead of storing the encrypted credential as a separate file on disk, it can +also be embedded in the unit file. Example: + +``` +systemd-creds encrypt -p --name=foobar plaintext.txt - +``` + +This will output a `SetCredentialEncrypted=` line that can directly be used in +a unit file. e.g.: + +``` +… +[Service] +ExecStart=/usr/bin/systemd-creds cat foobar +SetCredentialEncrypted=foobar: \ + k6iUCUh0RJCQyvL8k8q1UyAAAAABAAAADAAAABAAAAC1lFmbWAqWZ8dCCQkAAAAAgAAAA \ + AAAAAALACMA0AAAACAAAAAAfgAg9uNpGmj8LL2nHE0ixcycvM3XkpOCaf+9rwGscwmqRJ \ + cAEO24kB08FMtd/hfkZBX8PqoHd/yPTzRxJQBoBsvo9VqolKdy9Wkvih0HQnQ6NkTKEdP \ + HQ08+x8sv5sr+Mkv4ubp3YT1Jvv7CIPCbNhFtag1n5y9J7bTOKt2SQwBOAAgACwAAABIA \ + ID8H3RbsT7rIBH02CIgm/Gv1ukSXO3DMHmVQkDG0wEciABAAII6LvrmL60uEZcp5qnEkx \ + SuhUjsDoXrJs0rfSWX4QAx5PwfdFuxPusgEfTYIiCb8a/W6RJc7cMweZVCQMbTARyIAAA \ + AAJt7Q9F/Gz0pBv1Lc4Dpn1WpebyBBm+vQ5N/lSKW2XSm8cONwCopxpDc7wJjXg7OTR6r \ + xGCpIvGXLt3ibwJl81woLya2RRjIvc/R2zNm/yWzZAjiOLPih4SuHthqiX98ey8PUmZJB \ + VGXglCZFjBx+d7eCqTIdghtp5pkDGwMJT6pjw4FfyFK2nJPawFKPAqzw9DK2iYttFeXi5 \ + 19xCfLBH9NKS/idlYXrhp+XIEtsr26s4lx5y10Goyc3qDOR3RD2cuZj0gHwV35hhhhcCz \ + JaYytef1X/YL+7fYH5kuE4rxSksoUuA/LhtjszBeGbcbIT+O8SuvBJHLKTSHxPL8FTyk3 \ + L4FSkEHs0rYwUIkKmnGohDdsYrMJ2fjH3yDNBP16aD1+f/Nuh75cjhUnGsDLt9K4hGg== \ +… +``` + +## Inheritance from Container Managers, Hypervisors, Kernel Command Line, or the UEFI Boot Environment + +Sometimes it is useful to parameterize whole systems the same way as services, +via `systemd` credentials. In particular, it might make sense to boot a +system with a set of credentials that are then propagated to individual +services where they are ultimately consumed. + +`systemd` supports five ways to pass credentials to systems: + +1. A container manager may set the `$CREDENTIALS_DIRECTORY` environment + variable for systemd running as PID 1 in the container, the same way as + systemd would set it for a service it + invokes. [`systemd-nspawn(1)`](https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html#Credentials)'s + `--set-credential=` and `--load-credential=` switches implement this, in + order to pass arbitrary credentials from host to container payload. Also see + the [Container Interface](CONTAINER_INTERFACE) documentation. + +2. Quite similar, VMs can be passed credentials via SMBIOS OEM strings (example + qemu command line switch `-smbios + type=11,value=io.systemd.credential:foo=bar` or `-smbios + type=11,value=io.systemd.credential.binary:foo=YmFyCg==`, the latter taking + a Base64 encoded argument to permit binary credentials being passed + in). Alternatively, qemu VMs can be invoked with `-fw_cfg + name=opt/io.systemd.credentials/foo,string=bar` to pass credentials from + host through the hypervisor into the VM via qemu's `fw_cfg` mechanism. (All + three of these specific switches would set credential `foo` to `bar`.) + Passing credentials via the SMBIOS mechanism is typically preferable over + `fw_cfg` since it is faster and less specific to the chosen VMM + implementation. Moreover, `fw_cfg` has a 55 character limitation on names + passed that way. So some settings may not fit. + +3. Credentials may be passed from the initrd to the host during the initrd → + host transition. Provisioning systems that run in the initrd may use this to + install credentials on the system. All files placed in + `/run/credentials/@initrd/` are imported into the set of file system + credentials during the transition. The files (and their directory) are + removed once this is completed. + +4. Credentials may also be passed from the UEFI environment to userspace, if + the + [`systemd-stub`](https://www.freedesktop.org/software/systemd/man/systemd-stub.html) + UEFI kernel stub is used. This allows placing encrypted credentials in the + EFI System Partition, which are then picked up by `systemd-stub` and passed + to the kernel and ultimately userspace where systemd receives them. This is + useful to implement secure parameterization of vendor-built and signed + initrds, as userspace can place credentials next to these EFI kernels, and + be sure they can be accessed securely from initrd context. + +5. Credentials can also be passed into a system via the kernel command line, + via the `systemd.set_credential=` and `systemd.set_credential_binary=` + kernel command line options (the latter takes Base64 encoded binary + data). Note though that any data specified here is visible to all userspace + applications (even unprivileged ones) via `/proc/cmdline`. Typically, this + is hence not useful to pass sensitive information, and should be avoided. + +Credentials passed to the system may be enumerated/displayed via `systemd-creds +--system`. They may also be propagated down to services, via the +`LoadCredential=` setting. Example: + +``` +systemd-nspawn --set-credential=mycred:supersecret -i test.raw -b +``` + +or + +``` +qemu-system-x86_64 \ + -machine type=q35,accel=kvm,smm=on \ + -smp 2 \ + -m 1G \ + -cpu host \ + -nographic \ + -nodefaults \ + -serial mon:stdio \ + -drive if=none,id=hd,file=test.raw,format=raw \ + -device virtio-scsi-pci,id=scsi \ + -device scsi-hd,drive=hd,bootindex=1 \ + -smbios type=11,value=io.systemd.credential:mycred=supersecret +``` + +Either of these lines will boot a disk image `test.raw`, once as container via +`systemd-nspawn`, and once as VM via `qemu`. In each case the credential +`mycred` is set to `supersecret`. + +Inside of the system invoked that way the credential may then be viewed: + +```sh +systemd-creds --system cat mycred +``` + +Or propagated to services further down: + +``` +systemd-run -p ImportCredential=mycred -P --wait systemd-creds cat mycred +``` + +## Well-Known Credentials + +Various services shipped with `systemd` consume credentials for tweaking behaviour: + +* [`systemd(1)`](https://www.freedesktop.org/software/systemd/man/systemd.html) + (I.E.: PID1, the system manager) will look for the credential `vmm.notify_socket` + and will use it to send a `READY=1` datagram when the system has finished + booting. This is useful for hypervisors/VMMs or other processes on the host + to receive a notification via VSOCK when a virtual machine has finished booting. + Note that in case the hypervisor does not support `SOCK_DGRAM` over `AF_VSOCK`, + `SOCK_SEQPACKET` will be tried instead. The credential payload should be in the + form: `vsock:<CID>:<PORT>`. Also note that this requires support for VHOST to be + built-in both the guest and the host kernels, and the kernel modules to be loaded. + +* [`systemd-sysusers(8)`](https://www.freedesktop.org/software/systemd/man/systemd-sysusers.html) + will look for the credentials `passwd.hashed-password.<username>`, + `passwd.plaintext-password.<username>` and `passwd.shell.<username>` to + configure the password (either in UNIX hashed form, or plaintext) or shell of + system users created. Replace `<username>` with the system user of your + choice, for example, `root`. + +* [`systemd-firstboot(1)`](https://www.freedesktop.org/software/systemd/man/systemd-firstboot.html) + will look for the credentials `firstboot.locale`, `firstboot.locale-messages`, + `firstboot.keymap`, `firstboot.timezone`, that configure locale, keymap or + timezone settings in case the data is not yet set in `/etc/`. + +* [`tmpfiles.d(5)`](https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html) + will look for the credentials `tmpfiles.extra` with arbitrary tmpfiles.d lines. + Can be encoded in base64 to allow easily passing it on the command line. + +* Further well-known credentials are documented in + [`systemd.system-credentials(7)`](https://www.freedesktop.org/software/systemd/man/systemd.system-credentials.html). + +In future more services are likely to gain support for consuming credentials. + +Example: + +``` +systemd-nspawn -i test.raw \ + --set-credential=passwd.hashed-password.root:$(mkpasswd mysecret) \ + --set-credential=firstboot.locale:C.UTF-8 \ + -b +``` + +This boots the specified disk image as `systemd-nspawn` container, and passes +the root password `mysecret`and default locale `C.UTF-8` to use to it. This +data is then propagated by default to `systemd-sysusers.service` and +`systemd-firstboot.service`, where it is applied. (Note that these services +will only do so if these settings in `/etc/` are so far unset, i.e. they only +have an effect on *unprovisioned* systems, and will never override data already +established in `/etc/`.) A similar line for qemu is: + +``` +qemu-system-x86_64 \ + -machine type=q35,accel=kvm,smm=on \ + -smp 2 \ + -m 1G \ + -cpu host \ + -nographic \ + -nodefaults \ + -serial mon:stdio \ + -drive if=none,id=hd,file=test.raw,format=raw \ + -device virtio-scsi-pci,id=scsi \ + -device scsi-hd,drive=hd,bootindex=1 \ + -smbios type=11,value=io.systemd.credential:passwd.hashed-password.root=$(mkpasswd mysecret) \ + -smbios type=11,value=io.systemd.credential:firstboot.locale=C.UTF-8 +``` + +This boots the specified disk image via qemu, provisioning public key SSH access +for the root user from the caller's key, and sends a notification when booting +has finished to a process on the host: + +``` +qemu-system-x86_64 \ + -machine type=q35,accel=kvm,smm=on \ + -smp 2 \ + -m 1G \ + -cpu host \ + -nographic \ + -nodefaults \ + -serial mon:stdio \ + -drive if=none,id=hd,file=test.raw,format=raw \ + -device virtio-scsi-pci,id=scsi \ + -device scsi-hd,drive=hd,bootindex=1 \ + -device vhost-vsock-pci,id=vhost-vsock-pci0,guest-cid=42 \ + -smbios type=11,value=io.systemd.credential:vmm.notify_socket=vsock:2:1234 \ + -smbios type=11,value=io.systemd.credential.binary:tmpfiles.extra=$(echo "f~ /root/.ssh/authorized_keys 600 root root - $(ssh-add -L | base64 -w 0)" | base64 -w 0) +``` + +A process on the host can listen for the notification, for example: + +``` +$ socat - VSOCK-LISTEN:1234,socktype=5 +READY=1 +``` + +## Relevant Paths + +From *service* perspective the runtime path to find loaded credentials in is +provided in the `$CREDENTIALS_DIRECTORY` environment variable. For *system +services* the credential directory will be `/run/credentials/<unit name>`, but +hardcoding this path is discouraged, because it does not work for *user +services*. Packagers and system administrators may hardcode the credential path +as a last resort for software that does not yet search for credentials relative +to `$CREDENTIALS_DIRECTORY`. + +From *generator* perspective the runtime path to find credentials passed into +the system in plaintext form in is provided in `$CREDENTIALS_DIRECTORY`, and +those passed into the system in encrypted form is provided in +`$ENCRYPTED_CREDENTIALS_DIRECTORY`. + +At runtime, credentials passed to the *system* are placed in +`/run/credentials/@system/` (for regular credentials, such as those passed from +a container manager or via qemu) and `/run/credentials/@encrypted/` (for +credentials that must be decrypted/validated before use, such as those from +`systemd-stub`). + +The `ImportCredential=` setting (and the `LoadCredential=` and +`LoadCredentialEncrypted=` settings when configured with a relative source +path) will search for the source file to read the credential from +automatically. Primarily, these credentials are searched among the credentials +passed into the system. If not found there, they are searched in +`/etc/credstore/`, `/run/credstore/`, +`/usr/lib/credstore/`. `LoadCredentialEncrypted=` will also search +`/etc/credstore.encrypted/` and similar directories. `ImportCredential=` will +search both the non-encrypted and encrypted directories. These directories are +hence a great place to store credentials to load on the system. + +## Conditionalizing Services + +Sometimes it makes sense to conditionalize system services and invoke them only +if the right system credential is passed to the system. Use the +`ConditionCredential=` and `AssertCredential=` unit file settings for that. diff --git a/docs/DAEMON_SOCKET_ACTIVATION.md b/docs/DAEMON_SOCKET_ACTIVATION.md new file mode 100644 index 0000000..1a027a3 --- /dev/null +++ b/docs/DAEMON_SOCKET_ACTIVATION.md @@ -0,0 +1,122 @@ +--- +title: Socket Activation with Popular Daemons +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +## nginx + +nginx includes an undocumented, internal socket-passing mechanism based on the `NGINX` environmental variable. It uses this to perform reloads without having to close and reopen its sockets, but it's also useful for socket activation. + +**/etc/nginx/my-nginx.conf** + +``` +http { + server { + listen [::]:80 ipv6only=on; + listen 80; + } +} +``` + +**/etc/systemd/system/my-nginx.service** + +``` +[Service] +User=nginx +Group=nginx +Environment=NGINX=3:4; +ExecStart=/usr/sbin/nginx -c/etc/nginx/my-nginx.conf +PrivateNetwork=true +``` + + +**/etc/systemd/system/my-nginx.socket** + +``` +[Socket] +ListenStream=80 +ListenStream=0.0.0.0:80 +BindIPv6Only=ipv6-only +After=network.target +Requires=network.target + +[Install] +WantedBy=sockets.target +``` + + +## PHP-FPM + +Like nginx, PHP-FPM includes a socket-passing mechanism an environmental variable. In PHP-FPM's case, it's `FPM_SOCKETS`. + +This configuration is possible with any web server that supports FastCGI (like Apache, Lighttpd, or nginx). The web server does not need to know anything special about the socket; use a normal PHP-FPM configuration. + +Paths are based on a Fedora 19 system. + +### First, the configuration files + +**/etc/php-fpm.d/my-php-fpm-pool.conf** + +``` +[global] +pid = /run/my-php-fpm-pool.pid ; Not really used by anything with daemonize = no, but needs to be writable. +error_log = syslog ; Will aggregate to the service's systemd journal. +daemonize = no ; systemd handles the forking. + +[www] +listen = /var/run/my-php-fpm-pool.socket ; Must match systemd socket unit. +user = nginx ; Ignored but required. +group = nginx ; Ignored but required. +pm = static +pm.max_children = 10 +slowlog = syslog +``` + + +**/etc/systemd/system/my-php-fpm-pool.service** + +``` +[Service] +User=nginx +Group=nginx +Environment="FPM_SOCKETS=/var/run/my-php-fpm-pool.socket=3" +ExecStart=/usr/sbin/php-fpm --fpm-config=/etc/php-fpm.d/my-php-fpm-pool.conf +KillMode=process +``` + + +**/etc/systemd/system/my-php-fpm-pool.socket** + +``` +[Socket] +ListenStream=/var/run/my-php-fpm-pool.socket + +[Install] +WantedBy=sockets.target +``` + + +### Second, the setup commands + +```sh +sudo systemctl --system daemon-reload +sudo systemctl start my-php-fpm-pool.socket +sudo systemctl enable my-php-fpm-pool.socket +``` + + +After accessing the web server, the service should be running. + +```sh +sudo systemctl status my-php-fpm-pool.service +``` + + +It's possible to shut down the service and re-activate it using the web browser, too. It's necessary to stop and start the socket to reset some shutdown PHP-FPM does that otherwise breaks reactivation. + +```sh +sudo systemctl stop my-php-fpm-pool.socket my-php-fpm-pool.service +sudo systemctl start my-php-fpm-pool.socket +``` diff --git a/docs/DEBUGGING.md b/docs/DEBUGGING.md new file mode 100644 index 0000000..dc1c874 --- /dev/null +++ b/docs/DEBUGGING.md @@ -0,0 +1,211 @@ +--- +title: Diagnosing Boot Problems +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Diagnosing Boot Problems + +If your machine gets stuck during boot, first check if the hang happens before or after control passes to systemd. + +Try to boot without `rhgb` and `quiet` on the kernel command line. If you see some messages like these: + +* Welcome to Fedora _VERSION_ (_codename_)!" +* Starting _name_... +* \[ OK \] Started _name_. + +then systemd is running. (See an actual [screenshot](f17boot.png).) + +Debugging always gets easier if you can get a shell. If you do not get a login prompt, try switching to a different virtual terminal using CTRL+ALT+F\_\_. Problems with a display server startup may manifest themselves as a missing login on tty1, but other VTs working. + +If the boot stops without presenting you with a login on any virtual console, let it retry for _up to 5 minutes_ before declaring it definitely stuck. There is a chance that a service that has trouble starting will be killed after this timeout and the boot will continue normally. Another possibility is that a device for an important mountpoint will fail to appear and you will be presented with _emergency mode_. + +## If You Get No Shell + +If you get neither a normal login nor the emergency mode shell, you will need to do additional steps to get debugging information out of the machine. + +* Try CTRL+ALT+DEL to reboot. + * If it does not reboot, mention it in your bugreport. Meanwhile force the reboot with [SysRq](http://fedoraproject.org/wiki/QA/Sysrq) or hard reset. +* When booting the next time, you will have to add some kernel command line arguments depending on which of the debugging strategies you choose from the following options. + +### Debug Logging to a Serial Console + +If you have a hardware serial console available or if you are debugging in a virtual machine (e.g. using virt-manager you can switch your view to a serial console in the menu View -> Text Consoles or connect from the terminal using `virsh console MACHINE`), you can ask systemd to log lots of useful debugging information to it by booting with: + +```sh +systemd.log_level=debug systemd.log_target=console console=ttyS0,38400 console=tty1 +``` + + +The above is useful if pid 1 is failing, but if a later but critical boot service is broken (such as networking), you can configure journald to forward to the console by using: + +```sh +systemd.journald.forward_to_console=1 console=ttyS0,38400 console=tty1 +``` + +console= can be specified multiple times, systemd will output to all of them. + +### Booting into Rescue or Emergency Targets + +To boot directly into rescue target add `systemd.unit=rescue.target` or just `1` to the kernel command line. This target is useful if the problem occurs somewhere after the basic system is brought up, during the starting of "normal" services. If this is the case, you should be able to disable the bad service from here. If the rescue target will not boot either, the more minimal emergency target might. + +To boot directly into emergency shell add `systemd.unit=emergency.target` or `emergency` to the kernel command line. Note that in the emergency shell you will have to remount the root filesystem read-write by yourself before editing any files: + +```sh +mount -o remount,rw / +``` + +Common issues that can be resolved in the emergency shell are bad lines in **/etc/fstab**. After fixing **/etc/fstab**, run `systemctl daemon-reload` to let systemd refresh its view of it. + +If not even the emergency target works, you can boot directly into a shell with `init=/bin/sh`. This may be necessary in case systemd itself or some libraries it depends on are damaged by filesystem corruption. You may need to reinstall working versions of the affected packages. + +If `init=/bin/sh` does not work, you must boot from another medium. + +### Early Debug Shell + +You can enable shell access to be available very early in the startup process to fall back on and diagnose systemd related boot up issues with various systemctl commands. Enable it using: + +```sh +systemctl enable debug-shell.service +``` + +or by specifying + +```sh +systemd.debug-shell=1 +``` + +on the kernel command line. + +**Tip**: If you find yourself in a situation where you cannot use systemctl to communicate with a running systemd (e.g. when setting this up from a different booted system), you can avoid communication with the manager by specifying `--root=`: + +```sh +systemctl --root=/ enable debug-shell.service +``` + +Once enabled, the next time you boot you will be able to switch to tty9 using CTRL+ALT+F9 and have a root shell there available from an early point in the booting process. You can use the shell for checking the status of services, reading logs, looking for stuck jobs with `systemctl list-jobs`, etc. + +**Warning:** Use this shell only for debugging! Do not forget to disable systemd-debug-shell.service after you've finished debugging your boot problems. Leaving the root shell always available would be a security risk. + +It is also possible to alias `kbrequest.target` to `debug-shell.service` to start the debug shell on demand. This has the same security implications, but avoids running the shell always. + +### verify prerequisites + +A (at least partly) populated `/dev` is required. Depending on your setup (e.g. on embedded systems), check that the Linux kernel config options `CONFIG_DEVTMPFS` and `CONFIG_DEVTMPFS_MOUNT` are set. Also support for cgroups and fanotify is recommended for a flawless operation, so check that the Linux kernel config options `CONFIG_CGROUPS` and `CONFIG_FANOTIFY` are set. The message "Failed to get D-Bus connection: No connection to service manager." during various `systemctl` operations is an indicator that these are missing. + +## If You Can Get a Shell + +When you have systemd running to the extent that it can provide you with a shell, please use it to extract useful information for debugging. Boot with these parameters on the kernel command line: + +```sh +systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on +``` + +in order to increase the verbosity of systemd, to let systemd write its logs to the kernel log buffer, to increase the size of the kernel log buffer, and to prevent the kernel from discarding messages. After reaching the shell, look at the log: + +```sh +journalctl -b +``` + +When reporting a bug, pipe that to a file and attach it to the bug report. + +To check for possibly stuck jobs use: + +```sh +systemctl list-jobs +``` + +The jobs that are listed as "running" are the ones that must complete before the "waiting" ones will be allowed to start executing. + + +# Diagnosing Shutdown Problems + +Just like with boot problems, when you encounter a hang during shutting down, make sure you wait _at least 5 minutes_ to distinguish a permanent hang from a broken service that's just timing out. Then it's worth testing whether the system reacts to CTRL+ALT+DEL in any way. + +If shutdown (whether it be to reboot or power-off) of your system gets stuck, first test if the kernel itself is able to reboot or power-off the machine forcedly using one of these commands: + +```sh +reboot -f +poweroff -f +``` + +If either one of the commands does not work, it's more likely to be a kernel, not systemd bug. + +## Shutdown Completes Eventually + +If normal reboot or poweroff work, but take a suspiciously long time, then + +* boot with the debug options: + +```sh +systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on enforcing=0 +``` + +* save the following script as **/usr/lib/systemd/system-shutdown/debug.sh** and make it executable: + +```sh +#!/bin/sh +mount -o remount,rw / +dmesg > /shutdown-log.txt +mount -o remount,ro / +``` + +* reboot + + +Look for timeouts logged in the resulting file **shutdown-log.txt** and/or attach it to a bugreport. + +## Shutdown Never Finishes + +If normal reboot or poweroff never finish even after waiting a few minutes, the above method to create the shutdown log will not help and the log must be obtained using other methods. Two options that are useful for debugging boot problems can be used also for shutdown problems: + +* use a serial console +* use a debug shell - not only is it available from early boot, it also stays active until late shutdown. + + +# Status and Logs of Services + +When the start of a service fails, systemctl will give you a generic error message: + +```sh +# systemctl start foo.service +Job failed. See system journal and 'systemctl status' for details. +``` + +The service may have printed its own error message, but you do not see it, because services run by systemd are not related to your login session and their outputs are not connected to your terminal. That does not mean the output is lost though. By default the stdout, stderr of services are directed to the systemd _journal_ and the logs that services produce via `syslog(3)` go there too. systemd also stores the exit code of failed services. Let's check: + +```sh +# systemctl status foo.service +foo.service - mmm service + Loaded: loaded (/etc/systemd/system/foo.service; static) + Active: failed (Result: exit-code) since Fri, 11 May 2012 20:26:23 +0200; 4s ago + Process: 1329 ExecStart=/usr/local/bin/foo (code=exited, status=1/FAILURE) + CGroup: name=systemd:/system/foo.service + +May 11 20:26:23 scratch foo[1329]: Failed to parse config +``` + + +In this example the service ran as a process with PID 1329 and exited with error code 1. If you run systemctl status as root or as a user from the `adm` group, you will get a few lines from the journal that the service wrote. In the example the service produced just one error message. + +To list the journal, use the `journalctl` command. + +If you have a syslog service (such as rsyslog) running, the journal will also forward the messages to it, so you'll find them in **/var/log/messages** (depending on rsyslog's configuration). + + +# Reporting systemd Bugs + +Be prepared to include some information (logs) about your system as well. These should be complete (no snippets please), not in an archive, uncompressed. + +Please report bugs to your distribution's bug tracker first. If you are sure that you are encountering an upstream bug, then first check [for existing bug reports](https://github.com/systemd/systemd/issues/), and if your issue is not listed [file a new bug](https://github.com/systemd/systemd/issues/new). + +## Information to Attach to a Bug Report + +Whenever possible, the following should be mentioned and attached to your bug report: + +* The exact kernel command-line used. Typically from the bootloader configuration file (e.g. **/boot/grub2/grub.cfg**) or from **/proc/cmdline** +* The journal (the output of `journalctl -b > journal.txt`) + * ideally after booting with `systemd.log_level=debug systemd.log_target=kmsg log_buf_len=1M printk.devkmsg=on` +* The output of a systemd dump: `systemd-analyze dump > systemd-dump.txt` +* The output of `/usr/lib/systemd/systemd --test --system --log-level=debug > systemd-test.txt 2>&1` diff --git a/docs/DESKTOP_ENVIRONMENTS.md b/docs/DESKTOP_ENVIRONMENTS.md new file mode 100644 index 0000000..0a0eff6 --- /dev/null +++ b/docs/DESKTOP_ENVIRONMENTS.md @@ -0,0 +1,118 @@ +--- +title: Desktop Environment Integration +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Desktop Environments + +NOTE: This document is a work-in-progress. + +## Single Graphical Session + +systemd only supports running one graphical session per user at a time. +While this might not have always been the case historically, having multiple +sessions for one user running at the same time is problematic. +The DBus session bus is shared between all the logins, and services that are +started must be implicitly assigned to the user's current graphical session. + +In principle it is possible to run a single graphical session across multiple +logind seats, and this could be a way to use more than one display per user. +When a user logs in to a second seat, the seat resources could be assigned +to the existing session, allowing the graphical environment to present it +is a single seat. +Currently nothing like this is supported or even planned. + +## Pre-defined systemd units + +[`systemd.special(7)`](https://www.freedesktop.org/software/systemd/man/systemd.special.html) +defines the `graphical-session.target` and `graphical-session-pre.target` to +allow cross-desktop integration. Furthermore, systemd defines the three base +slices `background`, `app` and `session`. +All units should be placed into one of these slices depending on their purposes: + + * `session.slice`: Contains only processes essential to run the user's graphical session + * `app.slice`: Contains all normal applications that the user is running + * `background.slice`: Useful for low-priority background tasks + +The purpose of this grouping is to assign different priorities to the +applications. +This could e.g. mean reserving memory to session processes, +preferentially killing background tasks in out-of-memory situations +or assigning different memory/CPU/IO priorities to ensure that the session +runs smoothly under load. + +TODO: Will there be a default to place units into e.g. `app.slice` by default +rather than the root slice? + +## XDG standardization for applications + +To ensure cross-desktop compatibility and encourage sharing of good practices, +desktop environments should adhere to the following conventions: + + * Application units should follow the scheme `app[-<launcher>]-<ApplicationID>[@<RANDOM>].service` + or `app[-<launcher>]-<ApplicationID>-<RANDOM>.scope` + e.g: + - `app-gnome-org.gnome.Evince@12345.service` + - `app-flatpak-org.telegram.desktop@12345.service` + - `app-KDE-org.kde.okular@12345.service` + - `app-org.kde.amarok.service` + - `app-org.gnome.Evince-12345.scope` + * Using `.service` units instead of `.scope` units, i.e. allowing systemd to + start the process on behalf of the caller, + instead of the caller starting the process and letting systemd know about it, + is encouraged. + * The RANDOM should be a string of random characters to ensure that multiple instances + of the application can be launched. + It can be omitted in the case of a non-transient application services which can ensure + multiple instances are not spawned, such as a DBus activated application. + * If no application ID is available, the launcher should generate a reasonable + name when possible (e.g. using `basename(argv[0])`). This name must not + contain a `-` character. + +This has the following advantages: + * Using the `app-<launcher>-` prefix means that the unit defaults can be + adjusted using desktop environment specific drop-in files. + * The application ID can be retrieved by stripping the prefix and postfix. + This in turn should map to the corresponding `.desktop` file when available + +TODO: Define the name of slices that should be used. +This could be `app-<launcher>-<ApplicationID>-<RANDOM>.slice`. + +TODO: Does it really make sense to insert the `<launcher>`? In GNOME I am +currently using a drop-in to configure `BindTo=graphical-session.target`, +`CollectMode=inactive-or-failed` and `TimeoutSec=5s`. I feel that such a +policy makes sense, but it may make much more sense to just define a +global default for all (graphical) applications. + + * Should application lifetime be bound to the session? + * May the user have applications that do not belong to the graphical session (e.g. launched from SSH)? + * Could we maybe add a default `app-.service.d` drop-in configuration? + +## XDG autostart integration + +To allow XDG autostart integration, systemd ships a cross-desktop generator +to create appropriate units for the autostart directory +(`systemd-xdg-autostart-generator`). +Desktop Environments can opt-in to using this by starting +`xdg-desktop-autostart.target`. The systemd generator correctly handles +`OnlyShowIn=` and `NotShowIn=`. It also handles the KDE and GNOME specific +`X-KDE-autostart-condition=` and `AutostartCondition=` by using desktop-environment-provided +binaries in an `ExecCondition=` line. + +However, this generator is somewhat limited in what it supports. For example, +all generated units will have `After=graphical-session.target` set on them, +and therefore may not be useful to start session services. + +Desktop files can be marked to be explicitly excluded from the generator using the line +`X-systemd-skip=true`. This should be set if an application provides its own +systemd service file for startup. + +## Startup and shutdown best practices + +Question here are: + + * Are there strong opinions on how the session-leader process should watch the user's session units? + * Should systemd/logind/… provide an integrated way to define a session in terms of a running *user* unit? + * Is having `gnome-session-shutdown.target` that is run with `replace-irreversibly` considered a good practice? diff --git a/docs/DISCOVERABLE_PARTITIONS.md b/docs/DISCOVERABLE_PARTITIONS.md new file mode 100644 index 0000000..bc05b6c --- /dev/null +++ b/docs/DISCOVERABLE_PARTITIONS.md @@ -0,0 +1 @@ +[This content has moved to the UAPI group website](https://uapi-group.org/specifications/specs/discoverable_partitions_specification/) diff --git a/docs/DISTRO_PORTING.md b/docs/DISTRO_PORTING.md new file mode 100644 index 0000000..c95a829 --- /dev/null +++ b/docs/DISTRO_PORTING.md @@ -0,0 +1,94 @@ +--- +title: Porting systemd To New Distributions +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Porting systemd To New Distributions + +## HOWTO + +You need to make the follow changes to adapt systemd to your +distribution: + +1. Find the right configure parameters for: + + * `-Dsysvinit-path=` + * `-Dsysvrcnd-path=` + * `-Drc-local=` + * `-Dloadkeys-path=` + * `-Dsetfont-path=` + * `-Dtty-gid=` + * `-Dntp-servers=` + * `-Ddns-servers=` + * `-Dsupport-url=` + +2. Try it out. + + Play around (as an ordinary user) with + `/usr/lib/systemd/systemd --test --system` for a test run + of systemd without booting. This will read the unit files and + print the initial transaction it would execute during boot-up. + This will also inform you about ordering loops and suchlike. + +## Compilation options + +The default configuration does not enable any optimization or hardening +options. This is suitable for development and testing, but not for end-user +installations. + +For deployment, optimization (`-O2` or `-O3` compiler options), link time +optimization (`-Db_lto=true` meson option), and hardening (e.g. +`-D_FORTIFY_SOURCE=2`, `-fstack-protector-strong`, `-fstack-clash-protection`, +`-fcf-protection`, `-pie` compiler options, and `-z relro`, `-z now`, +`--as-needed` linker options) are recommended. The most appropriate set of +options depends on the architecture and distribution specifics so no default is +provided. + +## NTP Pool + +By default, systemd-timesyncd uses the Google Public NTP servers +`time[1-4].google.com`, if no other NTP configuration is available. +They serve time that uses a +[leap second smear](https://developers.google.com/time/smear) +and can be up to .5s off from servers that use stepped leap seconds. + +If you prefer to use leap second steps, please register your own +vendor pool at ntp.org and make it the built-in default by +passing `-Dntp-servers=` to meson. Registering vendor +pools is [free](http://www.pool.ntp.org/en/vendors.html). + +Use `-Dntp-servers=` to direct systemd-timesyncd to different fallback +NTP servers. + +## DNS Servers + +By default, systemd-resolved uses Cloudflare and Google Public DNS servers +`1.1.1.1`, `8.8.8.8`, `1.0.0.1`, `8.8.4.4`, `2606:4700:4700::1111`, `2001:4860:4860::8888`, `2606:4700:4700::1001`, `2001:4860:4860::8844` +as fallback, if no other DNS configuration is available. + +Use `-Ddns-servers=` to direct systemd-resolved to different fallback +DNS servers. + +## PAM + +The default PAM config shipped by systemd is really bare bones. +It does not include many modules your distro might want to enable +to provide a more seamless experience. For example, limits set in +`/etc/security/limits.conf` will not be read unless you load `pam_limits`. +Make sure you add modules your distro expects from user services. + +Pass `-Dpamconfdir=no` to meson to avoid installing this file and +instead install your own. + +## Contributing Upstream + +We generally no longer accept distribution-specific patches to +systemd upstream. If you have to make changes to systemd's source code +to make it work on your distribution, unless your code is generic +enough to be generally useful, we are unlikely to merge it. Please +always consider adopting the upstream defaults. If that is not +possible, please maintain the relevant patches downstream. + +Thank you for understanding. diff --git a/docs/ELF_PACKAGE_METADATA.md b/docs/ELF_PACKAGE_METADATA.md new file mode 100644 index 0000000..6cb3f78 --- /dev/null +++ b/docs/ELF_PACKAGE_METADATA.md @@ -0,0 +1,105 @@ +--- +title: Package Metadata for ELF Files +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Package Metadata for Core Files + +*Intended audience: hackers working on userspace subsystems that create ELF binaries +or parse ELF core files.* + +## Motivation + +ELF binaries get stamped with a unique, build-time generated hex string identifier called +`build-id`, [which gets embedded as an ELF note called `.note.gnu.build-id`](https://fedoraproject.org/wiki/Releases/FeatureBuildId). +In most cases, this allows to associate a stripped binary with its debugging information. +It is used, for example, to dynamically fetch DWARF symbols from a debuginfo server, or +to query the local package manager and find out the package metadata or, again, the DWARF +symbols or program sources. + +However, this usage of the `build-id` requires either local metadata, usually set up by +the package manager, or access to a remote server over the network. Both of those might +be unavailable or forbidden. + +Thus it becomes desirable to add additional metadata to a binary at build time, so that +`systemd-coredump` and other services analyzing core files are able to extract said +metadata simply from the core file itself, without external dependencies. + +## Implementation + +This document will attempt to define a common metadata format specification, so that +multiple implementers might use it when building packages, or core file analyzers, and +so on. + +The metadata will be embedded in a single, new, 4-bytes-aligned, allocated, 0-padded, +read-only ELF header section, in a name-value JSON object format. Implementers working on parsing +core files should not assume a specific list of names, but parse anything that is included +in the section, and should look for the note using the `note type`. Implementers working on +build tools should strive to use the same names, for consistency. The most common will be +listed here. When corresponding to the content of os-release, the values should match, again for consistency. + +If available, the metadata should also include the debuginfod server URL that can provide +the original executable, debuginfo and sources, to further facilitate debugging. + +* Section header + +``` +SECTION: `.note.package` +note type: `0xcafe1a7e` +Owner: `FDO` (FreeDesktop.org) +Value: a single JSON object encoded as a zero-terminated UTF-8 string +``` + +* JSON payload + +```json +{ + "type":"rpm", # this provides a namespace for the package+package-version fields + "os":"fedora", + "osVersion":"33", + "name":"coreutils", + "version":"4711.0815.fc13", + "architecture":"arm32", + "osCpe": "cpe:/o:fedoraproject:fedora:33", # A CPE name for the operating system, `CPE_NAME` from os-release is a good default + "debugInfoUrl": "https://debuginfod.fedoraproject.org/" +} +``` + +The format is a single JSON object, encoded as a zero-terminated `UTF-8` string. +Each name in the object shall be unique as per recommendations of +[RFC8259](https://datatracker.ietf.org/doc/html/rfc8259#section-4). Strings shall +not contain any control character, nor use `\uXXX` escaping. + +When it comes to JSON numbers, this specification assumes that JSON parsers +processing this information are capable of reproducing the full signed 53bit +integer range (i.e. -2⁵³+1…+2⁵³-1) as well as the full 64-bit IEEE floating +point number range losslessly (with the exception of NaN/-inf/+inf, since JSON +cannot encode that), as per recommendations of +[RFC8259](https://datatracker.ietf.org/doc/html/rfc8259#page-8). Fields in +these JSON objects are thus permitted to encode numeric values from these +ranges as JSON numbers, and should not use numeric values not covered by these +types and ranges. + +Reference implementations of [packaging tools for .deb and .rpm](https://github.com/systemd/package-notes) +are available, and provide macros/helpers to include the note in binaries built +by the package build system. They make use of the new `--package-metadata` flag that +is available in the bfd, gold, mold and lld linkers (versions 2.39, 1.3.0 and 15.0 +respectively). This linker flag takes a JSON payload as parameter. + +## Well-known keys + +The metadata format is intentionally left open, so that vendors can add their own information. +A set of well-known keys is defined here, and hopefully shared among all vendors. + +| Key name | Key description | Example value | +|--------------|--------------------------------------------------------------------------|---------------------------------------| +| type | The packaging type | rpm | +| os | The OS name, typically corresponding to ID in os-release | fedora | +| osVersion | The OS version, typically corresponding to VERSION_ID in os-release | 33 | +| name | The source package name | coreutils | +| version | The source package version | 4711.0815.fc13 | +| architecture | The binary package architecture | arm32 | +| osCpe | A CPE name for the OS, typically corresponding to CPE_NAME in os-release | cpe:/o:fedoraproject:fedora:33 | +| debugInfoUrl | The debuginfod server url, if available | https://debuginfod.fedoraproject.org/ | diff --git a/docs/ENVIRONMENT.md b/docs/ENVIRONMENT.md new file mode 100644 index 0000000..5e15b2b --- /dev/null +++ b/docs/ENVIRONMENT.md @@ -0,0 +1,597 @@ +--- +title: Known Environment Variables +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Known Environment Variables + +A number of systemd components take additional runtime parameters via +environment variables. Many of these environment variables are not supported at +the same level as command line switches and other interfaces are: we don't +document them in the man pages and we make no stability guarantees for +them. While they generally are unlikely to be dropped any time soon again, we +do not want to guarantee that they stay around for good either. + +Below is an (incomprehensive) list of the environment variables understood by +the various tools. Note that this list only covers environment variables not +documented in the proper man pages. + +All tools: + +* `$SYSTEMD_OFFLINE=[0|1]` — if set to `1`, then `systemctl` will refrain from + talking to PID 1; this has the same effect as the historical detection of + `chroot()`. Setting this variable to `0` instead has a similar effect as + `$SYSTEMD_IGNORE_CHROOT=1`; i.e. tools will try to communicate with PID 1 + even if a `chroot()` environment is detected. You almost certainly want to + set this to `1` if you maintain a package build system or similar and are + trying to use a modern container system and not plain `chroot()`. + +* `$SYSTEMD_IGNORE_CHROOT=1` — if set, don't check whether being invoked in a + `chroot()` environment. This is particularly relevant for systemctl, as it + will not alter its behaviour for `chroot()` environments if set. Normally it + refrains from talking to PID 1 in such a case; turning most operations such + as `start` into no-ops. If that's what's explicitly desired, you might + consider setting `$SYSTEMD_OFFLINE=1`. + +* `$SYSTEMD_FIRST_BOOT=0|1` — if set, assume "first boot" condition to be false + or true, instead of checking the flag file created by PID 1. + +* `$SD_EVENT_PROFILE_DELAYS=1` — if set, the sd-event event loop implementation + will print latency information at runtime. + +* `$SYSTEMD_PROC_CMDLINE` — if set, the contents are used as the kernel command + line instead of the actual one in `/proc/cmdline`. This is useful for + debugging, in order to test generators and other code against specific kernel + command lines. + +* `$SYSTEMD_OS_RELEASE` — if set, use this path instead of `/etc/os-release` or + `/usr/lib/os-release`. When operating under some root (e.g. `systemctl + --root=…`), the path is prefixed with the root. Only useful for debugging. + +* `$SYSTEMD_FSTAB` — if set, use this path instead of `/etc/fstab`. Only useful + for debugging. + +* `$SYSTEMD_SYSROOT_FSTAB` — if set, use this path instead of + `/sysroot/etc/fstab`. Only useful for debugging `systemd-fstab-generator`. + +* `$SYSTEMD_SYSFS_CHECK` — takes a boolean. If set, overrides sysfs container + detection that ignores `/dev/` entries in fstab. Only useful for debugging + `systemd-fstab-generator`. + +* `$SYSTEMD_CRYPTTAB` — if set, use this path instead of `/etc/crypttab`. Only + useful for debugging. Currently only supported by + `systemd-cryptsetup-generator`. + +* `$SYSTEMD_INTEGRITYTAB` — if set, use this path instead of + `/etc/integritytab`. Only useful for debugging. Currently only supported by + `systemd-integritysetup-generator`. + +* `$SYSTEMD_VERITYTAB` — if set, use this path instead of + `/etc/veritytab`. Only useful for debugging. Currently only supported by + `systemd-veritysetup-generator`. + +* `$SYSTEMD_EFI_OPTIONS` — if set, used instead of the string in the + `SystemdOptions` EFI variable. Analogous to `$SYSTEMD_PROC_CMDLINE`. + +* `$SYSTEMD_DEFAULT_HOSTNAME` — override the compiled-in fallback hostname + (relevant in particular for the system manager and `systemd-hostnamed`). + Must be a valid hostname (either a single label or a FQDN). + +* `$SYSTEMD_IN_INITRD` — takes a boolean. If set, overrides initrd detection. + This is useful for debugging and testing initrd-only programs in the main + system. + +* `$SYSTEMD_BUS_TIMEOUT=SECS` — specifies the maximum time to wait for method call + completion. If no time unit is specified, assumes seconds. The usual other units + are understood, too (us, ms, s, min, h, d, w, month, y). If it is not set or set + to 0, then the built-in default is used. + +* `$SYSTEMD_MEMPOOL=0` — if set, the internal memory caching logic employed by + hash tables is turned off, and libc `malloc()` is used for all allocations. + +* `$SYSTEMD_UTF8=` — takes a boolean value, and overrides whether to generate + non-ASCII special glyphs at various places (i.e. "→" instead of + "->"). Usually this is determined automatically, based on `$LC_CTYPE`, but in + scenarios where locale definitions are not installed it might make sense to + override this check explicitly. + +* `$SYSTEMD_EMOJI=0` — if set, tools such as `systemd-analyze security` will + not output graphical smiley emojis, but ASCII alternatives instead. Note that + this only controls use of Unicode emoji glyphs, and has no effect on other + Unicode glyphs. + +* `$RUNTIME_DIRECTORY` — various tools use this variable to locate the + appropriate path under `/run/`. This variable is also set by the manager when + `RuntimeDirectory=` is used, see systemd.exec(5). + +* `$SYSTEMD_CRYPT_PREFIX` — if set configures the hash method prefix to use for + UNIX `crypt()` when generating passwords. By default the system's "preferred + method" is used, but this can be overridden with this environment variable. + Takes a prefix such as `$6$` or `$y$`. (Note that this is only honoured on + systems built with libxcrypt and is ignored on systems using glibc's + original, internal `crypt()` implementation.) + +* `$SYSTEMD_SECCOMP=0` — if set, seccomp filters will not be enforced, even if + support for it is compiled in and available in the kernel. + +* `$SYSTEMD_LOG_SECCOMP=1` — if set, system calls blocked by seccomp filtering, + for example in `systemd-nspawn`, will be logged to the audit log, if the + kernel supports this. + +* `$SYSTEMD_ENABLE_LOG_CONTEXT` — if set, extra fields will always be logged to + the journal instead of only when logging in debug mode. + +* `$SYSTEMD_NETLINK_DEFAULT_TIMEOUT` — specifies the default timeout of waiting + replies for netlink messages from the kernel. Defaults to 25 seconds. + +`systemctl`: + +* `$SYSTEMCTL_FORCE_BUS=1` — if set, do not connect to PID 1's private D-Bus + listener, and instead always connect through the dbus-daemon D-bus broker. + +* `$SYSTEMCTL_INSTALL_CLIENT_SIDE=1` — if set, enable or disable unit files on + the client side, instead of asking PID 1 to do this. + +* `$SYSTEMCTL_SKIP_SYSV=1` — if set, do not call SysV compatibility hooks. + +* `$SYSTEMCTL_SKIP_AUTO_KEXEC=1` — if set, do not automatically kexec instead of + reboot when a new kernel has been loaded. + +* `$SYSTEMCTL_SKIP_AUTO_SOFT_REBOOT=1` — if set, do not automatically soft-reboot + instead of reboot when a new root file system has been loaded in + `/run/nextroot/`. + +`systemd-nspawn`: + +* `$SYSTEMD_NSPAWN_UNIFIED_HIERARCHY=1` — if set, force `systemd-nspawn` into + unified cgroup hierarchy mode. + +* `$SYSTEMD_NSPAWN_API_VFS_WRITABLE=1` — if set, make `/sys/`, `/proc/sys/`, + and friends writable in the container. If set to "network", leave only + `/proc/sys/net/` writable. + +* `$SYSTEMD_NSPAWN_CONTAINER_SERVICE=…` — override the "service" name nspawn + uses to register with machined. If unset defaults to "nspawn", but with this + variable may be set to any other value. + +* `$SYSTEMD_NSPAWN_USE_CGNS=0` — if set, do not use cgroup namespacing, even if + it is available. + +* `$SYSTEMD_NSPAWN_LOCK=0` — if set, do not lock container images when running. + +* `$SYSTEMD_NSPAWN_TMPFS_TMP=0` — if set, do not overmount `/tmp/` in the + container with a tmpfs, but leave the directory from the image in place. + +* `$SYSTEMD_NSPAWN_CHECK_OS_RELEASE=0` — if set, do not fail when trying to + boot an OS tree without an os-release file (useful when trying to boot a + container with empty `/etc/` and bind-mounted `/usr/`) + +* `$SYSTEMD_SUPPRESS_SYNC=1` — if set, all disk synchronization syscalls are + blocked to the container payload (e.g. `sync()`, `fsync()`, `syncfs()`, …) + and the `O_SYNC`/`O_DSYNC` flags are made unavailable to `open()` and + friends. This is equivalent to passing `--suppress-sync=yes` on the + `systemd-nspawn` command line. + +* `$SYSTEMD_NSPAWN_NETWORK_MAC=...` — if set, allows users to set a specific MAC + address for a container, ensuring that it uses the provided value instead of + generating a random one. It is effective when used with `--network-veth`. The + expected format is six groups of two hexadecimal digits separated by colons, + e.g. `SYSTEMD_NSPAWN_NETWORK_MAC=12:34:56:78:90:AB` + +`systemd-logind`: + +* `$SYSTEMD_BYPASS_HIBERNATION_MEMORY_CHECK=1` — if set, report that + hibernation is available even if the swap devices do not provide enough room + for it. + +* `$SYSTEMD_REBOOT_TO_FIRMWARE_SETUP` — if set, overrides `systemd-logind`'s + built-in EFI logic of requesting a reboot into the firmware. Takes a boolean. + If set to false, the functionality is turned off entirely. If set to true, + instead of requesting a reboot into the firmware setup UI through EFI a file, + `/run/systemd/reboot-to-firmware-setup` is created whenever this is + requested. This file may be checked for by services run during system + shutdown in order to request the appropriate operation from the firmware in + an alternative fashion. + +* `$SYSTEMD_REBOOT_TO_BOOT_LOADER_MENU` — similar to the above, allows + overriding of `systemd-logind`'s built-in EFI logic of requesting a reboot + into the boot loader menu. Takes a boolean. If set to false, the + functionality is turned off entirely. If set to true, instead of requesting a + reboot into the boot loader menu through EFI, the file + `/run/systemd/reboot-to-boot-loader-menu` is created whenever this is + requested. The file contains the requested boot loader menu timeout in µs, + formatted in ASCII decimals, or zero in case no timeout is requested. This + file may be checked for by services run during system shutdown in order to + request the appropriate operation from the boot loader in an alternative + fashion. + +* `$SYSTEMD_REBOOT_TO_BOOT_LOADER_ENTRY` — similar to the above, allows + overriding of `systemd-logind`'s built-in EFI logic of requesting a reboot + into a specific boot loader entry. Takes a boolean. If set to false, the + functionality is turned off entirely. If set to true, instead of requesting a + reboot into a specific boot loader entry through EFI, the file + `/run/systemd/reboot-to-boot-loader-entry` is created whenever this is + requested. The file contains the requested boot loader entry identifier. This + file may be checked for by services run during system shutdown in order to + request the appropriate operation from the boot loader in an alternative + fashion. Note that by default only boot loader entries which follow the + [Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification) + and are placed in the ESP or the Extended Boot Loader partition may be + selected this way. However, if a directory `/run/boot-loader-entries/` + exists, the entries are loaded from there instead. The directory should + contain the usual directory hierarchy mandated by the Boot Loader + Specification, i.e. the entry drop-ins should be placed in + `/run/boot-loader-entries/loader/entries/*.conf`, and the files referenced by + the drop-ins (including the kernels and initrds) somewhere else below + `/run/boot-loader-entries/`. Note that all these files may be (and are + supposed to be) symlinks. `systemd-logind` will load these files on-demand, + these files can hence be updated (ideally atomically) whenever the boot + loader configuration changes. A foreign boot loader installer script should + hence synthesize drop-in snippets and symlinks for all boot entries at boot + or whenever they change if it wants to integrate with `systemd-logind`'s + APIs. + +`systemd-udevd` and sd-device library: + +* `$NET_NAMING_SCHEME=` — if set, takes a network naming scheme (i.e. one of + "v238", "v239", "v240"…, or the special value "latest") as parameter. If + specified udev's `net_id` builtin will follow the specified naming scheme + when determining stable network interface names. This may be used to revert + to naming schemes of older udev versions, in order to provide more stable + naming across updates. This environment variable takes precedence over the + kernel command line option `net.naming-scheme=`, except if the value is + prefixed with `:` in which case the kernel command line option takes + precedence, if it is specified as well. + +* `$SYSTEMD_DEVICE_VERIFY_SYSFS` — if set to "0", disables verification that + devices sysfs path are actually backed by sysfs. Relaxing this verification + is useful for testing purposes. + +`nss-systemd`: + +* `$SYSTEMD_NSS_BYPASS_SYNTHETIC=1` — if set, `nss-systemd` won't synthesize + user/group records for the `root` and `nobody` users if they are missing from + `/etc/passwd`. + +* `$SYSTEMD_NSS_DYNAMIC_BYPASS=1` — if set, `nss-systemd` won't return + user/group records for dynamically registered service users (i.e. users + registered through `DynamicUser=1`). + +`systemd-timedated`: + +* `$SYSTEMD_TIMEDATED_NTP_SERVICES=…` — colon-separated list of unit names of + NTP client services. If set, `timedatectl set-ntp on` enables and starts the + first existing unit listed in the environment variable, and + `timedatectl set-ntp off` disables and stops all listed units. + +`systemd-sulogin-shell`: + +* `$SYSTEMD_SULOGIN_FORCE=1` — This skips asking for the root password if the + root password is not available (such as when the root account is locked). + See `sulogin(8)` for more details. + +`bootctl` and other tools that access the EFI System Partition (ESP): + +* `$SYSTEMD_RELAX_ESP_CHECKS=1` — if set, the ESP validation checks are + relaxed. Specifically, validation checks that ensure the specified ESP path + is a FAT file system are turned off, as are checks that the path is located + on a GPT partition with the correct type UUID. + +* `$SYSTEMD_ESP_PATH=…` — override the path to the EFI System Partition. This + may be used to override ESP path auto detection, and redirect any accesses to + the ESP to the specified directory. Note that unlike with `bootctl`'s + `--path=` switch only very superficial validation of the specified path is + done when this environment variable is used. + +* `$KERNEL_INSTALL_CONF_ROOT=…` — override the built in default configuration + directory /etc/kernel/ to read files like entry-token and install.conf from. + +`systemd` itself: + +* `$SYSTEMD_ACTIVATION_UNIT` — set for all NSS and PAM module invocations that + are done by the service manager on behalf of a specific unit, in child + processes that are later (after execve()) going to become unit + processes. Contains the full unit name (e.g. "foobar.service"). NSS and PAM + modules can use this information to determine in which context and on whose + behalf they are being called, which may be useful to avoid deadlocks, for + example to bypass IPC calls to the very service that is about to be + started. Note that NSS and PAM modules should be careful to only rely on this + data when invoked privileged, or possibly only when getppid() returns 1, as + setting environment variables is of course possible in any even unprivileged + contexts. + +* `$SYSTEMD_ACTIVATION_SCOPE` — closely related to `$SYSTEMD_ACTIVATION_UNIT`, + it is either set to `system` or `user` depending on whether the NSS/PAM + module is called by systemd in `--system` or `--user` mode. + +* `$SYSTEMD_SUPPORT_DEVICE`, `$SYSTEMD_SUPPORT_MOUNT`, `$SYSTEMD_SUPPORT_SWAP` - + can be set to `0` to mark respective unit type as unsupported. Generally, + having less units saves system resources so these options might be useful + for cases where we don't need to track given unit type, e.g. `--user` manager + often doesn't need to deal with device or swap units because they are + handled by the `--system` manager (PID 1). Note that setting certain unit + type as unsupported may not prevent loading some units of that type if they + are referenced by other units of another supported type. + +* `$SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_BURST` — can be set to override the mount + units burst rate limit for parsing `/proc/self/mountinfo`. On a system with + few resources but many mounts the rate limit may be hit, which will cause the + processing of mount units to stall. The burst limit may be adjusted when the + default is not appropriate for a given system. Defaults to `5`, accepts + positive integers. + +`systemd-remount-fs`: + +* `$SYSTEMD_REMOUNT_ROOT_RW=1` — if set and no entry for the root directory + exists in `/etc/fstab` (this file always takes precedence), then the root + directory is remounted writable. This is primarily used by + `systemd-gpt-auto-generator` to ensure the root partition is mounted writable + in accordance to the GPT partition flags. + +`systemd-firstboot` and `localectl`: + +* `$SYSTEMD_LIST_NON_UTF8_LOCALES=1` — if set, non-UTF-8 locales are listed among + the installed ones. By default non-UTF-8 locales are suppressed from the + selection, since we are living in the 21st century. + +`systemd-resolved`: + +* `$SYSTEMD_RESOLVED_SYNTHESIZE_HOSTNAME` — if set to "0", `systemd-resolved` + won't synthesize system hostname on both regular and reverse lookups. + +`systemd-sysext`: + +* `$SYSTEMD_SYSEXT_HIERARCHIES` — this variable may be used to override which + hierarchies are managed by `systemd-sysext`. By default only `/usr/` and + `/opt/` are managed, and directories may be added or removed to that list by + setting this environment variable to a colon-separated list of absolute + paths. Only "real" file systems and directories that only contain "real" file + systems as submounts should be used. Do not specify API file systems such as + `/proc/` or `/sys/` here, or hierarchies that have them as submounts. In + particular, do not specify the root directory `/` here. Similarly, + `$SYSTEMD_CONFEXT_HIERARCHIES` works for confext images and supports the + systemd-confext multi-call functionality of sysext. + +`systemd-tmpfiles`: + +* `$SYSTEMD_TMPFILES_FORCE_SUBVOL` — if unset, `v`/`q`/`Q` lines will create + subvolumes only if the OS itself is installed into a subvolume. If set to `1` + (or another value interpreted as true), these lines will always create + subvolumes if the backing filesystem supports them. If set to `0`, these + lines will always create directories. + +`systemd-sysusers` + +* `$SOURCE_DATE_EPOCH` — if unset, the field of the date of last password change + in `/etc/shadow` will be the number of days from Jan 1, 1970 00:00 UTC until + today. If `$SOURCE_DATE_EPOCH` is set to a valid UNIX epoch value in seconds, + then the field will be the number of days until that time instead. This is to + support creating bit-by-bit reproducible system images by choosing a + reproducible value for the field of the date of last password change in + `/etc/shadow`. See: https://reproducible-builds.org/specs/source-date-epoch/ + +`systemd-sysv-generator`: + +* `$SYSTEMD_SYSVINIT_PATH` — Controls where `systemd-sysv-generator` looks for + SysV init scripts. + +* `$SYSTEMD_SYSVRCND_PATH` — Controls where `systemd-sysv-generator` looks for + SysV init script runlevel link farms. + +systemd tests: + +* `$SYSTEMD_TEST_DATA` — override the location of test data. This is useful if + a test executable is moved to an arbitrary location. + +* `$SYSTEMD_TEST_NSS_BUFSIZE` — size of scratch buffers for "reentrant" + functions exported by the nss modules. + +* `$TESTFUNCS` – takes a colon separated list of test functions to invoke, + causes all non-matching test functions to be skipped. Only applies to tests + using our regular test boilerplate. + +fuzzers: + +* `$SYSTEMD_FUZZ_OUTPUT` — A boolean that specifies whether to write output to + stdout. Setting to true is useful in manual invocations, since all output is + suppressed by default. + +* `$SYSTEMD_FUZZ_RUNS` — The number of times execution should be repeated in + manual invocations. + +Note that it may be also useful to set `$SYSTEMD_LOG_LEVEL`, since all logging +is suppressed by default. + +`systemd-importd`: + +* `$SYSTEMD_IMPORT_BTRFS_SUBVOL` — takes a boolean, which controls whether to + prefer creating btrfs subvolumes over plain directories for machine + images. Has no effect on non-btrfs file systems where subvolumes are not + available anyway. If not set, defaults to true. + +* `$SYSTEMD_IMPORT_BTRFS_QUOTA` — takes a boolean, which controls whether to set + up quota automatically for created btrfs subvolumes for machine images. If + not set, defaults to true. Has no effect if machines are placed in regular + directories, because btrfs subvolumes are not supported or disabled. If + enabled, the quota group of the subvolume is automatically added to a + combined quota group for all such machine subvolumes. + +* `$SYSTEMD_IMPORT_SYNC` — takes a boolean, which controls whether to + synchronize images to disk after installing them, before completing the + operation. If not set, defaults to true. If disabled installation of images + will be quicker, but not as safe. + +`systemd-dissect`, `systemd-nspawn` and all other tools that may operate on +disk images with `--image=` or similar: + +* `$SYSTEMD_DISSECT_VERITY_SIDECAR` — takes a boolean, which controls whether to + load "sidecar" Verity metadata files. If enabled (which is the default), + whenever a disk image is used, a set of files with the `.roothash`, + `.usrhash`, `.roothash.p7s`, `.usrhash.p7s`, `.verity` suffixes are searched + adjacent to disk image file, containing the Verity root hashes, their + signatures or the Verity data itself. If disabled this automatic discovery of + Verity metadata files is turned off. + +* `$SYSTEMD_DISSECT_VERITY_EMBEDDED` — takes a boolean, which controls whether + to load the embedded Verity signature data. If enabled (which is the + default), Verity root hash information and a suitable signature is + automatically acquired from a signature partition, following the + [Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification). + If disabled any such partition is ignored. Note that this only disables + discovery of the root hash and its signature, the Verity data partition + itself is still searched in the GPT image. + +* `$SYSTEMD_DISSECT_VERITY_SIGNATURE` — takes a boolean, which controls whether + to validate the signature of the Verity root hash if available. If enabled + (which is the default), the signature of suitable disk images is validated + against any of the certificates in `/etc/verity.d/*.crt` (and similar + directories in `/usr/lib/`, `/run`, …) or passed to the kernel for validation + against its built-in certificates. + +* `$SYSTEMD_DISSECT_VERITY_TIMEOUT_SEC=sec` — takes a timespan, which controls + the timeout waiting for the image to be configured. Defaults to 100 msec. + +* `$SYSTEMD_DISSECT_FILE_SYSTEMS=` — takes a colon-separated list of file + systems that may be mounted for automatically dissected disk images. If not + specified defaults to something like: `ext4:btrfs:xfs:vfat:erofs:squashfs` + +* `$SYSTEMD_LOOP_DIRECT_IO` – takes a boolean, which controls whether to enable + `LO_FLAGS_DIRECT_IO` (i.e. direct IO + asynchronous IO) on loopback block + devices when opening them. Defaults to on, set this to "0" to disable this + feature. + +`systemd-cryptsetup`: + +* `$SYSTEMD_CRYPTSETUP_USE_TOKEN_MODULE` – takes a boolean, which controls + whether to use the libcryptsetup "token" plugin module logic even when + activating via FIDO2, PKCS#11, TPM2, i.e. mechanisms natively supported by + `systemd-cryptsetup`. Defaults to enabled. + +* `$SYSTEMD_CRYPTSETUP_TOKEN_PATH` – takes a path to a directory in the file + system. If specified overrides where libcryptsetup will look for token + modules (.so). This is useful for debugging token modules: set this + environment variable to the build directory and you are set. This variable + is only supported when systemd is compiled in developer mode. + +Various tools that read passwords from the TTY, such as `systemd-cryptenroll` +and `homectl`: + +* `$PASSWORD` — takes a string: the literal password to use. If this + environment variable is set it is used as password instead of prompting the + user interactively. This exists primarily for debugging and testing + purposes. Do not use this for production code paths, since environment + variables are typically inherited down the process tree without restrictions + and should thus not be used for secrets. + +* `$NEWPASSWORD` — similar to `$PASSWORD` above, but is used when both a + current and a future password are required, for example if the password is to + be changed. In that case `$PASSWORD` shall carry the current (i.e. old) + password and `$NEWPASSWORD` the new. + +`systemd-homed`: + +* `$SYSTEMD_HOME_ROOT` – defines an absolute path where to look for home + directories/images. When unspecified defaults to `/home/`. This is useful for + debugging purposes in order to run a secondary `systemd-homed` instance that + operates on a different directory where home directories/images are placed. + +* `$SYSTEMD_HOME_RECORD_DIR` – defines an absolute path where to look for + fixated home records kept on the host. When unspecified defaults to + `/var/lib/systemd/home/`. Similar to `$SYSTEMD_HOME_ROOT` this is useful for + debugging purposes, in order to run a secondary `systemd-homed` instance that + operates on a record database entirely separate from the host's. + +* `$SYSTEMD_HOME_DEBUG_SUFFIX` – takes a short string that is suffixed to + `systemd-homed`'s D-Bus and Varlink service names/sockets. This is also + understood by `homectl`. This too is useful for running an additional copy of + `systemd-homed` that doesn't interfere with the host's main one. + +* `$SYSTEMD_HOMEWORK_PATH` – configures the path to the `systemd-homework` + binary to invoke. If not specified defaults to + `/usr/lib/systemd/systemd-homework`. + + Combining these four environment variables is pretty useful when + debugging/developing `systemd-homed`: +```sh +SYSTEMD_HOME_DEBUG_SUFFIX=foo \ + SYSTEMD_HOMEWORK_PATH=/home/lennart/projects/systemd/build/systemd-homework \ + SYSTEMD_HOME_ROOT=/home.foo/ \ + SYSTEMD_HOME_RECORD_DIR=/var/lib/systemd/home.foo/ \ + /home/lennart/projects/systemd/build/systemd-homed +``` + +* `$SYSTEMD_HOME_MOUNT_OPTIONS_BTRFS`, `$SYSTEMD_HOME_MOUNT_OPTIONS_EXT4`, + `$SYSTEMD_HOME_MOUNT_OPTIONS_XFS` – configure the default mount options to + use for LUKS home directories, overriding the built-in default mount + options. There's one variable for each of the supported file systems for the + LUKS home directory backend. + +* `$SYSTEMD_HOME_MKFS_OPTIONS_BTRFS`, `$SYSTEMD_HOME_MKFS_OPTIONS_EXT4`, + `$SYSTEMD_HOME_MKFS_OPTIONS_XFS` – configure additional arguments to use for + `mkfs` when formatting LUKS home directories. There's one variable for each + of the supported file systems for the LUKS home directory backend. + +`kernel-install`: + +* `$KERNEL_INSTALL_BYPASS` – If set to "1", execution of kernel-install is skipped + when kernel-install is invoked. This can be useful if kernel-install is invoked + unconditionally as a child process by another tool, such as package managers + running kernel-install in a postinstall script. + +`systemd-journald`, `journalctl`: + +* `$SYSTEMD_JOURNAL_COMPACT` – Takes a boolean. If enabled, journal files are written + in a more compact format that reduces the amount of disk space required by the + journal. Note that journal files in compact mode are limited to 4G to allow use of + 32-bit offsets. Enabled by default. + +* `$SYSTEMD_JOURNAL_COMPRESS` – Takes a boolean, or one of the compression + algorithms "XZ", "LZ4", and "ZSTD". If enabled, the default compression + algorithm set at compile time will be used when opening a new journal file. + If disabled, the journal file compression will be disabled. Note that the + compression mode of existing journal files are not changed. To make the + specified algorithm takes an effect immediately, you need to explicitly run + `journalctl --rotate`. + +* `$SYSTEMD_CATALOG` – path to the compiled catalog database file to use for + `journalctl -x`, `journalctl --update-catalog`, `journalctl --list-catalog` + and related calls. + +* `$SYSTEMD_CATALOG_SOURCES` – path to the catalog database input source + directory to use for `journalctl --update-catalog`. + +`systemd-pcrextend`, `systemd-cryptsetup`: + +* `$SYSTEMD_FORCE_MEASURE=1` — If set, force measuring of resources (which are + marked for measurement) even if not booted on a kernel equipped with + systemd-stub. Normally, requested measurement of resources is conditionalized + on kernels that have booted with `systemd-stub`. With this environment + variable the test for that my be bypassed, for testing purposes. + +`systemd-repart`: + +* `$SYSTEMD_REPART_MKFS_OPTIONS_<FSTYPE>` – configure additional arguments to use for + `mkfs` when formatting partition file systems. There's one variable for each + of the supported file systems. + +* `$SYSTEMD_REPART_OVERRIDE_FSTYPE` – if set the value will override the file + system type specified in Format= lines in partition definition files. + +`systemd-nspawn`, `systemd-networkd`: + +* `$SYSTEMD_FIREWALL_BACKEND` – takes a string, either `iptables` or + `nftables`. Selects the firewall backend to use. If not specified tries to + use `nftables` and falls back to `iptables` if that's not available. + +`systemd-storagetm`: + +* `$SYSTEMD_NVME_MODEL`, `$SYSTEMD_NVME_FIRMWARE`, `$SYSTEMD_NVME_SERIAL`, + `$SYSTEMD_NVME_UUID` – these take a model string, firmware version string, + serial number string, and UUID formatted as string. If specified these + override the defaults exposed on the NVME subsystem and namespace, which are + derived from the underlying block device and system identity. Do not set the + latter two via the environment variable unless `systemd-storagetm` is invoked + to expose a single device only, since those identifiers better should be kept + unique. diff --git a/docs/FAQ.md b/docs/FAQ.md new file mode 100644 index 0000000..483645b --- /dev/null +++ b/docs/FAQ.md @@ -0,0 +1,114 @@ +--- +title: Frequently Asked Questions +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Frequently Asked Questions + +Also check out the [Tips & Tricks](../TIPS_AND_TRICKS)! + +**Q: How do I change the current runlevel?** + +A: In systemd runlevels are exposed via "target units". You can change them like this: + +```sh +# systemctl isolate runlevel5.target +``` + +Note however, that the concept of runlevels is a bit out of date, and it is usually nicer to use modern names for this. e.g.: + +```sh +# systemctl isolate graphical.target +``` + +This will only change the current runlevel, and has no effect on the next boot. + +**Q: How do I change the default runlevel to boot into?** + +A: The symlink /etc/systemd/system/default.target controls where we boot into by default. Link it to the target unit of your choice. For example, like this: + +```sh +# ln -sf /usr/lib/systemd/system/multi-user.target /etc/systemd/system/default.target +``` + +or + +```sh +# ln -sf /usr/lib/systemd/system/graphical.target /etc/systemd/system/default.target +``` + +**Q: How do I figure out the current runlevel?** + +A: Note that there might be more than one target active at the same time. So the question regarding _the_ runlevel might not always make sense. Here's how you would figure out all targets that are currently active: + +```sh +$ systemctl list-units --type=target +``` + +If you are just interested in a single number, you can use the venerable _runlevel_ command, but again, its output might be misleading. + +**Q: I want to change a service file, but rpm keeps overwriting it in /usr/lib/systemd/system all the time, how should I handle this?** + +A: The recommended way is to copy the service file from /usr/lib/systemd/system to /etc/systemd/system and edit it there. The latter directory takes precedence over the former, and rpm will never overwrite it. If you want to use the distributed service file again you can simply delete (or rename) the service file in /etc/systemd/system again. + +**Q: My service foo.service as distributed by my operating system vendor is only started when (a connection comes in or some hardware is plugged in). I want to have it started always on boot, too. What should I do?** + +A: Simply place a symlink from that service file in the multi-user.target.wants/ directory (which is where you should symlink everything you want to run in the old runlevel 3, i.e. the normal boot-up without graphical UI. It is pulled in by graphical.target too, so will be started for graphical boot-ups, too): + +```sh +# ln -sf /usr/lib/systemd/system/foobar.service /etc/systemd/system/multi-user.target.wants/foobar.service +# systemctl daemon-reload +``` + +**Q: I want to enable another getty, how would I do that?** + +A: Simply instantiate a new getty service for the port of your choice (internally, this places another symlink for instantiating another serial getty in the getty.target.wants/ directory). +```sh +# systemctl enable serial-getty@ttyS2.service +# systemctl start serial-getty@ttyS2.service +``` + +Note that gettys on the virtual console are started on demand. You can control how many you get via the NAutoVTs= setting in [logind.conf(7)](http://www.freedesktop.org/software/systemd/man/logind.html). Also see [this blog story](http://0pointer.de/blog/projects/serial-console.html). + +**Q: How to I figure out which service a process belongs to?** + +A: You may either use ps for that: + +```sh +$ alias psc='ps xawf -eo pid,user,cgroup,args' +$ psc +... +``` + +Or you can even check /proc/$PID/cgroup directly. Also see [this blog story](http://0pointer.de/blog/projects/systemd-for-admins-2.html). + +**Q: Why don't you use inotify to reload the unit files automatically on change?** + +A: Unfortunately that would be a racy operation. For an explanation why and how we tried to improve the situation, see [the bugzilla report about this](https://bugzilla.redhat.com/show_bug.cgi?id=615527). + +**Q: I have a native systemd service file and a SysV init script installed which share the same basename, e.g. /usr/lib/systemd/system/foobar.service vs. /etc/init.d/foobar -- which one wins?** + +A: If both files are available the native unit file always takes precedence and the SysV init script is ignored, regardless whether either is enabled or disabled. Note that a SysV service that is enabled but overridden by a native service does not have the effect that the native service would be enabled, too. Enabling of native and SysV services is completely independent. Or in other words: you cannot enable a native service by enabling a SysV service by the same name, and if a SysV service is enabled but the respective native service is not, this will not have the effect that the SysV script is executed. + +**Q: How can I use journalctl to display full (= not truncated) messages even if less is not used?** + +A: Use: + +```sh +# journalctl --full +``` + + +**Q: Whenever my service tries to acquire RT scheduling for one of its threads this is refused with EPERM even though my service is running with full privileges. This works fine on my non-systemd system!** + +A: By default, systemd places all systemd daemons in their own cgroup in the "cpu" hierarchy. Unfortunately, due to a kernel limitation, this has the effect of disallowing RT entirely for the service. See [My Service Can't Get Realtime!](../MY_SERVICE_CANT_GET_REATLIME) for a longer discussion and what to do about this. + +**Q: My service is ordered after `network.target` but at boot it is still called before the network is up. What's going on?** + +A: That's a long story, and that's why we have a wiki page of its own about this: [Running Services After the Network is up](../NETWORK_ONLINE) + +**Q: My systemd system always comes up with `/tmp` as a tiny `tmpfs`. How do I get rid of this?** + +A: That's also a long story, please have a look on [API File Systems](../API_FILE_SYSTEMS) diff --git a/docs/FILE_DESCRIPTOR_STORE.md b/docs/FILE_DESCRIPTOR_STORE.md new file mode 100644 index 0000000..206dda7 --- /dev/null +++ b/docs/FILE_DESCRIPTOR_STORE.md @@ -0,0 +1,213 @@ +--- +title: File Descriptor Store +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The File Descriptor Store + +*TL;DR: The systemd service manager may optionally maintain a set of file +descriptors for each service. Those file descriptors are under control of the +service. Storing file descriptors in the manager makes is easier to restart +services without dropping connections or losing state.* + +Since its inception `systemd` has supported the *socket* *activation* +mechanism: the service manager creates and listens on some sockets (and similar +UNIX file descriptors) on behalf of a service, and then passes them to the +service during activation of the service via UNIX file descriptor (short: *fd*) +passing over `execve()`. This is primarily exposed in the +[.socket](https://www.freedesktop.org/software/systemd/man/systemd.socket.html) +unit type. + +The *file* *descriptor* *store* (short: *fdstore*) extends this concept, and +allows services to *upload* during runtime additional fds to the service +manager that it shall keep on its behalf. File descriptors are passed back to +the service on subsequent activations, the same way as any socket activation +fds are passed. + +If a service fd is passed to the fdstore logic of the service manager it only +maintains a duplicate of it (in the sense of UNIX +[`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html)), the fd remains +also in possession of the service itself, and it may (and is expected to) +invoke any operations on it that it likes. + +The primary use-case of this logic is to permit services to restart seamlessly +(for example to update them to a newer version), without losing execution +context, dropping pinned resources, terminating established connections or even +just momentarily losing connectivity. In fact, as the file descriptors can be +uploaded freely at any time during the service runtime, this can even be used +to implement services that robustly handle abnormal termination and can recover +from that without losing pinned resources. + +Note that Linux supports the +[`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) concept +that allows associating a memory-backed fd with arbitrary data. This may +conveniently be used to serialize service state into and then place in the +fdstore, in order to implement service restarts with full service state being +passed over. + +## Basic Mechanism + +The fdstore is enabled per-service via the +[`FileDescriptorStoreMax=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStoreMax=) +service setting. It defaults to zero (which means the fdstore logic is turned +off), but can take an unsigned integer value that controls how many fds to +permit the service to upload to the service manager to keep simultaneously. + +If set to values > 0, the fdstore is enabled. When invoked the service may now +(asynchronously) upload file descriptors to the fdstore via the +[`sd_pid_notify_with_fds()`](https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html) +API call (or an equivalent re-implementation). When uploading the fds it is +necessary to set the `FDSTORE=1` field in the message, to indicate what the fd +is intended for. It's recommended to also set the `FDNAME=…` field to any +string of choice, which may be used to identify the fd later. + +Whenever the service is restarted the fds in its fdstore will be passed to the +new instance following the same protocol as for socket activation fds. i.e. the +`$LISTEN_FDS`, `$LISTEN_PIDS`, `$LISTEN_FDNAMES` environment variables will be +set (the latter will be populated from the `FDNAME=…` field mentioned +above). See +[`sd_listen_fds()`](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html) +for details on receiving such fds in a service. (Note that the name set in +`FDNAME=…` does not need to be unique, which is useful when operating with +multiple fully equivalent sockets or similar, for example for a service that +both operates on IPv4 and IPv6 and treats both more or less the same.). + +And that's already the gist of it. + +## Seamless Service Restarts + +A system service that provides a client-facing interface that shall be able to +seamlessly restart can make use of this in a scheme like the following: +whenever a new connection comes in it uploads its fd immediately into its +fdstore. At appropriate times it also serializes its state into a memfd it +uploads to the service manager — either whenever the state changed +sufficiently, or simply right before it terminates. (The latter of course means +that state only survives on *clean* restarts and abnormal termination implies the +state is lost completely — while the former would mean there's a good chance the +next restart after an abnormal termination could continue where it left off +with only some context lost.) + +Using the fdstore for such seamless service restarts is generally recommended +over implementations that attempt to leave a process from the old service +instance around until after the new instance already started, so that the old +then communicates with the new service instance, and passes the fds over +directly. Typically service restarts are a mechanism for implementing *code* +updates, hence leaving two version of the service running at the same time is +generally problematic. It also collides with the systemd service manager's +general principle of guaranteeing a pristine execution environment, a pristine +security context, and a pristine resource management context for freshly +started services, without uncontrolled "leftovers" from previous runs. For +example: leaving processes from previous runs generally negatively affects +lifecycle management (i.e. `KillMode=none` must be set), which disables large +parts of the service managers state tracking, resource management (as resource +counters cannot start at zero during service activation anymore, since the old +processes remaining skew them), security policies (as processes with possibly +out-of-date security policies – SElinux, AppArmor, any LSM, seccomp, BPF — in +effect remain), and similar. + +## File Descriptor Store Lifecycle + +By default any file descriptor stored in the fdstore for which a `POLLHUP` or +`POLLERR` is seen is automatically closed and removed from the fdstore. This +behavior can be turned off, by setting the `FDPOLL=0` field when uploading the +fd via `sd_notify_with_fds()`. + +The fdstore is automatically closed whenever the service is fully deactivated +and no jobs are queued for it anymore. This means that a restart job for a +service will leave the fdstore intact, but a separate stop and start job for +it — executed synchronously one after the other — will likely not. + +This behavior can be modified via the +[`FileDescriptorStorePreserve=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStorePreserve=) +setting in service unit files. If set to `yes` the fdstore will be kept as long +as the service definition is loaded into memory by the service manager, i.e. as +long as at least one other loaded unit has a reference to it. + +The `systemctl clean --what=fdstore …` command may be used to explicitly clear +the fdstore of a service. This is only allowed when the service is fully +deactivated, and is hence primarily useful in case +`FileDescriptorStorePreserve=yes` is set (because the fdstore is otherwise +fully closed anyway in this state). + +Individual file descriptors may be removed from the fdstore via the +`sd_notify()` mechanism, by sending an `FDSTOREREMOVE=1` message, accompanied +by an `FDNAME=…` string identifying the fds to remove. (The name does not have +to be unique, as mentioned, in which case *all* matching fds are +closed). Generally it's a good idea to send such messages to the service +manager during initialization of the service whenever an unrecognized fd is +received, to make the service robust for code updates: if an old version +uploaded an fd that the new version doesn't recognize anymore it's good idea to +close it both in the service and in the fdstore. + +Note that storing a duplicate of an fd in the fdstore means the resource pinned +by the fd remains pinned even if the service closes its duplicate of the +fd. This in particular means that peers on a connection socket uploaded this +way will not receive an automatic `POLLHUP` event anymore if the service code +issues `close()` on the socket. It must accompany it with an `FDSTOREREMOVE=1` +notification to the service manager, so that the fd is comprehensively closed. + +## Access Control + +Access to the fds in the file descriptor store is generally restricted to the +service code itself. Pushing fds into or removing fds from the fdstore is +subject to the access control restrictions of any other `sd_notify()` message, +which is controlled via +[`NotifyAccess=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#NotifyAccess=). + +By default only the main service process hence can push/remove fds, but by +setting `NotifyAccess=all` this may be relaxed to allow arbitrary service +child processes to do the same. + +## Soft Reboot + +The fdstore is particularly interesting in [soft +reboot](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html) +scenarios, as per `systemctl soft-reboot` (which restarts userspace like in a +real reboot, but leaves the kernel running). File descriptor stores that remain +loaded at the very end of the system cycle — just before the soft-reboot – are +passed over to the next system cycle, and propagated to services they originate +from there. This enables updating the full userspace of a system during +runtime, fully replacing all processes without losing pinning resources, +interrupting connectivity or established connections and similar. + +This mechanism can be enabled either by making sure the service survives until +the very end (i.e. by setting `DefaultDependencies=no` so that it keeps running +for the whole system lifetime without being regularly deactivated at shutdown) +or by setting `FileDescriptorStorePreserve=yes` (and referencing the unit +continuously). + +For further details see [Resource +Pass-Through](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html#Resource%20Pass-Through). + +## Initrd Transitions + +The fdstore may also be used to pass file descriptors for resources from the +initrd context to the main system. Restarting all processes after the +transition is important as code running in the initrd should generally not +continue to run after the switch to the host file system, since that pins +backing files from the initrd, and the initrd might contain different versions +of programs than the host. + +Any service that still runs during the initrd→host transition will have its +fdstore passed over the transition, where it will be passed back to any queued +services of the same name. + +The soft reboot cycle transition and the initrd→host transition are +semantically very similar, hence similar rules apply, and in both cases it is +recommended to use the fdstore if pinned resources shall be passed over. + +## Debugging + +The +[`systemd-analyze`](https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20fdstore%20%5BUNIT...%5D) +tool may be used to list the current contents of the fdstore of any running +service. + +The +[`systemd-run`](https://www.freedesktop.org/software/systemd/man/systemd-run.html) +tool may be used to quickly start a testing binary or similar as a service. Use +`-p FileDescriptorStore=4711` to enable the fdstore from `systemd-run`'s +command line. By using the `-t` switch you can even interactively communicate +via processes spawned that way, via the TTY. diff --git a/docs/GROUP_RECORD.md b/docs/GROUP_RECORD.md new file mode 100644 index 0000000..f463b0a --- /dev/null +++ b/docs/GROUP_RECORD.md @@ -0,0 +1,163 @@ +--- +title: JSON Group Records +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# JSON Group Records + +Long story short: JSON Group Records are to `struct group` what +[JSON User Records](USER_RECORD) are to `struct passwd`. + +Conceptually, much of what applies to JSON user records also applies to JSON +group records. They also consist of seven sections, with similar properties and +they carry some identical (or at least very similar) fields. + +## Fields in the `regular` section + +`groupName` → A string with the UNIX group name. Matches the `gr_name` field of +UNIX/glibc NSS `struct group`, or the shadow structure `struct sgrp`'s +`sg_namp` field. + +`realm` → The "realm" the group belongs to, conceptually identical to the same +field of user records. A string in DNS domain name syntax. + +`description` → A descriptive string for the group. This is similar to the +`realName` field of user records, and accepts arbitrary strings, as long as +they follow the same GECOS syntax requirements as `realName`. + +`disposition` → The disposition of the group, conceptually identical to the +same field of user records. A string. + +`service` → A string, an identifier for the service managing this group record +(this field is typically in reverse domain name syntax.) + +`lastChangeUSec` → An unsigned 64-bit integer, a timestamp (in µs since the UNIX +epoch 1970) of the last time the group record has been modified. (Covers only +the `regular`, `perMachine` and `privileged` sections). + +`gid` → An unsigned integer in the range 0…4294967295: the numeric UNIX group +ID (GID) to use for the group. This corresponds to the `gr_gid` field of +`struct group`. + +`members` → An array of strings, listing user names that are members of this +group. Note that JSON user records also contain a `memberOf` field, or in other +words a group membership can either be denoted in the JSON user record or in +the JSON group record, or in both. The list of memberships should be determined +as the combination of both lists (plus optionally others). If a user is listed +as member of a group and doesn't exist it should be ignored. This field +corresponds to the `gr_mem` field of `struct group` and the `sg_mem` field of +`struct sgrp`. + +`administrators` → Similarly, an array of strings, listing user names that +shall be considered "administrators" of this group. This field corresponds to +the `sg_adm` field of `struct sgrp`. + +`privileged`/`perMachine`/`binding`/`status`/`signature`/`secret` → The +objects/arrays for the other six group record sections. These are organized the +same way as for the JSON user records, and have the same semantics. + +## Fields in the `privileged` section + +The following fields are defined: + +`hashedPassword` → An array of strings with UNIX hashed passwords; see the +matching field for user records for details. This field corresponds to the +`sg_passwd` field of `struct sgrp` (and `gr_passwd` of `struct group` in a +way). + +## Fields in the `perMachine` section + +`matchMachineId`/`matchHostname` → Strings, match expressions similar as for +user records, see the user record documentation for details. + +The following fields are defined for the `perMachine` section and are defined +equivalent to the fields of the same name in the `regular` section, and +override those: + +`gid`, `members`, `administrators` + +## Fields in the `binding` section + +The following fields are defined for the `binding` section, and are equivalent +to the fields of the same name in the `regular` and `perMachine` sections: + +`gid` + +## Fields in the `status` section + +The following fields are defined in the `status` section, and are mostly +equivalent to the fields of the same name in the `regular` section, though with +slightly different conceptual semantics, see the same fields in the user record +documentation: + +`service` + +## Fields in the `signature` section + +The fields in this section are defined identically to those in the matching +section in the user record. + +## Fields in the `secret` section + +Currently no fields are defined in this section for group records. + +## Mapping to `struct group` and `struct sgrp` + +When mapping classic UNIX group records (i.e. `struct group` and `struct sgrp`) +to JSON group records the following mappings should be applied: + +| Structure | Field | Section | Field | Condition | +|----------------|-------------|--------------|------------------|----------------------------| +| `struct group` | `gr_name` | `regular` | `groupName` | | +| `struct group` | `gr_passwd` | `privileged` | `password` | (See notes below) | +| `struct group` | `gr_gid` | `regular` | `gid` | | +| `struct group` | `gr_mem` | `regular` | `members` | | +| `struct sgrp` | `sg_namp` | `regular` | `groupName` | | +| `struct sgrp` | `sg_passwd` | `privileged` | `password` | (See notes below) | +| `struct sgrp` | `sg_adm` | `regular` | `administrators` | | +| `struct sgrp` | `sg_mem` | `regular` | `members` | | + +At this time almost all Linux machines employ shadow passwords, thus the +`gr_passwd` field in `struct group` is set to `"x"`, and the actual password +is stored in the shadow entry `struct sgrp`'s field `sg_passwd`. + +## Extending These Records + +The same logic and recommendations apply as for JSON user records. + +## Examples + +A reasonable group record for a system group might look like this: + +```json +{ + "groupName" : "systemd-resolve", + "gid" : 193, + "status" : { + "6b18704270e94aa896b003b4340978f1" : { + "service" : "io.systemd.NameServiceSwitch" + } + } +} +``` + +And here's a more complete one for a regular group: + +```json +{ + "groupName" : "grobie", + "binding" : { + "6b18704270e94aa896b003b4340978f1" : { + "gid" : 60232 + } + }, + "disposition" : "regular", + "status" : { + "6b18704270e94aa896b003b4340978f1" : { + "service" : "io.systemd.Home" + } + } +} +``` diff --git a/docs/HACKING.md b/docs/HACKING.md new file mode 100644 index 0000000..aea25db --- /dev/null +++ b/docs/HACKING.md @@ -0,0 +1,339 @@ +--- +title: Hacking on systemd +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Hacking on systemd + +We welcome all contributions to systemd. If you notice a bug or a missing +feature, please feel invited to fix it, and submit your work as a +[GitHub Pull Request (PR)](https://github.com/systemd/systemd/pull/new). + +Please make sure to follow our [Coding Style](CODING_STYLE) when submitting +patches. Also have a look at our [Contribution Guidelines](CONTRIBUTING). + +When adding new functionality, tests should be added. For shared functionality +(in `src/basic/` and `src/shared/`) unit tests should be sufficient. The general +policy is to keep tests in matching files underneath `src/test/`, +e.g. `src/test/test-path-util.c` contains tests for any functions in +`src/basic/path-util.c`. If adding a new source file, consider adding a matching +test executable. For features at a higher level, tests in `src/test/` are very +strongly recommended. If that is not possible, integration tests in `test/` are +encouraged. + +Please also have a look at our list of [code quality tools](CODE_QUALITY) we +have setup for systemd, to ensure our codebase stays in good shape. + +Please always test your work before submitting a PR. For many of the components +of systemd testing is straightforward as you can simply compile systemd and +run the relevant tool from the build directory. + +For some components (most importantly, systemd/PID 1 itself) this is not +possible, however. In order to simplify testing for cases like this we provide +a set of `mkosi` build files directly in the source tree. +[mkosi](https://github.com/systemd/mkosi) is a tool for building clean OS images +from an upstream distribution in combination with a fresh build of the project +in the local working directory. To make use of this, please install `mkosi` v19 +or newer using your distribution's package manager or from the +[GitHub repository](https://github.com/systemd/mkosi). `mkosi` will build an +image for the host distro by default. First, run `mkosi genkey` to generate a key +and certificate to be used for secure boot and verity signing. After that is done, +it is sufficient to type `mkosi` in the systemd project directory to generate a disk +image you can boot either in `systemd-nspawn` or in a UEFI-capable VM: + +```sh +$ sudo mkosi boot # nspawn still needs sudo for now +``` + +or: + +```sh +$ mkosi qemu +``` + +Every time you rerun the `mkosi` command a fresh image is built, incorporating +all current changes you made to the project tree. + +Putting this all together, here's a series of commands for preparing a patch +for systemd: + +```sh +$ git clone https://github.com/systemd/mkosi.git # If mkosi v19 or newer is not packaged by your distribution +$ ln -s $PWD/mkosi/bin/mkosi /usr/local/bin/mkosi # If mkosi v19 or newer is not packaged by your distribution +$ git clone https://github.com/systemd/systemd.git +$ cd systemd +$ git checkout -b <BRANCH> # where BRANCH is the name of the branch +$ vim src/core/main.c # or wherever you'd like to make your changes +$ mkosi -f qemu # (re-)build and boot up the test image in qemu +$ git add -p # interactively put together your patch +$ git commit # commit it +$ git push -u <REMOTE> # where REMOTE is your "fork" on GitHub +``` + +And after that, head over to your repo on GitHub and click "Compare & pull request" + +If you want to do a local build without mkosi, most distributions also provide +very simple and convenient ways to install most development packages necessary +to build systemd: + +```sh +# Fedora +$ sudo dnf builddep systemd +# Debian/Ubuntu +$ sudo apt-get build-dep systemd +# Arch +$ sudo pacman -S devtools +$ pkgctl repo clone --protocol=https systemd +$ cd systemd +$ makepkg -seoc +``` + +After installing the development packages, systemd can be built from source as follows: + +```sh +$ meson setup build <options> +$ ninja -C build +$ meson test -C build +``` + +Happy hacking! + +## Templating engines in .in files + +Some source files are generated during build. We use two templating engines: +* meson's `configure_file()` directive uses syntax with `@VARIABLE@`. + + See the + [Meson docs for `configure_file()`](https://mesonbuild.com/Reference-manual.html#configure_file) + for details. + +{% raw %} +* most files are rendered using jinja2, with `{{VARIABLE}}` and `{% if … %}`, + `{% elif … %}`, `{% else … %}`, `{% endif … %}` blocks. `{# … #}` is a + jinja2 comment, i.e. that block will not be visible in the rendered + output. `{% raw %} … `{% endraw %}`{{ '{' }}{{ '% endraw %' }}}` creates a block + where jinja2 syntax is not interpreted. + + See the + [Jinja Template Designer Documentation](https://jinja2docs.readthedocs.io/en/stable/templates.html#synopsis) + for details. + +Please note that files for both template engines use the `.in` extension. + +## Developer and release modes + +In the default meson configuration (`-Dmode=developer`), certain checks are +enabled that are suitable when hacking on systemd (such as internal +documentation consistency checks). Those are not useful when compiling for +distribution and can be disabled by setting `-Dmode=release`. + +## Sanitizers in mkosi + +See [Testing systemd using sanitizers](TESTING_WITH_SANITIZERS) for more information +on how to build with sanitizers enabled in mkosi. + +## Fuzzers + +systemd includes fuzzers in `src/fuzz/` that use libFuzzer and are automatically +run by [OSS-Fuzz](https://github.com/google/oss-fuzz) with sanitizers. +To add a fuzz target, create a new `src/fuzz/fuzz-foo.c` file with a `LLVMFuzzerTestOneInput` +function and add it to the list in `src/fuzz/meson.build`. + +Whenever possible, a seed corpus and a dictionary should also be added with new +fuzz targets. The dictionary should be named `src/fuzz/fuzz-foo.dict` and the seed +corpus should be built and exported as `$OUT/fuzz-foo_seed_corpus.zip` in +`tools/oss-fuzz.sh`. + +The fuzzers can be built locally if you have libFuzzer installed by running +`tools/oss-fuzz.sh`, or by running: + +``` +CC=clang CXX=clang++ \ +meson setup build-libfuzz -Dllvm-fuzz=true -Db_sanitize=address,undefined -Db_lundef=false \ + -Dc_args='-fno-omit-frame-pointer -DFUZZING_BUILD_MODE_UNSAFE_FOR_PRODUCTION' +ninja -C build-libfuzz fuzzers +``` + +Each fuzzer then can be then run manually together with a directory containing +the initial corpus: + +``` +export UBSAN_OPTIONS=print_stacktrace=1:print_summary=1:halt_on_error=1 +build-libfuzz/fuzz-varlink-idl test/fuzz/fuzz-varlink-idl/ +``` + +Note: the `halt_on_error=1` UBSan option is especially important, otherwise +the fuzzer won't crash when undefined behavior is triggered. + +You should also confirm that the fuzzers can be built and run using +[the OSS-Fuzz toolchain](https://google.github.io/oss-fuzz/advanced-topics/reproducing/#building-using-docker): + +``` +path_to_systemd=... + +git clone --depth=1 https://github.com/google/oss-fuzz +cd oss-fuzz + +for sanitizer in address undefined memory; do + for engine in libfuzzer afl honggfuzz; do + ./infra/helper.py build_fuzzers --sanitizer "$sanitizer" --engine "$engine" \ + --clean systemd "$path_to_systemd" + + ./infra/helper.py check_build --sanitizer "$sanitizer" --engine "$engine" \ + -e ALLOWED_BROKEN_TARGETS_PERCENTAGE=0 systemd + done +done + +./infra/helper.py build_fuzzers --clean --architecture i386 systemd "$path_to_systemd" +./infra/helper.py check_build --architecture i386 -e ALLOWED_BROKEN_TARGETS_PERCENTAGE=0 systemd + +./infra/helper.py build_fuzzers --clean --sanitizer coverage systemd "$path_to_systemd" +./infra/helper.py coverage --no-corpus-download systemd +``` + +If you find a bug that impacts the security of systemd, please follow the +guidance in [CONTRIBUTING.md](CONTRIBUTING) on how to report a security vulnerability. + +For more details on building fuzzers and integrating with OSS-Fuzz, visit: + +- [Setting up a new project - OSS-Fuzz](https://google.github.io/oss-fuzz/getting-started/new-project-guide/) +- [Tutorials - OSS-Fuzz](https://google.github.io/oss-fuzz/reference/useful-links/#tutorials) + +## Debugging binaries that need to run as root in vscode + +When trying to debug binaries that need to run as root, we need to do some custom configuration in vscode to +have it try to run the applications as root and to ask the user for the root password when trying to start +the binary. To achieve this, we'll use a custom debugger path which points to a script that starts `gdb` as +root using `pkexec`. pkexec will prompt the user for their root password via a graphical interface. This +guide assumes the C/C++ extension is used for debugging. + +First, create a file `sgdb` in the root of the systemd repository with the following contents and make it +executable: + +``` +#!/bin/sh +exec pkexec gdb "$@" +``` + +Then, open launch.json in vscode, and set `miDebuggerPath` to `${workspaceFolder}/sgdb` for the corresponding +debug configuration. Now, whenever you try to debug the application, vscode will try to start gdb as root via +pkexec which will prompt you for your password via a graphical interface. After entering your password, +vscode should be able to start debugging the application. + +For more information on how to set up a debug configuration for C binaries, please refer to the official +vscode documentation [here](https://code.visualstudio.com/docs/cpp/launch-json-reference) + +## Debugging systemd with mkosi + vscode + +To simplify debugging systemd when testing changes using mkosi, we're going to show how to attach +[VSCode](https://code.visualstudio.com/)'s debugger to an instance of systemd running in a mkosi image using +QEMU. + +To allow VSCode's debugger to attach to systemd running in a mkosi image, we have to make sure it can access +the virtual machine spawned by mkosi where systemd is running. mkosi makes this possible via a handy SSH +option that makes the generated image accessible via SSH when booted. Thus you must build the image with +`mkosi --ssh`. The easiest way to set the option is to create a file `mkosi.local.conf` in the root of the +repository and add the following contents: + +``` +[Host] +Ssh=yes +RuntimeTrees=. +``` + +Also make sure that the SSH agent is running on your system and that you've added your SSH key to it with +`ssh-add`. Also make sure that `virtiofsd` is installed. + +After rebuilding the image and booting it with `mkosi qemu`, you should now be able to connect to it by +running `mkosi ssh` from the same directory in another terminal window. + +Now we need to configure VSCode. First, make sure the C/C++ extension is installed. If you're already using +a different extension for code completion and other IDE features for C in VSCode, make sure to disable the +corresponding parts of the C/C++ extension in your VSCode user settings by adding the following entries: + +```json +"C_Cpp.formatting": "Disabled", +"C_Cpp.intelliSenseEngine": "Disabled", +"C_Cpp.enhancedColorization": "Disabled", +"C_Cpp.suggestSnippets": false, +``` + +With the extension set up, we can create the launch.json file in the .vscode/ directory to tell the VSCode +debugger how to attach to the systemd instance running in our mkosi container/VM. Create the file, and possibly +the directory, and add the following contents: + +```json +{ + "version": "0.2.0", + "configurations": [ + { + "type": "cppdbg", + "program": "/usr/lib/systemd/systemd", + "processId": "${command:pickRemoteProcess}", + "request": "attach", + "name": "systemd", + "pipeTransport": { + "pipeProgram": "mkosi", + "pipeArgs": [ + "-C", + "/path/to/systemd/repo/directory/on/host/system/", + "ssh" + ], + "debuggerPath": "/usr/bin/gdb" + }, + "MIMode": "gdb", + "sourceFileMap": { + "/root/src/systemd": { + "editorPath": "${workspaceFolder}", + "useForBreakpoints": false + }, + } + } + ] +} +``` + +Now that the debugger knows how to connect to our process in the container/VM and we've set up the necessary +source mappings, go to the "Run and Debug" window and run the "systemd" debug configuration. If everything +goes well, the debugger should now be attached to the systemd instance running in the container/VM. You can +attach breakpoints from the editor and enjoy all the other features of VSCode's debugger. + +To debug systemd components other than PID 1, set "program" to the full path of the component you want to +debug and set "processId" to "${command:pickProcess}". Now, when starting the debugger, VSCode will ask you +the PID of the process you want to debug. Run `systemctl show --property MainPID --value <component>` in the +container to figure out the PID and enter it when asked and VSCode will attach to that process instead. + +## Debugging systemd-boot + +During boot, systemd-boot and the stub loader will output messages like +`systemd-boot@0x0A` and `systemd-stub@0x0B`, providing the base of the loaded +code. This location can then be used to attach to a QEMU session (provided it +was run with `-s`). See `debug-sd-boot.sh` script in the tools folder which +automates this processes. + +If the debugger is too slow to attach to examine an early boot code passage, +the call to `DEFINE_EFI_MAIN_FUNCTION()` can be modified to enable waiting. As +soon as the debugger has control, we can then run `set variable wait = 0` or +`return` to continue. Once the debugger has attached, setting breakpoints will +work like usual. + +To debug systemd-boot in an IDE such as VSCode we can use a launch configuration like this: +```json +{ + "name": "systemd-boot", + "type": "cppdbg", + "request": "launch", + "program": "${workspaceFolder}/build/src/boot/efi/systemd-bootx64.efi", + "cwd": "${workspaceFolder}", + "MIMode": "gdb", + "miDebuggerServerAddress": ":1234", + "setupCommands": [ + { "text": "shell mkfifo /tmp/sdboot.{in,out}" }, + { "text": "shell qemu-system-x86_64 [...] -s -serial pipe:/tmp/sdboot" }, + { "text": "shell ${workspaceFolder}/tools/debug-sd-boot.sh ${workspaceFolder}/build/src/boot/efi/systemd-bootx64.efi /tmp/sdboot.out systemd-boot.gdb" }, + { "text": "source /tmp/systemd-boot.gdb" }, + ] +} +``` diff --git a/docs/HOME_DIRECTORY.md b/docs/HOME_DIRECTORY.md new file mode 100644 index 0000000..f1b7faf --- /dev/null +++ b/docs/HOME_DIRECTORY.md @@ -0,0 +1,178 @@ +--- +title: Home Directories +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Home Directories + +[`systemd-homed.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-homed.service.html) +manages home directories of regular ("human") users. Each directory it manages +encapsulates both the data store and the user record of the user, so that it +comprehensively describes the user account, and is thus naturally portable +between systems without any further, external metadata. This document describes +the format used by these home directories, in the context of the storage +mechanism used. + +## General Structure + +Inside of the home directory a file `~/.identity` contains the JSON formatted +user record of the user. It follows the format defined in +[`JSON User Records`](USER_RECORD). It is recommended to bring the +record into 'normalized' form (i.e. all objects should contain their fields +sorted alphabetically by their key) before storing it there, though this is not +required nor enforced. Since the user record is cryptographically signed, the +user cannot make modifications to the file on their own (at least not without +corrupting it, or knowing the private key used for signing the record). Note +that user records are stored here without their `binding`, `status` and +`secret` sections, i.e. only with the sections included in the signature plus +the signature section itself. + +## Storage Mechanism: Plain Directory/`btrfs` Subvolume + +If the plain directory or `btrfs` subvolume storage mechanism of +`systemd-homed` is used (i.e. `--storage=directory` or `--storage=subvolume` on +the +[`homectl(1)`](https://www.freedesktop.org/software/systemd/man/homectl.html) +command line) the home directory requires no special setup besides including +the user record in the `~/.identity` file. + +It is recommended to name home directories managed this way by +`systemd-homed.service` by the user name, suffixed with `.homedir` (example: +`lennart.homedir` for a user `lennart`) but this is not enforced. When the user +is logged in, the directory is generally mounted to `/home/$USER` (in our +example: `/home/lennart`), thus dropping the suffix while the home directory is +active. `systemd-homed` will automatically discover home directories named this +way in `/home/*.homedir` and synthesize NSS user records for them as they show +up. + +## Storage Mechanism: `fscrypt` Directories + +This storage mechanism is mostly identical to the plain directory storage +mechanism, except that the home directory is encrypted using `fscrypt`. (Use +`--storage=fscrypt` on the `homectl` command line.) Key management is +implemented via extended attributes on the directory itself: for each password +an extended attribute `trusted.fscrypt_slot0`, `trusted.fscrypt_slot1`, +`trusted.fscrypt_slot2`, … is maintained. Its value contains a colon-separated +pair of Base64 encoded data fields. The first field contains a salt value, the +second field the encrypted volume key. The latter is encrypted using AES256 in +counter mode, using a key derived from the password via PBKDF2-HMAC-SHA512, +together with the salt value. The construction is similar to what LUKS does for +`dm-crypt` encrypted volumes. Note that extended attributes are not encrypted +by `fscrypt` and hence are suitable for carrying the key slots. Moreover, by +using extended attributes, the slots are directly attached to the directory and +an independent sidecar key database is not required. + +## Storage Mechanism: `cifs` Home Directories + +In this storage mechanism, the home directory is mounted from a CIFS server and +service at login, configured inside the user record. (Use `--storage=cifs` on +the `homectl` command line.) The local password of the user is used to log into +the CIFS service. The directory share needs to contain the user record in +`~/.identity` as well. Note that this means that the user record needs to be +registered locally before it can be mounted for the first time, since CIFS +domain and server information needs to be known *before* the mount. Note that +for all other storage mechanisms it is entirely sufficient if the directories +or storage artifacts are placed at the right locations — all information to +activate them can be derived automatically from their mere availability. + +## Storage Mechanism: `luks` Home Directories + +This is the most advanced and most secure storage mechanism and consists of a +Linux file system inside a LUKS2 volume inside a loopback file (or on removable +media). (Use `--storage=luks` on the `homectl` command line.) Specifically: + +* The image contains a GPT partition table. For now it should only contain a + single partition, and that partition must have the type UUID + `773f91ef-66d4-49b5-bd83-d683bf40ad16`. Its partition label must be the + user name. + +* This partition must contain a LUKS2 volume, whose label must be the user + name. The LUKS2 volume must contain a LUKS2 token field of type + `systemd-homed`. The JSON data of this token must have a `record` field, + containing a string with base64-encoded data. This data is the JSON user + record, in the same serialization as in `~/.identity`, though encrypted. The + JSON data of this token must also have an `iv` field, which contains a + base64-encoded binary initialization vector for the encryption. The + encryption used is the same as the LUKS2 volume itself uses, unlocked by the + same volume key, but based on its own IV. + +* Inside of this LUKS2 volume must be a Linux file system, one of `ext4`, + `btrfs` and `xfs`. The file system label must be the user name. + +* This file system should contain a single directory named after the user. This + directory will become the home directory of the user when activated. It + contains a second copy of the user record in the `~/.identity` file, like in + the other storage mechanisms. + +The image file should reside in a directory `/home/` on the system, +named after the user, suffixed with `.home`. When activated, the container home +directory is mounted to the same path, though with the `.home` suffix dropped — +unless a different mount point is defined in the user record. (e.g.: the +loopback file `/home/waldo.home` is mounted to `/home/waldo` while activated.) +When the image is stored on removable media (such as a USB stick), the image +file can be directly `dd`'ed onto it; the format is unchanged. The GPT envelope +should ensure the image is properly recognizable as a home directory both when +used in a loopback file and on a removable USB stick. (Note that when mounting +a home directory from a USB stick, it too defaults to a directory in `/home/`, +named after the username, with no further suffix.) + +Rationale for the GPT partition table envelope: this way the image is nicely +discoverable and recognizable already by partition managers as a home +directory. Moreover, when copied onto a USB stick the GPT envelope makes sure +the stick is properly recognizable as a portable home directory +medium. (Moreover, it allows embedding additional partitions later on, for +example on a multi-purpose USB stick that contains both a home +directory and a generic storage volume.) + +Rationale for including the encrypted user record in the LUKS2 header: +Linux kernel file system implementations are generally not robust towards +maliciously formatted file systems; there's a good chance that file system +images can be used as attack vectors, exploiting the kernel. Thus it is +necessary to validate the home directory image *before* mounting it and +establishing a minimal level of trust. Since the user record data is +cryptographically signed and user records not signed with a recognized private +key are not accepted, a minimal level of trust between the system and the home +directory image is established. + +Rationale for storing the home directory one level below to root directory of +the contained file system: this way special directories such as `lost+found/` +do not show up in the user's home directory. + +## Algorithm + +Regardless of the storage mechanism used, an activated home directory +necessarily involves a mount point to be established. In case of the +directory-based storage mechanisms (`directory`, `subvolume` and `fscrypt`) +this is a bind mount. In case of `cifs` this is a CIFS network mount, and in +case of the LUKS2 backend a regular block device mount of the file system +contained in the LUKS2 image. By requiring a mount for all cases (even for +those that already are a directory), a clear logic is defined to distinguish +active and inactive home directories, so that the directories become +inaccessible under their regular path the instant they are +deactivated. Moreover, the `nosuid`, `nodev` and `noexec` flags configured in +the user record are applied when the bind mount is established. + +During activation, the user records retained on the host, the user record +stored in the LUKS2 header (in case of the LUKS2 storage mechanism) and the +user record stored inside the home directory in `~/.identity` are +compared. Activation is only permitted if they match the same user and are +signed by a recognized key. When the three instances differ in `lastChangeUSec` +field, the newest record wins, and is propagated to the other two locations. + +During activation, the file system checker (`fsck`) appropriate for the +selected file system is automatically invoked, ensuring the file system is in a +healthy state before it is mounted. + +If the UID assigned to a user does not match the owner of the home directory in +the file system, the home directory is automatically and recursively `chown()`ed +to the correct UID. + +Depending on the `luksDiscard` setting of the user record, either the backing +loopback file is `fallocate()`ed during activation, or the mounted file system +is `FITRIM`ed after mounting, to ensure the setting is correctly enforced. + +When deactivating a home directory, the file system or block device is trimmed +or extended as configured in the `luksOfflineDiscard` setting of the user +record. diff --git a/docs/INCOMPATIBILITIES.md b/docs/INCOMPATIBILITIES.md new file mode 100644 index 0000000..75b60b6 --- /dev/null +++ b/docs/INCOMPATIBILITIES.md @@ -0,0 +1,33 @@ +--- +title: Compatibility with SysV +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Compatibility with SysV + +systemd provides a fair degree of compatibility with the behavior exposed by the SysV init system as implemented by many distributions. Compatibility is provided both for the user experience and the SysV scripting APIs. However, there are some areas where compatibility is limited due to technical reasons or design decisions of systemd and the distributions. All of the following applies to SysV init scripts handled by systemd, however a number of them matter only on specific distributions. Many of the incompatibilities are specific to distribution-specific extensions of LSB/SysV init. + +* If your distribution removes SysV init scripts in favor of systemd unit files typing "/etc/init.d/foobar start" to start a service will not work, since the script will not be available. Use the more correct "/sbin/service foobar start" instead, and your command will be forwarded to systemd. Note that invoking the init script directly has always been suboptimal since too much of the caller's execution context (environment block, umask, resource limits, audit trails, ...) ended up being inherited by the service, and invocation via "/sbin/service" used to clean this up at least partially. Invocation via /sbin/service works on both SysV and systemd systems. Also, LSB only standardizes invocation via "/sbin/service" anyway. (Note that some distributions ship both systemd unit files and SysV scripts for the services. For these invoking the init scripts will work as expected and the request be forwarded to systemd in any case.) +* LSB header dependency information matters. The SysV implementations on many distributions did not use the dependency information encoded in LSB init script headers, or used them only in very limited ways. Due to that they are often incorrect or incomplete. systemd however fully interprets these headers and follows them closely at runtime (and not at installation time like some implementations). +* Timeouts apply to all init script operations in systemd. While on SysV systems a hanging init script could freeze the system on systemd all init script operations are subject to a timeout of 5min. +* Services are executed in completely clean execution contexts, no context of the invoking user session is inherited. Not even $HOME or similar are set. Init scripts depending on these will not work correctly. +* Services cannot read from stdin, as this will be connected to /dev/null. That means interactive init scripts are not supported (i.e. Debian's X-Interactive in the LSB header is not supported either.) Thankfully most distributions do not support interaction in init scripts anyway. If you need interaction to ask disk or SSL passphrases please consider using the minimal password querying framework systemd supports. ([details](PASSWORD_AGENTS), [manual page](http://0pointer.de/public/systemd-man/systemd-ask-password.html)) +* Additional verbs for init scripts are not supported. If your init script traditionally supported additional verbs for your init script simply move them to an auxiliary script. +* Additional parameters to the standard verbs (i.e. to "start", "stop" and "status") are not supported. This was an extension of SysV that never was standardized officially, and is not supported in systemd. +* Overriding the "restart" verb is not supported. This verb is always implemented by systemd itself, and consists of a "stop" followed by a "start". +* systemd only stops running services. On traditional SysV a K link installed for shutdown was executed when going down regardless whether the service was started before or not. systemd is more strict here and does not stop service that weren't started in the first place. +* Note that neither S nor K links for runlevels 0 and 6 have any effect. Running services will be terminated anyway when shutting down, and no new SysV services are started at shut down. +* If systemd doesn't know which PID is the main PID of a service, it will not be able to track its runtime, and hence a service exiting on its own will not make systemd consider it stopped. Use the Red Hat "pidfile:" syntax in the SysV script header comment block to let systemd know which PID file (and hence PID) belongs to your service. Note that systemd cannot know if a SysV service is one of the kind where the runtime is defined by a specific process or whether it is one where there is none, hence the requirement of explicit configuration of a PID file in order to make systemd track the process lifetime. (Note that the Red Hat "pidfile:" stanza may only appear once in init scripts.) +* Runlevels are supported in a limited fashion only. SysV runlevels are mapped to systemd target units, however not all systemd target units map back to SysV runlevels. This is due to the fact that systemd targets are a lot more flexible and expressive than SysV runlevels. That means that checks for the current runlevel (with /sbin/runlevel or so) may well return "N" (i.e. unknown runlevel) during normal operation. Scripts that rely on explicit runlevel checks are incompatible with many setups. Avoid runlevel checks like these. +* Tools like /sbin/chkconfig might return misleading information when used to list enablement status of services. First of all, the tool will only see SysV services, not native units. Secondly, it will only show runlevel-related information (which does not fully map to systemd targets). Finally, the information shown might be overridden by a native unit file. +* By default runlevels 2,3,4 are all aliases for "multi-user.target". If a service is enabled in one of these runlevels, they'll be enabled in all of these. This is only a default however, and users can easily override the mappings, and split them up into individual runlevels if they want. However, we recommend moving on from runlevels and using the much more expressive target units of systemd. +* Early boot runlevels as they are used by some distributions are no longer supported. i.e. "fake", distribution-specific runlevels such as "S" or "b" cannot be used with systemd. +* On SysV systems changes to init scripts or any other files that define the boot process (such as /etc/fstab) usually had an immediate effect on everything started later. This is different on systemd-based systems where init script information and other boot-time configuration files are only reread when "systemctl daemon-reload" is issued. (Note that some commands, notably "systemctl enable"/"systemctl disable" do this implicitly however.) This is by design, and a safety feature, since it ensures that half-completed changes are not read at the wrong time. +* Multiple entries for the same mount path in /etc/fstab are not supported. In systemd there's only a single unit definition for each mount path read at any time. Also the listing order of mounts in /etc/fstab has no effect, mounts are executed in parallel and dependencies between them generated automatically depending on path prefixes and source paths. +* systemd's handling of the existing "nofail" mount option in /etc/fstab is stricter than it used to be on some sysvinit distributions: mount points that fail and are not listed as "nofail" will cause the boot to be stopped, for security reasons, as we we should not permit unprivileged code to run without everything listed — and not expressly exempted through "nofail" — being around. Hence, please mark all mounts where booting shall proceed regardless whether they succeeded or not with "nofail" +* Some SysV systems support an "rc.local" script that is supposed to be called "last" during boot. In systemd, the script is supported, but the semantics are less strict, as there is simply no concept of "last service", as the boot process is event- and request-based, parallelized and compositive. In general, it's a good idea to write proper unit files with properly defined dependencies, and avoid making use of rc.local. +* systemd assumes that the UID boundary between system and regular users is a choice the distribution makes, and not the administrator. Hence it expects this setting as compile-time option to be picked by the distribution. It will _not_ check /etc/login.defs during runtime. + +Note that there are some areas where systemd currently provides a certain amount of compatibility where we expect this compatibility to be removed eventually. diff --git a/docs/INHIBITOR_LOCKS.md b/docs/INHIBITOR_LOCKS.md new file mode 100644 index 0000000..61efdc2 --- /dev/null +++ b/docs/INHIBITOR_LOCKS.md @@ -0,0 +1,160 @@ +--- +title: Inhibitor Locks +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Inhibitor Locks + +systemd 183 and newer include a logic to inhibit system shutdowns and sleep states. This is implemented as part of [systemd-logind.daemon(8)](http://www.freedesktop.org/software/systemd/man/systemd-logind.service.html) There are a couple of different use cases for this: + +- A CD burning application wants to ensure that the system is not turned off or suspended while the burn process is in progress. +- A package manager wants to ensure that the system is not turned off while a package upgrade is in progress. +- An office suite wants to be notified before system suspend in order to save all data to disk, and delay the suspend logic until all data is written. +- A web browser wants to be notified before system hibernation in order to free its cache to minimize the amount of memory that needs to be virtualized. +- A screen lock tool wants to bring up the screen lock right before suspend, and delay the suspend until that's complete. + +Applications which want to make use of the inhibition logic shall take an inhibitor lock via the [logind D-Bus API](http://www.freedesktop.org/wiki/Software/systemd/logind). + +Seven distinct inhibitor lock types may be taken, or a combination of them: + +1. _sleep_ inhibits system suspend and hibernation requested by (unprivileged) **users** +2. _shutdown_ inhibits high-level system power-off and reboot requested by (unprivileged) **users** +3. _idle_ inhibits that the system goes into idle mode, possibly resulting in **automatic** system suspend or shutdown depending on configuration. + +- _handle-power-key_ inhibits the low-level (i.e. logind-internal) handling of the system power **hardware** key, allowing (possibly unprivileged) external code to handle the event instead. + +4. Similar, _handle-suspend-key_ inhibits the low-level handling of the system **hardware** suspend key. +5. Similar, _handle-hibernate-key_ inhibits the low-level handling of the system **hardware** hibernate key. +6. Similar, _handle-lid-switch_ inhibits the low-level handling of the systemd **hardware** lid switch. + +Two different modes of locks are supported: + +1. _block_ inhibits operations entirely until the lock is released. If such a lock is taken the operation will fail (but still may be overridden if the user possesses the necessary privileges). +2. _delay_ inhibits operations only temporarily, either until the lock is released or up to a certain amount of time. The InhibitDelayMaxSec= setting in [logind.conf(5)](http://www.freedesktop.org/software/systemd/man/logind.conf.html) controls the timeout for this. This is intended to be used by applications which need a synchronous way to execute actions before system suspend but shall not be allowed to block suspend indefinitely. This mode is only available for _sleep_ and _shutdown_ locks. + +Inhibitor locks are taken via the Inhibit() D-Bus call on the logind Manager object: + +``` +$ gdbus introspect --system --dest org.freedesktop.login1 --object-path /org/freedesktop/login1 +node /org/freedesktop/login1 { + interface org.freedesktop.login1.Manager { + methods: + Inhibit(in s what, + in s who, + in s why, + in s mode, + out h fd); + ListInhibitors(out a(ssssuu) inhibitors); + ... + signals: + PrepareForShutdown(b active); + PrepareForSleep(b active); + ... + properties: + readonly s BlockInhibited = ''; + readonly s DelayInhibited = ''; + readonly t InhibitDelayMaxUSec = 5000000; + readonly b PreparingForShutdown = false; + readonly b PreparingForSleep = false; + ... + }; + ... +}; +``` + +**Inhibit()** is the only API necessary to take a lock. It takes four arguments: + +- _What_ is a colon-separated list of lock types, i.e. `shutdown`, `sleep`, `idle`, `handle-power-key`, `handle-suspend-key`, `handle-hibernate-key`, `handle-lid-switch`. Example: "shutdown:idle" +- _Who_ is a human-readable, descriptive string of who is taking the lock. Example: "Package Updater" +- _Why_ is a human-readable, descriptive string of why the lock is taken. Example: "Package Update in Progress" +- _Mode_ is one of `block` or `delay`, see above. Example: "block" + +Inhibit() returns a single value, a file descriptor that encapsulates the lock. As soon as the file descriptor is closed (and all its duplicates) the lock is automatically released. If the client dies while the lock is taken the kernel automatically closes the file descriptor so that the lock is automatically released. A delay lock taken this way should be released ASAP on reception of PrepareForShutdown(true) (see below), but of course only after execution of the actions the application wanted to delay the operation for in the first place. + +**ListInhibitors()** lists all currently active inhibitor locks. It returns an array of structs, each consisting of What, Who, Why, Mode as above, plus the PID and UID of the process that requested the lock. + +The **PrepareForShutdown()** and **PrepareForSleep()** signals are emitted when a system suspend or shutdown has been requested and is about to be executed, as well as after the the suspend/shutdown was completed (or failed). The signals carry a boolean argument. If _True_ the shutdown/sleep has been requested, and the preparation phase for it begins, if _False_ the operation has finished completion (or failed). If _True_, this should be used as indication for applications to quickly execute the operations they wanted to execute before suspend/shutdown and then release any delay lock taken. If _False_ the suspend/shutdown operation is over, either successfully or unsuccessfully (of course, this signal will never be sent if a shutdown request was successful). The signal with _False_ is generally delivered only after the system comes back from suspend, the signal with _True_ possibly as well, for example when no delay lock was taken in the first place, and the system suspend hence executed without any delay. The signal with _False_ is usually the signal on which applications request a new delay lock in order to be synchronously notified about the next suspend/shutdown cycle. Note that watching PrepareForShutdown(true)[?](//secure.freedesktop.org/write/www/ikiwiki.cgi?do=create&from=Software%2Fsystemd%2Finhibit&page=Software%2Fsystemd%2Finhibit%2FPrepareForSleep)/PrepareForSleep(true) without taking a delay lock is racy and should not be done, as any code that an application might want to execute on this signal might not actually finish before the suspend/shutdown cycle is executed. _Again_: if you watch PrepareForSuspend(true), then you really should have taken a delay lock first. PrepareForShutdown(false) may be subscribed to by applications which want to be notified about system resume events. Note that this will only be sent out for suspend/resume cycles done via logind, i.e. generally only for high-level user-induced suspend cycles, and not automatic, low-level kernel induced ones which might exist on certain devices with more aggressive power management. + +The **BlockInhibited** and **DelayInhibited** properties encode what types of locks are currently taken. These fields are a colon separated list of `shutdown`, `sleep`, `idle`, `handle-power-key`, `handle-suspend-key`, `handle-hibernate-key`, `handle-lid-switch`. The list is basically the union of the What fields of all currently active locks of the specific mode. + +**InhibitDelayMaxUSec** contains the delay timeout value as configured in [logind.conf(5)](http://www.freedesktop.org/software/systemd/man/logind.conf.html). + +The **PreparingForShutdown** and **PreparingForSleep** boolean properties are true between the two PrepareForShutdown() resp PrepareForSleep() signals that are sent out. Note that these properties do not trigger PropertyChanged signals. + +## Taking Blocking Locks + +Here's the basic scheme for applications which need blocking locks such as a package manager or CD burning application: + +1. Take the lock +2. Do your work you don't want to see interrupted by system sleep or shutdown +3. Release the lock + +Example pseudo code: + +``` +fd = Inhibit("shutdown:idle", "Package Manager", "Upgrade in progress...", "block"); +/* ... + do your work + ... */ +close(fd); +``` + +## Taking Delay Locks + +Here's the basic scheme for applications which need delay locks such as a web browser or office suite: + +1. As you open a document, take the delay lock +2. As soon as you see PrepareForSleep(true), save your data, then release the lock +3. As soon as you see PrepareForSleep(false), take the delay lock again, continue as before. + +Example pseudo code: + +``` +int fd = -1; + +takeLock() { + if (fd >= 0) + return; + + fd = Inhibit("sleep", "Word Processor", "Save any unsaved data in time...", "delay"); +} + +onDocumentOpen(void) { + takeLock(); +} + +onPrepareForSleep(bool b) { + if (b) { + saveData(); + if (fd >= 0) { + close(fd); + fd = -1; + } + } else + takeLock(); + +} + +``` + +## Taking Key Handling Locks + +By default logind will handle the power and sleep keys of the machine, as well as the lid switch in all states. This ensures that this basic system behavior is guaranteed to work in all circumstances, on text consoles as well as on all graphical environments. However, some DE might want to do their own handling of these keys, for example in order to show a pretty dialog box before executing the relevant operation, or to simply disable the action under certain conditions. For these cases the handle-power-key, handle-suspend-key, handle-hibernate-key and handle-lid-switch type inhibitor locks are available. When taken, these locks simply disable the low-level handling of the keys, they have no effect on system suspend/hibernate/poweroff executed with other mechanisms than the hardware keys (such as the user typing "systemctl suspend" in a shell). A DE intending to do its own handling of these keys should simply take the locks at login time, and release them on logout; alternatively it might make sense to take this lock only temporarily under certain circumstances (e.g. take the lid switch lock only when a second monitor is plugged in, in order to support the common setup where people close their laptops when they have the big screen connected). + +These locks need to be taken in the "block" mode, "delay" is not supported for them. + +If a DE wants to ensure the lock screen for the eventual resume is on the screen before the system enters suspend state, it should do this via a suspend delay inhibitor block (see above). + +## Miscellanea + +Taking inhibitor locks is a privileged operation. Depending on the action _org.freedesktop.login1.inhibit-block-shutdown_, _org.freedesktop.login1.inhibit-delay-shutdown_, _org.freedesktop.login1.inhibit-block-sleep_, _org.freedesktop.login1.inhibit-delay-sleep_, _org.freedesktop.login1.inhibit-block-idle_, _org.freedesktop.login1.inhibit-handle-power-key_, _org.freedesktop.login1.inhibit-handle-suspend-key_, _org.freedesktop.login1.inhibit-handle-hibernate-key_,_org.freedesktop.login1.inhibit-handle-lid-switch_. In general it should be assumed that delay locks are easier to obtain than blocking locks, simply because their impact is much more minimal. Note that the policy checks for Inhibit() are never interactive. + +Inhibitor locks should not be misused. For example taking idle blocking locks without a very good reason might cause mobile devices to never auto-suspend. This can be quite detrimental for the battery. + +If an application finds a lock denied it should not consider this much of an error and just continue its operation without the protecting lock. + +The tool [systemd-inhibit(1)](http://www.freedesktop.org/software/systemd/man/systemd-inhibit.html) may be used to take locks or list active locks from the command line. + +Note that gnome-session also provides an [inhibitor API](http://people.gnome.org/~mccann/gnome-session/docs/gnome-session.html#org.gnome.SessionManager.Inhibit), which is very similar to the one of systemd. Internally, locks taken on gnome-session's interface will be forwarded to logind, hence both APIs are supported. While both offer similar functionality they do differ in some regards. For obvious reasons gnome-session can offer logout locks and screensaver avoidance locks which logind lacks. logind's API OTOH supports delay locks in addition to block locks like GNOME. Also, logind is available to system components, and centralizes locks from all users, not just those of a specific one. In general: if in doubt it is probably advisable to stick to the GNOME locks, unless there is a good reason to use the logind APIs directly. When locks are to be enumerated it is better to use the logind APIs however, since they also include locks taken by system services and other users. diff --git a/docs/INITRD_INTERFACE.md b/docs/INITRD_INTERFACE.md new file mode 100644 index 0000000..0461ae2 --- /dev/null +++ b/docs/INITRD_INTERFACE.md @@ -0,0 +1,70 @@ +--- +title: Initrd Interface +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + + +# The initrd Interface of systemd + +The Linux initrd mechanism (short for "initial RAM disk", also known as +"initramfs") refers to a small file system archive that is unpacked by the +kernel and contains the first userspace code that runs. It typically finds and +transitions into the actual root file system to use. systemd supports both +initrd and initrd-less boots. If an initrd is used, it is a good idea to pass a +few bits of runtime information from the initrd to systemd in order to avoid +duplicate work and to provide performance data to the administrator. In this +page we attempt to roughly describe the interfaces that exist between the +initrd and systemd. These interfaces are currently used by +[mkosi](https://github.com/systemd/mkosi)-generated initrds, dracut and the +Arch Linux initrds. + +* The initrd should mount `/run/` as a tmpfs and pass it pre-mounted when + jumping into the main system when executing systemd. The mount options should + be `mode=0755,nodev,nosuid,strictatime`. + +* It's highly recommended that the initrd also mounts `/usr/` (if split off) as + appropriate and passes it pre-mounted to the main system, to avoid the + problems described in [Booting without /usr is + Broken](https://www.freedesktop.org/wiki/Software/systemd/separate-usr-is-broken). + +* If the executable `/run/initramfs/shutdown` exists systemd will use it to + jump back into the initrd on shutdown. `/run/initramfs/` should be a usable + initrd environment to which systemd will pivot back and the `shutdown` + executable in it should be able to detach all complex storage that for + example was needed to mount the root file system. It's the job of the initrd + to set up this directory and executable in the right way so that this works + correctly. The shutdown binary is invoked with the shutdown verb as `argv[1]`, + optionally followed (in `argv[2]`, `argv[3]`, … systemd's original command + line options, for example `--log-level=` and similar. + +* Storage daemons run from the initrd should follow the guide on + [systemd and Storage Daemons for the Root File System](ROOT_STORAGE_DAEMONS) + to survive properly from the boot initrd all the way to the point where + systemd jumps back into the initrd for shutdown. + +One last clarification: we use the term _initrd_ very generically here +describing any kind of early boot file system, regardless whether that might be +implemented as an actual ramdisk, ramfs or tmpfs. We recommend using _initrd_ +in this sense as a term that is unrelated to the actual backing technologies +used. + +## Using systemd inside an initrd + +It is also possible and recommended to implement the initrd itself based on +systemd. Here are a few terse notes: + +* Provide `/etc/initrd-release` in the initrd image. The idea is that it + follows the same format as the usual `/etc/os-release` but describes the + initrd implementation rather than the OS. systemd uses the existence of this + file as a flag whether to run in initrd mode, or not. + +* When run in initrd mode, systemd and its components will read a couple of + additional command line arguments, which are generally prefixed with `rd.` + +* To transition into the main system image invoke `systemctl switch-root`. + +* The switch-root operation will result in a killing spree of all running + processes. Some processes might need to be excluded from that, see the guide + on [systemd and Storage Daemons for the Root File System](ROOT_STORAGE_DAEMONS). diff --git a/docs/JOURNAL_EXPORT_FORMATS.md b/docs/JOURNAL_EXPORT_FORMATS.md new file mode 100644 index 0000000..e1eb0d3 --- /dev/null +++ b/docs/JOURNAL_EXPORT_FORMATS.md @@ -0,0 +1,158 @@ +--- +title: Journal Export Formats +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Journal Export Formats + +## Journal Export Format + +_Note that this document describes the binary serialization format of journals only, as used for transfer across the network. +For interfacing with web technologies there's the Journal JSON Format, described below. +The binary format on disk is documented as the [Journal File Format](JOURNAL_FILE_FORMAT)._ + +_Before reading on, please make sure you are aware of the [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), in particular realize that they may include binary non-text data (though usually don't), and the same field might have multiple values assigned within the same entry (though usually hasn't)._ + +When exporting journal data for other uses or transferring it via the network/local IPC the _journal export format_ is used. It's a simple serialization of journal entries, that is easy to read without any special tools, but still binary safe where necessary. The format is like this: + +* Two journal entries that follow each other are separated by a double newline. +* Journal fields consisting only of valid non-control UTF-8 codepoints are serialized as they are (i.e. the field name, followed by '=', followed by field data), followed by a newline as separator to the next field. Note that fields containing newlines cannot be formatted like this. Non-control UTF-8 codepoints are the codepoints with value at or above 32 (' '), or equal to 9 (TAB). +* Other journal fields are serialized in a special binary safe way: field name, followed by newline, followed by a binary 64-bit little endian size value, followed by the binary field data, followed by a newline as separator to the next field. +* Entry metadata that is not actually a field is serialized like it was a field, but beginning with two underscores. More specifically, `__CURSOR=`, `__REALTIME_TIMESTAMP=`, `__MONOTONIC_TIMESTAMP=`, `__SEQNUM=`, `__SEQNUM_ID` are introduced this way. Note that these meta-fields are only generated when actual journal files are serialized. They are omitted for entries that do not originate from a journal file (for example because they are transferred for the first time to be stored in one). Or in other words: if you are generating this format you shouldn't care about these special double-underscore fields. But you might find them usable when you deserialize the format generated by us. Additional fields prefixed with two underscores might be added later on, your parser should skip over the fields it does not know. +* The order in which fields appear in an entry is undefined and might be different for each entry that is serialized. +And that's already it. + +This format can be generated via `journalctl -o export`. + +Here's an example for two serialized entries which consist only of text data: + +``` +__CURSOR=s=739ad463348b4ceca5a9e69c95a3c93f;i=4ece7;b=6c7c6013a26343b29e964691ff25d04c;m=4fc72436e;t=4c508a72423d9;x=d3e5610681098c10;p=system.journal +__REALTIME_TIMESTAMP=1342540861416409 +__MONOTONIC_TIMESTAMP=21415215982 +_BOOT_ID=6c7c6013a26343b29e964691ff25d04c +_TRANSPORT=syslog +PRIORITY=4 +SYSLOG_FACILITY=3 +SYSLOG_IDENTIFIER=gdm-password] +SYSLOG_PID=587 +MESSAGE=AccountsService-DEBUG(+): ActUserManager: ignoring unspecified session '8' since it's not graphical: Success +_PID=587 +_UID=0 +_GID=500 +_COMM=gdm-session-wor +_EXE=/usr/libexec/gdm-session-worker +_CMDLINE=gdm-session-worker [pam/gdm-password] +_AUDIT_SESSION=2 +_AUDIT_LOGINUID=500 +_SYSTEMD_CGROUP=/user/lennart/2 +_SYSTEMD_SESSION=2 +_SELINUX_CONTEXT=system_u:system_r:xdm_t:s0-s0:c0.c1023 +_SOURCE_REALTIME_TIMESTAMP=1342540861413961 +_MACHINE_ID=a91663387a90b89f185d4e860000001a +_HOSTNAME=epsilon + +__CURSOR=s=739ad463348b4ceca5a9e69c95a3c93f;i=4ece8;b=6c7c6013a26343b29e964691ff25d04c;m=4fc72572f;t=4c508a7243799;x=68597058a89b7246;p=system.journal +__REALTIME_TIMESTAMP=1342540861421465 +__MONOTONIC_TIMESTAMP=21415221039 +_BOOT_ID=6c7c6013a26343b29e964691ff25d04c +_TRANSPORT=syslog +PRIORITY=6 +SYSLOG_FACILITY=9 +SYSLOG_IDENTIFIER=/USR/SBIN/CROND +SYSLOG_PID=8278 +MESSAGE=(root) CMD (run-parts /etc/cron.hourly) +_PID=8278 +_UID=0 +_GID=0 +_COMM=run-parts +_EXE=/usr/bin/bash +_CMDLINE=/bin/bash /bin/run-parts /etc/cron.hourly +_AUDIT_SESSION=8 +_AUDIT_LOGINUID=0 +_SYSTEMD_CGROUP=/user/root/8 +_SYSTEMD_SESSION=8 +_SELINUX_CONTEXT=system_u:system_r:crond_t:s0-s0:c0.c1023 +_SOURCE_REALTIME_TIMESTAMP=1342540861416351 +_MACHINE_ID=a91663387a90b89f185d4e860000001a +_HOSTNAME=epsilon + +``` + +A message with a binary field produced by +```bash +python3 -c 'from systemd import journal; journal.send("foo\nbar")' +journalctl -n1 -o export +``` + +``` +__CURSOR=s=bcce4fb8ffcb40e9a6e05eee8b7831bf;i=5ef603;b=ec25d6795f0645619ddac9afdef453ee;m=545242e7049;t=50f1202 +__REALTIME_TIMESTAMP=1423944916375353 +__MONOTONIC_TIMESTAMP=5794517905481 +_BOOT_ID=ec25d6795f0645619ddac9afdef453ee +_TRANSPORT=journal +_UID=1001 +_GID=1001 +_CAP_EFFECTIVE=0 +_SYSTEMD_OWNER_UID=1001 +_SYSTEMD_SLICE=user-1001.slice +_MACHINE_ID=5833158886a8445e801d437313d25eff +_HOSTNAME=bupkis +_AUDIT_LOGINUID=1001 +_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 +CODE_LINE=1 +CODE_FUNC=<module> +SYSLOG_IDENTIFIER=python3 +_COMM=python3 +_EXE=/usr/bin/python3.4 +_AUDIT_SESSION=35898 +_SYSTEMD_CGROUP=/user.slice/user-1001.slice/session-35898.scope +_SYSTEMD_SESSION=35898 +_SYSTEMD_UNIT=session-35898.scope +MESSAGE +^G^@^@^@^@^@^@^@foo +bar +CODE_FILE=<string> +_PID=16853 +_CMDLINE=python3 -c from systemd import journal; journal.send("foo\nbar") +_SOURCE_REALTIME_TIMESTAMP=1423944916372858 +``` + +## Journal JSON Format + +_Note that this section describes the JSON serialization format of the journal only, as used for interfacing with web technologies. +For binary transfer of journal data across the network there's the Journal Export Format described above. +The binary format on disk is documented as [Journal File Format](JOURNAL_FILE_FORMAT)._ + +_Before reading on, please make sure you are aware of the [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), in particular realize that they may include binary non-text data (though usually don't), and the same field might have multiple values assigned within the same entry (though usually hasn't)._ + +In most cases the Journal JSON serialization is the obvious mapping of the entry field names (as JSON strings) to the entry field values (also as JSON strings) encapsulated in one JSON object. However, there are a few special cases to handle: + +* A field that contains non-printable or non-UTF8 is serialized as a number array instead. This is necessary to handle binary data in a safe way without losing data, since JSON cannot embed binary data natively. Each byte of the binary field will be mapped to its numeric value in the range 0…255. +* The JSON serializer can optionally skip huge (as in larger than a specific threshold) data fields from the JSON object. If that is enabled and a data field is too large, the field name is still included in the JSON object but assigned _null_. +* Within the same entry, Journal fields may have multiple values assigned. This is not allowed in JSON. The serializer will hence create a single JSON field only for these cases, and assign it an array of values (which the can be strings, _null_ or number arrays, see above). +* If the JSON data originates from a journal file it may include the special addressing fields `__CURSOR`, `__REALTIME_TIMESTAMP`, `__MONOTONIC_TIMESTAMP`, `__SEQNUM`, `__SEQNUM_ID`, which contain the cursor string of this entry as string, the realtime/monotonic timestamps of this entry as formatted numeric string of usec since the respective epoch, and the sequence number and associated sequence number ID, both formatted as strings. + +Here's an example, illustrating all cases mentioned above. Consider this entry: + +``` +MESSAGE=Hello World +_UDEV_DEVNODE=/dev/waldo +_UDEV_DEVLINK=/dev/alias1 +_UDEV_DEVLINK=/dev/alias2 +BINARY=this is a binary value \a +LARGE=this is a super large value (let's pretend at least, for the sake of this example) +``` + +This translates into the following JSON Object: +```json +{ + "MESSAGE" : "Hello World", + "_UDEV_DEVNODE" : "/dev/waldo", + "_UDEV_DEVLINK" : [ "/dev/alias1", "/dev/alias2" ], + "BINARY" : [ 116, 104, 105, 115, 32, 105, 115, 32, 97, 32, 98, 105, 110, 97, 114, 121, 32, 118, 97, 108, 117, 101, 32, 7 ], + "LARGE" : null +} +``` diff --git a/docs/JOURNAL_FILE_FORMAT.md b/docs/JOURNAL_FILE_FORMAT.md new file mode 100644 index 0000000..e0737c5 --- /dev/null +++ b/docs/JOURNAL_FILE_FORMAT.md @@ -0,0 +1,755 @@ +--- +title: Journal File Format +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Journal File Format + +_Note that this document describes the binary on-disk format of journals only. +For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format). +For transfer of journal data across the network there's the [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format)._ + +The systemd journal stores log data in a binary format with several features: + +* Fully indexed by all fields +* Can store binary data, up to 2^64-1 in size +* Seekable +* Primarily append-based, hence robust to corruption +* Support for in-line compression +* Support for in-line Forward Secure Sealing + +This document explains the basic structure of the file format on disk. We are +making this available primarily to allow review and provide documentation. Note +that the actual implementation in the [systemd +codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the +only ultimately authoritative description of the format, so if this document +and the code disagree, the code is right. That said we'll of course try hard to +keep this document up-to-date and accurate. + +Instead of implementing your own reader or writer for journal files we ask you +to use the [Journal's native C +API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access +these files. It provides you with full access to the files, and will not +withhold any data. If you find a limitation, please ping us and we might add +some additional interfaces for you. + +If you need access to the raw journal data in serialized stream form without C +API our recommendation is to make use of the [Journal Export +Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can +get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export +format is much simpler to parse, but complete and accurate. Due to its +stream-based nature it is not indexed. + +_Or, to put this in other words: this low-level document is probably not what +you want to use as base of your project. You want our [C +API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! +And if you really don't want the C API, then you want the +[Journal Export Format or Journal JSON Format](JOURNAL_EXPORT_FORMATS) +instead! This document is primarily for your entertainment and education. +Thank you!_ + +This document assumes you have a basic understanding of the journal concepts, +the properties of a journal entry and so on. If not, please go and read up, +then come back! This is a good opportunity to read about the [basic properties +of journal +entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), +in particular realize that they may include binary non-text data (though +usually don't), and the same field might have multiple values assigned within +the same entry. + +This document describes the current format of systemd 246. The documented +format is compatible with the format used in the first versions of the journal, +but received various compatible and incompatible additions since. + +If you are wondering why the journal file format has been created in the first +place instead of adopting an existing database implementation, please have a +look [at this +thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). + + +## Basics + +* All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format. +* Offsets are always relative to the beginning of the file. +* The 64-bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value, and the second one uses as lower 32-bit part. +* All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit +* The format is designed to be read and written via memory mapping using multiple mapped windows. +* All time values are stored in usec since the respective epoch. +* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) +* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) +* Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. + +## General Rules + +If any kind of corruption is noticed by a writer it should immediately rotate +the file and start a new one. No further writes should be attempted to the +original file, but it should be left around so that as little data as possible +is lost. + +If any kind of corruption is noticed by a reader it should try hard to handle +this gracefully, such as skipping over the corrupted data, but allowing access +to as much data around it as possible. + +A reader should verify all offsets and other data as it reads it. This includes +checking for alignment and range of offsets in the file, especially before +trying to read it via a memory map. + +A reader must interleave rotated and corrupted files as good as possible and +present them as single stream to the user. + +All fields marked as "reserved" must be initialized with 0 when writing and be +ignored on reading. They are currently not used but might be used later on. + + +## Structure + +The file format's data structures are declared in +[journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h). + +The file format begins with a header structure. After the header structure +object structures follow. Objects are appended to the end as time +progresses. Most data stored in these objects is not altered anymore after +having been written once, with the exception of records necessary for +indexing. When new data is appended to a file the writer first writes all new +objects to the end of the file, and then links them up at front after that's +done. Currently, seven different object types are known: + +```c +enum { + OBJECT_UNUSED, + OBJECT_DATA, + OBJECT_FIELD, + OBJECT_ENTRY, + OBJECT_DATA_HASH_TABLE, + OBJECT_FIELD_HASH_TABLE, + OBJECT_ENTRY_ARRAY, + OBJECT_TAG, + _OBJECT_TYPE_MAX +}; +``` + +* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". +* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. +* An **ENTRY** object, which binds several **DATA** objects together into a log entry. +* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. +* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. +* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. +* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). + +## Header + +The Header struct defines, well, you guessed it, the file header: + +```c +_packed_ struct Header { + uint8_t signature[8]; /* "LPKSHHRH" */ + le32_t compatible_flags; + le32_t incompatible_flags; + uint8_t state; + uint8_t reserved[7]; + sd_id128_t file_id; + sd_id128_t machine_id; + sd_id128_t tail_entry_boot_id; + sd_id128_t seqnum_id; + le64_t header_size; + le64_t arena_size; + le64_t data_hash_table_offset; + le64_t data_hash_table_size; + le64_t field_hash_table_offset; + le64_t field_hash_table_size; + le64_t tail_object_offset; + le64_t n_objects; + le64_t n_entries; + le64_t tail_entry_seqnum; + le64_t head_entry_seqnum; + le64_t entry_array_offset; + le64_t head_entry_realtime; + le64_t tail_entry_realtime; + le64_t tail_entry_monotonic; + /* Added in 187 */ + le64_t n_data; + le64_t n_fields; + /* Added in 189 */ + le64_t n_tags; + le64_t n_entry_arrays; + /* Added in 246 */ + le64_t data_hash_chain_depth; + le64_t field_hash_chain_depth; + /* Added in 252 */ + le32_t tail_entry_array_offset; + le32_t tail_entry_array_n_entries; + /* Added in 254 */ + le64_t tail_entry_offset; +}; +``` + +The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. + +If a writer finds that the **machine_id** of a file to write to does not match +the machine it is running on it should immediately rotate the file and start a +new one. + +When journal file is first created the **file_id** is randomly and uniquely +initialized. + +When a writer creates a file it shall initialize the **tail_entry_boot_id** to +the current boot ID of the system. When appending an entry it shall update the +field to the boot ID of that entry, so that it is guaranteed that the +**tail_entry_monotonic** field refers to a timestamp of the monotonic clock +associated with the boot with the ID indicated by the **tail_entry_boot_id** +field. (Compatibility note: in older versions of the journal, the field was +also supposed to be updated whenever the file was opened for any form of +writing, including when opened to mark it as archived. This behaviour has been +deemed problematic since without an associated boot ID the +**tail_entry_monotonic** field is useless. To indicate whether the boot ID is +updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it +is not set, the **tail_entry_monotonic** field is not usable). + +The currently used part of the file is the **header_size** plus the +**arena_size** field of the header. If a writer needs to write to a file where +the actual file size on disk is smaller than the reported value it shall +immediately rotate the file and start a new one. If a writer is asked to write +to a file with a header that is shorter than its own definition of the struct +Header, it shall immediately rotate the file and start a new one. + +The **n_objects** field contains a counter for objects currently available in +this file. As objects are appended to the end of the file this counter is +increased. + +The first object in the file starts immediately after the header. The last +object in the file is at the offset **tail_object_offset**, which may be 0 if +no object is in the file yet. + +The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are +counters of the objects of the specific types. + +**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number +(see below) of the last or first entry in the file, respectively, or 0 if no +entry has been written yet. + +**tail_entry_realtime** and **head_entry_realtime** contain the wallclock +timestamp of the last or first entry in the file, respectively, or 0 if no +entry has been written yet. + +**tail_entry_monotonic** is the monotonic timestamp of the last entry in the +file, referring to monotonic time of the boot identified by +**tail_entry_boot_id**, but only if the +JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it +is not set, this field might refer to a different boot then the one in the +**tail_entry_boot_id** field, for example when the file was ultimately +archived. + +**data_hash_chain_depth** is a counter of the deepest chain in the data hash +table, minus one. This is updated whenever a chain is found that is longer than +the previous deepest chain found. Note that the counter is updated during hash +table lookups, as the chains are traversed. This counter is used to determine +when it is a good time to rotate the journal file, because hash collisions +became too frequent. + +Similar, **field_hash_chain_depth** is a counter of the deepest chain in the +field hash table, minus one. + +**tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate +access to the last entry array in the global entry array chain. + +**tail_entry_offset** allow immediate access to the last entry in the journal +file. + +## Extensibility + +The format is supposed to be extensible in order to enable future additions of +features. Readers should simply skip objects of unknown types as they read +them. If a compatible feature extension is made a new bit is registered in the +header's **compatible_flags** field. If a feature extension is used that makes +the format incompatible a new bit is registered in the header's +**incompatible_flags** field. Readers should check these two bit fields, if +they find a flag they don't understand in compatible_flags they should continue +to read the file, but if they find one in **incompatible_flags** they should +fail, asking for an update of the software. Writers should refuse writing if +there's an unknown bit flag in either of these fields. + +The file header may be extended as new features are added. The size of the file +header is stored in the header. All header fields up to **n_data** are known to +unconditionally exist in all revisions of the file format, all fields starting +with **n_data** needs to be explicitly checked for via a size check, since they +were additions after the initial release. + +Currently only five extensions flagged in the flags fields are known: + +```c +enum { + HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, + HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, + HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, + HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, + HEADER_INCOMPATIBLE_COMPACT = 1 << 4, +}; + +enum { + HEADER_COMPATIBLE_SEALED = 1 << 0, + HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1, +}; +``` + +HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects +that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 +indicates that the file includes DATA objects that are compressed with the LZ4 +algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are +objects compressed with ZSTD. + +HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins +hash function the keyed siphash24 hash function is used for the two hash +tables, see below. + +HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary +format that uses less space on disk compared to the original format. + +HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required +for Forward Secure Sealing. + +HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the +**tail_entry_boot_id** field is strictly updated on initial creation of the +file and whenever an entry is updated (in which case the flag is set), or also +when the file is archived (in which case it is unset). New files should always +set this flag (and thus not update the **tail_entry_boot_id** except when +creating the file and when appending an entry to it. + +## Dirty Detection + +```c +enum { + STATE_OFFLINE = 0, + STATE_ONLINE = 1, + STATE_ARCHIVED = 2, + _STATE_MAX +}; +``` + +If a file is opened for writing the **state** field should be set to +STATE_ONLINE. If a file is closed after writing the **state** field should be +set to STATE_OFFLINE. After a file has been rotated it should be set to +STATE_ARCHIVED. If a writer is asked to write to a file that is not in +STATE_OFFLINE it should immediately rotate the file and start a new one, +without changing the file. + +After and before the state field is changed, `fdatasync()` should be executed on +the file to ensure the dirty state hits disk. + + +## Sequence Numbers + +All entries carry sequence numbers that are monotonically counted up for each +entry (starting at 1) and are unique among all files which carry the same +**seqnum_id** field. This field is randomly generated when the journal daemon +creates its first file. All files generated by the same journal daemon instance +should hence carry the same seqnum_id. This should guarantee a monotonic stream +of sequential numbers for easy interleaving even if entries are distributed +among several files, such as the system journal and many per-user journals. + + +## Concurrency + +The file format is designed to be usable in a simultaneous +single-writer/multiple-reader scenario. The synchronization model is very weak +in order to facilitate storage on the most basic of file systems (well, the +most basic ones that provide us with `mmap()` that is), and allow good +performance. No file locking is used. The only time where disk synchronization +via `fdatasync()` should be enforced is after and before changing the **state** +field in the file header (see below). It is recommended to execute a memory +barrier after appending and initializing new objects at the end of the file, +and before linking them up in the earlier objects. + +This weak synchronization model means that it is crucial that readers verify +the structural integrity of the file as they read it and handle invalid +structure gracefully. (Checking what you read is a pretty good idea out of +security considerations anyway.) This specifically includes checking offset +values, and that they point to valid objects, with valid sizes and of the type +and hash value expected. All code must be written with the fact in mind that a +file with inconsistent structure might just be inconsistent temporarily, and +might become consistent later on. Payload OTOH requires less scrutiny, as it +should only be linked up (and hence visible to readers) after it was +successfully written to memory (though not necessarily to disk). On non-local +file systems it is a good idea to verify the payload hashes when reading, in +order to avoid annoyances with `mmap()` inconsistencies. + +Clients intending to show a live view of the journal should use `inotify()` for +this to watch for files changes. Since file writes done via `mmap()` do not +result in `inotify()` writers shall truncate the file to its current size after +writing one or more entries, which results in inotify events being +generated. Note that this is not used as a transaction scheme (it doesn't +protect anything), but merely for triggering wakeups. + +Note that inotify will not work on network file systems if reader and writer +reside on different hosts. Readers which detect they are run on journal files +on a non-local file system should hence not rely on inotify for live views but +fall back to simple time based polling of the files (maybe recheck every 2s). + + +## Objects + +All objects carry a common header: + +```c +enum { + OBJECT_COMPRESSED_XZ = 1 << 0, + OBJECT_COMPRESSED_LZ4 = 1 << 1, + OBJECT_COMPRESSED_ZSTD = 1 << 2, +}; + +_packed_ struct ObjectHeader { + uint8_t type; + uint8_t flags; + uint8_t reserved[6]; + le64_t size; + uint8_t payload[]; +}; +``` + +The **type** field is one of the object types listed above. The **flags** field +currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and +OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that +the data payload is compressed with XZ/LZ4/ZSTD. If one of the +OBJECT_COMPRESSED_* flags is set for an object then the matching +HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD +flag must be set for the file as well. At most one of these three bits may be +set. The **size** field encodes the size of the object including all its +headers and payload. + + +## Data Objects + +```c +_packed_ struct DataObject { + ObjectHeader object; + le64_t hash; + le64_t next_hash_offset; + le64_t next_field_offset; + le64_t entry_offset; /* the first array entry we store inline */ + le64_t entry_array_offset; + le64_t n_entries; + union { \ + struct { \ + uint8_t payload[] ; \ + } regular; \ + struct { \ + le32_t tail_entry_array_offset; \ + le32_t tail_entry_array_n_entries; \ + uint8_t payload[]; \ + } compact; \ + }; \ +}; +``` + +Data objects carry actual field data in the **payload[]** array, including a +field name, a `=` and the field data. Example: +`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the +payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header +this is the siphash24 hash value of the payload, keyed by the file ID as stored +in the **file_id** field of the file header. If the flag is not set it is the +non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as +it makes the format more robust against attackers that want to trigger hash +collisions in the hash table. + +**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if +a hash collision happens (in a singly linked list, with an offset of 0 +indicating the end). **next_field_offset** is used to link up data objects with +the same field name from the FIELD object of the field used. + +**entry_offset** is an offset to the first ENTRY object referring to this DATA +object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with +offsets to other entries referencing this DATA object. Storing the offset to +the first ENTRY object in-line is an optimization given that many DATA objects +will be referenced from a single entry only (for example, `MESSAGE=` frequently +includes a practically unique string). **n_entries** is a counter of the total +number of ENTRY objects that reference this object, i.e. the sum of all +ENTRY_ARRAYS chained up from this object, plus 1. + +The **payload[]** field contains the field name and date unencoded, unless +OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the +`ObjectHeader`, in which case the payload is compressed with the indicated +compression algorithm. + +If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to +allow immediate access to the tail entry array in the DATA object's entry array +chain. + +## Field Objects + +```c +_packed_ struct FieldObject { + ObjectHeader object; + le64_t hash; + le64_t next_hash_offset; + le64_t head_data_offset; + uint8_t payload[]; +}; +``` + +Field objects are used to enumerate all possible values a certain field name +can take in the entire journal file. + +The **payload[]** array contains the actual field name, without '=' or any +field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of +the payload. As for the DATA objects, this too is either the `.file_id` keyed +siphash24 hash of the payload, or the non-keyed Jenkins hash. + +**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE +if a hash collision happens (in singly linked list, offset 0 indicating the +end). **head_data_offset** points to the first DATA object that shares this +field name. It is the head of a singly linked list using DATA's +**next_field_offset** offset. + + +## Entry Objects + +``` +_packed_ struct EntryObject { + ObjectHeader object; + le64_t seqnum; + le64_t realtime; + le64_t monotonic; + sd_id128_t boot_id; + le64_t xor_hash; + union { \ + struct { \ + le64_t object_offset; \ + le64_t hash; \ + } regular[]; \ + struct { \ + le32_t object_offset; \ + } compact[]; \ + } items; \ +}; +``` + +An ENTRY object binds several DATA objects together into one log entry, and +includes other metadata such as various timestamps. + +The **seqnum** field contains the sequence number of the entry, **realtime** +the realtime timestamp, and **monotonic** the monotonic timestamp for the boot +identified by **boot_id**. + +The **xor_hash** field contains a binary XOR of the hashes of the payload of +all DATA objects referenced by this ENTRY. This value is usable to check the +contents of the entry, being independent of the order of the DATA objects in +the array. Note that even for files that have the +`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise +used hash function) the hash function used for this field, as singular +exception, is the Jenkins lookup3 hash function. The XOR hash value is used to +quickly compare the contents of two entries, and to define a well-defined order +between two entries that otherwise have the same sequence numbers and +timestamps. + +The **items[]** array contains references to all DATA objects of this entry, +plus their respective hashes (which are calculated the same way as in the DATA +objects, i.e. keyed by the file ID). + +If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored +as 32-bit integers instead of 64-bit and the unused hash field per data object is +not stored anymore. + +In the file ENTRY objects are written ordered monotonically by sequence +number. For continuous parts of the file written during the same boot +(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo +wallclock time jumps (due to incorrect clocks being corrected) the realtime +timestamps are monotonic too. + + +## Hash Table Objects + +```c +_packed_ struct HashItem { + le64_t head_hash_offset; + le64_t tail_hash_offset; +}; + +_packed_ struct HashTableObject { + ObjectHeader object; + HashItem items[]; +}; +``` + +The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are +identical. They implement a simple hash table, with each cell containing +offsets to the head and tail of the singly linked list of the DATA and FIELD +objects, respectively. DATA's and FIELD's next_hash_offset field are used to +chain up the objects. Empty cells have both offsets set to 0. + +Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE +objects. Their payload is directly referred to by the file header in the +**data_hash_table_offset**, **data_hash_table_size**, +**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do +_not_ point to the object headers but directly to the payloads. When a new +journal file is created the two hash table objects need to be created right +away as first two objects in the stream. + +If the hash table fill level is increasing over a certain fill level (Learning +from Java's Hashtable for example: > 75%), the writer should rotate the file +and create a new one. + +The DATA_HASH_TABLE should be sized taking into account to the maximum size the +file is expected to grow, as configured by the administrator or disk space +considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the +number of fields should be pretty static as it depends only on developers' +creativity rather than runtime parameters. + + +## Entry Array Objects + + +```c +_packed_ struct EntryArrayObject { + ObjectHeader object; + le64_t next_entry_array_offset; + union { + le64_t regular[]; + le32_t compact[]; + } items; +}; +``` + +Entry Arrays are used to store a sorted array of offsets to entries. Entry +arrays are strictly sorted by offsets on disk, and hence by their timestamps +and sequence numbers (with some restrictions, see above). + +If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit +integers instead of 64-bit. + +Entry Arrays are chained up. If one entry array is full another one is +allocated and the **next_entry_array_offset** field of the old one pointed to +it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the +list. To optimize allocation and seeking, as entry arrays are appended to a +chain of entry arrays they should increase in size (double). + +Due to being monotonically ordered entry arrays may be searched with a binary +search (bisection). + +One chain of entry arrays links up all entries written to the journal. The +first entry array is referenced in the **entry_array_offset** field of the +header. + +Each DATA object also references an entry array chain listing all entries +referencing a specific DATA object. Since many DATA objects are only referenced +by a single ENTRY the first offset of the list is stored inside the DATA object +itself, an ENTRY_ARRAY object is only needed if it is referenced by more than +one ENTRY. + + +## Tag Object + +```c +#define TAG_LENGTH (256/8) + +_packed_ struct TagObject { + ObjectHeader object; + le64_t seqnum; + le64_t epoch; + uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ +}; +``` + +Tag objects are used to seal off the journal for alteration. In regular +intervals a tag object is appended to the file. The tag object consists of a +SHA-256 HMAC tag that is calculated from the objects stored in the file since +the last tag was written, or from the beginning if no tag was written yet. The +key for the HMAC is calculated via the externally maintained FSPRG logic for +the epoch that is written into **epoch**. The sequence number **seqnum** is +increased with each tag. When calculating the HMAC of objects header fields +that are volatile are excluded (skipped). More specifically all fields that +might validly be altered to maintain a consistent file structure (such as +offsets to objects added later for the purpose of linked lists and suchlike) +after an object has been written are not protected by the tag. This means a +verifier has to independently check these fields for consistency of +structure. For the fields excluded from the HMAC please consult the source code +directly. A verifier should read the file from the beginning to the end, always +calculating the HMAC for the objects it reads. Each time a tag object is +encountered the HMAC should be verified and restarted. The tag object sequence +numbers need to increase strictly monotonically. Tag objects themselves are +partially protected by the HMAC (i.e. seqnum and epoch is included, the tag +itself not). + + +## Algorithms + +### Reading + +Given an offset to an entry all data fields are easily found by following the +offsets in the data item array of the entry. + +Listing entries without filter is done by traversing the list of entry arrays +starting with the headers' **entry_array_offset** field. + +Seeking to an entry by timestamp or sequence number (without any matches) is +done via binary search in the entry arrays starting with the header's +**entry_array_offset** field. Since these arrays double in size as more are +added the time cost of seeking is O(log(n)*log(n)) if n is the number of +entries in the file. + +When seeking or listing with one field match applied the DATA object of the +match is first identified, and then its data entry array chain traversed. The +time cost is the same as for seeks/listings with no match. + +If multiple matches are applied, multiple chains of entry arrays should be +traversed in parallel. Since they all are strictly monotonically ordered by +offset of the entries, advancing in one can be directly applied to the others, +until an entry matching all matches is found. In the worst case seeking like +this is O(n) where n is the number of matching entries of the "loosest" match, +but in the common case should be much more efficient at least for the +well-known fields, where the set of possible field values tend to be closely +related. Checking whether an entry matches a number of matches is efficient +since the item array of the entry contains hashes of all data fields +referenced, and the number of data fields of an entry is generally small (< +30). + +When interleaving multiple journal files seeking tends to be a frequently used +operation, but in this case can be effectively suppressed by caching results +from previous entries. + +When listing all possible values a certain field can take it is sufficient to +look up the FIELD object and follow the chain of links to all DATA it includes. + +### Writing + +When an entry is appended to the journal, for each of its data fields the data +hash table should be checked. If the data field does not yet exist in the file, +it should be appended and added to the data hash table. When a data field's data +object is added, the field hash table should be checked for the field name of +the data field, and a field object be added if necessary. After all data fields +(and recursively all field names) of the new entry are appended and linked up +in the hashtables, the entry object should be appended and linked up too. + +At regular intervals a tag object should be written if sealing is enabled (see +above). Before the file is closed a tag should be written too, to seal it off. + +Before writing an object, time and disk space limits should be checked and +rotation triggered if necessary. + + +## Optimizing Disk IO + +_A few general ideas to keep in mind:_ + +The hash tables for looking up fields and data should be quickly in the memory +cache and not hurt performance. All entries and entry arrays are ordered +strictly by time on disk, and hence should expose an OK access pattern on +rotating media, when read sequentially (which should be the most common case, +given the nature of log data). + +The disk access patterns of the binary search for entries needed for seeking +are problematic on rotating disks. This should not be a major issue though, +since seeking should not be a frequent operation. + +When reading, collecting data fields for presenting entries to the user is +problematic on rotating disks. In order to optimize these patterns the item +array of entry objects should be sorted by disk offset before +writing. Effectively, frequently used data objects should be in the memory +cache quickly. Non-frequently used data objects are likely to be located +between the previous and current entry when reading and hence should expose an +OK access pattern. Problematic are data objects that are neither frequently nor +infrequently referenced, which will cost seek time. + +And that's all there is to it. + +Thanks for your interest! diff --git a/docs/JOURNAL_NATIVE_PROTOCOL.md b/docs/JOURNAL_NATIVE_PROTOCOL.md new file mode 100644 index 0000000..ce00d7e --- /dev/null +++ b/docs/JOURNAL_NATIVE_PROTOCOL.md @@ -0,0 +1,191 @@ +--- +title: Native Journal Protocol +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Native Journal Protocol + +`systemd-journald.service` accepts log data via various protocols: + +* Classic RFC3164 BSD syslog via the `/dev/log` socket +* STDOUT/STDERR of programs via `StandardOutput=journal` + `StandardError=journal` in service files (both of which are default settings) +* Kernel log messages via the `/dev/kmsg` device node +* Audit records via the kernel's audit subsystem +* Structured log messages via `journald`'s native protocol + +The latter is what this document is about: if you are developing a program and +want to pass structured log data to `journald`, it's the Journal's native +protocol that you want to use. The systemd project provides the +[`sd_journal_print(3)`](https://www.freedesktop.org/software/systemd/man/sd_journal_print.html) +API that implements the client side of this protocol. This document explains +what this interface does behind the scenes, in case you'd like to implement a +client for it yourself, without linking to `libsystemd` — for example because +you work in a programming language other than C or otherwise want to avoid the +dependency. + +## Basics + +The native protocol of `journald` is spoken on the +`/run/systemd/journal/socket` `AF_UNIX`/`SOCK_DGRAM` socket on which +`systemd-journald.service` listens. Each datagram sent to this socket +encapsulates one journal entry that shall be written. Since datagrams are +subject to a size limit and we want to allow large journal entries, datagrams +sent over this socket may come in one of two formats: + +* A datagram with the literal journal entry data as payload, without + any file descriptors attached. + +* A datagram with an empty payload, but with a single + [`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) + file descriptor that contains the literal journal entry data. + +Other combinations are not permitted, i.e. datagrams with both payload and file +descriptors, or datagrams with neither, or more than one file descriptor. Such +datagrams are ignored. The `memfd` file descriptor should be fully sealed. The +binary format in the datagram payload and in the `memfd` memory is +identical. Typically a client would attempt to first send the data as datagram +payload, but if this fails with an `EMSGSIZE` error it would immediately retry +via the `memfd` logic. + +A client probably should bump up the `SO_SNDBUF` socket option of its `AF_UNIX` +socket towards `journald` in order to delay blocking I/O as much as possible. + +## Data Format + +Each datagram should consist of a number of environment-like key/value +assignments. Unlike environment variable assignments the value may contain NUL +bytes however, as well as any other binary data. Keys may not include the `=` +or newline characters (or any other control characters or non-ASCII characters) +and may not be empty. + +Serialization into the datagram payload or `memfd` is straightforward: each +key/value pair is serialized via one of two methods: + +* The first method inserts a `=` character between key and value, and suffixes +the result with `\n` (i.e. the newline character, ASCII code 10). Example: a +key `FOO` with a value `BAR` is serialized `F`, `O`, `O`, `=`, `B`, `A`, `R`, +`\n`. + +* The second method should be used if the value of a field contains a `\n` +byte. In this case, the key name is serialized as is, followed by a `\n` +character, followed by a (non-aligned) little-endian unsigned 64-bit integer +encoding the size of the value, followed by the literal value data, followed by +`\n`. Example: a key `FOO` with a value `BAR` may be serialized using this +second method as: `F`, `O`, `O`, `\n`, `\003`, `\000`, `\000`, `\000`, `\000`, +`\000`, `\000`, `\000`, `B`, `A`, `R`, `\n`. + +If the value of a key/value pair contains a newline character (`\n`), it *must* +be serialized using the second method. If it does not, either method is +permitted. However, it is generally recommended to use the first method if +possible for all key/value pairs where applicable since the generated datagrams +are easily recognized and understood by the human eye this way, without any +manual binary decoding — which improves the debugging experience a lot, in +particular with tools such as `strace` that can show datagram content as text +dump. After all, log messages are highly relevant for debugging programs, hence +optimizing log traffic for readability without special tools is generally +desirable. + +Note that keys that begin with `_` have special semantics in `journald`: they +are *trusted* and implicitly appended by `journald` on the receiving +side. Clients should not send them — if they do anyway, they will be ignored. + +The most important key/value pair to send is `MESSAGE=`, as that contains the +actual log message text. Other relevant keys a client should send in most cases +are `PRIORITY=`, `CODE_FILE=`, `CODE_LINE=`, `CODE_FUNC=`, `ERRNO=`. It's +recommended to generate these fields implicitly on the client side. For further +information see the [relevant documentation of these +fields](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html). + +The order in which the fields are serialized within one datagram is undefined +and may be freely chosen by the client. The server side might or might not +retain or reorder it when writing it to the Journal. + +Some programs might generate multi-line log messages (e.g. a stack unwinder +generating log output about a stack trace, with one line for each stack +frame). It's highly recommended to send these as a single datagram, using a +single `MESSAGE=` field with embedded newline characters between the lines (the +second serialization method described above must hence be used for this +field). If possible do not split up individual events into multiple Journal +events that might then be processed and written into the Journal as separate +entries. The Journal toolchain is capable of handling multi-line log entries +just fine, and it's generally preferred to have a single set of metadata fields +associated with each multi-line message. + +Note that the same keys may be used multiple times within the same datagram, +with different values. The Journal supports this and will write such entries to +disk without complaining. This is useful for associating a single log entry +with multiple suitable objects of the same type at once. This should only be +used for specific Journal fields however, where this is expected. Do not use +this for Journal fields where this is not expected and where code reasonably +assumes per-event uniqueness of the keys. In most cases code that consumes and +displays log entries is likely to ignore such non-unique fields or only +consider the first of the specified values. Specifically, if a Journal entry +contains multiple `MESSAGE=` fields, likely only the first one is +displayed. Note that a well-written logging client library thus will not use a +plain dictionary for accepting structured log metadata, but rather a data +structure that allows non-unique keys, for example an array, or a dictionary +that optionally maps to a set of values instead of a single value. + +## Example Datagram + +Here's an encoded message, with various common fields, all encoded according to +the first serialization method, with the exception of one, where the value +contains a newline character, and thus the second method is needed to be used. + +``` +PRIORITY=3\n +SYSLOG_FACILITY=3\n +CODE_FILE=src/foobar.c\n +CODE_LINE=77\n +BINARY_BLOB\n +\004\000\000\000\000\000\000\000xx\nx\n +CODE_FUNC=some_func\n +SYSLOG_IDENTIFIER=footool\n +MESSAGE=Something happened.\n +``` + +(Lines are broken here after each `\n` to make things more readable. C-style +backslash escaping is used.) + +## Automatic Protocol Upgrading + +It might be wise to automatically upgrade to logging via the Journal's native +protocol in clients that previously used the BSD syslog protocol. Behaviour in +this case should be pretty obvious: try connecting a socket to +`/run/systemd/journal/socket` first (on success use the native Journal +protocol), and if that fails fall back to `/dev/log` (and use the BSD syslog +protocol). + +Programs normally logging to STDERR might also choose to upgrade to native +Journal logging in case they are invoked via systemd's service logic, where +STDOUT and STDERR are going to the Journal anyway. By preferring the native +protocol over STDERR-based logging, structured metadata can be passed along, +including priority information and more — which is not available on STDERR +based logging. If a program wants to detect automatically whether its STDERR is +connected to the Journal's stream transport, look for the `$JOURNAL_STREAM` +environment variable. The systemd service logic sets this variable to a +colon-separated pair of device and inode number (formatted in decimal ASCII) of +the STDERR file descriptor. If the `.st_dev` and `.st_ino` fields of the +`struct stat` data returned by `fstat(STDERR_FILENO, …)` match these values a +program can be sure its STDERR is connected to the Journal, and may then opt to +upgrade to the native Journal protocol via an `AF_UNIX` socket of its own, and +cease to use STDERR. + +Why bother with this environment variable check? A service program invoked by +systemd might employ shell-style I/O redirection on invoked subprograms, and +those should likely not upgrade to the native Journal protocol, but instead +continue to use the redirected file descriptors passed to them. Thus, by +comparing the device and inode number of the actual STDERR file descriptor with +the one the service manager passed, one can make sure that no I/O redirection +took place for the current program. + +## Alternative Implementations + +If you are looking for alternative implementations of this protocol (besides +systemd's own in `sd_journal_print()`), consider +[GLib's](https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gmessages.c) or +[`dbus-broker`'s](https://github.com/bus1/dbus-broker/blob/main/src/util/log.c). + +And that's already all there is to it. diff --git a/docs/MEMORY_PRESSURE.md b/docs/MEMORY_PRESSURE.md new file mode 100644 index 0000000..69c23ec --- /dev/null +++ b/docs/MEMORY_PRESSURE.md @@ -0,0 +1,238 @@ +--- +title: Memory Pressure Handling +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Memory Pressure Handling in systemd + +When the system is under memory pressure (i.e. some component of the OS +requires memory allocation but there is only very little or none available), +it can attempt various things to make more memory available again ("reclaim"): + +* The kernel can flush out memory pages backed by files on disk, under the + knowledge that it can reread them from disk when needed again. Candidate + pages are the many memory mapped executable files and shared libraries on + disk, among others. + +* The kernel can flush out memory packages not backed by files on disk + ("anonymous" memory, i.e. memory allocated via `malloc()` and similar calls, + or `tmpfs` file system contents) if there's swap to write it to. + +* Userspace can proactively release memory it allocated but doesn't immediately + require back to the kernel. This includes allocation caches, and other forms + of caches that are not required for normal operation to continue. + +The latter is what we want to focus on in this document: how to ensure +userspace process can detect mounting memory pressure early and release memory +back to the kernel as it happens, relieving the memory pressure before it +becomes too critical. + +The effects of memory pressure during runtime generally are growing latencies +during operation: when a program requires memory but the system is busy writing +out memory to (relatively slow) disks in order make some available, this +generally surfaces in scheduling latencies, and applications and services will +slow down until memory pressure is relieved. Hence, to ensure stable service +latencies it is essential to release unneeded memory back to the kernel early +on. + +On Linux the [Pressure Stall Information +(PSI)](https://docs.kernel.org/accounting/psi.html) Linux kernel interface is +the primary way to determine the system or a part of it is under memory +pressure. PSI makes available to userspace a `poll()`-able file descriptor that +gets notifications whenever memory pressure latencies for the system or a +control group grow beyond some level. + +`systemd` itself makes use of PSI, and helps applications to do so too. +Specifically: + +* Most of systemd's long running components watch for PSI memory pressure + events, and release allocation caches and other resources once seen. + +* systemd's service manager provides a protocol for asking services to monitor + PSI events and configure the appropriate pressure thresholds. + +* systemd's `sd-event` event loop API provides a high-level call + `sd_event_add_memory_pressure()` enabling programs using it to efficiently + hook into the PSI memory pressure protocol provided by the service manager, + with very few lines of code. + +## Memory Pressure Service Protocol + +If memory pressure handling for a specific service is enabled via +`MemoryPressureWatch=` the memory pressure service protocol is used to tell the +service code about this. Specifically two environment variables are set by the +service manager, and typically consumed by the service: + +* The `$MEMORY_PRESSURE_WATCH` environment variable will contain an absolute + path in the file system to the file to watch for memory pressure events. This + will usually point to a PSI file such as the `memory.pressure` file of the + service's cgroup. In order to make debugging easier, and allow later + extension it is recommended for applications to also allow this path to refer + to an `AF_UNIX` stream socket in the file system or a FIFO inode in the file + system. Regardless which of the three types of inodes this absolute path + refers to, all three are `poll()`-able for memory pressure events. The + variable can also be set to the literal string `/dev/null`. If so the service + code should take this as indication that memory pressure monitoring is not + desired and should be turned off. + +* The `$MEMORY_PRESSURE_WRITE` environment variable is optional. If set by the + service manager it contains Base64 encoded data (that may contain arbitrary + binary values, including NUL bytes) that should be written into the path + provided via `$MEMORY_PRESSURE_WATCH` right after opening it. Typically, if + talking directly to a PSI kernel file this will contain information about the + threshold settings configurable in the service manager. + +When a service initializes it hence should look for +`$MEMORY_PRESSURE_WATCH`. If set, it should try to open the specified path. If +it detects the path to refer to a regular file it should assume it refers to a +PSI kernel file. If so, it should write the data from `$MEMORY_PRESSURE_WRITE` +into the file descriptor (after Base64-decoding it, and only if the variable is +set) and then watch for `POLLPRI` events on it. If it detects the paths refers +to a FIFO inode, it should open it, write the `$MEMORY_PRESSURE_WRITE` data +into it (as above) and then watch for `POLLIN` events on it. Whenever `POLLIN` +is seen it should read and discard any data queued in the FIFO. If the path +refers to an `AF_UNIX` socket in the file system, the application should +`connect()` a stream socket to it, write `$MEMORY_PRESSURE_WRITE` into it (as +above) and watch for `POLLIN`, discarding any data it might receive. + +To summarize: + +* If `$MEMORY_PRESSURE_WATCH` points to a regular file: open and watch for + `POLLPRI`, never read from the file descriptor. + +* If `$MEMORY_PRESSURE_WATCH` points to a FIFO: open and watch for `POLLIN`, + read/discard any incoming data. + +* If `$MEMORY_PRESSURE_WATCH` points to an `AF_UNIX` socket: connect and watch + for `POLLIN`, read/discard any incoming data. + +* If `$MEMORY_PRESSURE_WATCH` contains the literal string `/dev/null`, turn off + memory pressure handling. + +(And in each case, immediately after opening/connecting to the path, write the +decoded `$MEMORY_PRESSURE_WRITE` data into it.) + +Whenever a `POLLPRI`/`POLLIN` event is seen the service is under memory +pressure. It should use this as hint to release suitable redundant resources, +for example: + +* glibc's memory allocation cache, via + [`malloc_trim()`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html). Similar, + allocation caches implemented in the service itself. + +* Any other local caches, such DNS caches, or web caches (in particular if + service is a web browser). + +* Terminate any idle worker threads or processes. + +* Run a garbage collection (GC) cycle, if the runtime environment supports it. + +* Terminate the process if idle, and can be automatically started when + needed next. + +Which actions precisely to take depends on the service in question. Note that +the notifications are delivered when memory allocation latency already degraded +beyond some point. Hence when discussing which resources to keep and which to +discard, keep in mind it's typically acceptable that latencies incurred +recovering discarded resources at a later point are acceptable, given that +latencies *already* are affected negatively. + +In case the path supplied via `$MEMORY_PRESSURE_WATCH` points to a PSI kernel +API file, or to an `AF_UNIX` opening it multiple times is safe and reliable, +and should deliver notifications to each of the opened file descriptors. This +is specifically useful for services that consist of multiple processes, and +where each of them shall be able to release resources on memory pressure. + +The `POLLPRI`/`POLLIN` conditions will be triggered every time memory pressure +is detected, but not continuously. It is thus safe to keep `poll()`-ing on the +same file descriptor continuously, and executing resource release operations +whenever the file descriptor triggers without having to expect overloading the +process. + +(Currently, the protocol defined here only allows configuration of a single +"degree" of memory pressure, there's no distinction made on how strong the +pressure is. In future, if it becomes apparent that there's clear need to +extend this we might eventually add different degrees, most likely by adding +additional environment variables such as `$MEMORY_PRESSURE_WRITE_LOW` and +`$MEMORY_PRESSURE_WRITE_HIGH` or similar, which may contain different settings +for lower or higher memory pressure thresholds.) + +## Service Manager Settings + +The service manager provides two per-service settings that control the memory +pressure handling: + +* The + [`MemoryPressureWatch=`](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryPressureWatch=) + setting controls whether to enable the memory pressure protocol for the + service in question. + +* The `MemoryPressureThresholdSec=` setting allows to configure the threshold + when to signal memory pressure to the services. It takes a time value + (usually in the millisecond range) that defines a threshold per 1s time + window: if memory allocation latencies grow beyond this threshold + notifications are generated towards the service, requesting it to release + resources. + +The `/etc/systemd/system.conf` file provides two settings that may be used to +select the default values for the above settings. If the threshold isn't +configured via the per-service nor system-wide option, it defaults to 100ms. + +When memory pressure monitoring is enabled for a service via +`MemoryPressureWatch=` this primarily does three things: + +* It enables cgroup memory accounting for the service (this is a requirement + for per-cgroup PSI) + +* It sets the aforementioned two environment variables for processes invoked + for the service, based on the control group of the service and provided + settings. + +* The `memory.pressure` PSI control group file associated with the service's + cgroup is delegated to the service (i.e. permissions are relaxed so that + unprivileged service payload code can open the file for writing). + +## Memory Pressure Events in `sd-event` + +The +[`sd-event`](https://www.freedesktop.org/software/systemd/man/sd-event.html) +event loop library provides two API calls that encapsulate the +functionality described above: + +* The + [`sd_event_add_memory_pressure()`](https://www.freedesktop.org/software/systemd/man/sd_event_add_memory_pressure.html) + call implements the service-side of the memory pressure protocol and + integrates it with an `sd-event` event loop. It reads the two environment + variables, connects/opens the specified file, writes the specified data to it, + then watches it for events. + +* The `sd_event_trim_memory()` call may be called to trim the calling + processes' memory. It's a wrapper around glibc's `malloc_trim()`, but first + releases allocation caches maintained by libsystemd internally. This function + serves as the default when a NULL callback is supplied to + `sd_event_add_memory_pressure()`. + +When implementing a service using `sd-event`, for automatic memory pressure +handling, it's typically sufficient to add a line such as: + +```c +(void) sd_event_add_memory_pressure(event, NULL, NULL, NULL); +``` + +– right after allocating the event loop object `event`. + +## Other APIs + +Other programming environments might have native APIs to watch memory +pressure/low memory events. Most notable is probably GLib's +[GMemoryMonitor](https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html). It +currently uses the per-system Linux PSI interface as the backend, but operates +differently than the above: memory pressure events are picked up by a system +service, which then propagates this through D-Bus to the applications. This is +typically less than ideal, since this means each notification event has to +traverse three processes before being handled. This traversal creates +additional latencies at a time where the system is already experiencing adverse +latencies. Moreover, it focusses on system-wide PSI events, even though +service-local ones are generally the better approach. diff --git a/docs/MINIMAL_BUILDS.md b/docs/MINIMAL_BUILDS.md new file mode 100644 index 0000000..faa4f2d --- /dev/null +++ b/docs/MINIMAL_BUILDS.md @@ -0,0 +1,18 @@ +--- +title: Minimal Builds +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Minimal Builds + +systemd includes a variety of components. The core components are always built (which includes systemd itself, as well as udevd and journald). Many of the other components can be disabled at compile time with configure switches. + +For some uses the configure switches do not provide sufficient modularity. For example, they cannot be used to build only the man pages, or to build only the tmpfiles tool, only detect-virt or only udevd. If such modularity is required that goes beyond what we support in the configure script we can suggest you two options: + +1. Build systemd as usual, but pick only the built files you need from the result of "make install DESTDIR=<directory>", by using the file listing functionality of your packaging software. For example: if all you want is the tmpfiles tool, then build systemd normally, and list only /usr/bin/systemd-tmpfiles in the .spec file for your RPM package. This is simple to do, allows you to pick exactly what you need, but requires a larger number of build dependencies (but not runtime dependencies). +2. If you want to reduce the build time dependencies (though only dbus and libcap are needed as build time deps) and you know the specific component you are interested in doesn't need it, then create a dummy .pc file for that dependency (i.e. basically empty), and configure systemd with PKG_CONFIG_PATH set to the path of these dummy .pc files. Then, build only the few bits you need with "make foobar", where foobar is the file you need. + We are open to merging patches for the build system that make more "fringe" components of systemd optional. However, please be aware that in order to keep the complexity of our build system small and its readability high, and to make our lives easier, we will not accept patches that make the minimal core components optional, i.e. systemd itself, journald and udevd. + +Note that the .pc file trick mentioned above currently doesn't work for libcap, since libcap doesn't provide a .pc file. We invite you to go ahead and post a patch to libcap upstream to get this corrected. We'll happily change our build system to look for that .pc file then. (a .pc file has been sent to upstream by Bryan Kadzban. It is also available at [http://kdzbn.homelinux.net/libcap-add-pkg-config.patch](http://kdzbn.homelinux.net/libcap-add-pkg-config.patch)). diff --git a/docs/MOUNT_REQUIREMENTS.md b/docs/MOUNT_REQUIREMENTS.md new file mode 100644 index 0000000..9ccbd08 --- /dev/null +++ b/docs/MOUNT_REQUIREMENTS.md @@ -0,0 +1,72 @@ +--- +title: Mount Requirements +category: Booting +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Mount Point Availability Requirements + +systemd makes various requirements on the time during boot where various parts +of the Linux file system hierarchy must be available and must be mounted. If +the file systems backing these mounts are located on external or remote media, +that require special drivers, infrastructure or networking to be set up, then +this implies that this functionality must be started and running at that point +already. + +Generally, there are three categories of requirements: + +1. 🌥️ *initrd*: File system mounts that must be established before the OS + transitions into the root file system. (i.e. that must be stablished from + the initrd before the initrd→host transition takes place.) + +2. 🌤️ *early*: File system mounts that must be established during early boot, + after the initrd→host transition took place, but before regular services are + started. (i.e. before `local-fs.target` is reached.) + +3. ☀️ *regular*: File system mounts that can be mounted at any time during the + boot process – but which specific, individual services might require to be + established at the point they are started. (i.e. these mounts are typically + ordered before `remote-fs.target`.) + +Of course, mounts that fall into category 3 can also be mounted during the +initrd or in early boot. And those from category 2 can also be mounted already +from the initrd. + +Here's a table with relevant mounts and to which category they belong: + +| *Mount* | *Category* | +|---------------|------------| +| `/` (root fs) | 1 | +| `/usr/` | 1 | +| `/etc/` | 1 | +| `/var/` | 2 | +| `/var/tmp/` | 2 | +| `/tmp/` | 2 | +| `/home/` | 3 | +| `/srv/` | 3 | +| XBOOTLDR | 3 | +| ESP | 3 | + +Or in other words: the root file system (obviously…), `/usr/` and `/etc/` (if +these are split off) must be mounted at the moment the initrd transitions into +the host. Then, `/var/` (with `/var/tmp/`) and `/tmp/` (if split off) must be +mounted, before the host reaches `local-fs.target` (and then `basic.target`), +after which any remaining mounts may be established. + +If mounts such as `/var/` are not mounted during early boot (or from the +initrd), and require some late boot service (for example a network manager +implementation) to operate this will likely result in cyclic ordering +dependencies, and will result in various forms of boot failures. + +If you intend to use network-backed mounts (NFS, SMB, iSCSI, NVME-TCP and +similar, including anything you add the `_netdev` pseudo mount option to) for +any of the mounts from category 1 or 2, make sure to use a network managing +implementation that is capable of running from the initrd/during early +boot. [`systemd-networkd(8)`](https://www.freedesktop.org/software/systemd/man/latest/systemd-networkd.html) +for example works well in such scenarios. + +Note that +[`systemd-homed.service(8)`](https://www.freedesktop.org/software/systemd/man/latest/systemd-homed.html) +(which is a regular service, i.e. runs after `basic.target`) requires `/home/` +to be mounted. diff --git a/docs/MY_SERVICE_CANT_GET_REATLIME.md b/docs/MY_SERVICE_CANT_GET_REATLIME.md new file mode 100644 index 0000000..20d31fb --- /dev/null +++ b/docs/MY_SERVICE_CANT_GET_REATLIME.md @@ -0,0 +1,28 @@ +--- +title: My Service Can't Get Realtime! +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# My Service Can't Get Realtime! + +_So, you have a service that requires real-time scheduling. When you run this service on your systemd system it is unable to acquire real-time scheduling, even though it is full root and has all possible privileges. And now you are wondering what is going on and what you can do about it?_ + +## What is Going on? + +By default systemd places all system services into their own control groups in the "cpu" hierarchy. This has the benefit that the CPU usage of services with many worker threads or processes (think: Apache with all its gazillion CGIs and stuff) gets roughly the same amount of CPU as a service with very few worker threads (think: MySQL). Instead of evening out CPU _per process_ this will cause CPU to be evened out _per service_. + +Now, the "cpu" cgroup controller of the Linux kernel has one major shortcoming: if a cgroup is created it needs an explicit, absolute RT time budget assigned, or otherwise RT is not available to any process in the group, and an attempt to acquire it will fail with EPERM. systemd will not assign any RT time budgets to the "cpu" cgroups it creates, simply because there is no feasible way to do that, since the budget needs to be specified in absolute time units and comes from a fixed pool. Or in other words: we'd love to assign a budget, but there are no sane values we could use. Thus, in its default configuration RT scheduling is simply not available for any system services. + +## Working Around the Issue + +Of course, that's quite a limitation, so here's how you work around this: + +* One option is to simply globally turn off that systemd creates a "cpu" cgroup for each of the system services. For that, edit `/etc/systemd/system.conf` and set `DefaultControllers=` to the empty string, then reboot. (An alternative is to disable the "cpu" controller in your kernel, entirely. systemd will not attempt to make use of controllers that aren't available in the kernel.) +* Another option is to turn this off for the specific service only. For that, edit your service file, and add `ControlGroup=cpu:/` to its `[Service]` section. This overrides the default logic for this one service only, and places all its processes back in the root cgroup of the "cpu" hierarchy, which has the full RT budget assigned. +* A third option is to simply assign your service a realtime budget. For that use `ControlGroupAttribute=cpu.rt_runtime_us 500000` in its `[Service]` or suchlike. See [the kernel documentation](http://www.kernel.org/doc/Documentation/scheduler/sched-design-CFS.txt) for details. The latter two options are not available for System V services. A possible solution is to write a small wrapper service file that simply calls the SysV script's start verb in `ExecStart=` and the stop verb in `ExecStop=`. (It also needs to set `RemainAfterExit=1` and `Type=forking`!) + +Note that this all only applies to services. By default, user applications run in the root cgroup of the "cpu" hierarchy, which avoids these problems for normal user applications. + +In the long run we hope that the kernel is fixed to not require an RT budget to be assigned for any cgroup created before a process can acquire RT (i.e. a process' RT budget should be derived from the nearest ancestor cgroup which has a budget assigned, rather than unconditionally its own uninitialized budget.) Ideally, we'd also like to create a per-user cgroup by default, so that users with many processes get roughly the same amount of CPU as users with very few. diff --git a/docs/NETWORK_ONLINE.md b/docs/NETWORK_ONLINE.md new file mode 100644 index 0000000..b249eb4 --- /dev/null +++ b/docs/NETWORK_ONLINE.md @@ -0,0 +1,263 @@ +--- +title: Running Services After the Network Is Up +category: Networking +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Network Configuration Synchronization Points + +systemd provides three target units related to network configuration: + +## Network pre-configuration: `network-pre.target` + +`network-pre.target` is used to order services before any network interfaces +start to be configured. Its primary purpose is for usage with firewall services +that want to establish a firewall *before* any network interface is up. + +`network-pre.target` is a passive unit: it cannot be started directly and it is +not pulled in by the network management service, but instead a service that +wants to run before it must pull it in. Network management services hence +should set `After=network-pre.target`, but not `Wants=network-pre.target` or +`Requires=network-pre.target`. Services that want to be run before the network +is configured should use `Before=network-pre.target` and +`Wants=network-pre.target`. This way, unless there's actually a service that +needs to be ordered before the network is up, this target is not pulled in, +avoiding an unnecessary synchronization point. + +## Network management services: `network.target` + +`network.target` indicates that the network management stack has been started. +Ordering after it has little meaning during start-up: whether any network +interfaces are already configured when it is reached is not defined. + +Its primary purpose is for ordering things properly at shutdown: since the +shutdown ordering of units in systemd is the reverse of the startup ordering, +any unit that has `After=network.target` can be sure that it is *stopped* +before the network is shut down when the system is going down. This allows +services to cleanly terminate connections before going down, instead of losing +ongoing connections leaving the other side in an undefined state. + +Note that `network.target` is a passive unit: you cannot start it directly and +it is not pulled in by any services that want to make use of the network. +Instead, it is pulled in by the network management services +themselves. Services using the network should hence simply place an +`After=network.target` stanza in their unit files, without +`Wants=network.target` or `Requires=network.target`. + +## Network connectivity has been established: `network-online.target` + +`network-online.target` is a target that actively waits until the network is +"up", where the definition of "up" is defined by the network management +software. Usually it indicates a configured, routable IP address of some +kind. Its primary purpose is to actively delay activation of services until the +network has been set up. + +It is an active target, meaning that it may be pulled in by the services +requiring the network to be up, but is not pulled in by the network management +service itself. By default all remote mounts defined in `/etc/fstab` make use +of this service, in order to make sure the network is up before attempts to +connect to a network share are made. Note that normally, if no service requires +it and if no remote mount point is configured, this target is not pulled into +the boot, thus avoiding any delays during boot should the network not be +available. It is strongly recommended not to make use of this target too +liberally: for example network server software should generally not pull this +in (since server software generally is happy to accept local connections even +before any routable network interface is up). Its primary purpose is network +client software that cannot operate without network. + +For more details about those targets, see the +[systemd.special(7)](https://www.freedesktop.org/software/systemd/man/systemd.special.html) +man page. + +## Compatibility with SysV init + +LSB defines a `$network` dependency for legacy init scripts. Whenever systemd +encounters a `$network` dependency in LSB headers of init scripts it will +translate this to `Wants=` and `After=` dependencies on +`network-online.target`, staying relatively close to traditional LSB behaviour. + +# Discussion + +The meaning of `$network` is defined [only very +unprecisely](http://refspecs.linuxbase.org/LSB_3.1.1/LSB-Core-generic/LSB-Core-generic/facilname.html) +and people tend to have different ideas what it is supposed to mean. Here are a +couple of ideas people came up with so far: + +* The network management software is up. +* All "configured" network interfaces are up and an IP address has been assigned to each. +* All discovered local hardware interfaces that have a link beat have an IP address assigned, independently whether there is actually any explicit local configuration for them. +* The network has been set up precisely to the level that a DNS server is reachable. +* Same, but some specific site-specific server is reachable. +* Same, but "the Internet" is reachable. +* All "configured" ethernet devices are up, but all "configured" PPP links which are supposed to also start at boot don't have to be yet. +* A certain "profile" is enabled and some condition of the above holds. If another "profile" is enabled a different condition would have to be checked. +* Based on the location of the system a different set of configuration should be up or checked for. +* At least one global IPv4 address is configured. +* At least one global IPv6 address is configured. +* At least one global IPv4 or IPv6 address is configured. +* And so on and so on. + +All these are valid approaches to the question "When is the network up?", but +none of them would be useful to be good as generic default. + +Modern networking tends to be highly dynamic: machines are moved between +networks, network configuration changes, hardware is added and removed, virtual +networks are set up, reconfigured, and shut down again. Network connectivity is +not unconditionally and continuously available, and a machine is connected to +different networks at different times. This is particularly true for mobile +hardware such as handsets, tablets, and laptops, but also for embedded and +servers. Software that is written under the assumption that network +connectivity is available continuously and never changes is hence not +up-to-date with reality. Well-written software should be able to handle dynamic +configuration changes. It should react to changing network configuration and +make the best of it. If it cannot reach a server it must retry. If network +configuration connectivity is lost it must not fail catastrophically. Reacting +to local network configuration changes in daemon code is not particularly +hard. In fact many well-known network-facing services running on Linux have +been doing this for decades. A service written like this is robust, can be +started at any time, and will always do the best of the circumstances it is +running in. + +`$network` / `network-online.target` is a mechanism that is required only to +deal with software that assumes continuous network is available (i.e. of the +simple not-well-written kind). Which facet of it it requires is undefined. An +IMAP server might just require a certain IP to be assigned so that it can +listen on it. OTOH a network file system client might need DNS up, and the +service to contact up, as well. What precisely is required is not obvious and +can be different things depending on local configuration. + +A robust system boots up independently of external services. More specifically, +if a network DHCP server does not react, this should not slow down boot on most +setups, but only for those where network connectivity is strictly needed (for +example, because the host actually boots from the network). + +# FAQ + +## How do I make sure that my service starts after the network is *really* online? + +That depends on your setup and the services you plan to run after it (see +above). If you need to delay you service after network connectivity has been +established, include + +```ini +After=network-online.target +Wants=network-online.target +``` + +in the `.service` file. + +This will delay boot until the network management software says the network is "up". +For details, see the next question. + +## What does "up" actually mean? + +The services that are ordered before `network-online.target` define its +meaning. *Usually* means that all configured network devices are up and have an +IP address assigned, but details may vary. In particular, configuration may +affect which interfaces are taken into account. + +`network-online.target` will time out after 90s. Enabling this might +considerably delay your boot even if the timeout is not reached. + +The right "wait" service must be enabled: +`NetworkManager-wait-online.service` if `NetworkManager` is used to configure +the network, `systemd-networkd-wait-online.service` if `systemd-networkd` is +used, etc. `systemd-networkd.service` has +`Also=systemd-networkd-wait-online.service` in its `[Install]` section, so when +`systemd-networkd.service` is enabled, `systemd-networkd-wait-online.service` +will be enabled too, which means that `network-online.target` will include +`systemd-networkd-wait-online.service` when and only when +`systemd-networkd.service` is enabled. `NetworkManager-wait-online.service` is +set up similarly. This means that the "wait" services do not need to be enabled +explicitly. They will be enabled automatically when the "main" service is +enabled, though they will not be *used* unless something else pulls in +`network-online.target`. + +To verify that the right service is enabled (usually only one should be): +```console +$ systemctl is-enabled NetworkManager-wait-online.service systemd-networkd-wait-online.service +disabled +enabled +``` + +## Should `network-online.target` be used? + +Please note that `network-online.target` means that the network connectivity +*has been* reached, not that it is currently available. By the very nature and +design of the network, connectivity may briefly or permanently disappear, so +for reasonable user experience, services need to handle temporary lack of +connectivity. + +If you are a developer, instead of wondering what to do about `network.target`, +please just fix your program to be friendly to dynamically changing network +configuration. That way you will make your users happy because things just +start to work, and you will get fewer bug reports. You also make the boot +faster by not delaying services until network connectivity has been +established. This is particularly important for folks with slow address +assignment replies from a DHCP server. + +Here are a couple of possible approaches: + +1. Watch rtnetlink and react to network configuration changes as they + happen. This is usually the nicest solution, but not always the easiest. +2. If you write a server: listen on `[::]`, `[::1]`, `0.0.0.0`, and `127.0.0.1` + only. These pseudo-addresses are unconditionally available. If you always + bind to these addresses you will have code that doesn't have to react to + network changes, as all you listen on is catch-all and private addresses. +3. If you write a server: if you want to listen on other, explicitly configured + addresses, consider using the `IP_FREEBIND` sockopt functionality of the + Linux kernel. This allows your code to bind to an address even if it is not + actually (yet or ever) configured locally. This also makes your code robust + towards network configuration changes. This is provided as `FreeBind=` + for systemd services, see + [systemd.socket(5)](https://www.freedesktop.org/software/systemd/man/systemd.socket.html). + +An exception to the above recommendations is services which require network +connectivity, but do not delay system startup. An example may be a service +which downloads package updates into a cache (to be used at some point in the +future by the package management software). Such a service may even start +during boot, and pull in and be ordered after `network-online.target`, but as +long as it is not ordered before any unit that is part of the default target, +it does not delay boot. It is usually easier to write such a service in a +"simplistic" way, where it doesn't try to wait for the network connectivity to +be (re-)established, but is instead started when the network has connectivity, +and if the network goes away, it fails and relies on the system manager to +restart it if appropriate. + +## Modyfing the meaning of `network-online.target` + +As described above, the meaning of this target is defined first by which +implementing services are enabled (`NetworkManager-wait-online.service`, +`systemd-networkd-wait-online.service`, …), and second by the configuration +specific to those services. + +For example, `systemd-networkd-wait-online.service` will wait until all +interfaces that are present and managed by +[systemd-networkd.service(8)](https://www.freedesktop.org/software/systemd/man/systemd-networkd.service.html). +are fully configured or failed and at least one link is online; see +[systemd-networkd-wait-online.service(8)](https://www.freedesktop.org/software/systemd/man/systemd-networkd-wait-online.service.html) +for details. Those conditions are affected by the presence of configuration +that matches various links, but also by settings like +`Unmanaged=`, `RequiredForOnline=`, `RequiredFamilyForOnline=`; see +[systemd.network(5)](https://www.freedesktop.org/software/systemd/man/systemd.network.html). + +It is also possible to plug in additional checks for network state. For +example, to delay `network-online.target` until some a specific host is +reachable (the name can be resolved over DNS and the appropriate route has been +established), the following simple service could be used: + +```ini +[Unit] +DefaultDependencies=no +After=nss-lookup.target +Before=network-online.target + +[Service] +Type=oneshot +RemainAfterExit=yes +ExecStart=sh -c 'until ping -c 1 example.com; do sleep 1; done' + +[Install] +WantedBy=network-online.target +``` diff --git a/docs/OPTIMIZATIONS.md b/docs/OPTIMIZATIONS.md new file mode 100644 index 0000000..3c8ac48 --- /dev/null +++ b/docs/OPTIMIZATIONS.md @@ -0,0 +1,52 @@ +--- +title: systemd Optimizations +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# systemd Optimizations + +_So you are working on a Linux distribution or appliance and need very fast boot-ups?_ + +systemd can already offer boot times of < 1s for the Core OS (userspace only, i.e. only the bits controlled by systemd) and < 2s for a complete up-to-date desktop environments on simpler (but modern, i.e. SSDs) laptops if configured properly (examples: [http://git.fenrus.org/tmp/bootchart-20120512-1036.svg](http://git.fenrus.org/tmp/bootchart-20120512-1036.svg)). In this page we want to suggest a couple of ideas how to achieve that, and if the resulting boot times do not suffice where we believe room for improvements are that we'd like to see implemented sooner or later. If you are interested in investing engineering manpower in systemd to get to even shorter boot times, this list hopefully includes a few good suggestions to start with. + +Of course, before optimizing you should instrument the boot to generate profiling data, so make sure you know your way around with systemd-bootchart, systemd-analyze and pytimechart! Optimizations without profiling are premature optimizations! + +Note that systemd's fast performance is a side effect of its design but wasn't the primary design goal. As it stands now systemd (and Fedora using it) has been optimized very little and still has a lot of room for improvements. There are still many low hanging fruits to pick! + +We are very interested in merging optimization work into systemd upstream. Note however that we are careful not to merge work that would drastically limit the general purpose usefulness or reliability of our code, or that would make systemd harder to maintain. So in case you work on optimizations for systemd, try to keep your stuff mainlineable. If in doubt, ask us. + +The distributions have adopted systemd to varying levels. While there are many compatibility scripts in the boot process on Debian for example, Fedora has much less (but still too many). For better performance consider disabling these scripts, or using a different distribution. + +It is our intention to optimize the upstream distributions by default (in particular Fedora) so that these optimizations won't be necessary. However, this will take some time, especially since making these changes is often not trivial when the general purpose usefulness cannot be compromised. + +What you can optimize (locally) without writing any code: + +1. Make sure not to use any fake block device storage technology such as LVM (as installed by default by various distributions, including Fedora) they result in the systemd-udev-settle.service unit to be pulled in. Settling device enumeration is slow, racy and mostly obsolete. Since LVM (still) hasn't been updated to handle Linux' event based design properly, settling device enumeration is still required for it, but it will slow down boot substantially. On Fedora, use "systemctl mask fedora-wait-storage.service fedora-storage-init-late.service fedora-storage-init.service" to get rid of all those storage technologies. Of course, don't try this if you actually installed your system with LVM. (The only fake block device storage technology that currently handles this all properly and doesn't require settling device enumerations is LUKS disk encryption.) +2. Consider bypassing the initrd, if you use one. On Fedora, make sure to install the OS on a plain disk without encryption, and without LVM/RAID/... (encrypted /home is fine) when doing this. Then, simply edit grub.conf and remove the initrd from your configuration, and change the root= kernel command line parameter so that it uses kernel device names instead of UUIDs, i.e. "root=sda5" or what is appropriate for your system. Also specify the root FS type with "rootfstype=ext4" (or as appropriate). Note that using kernel devices names is not really that nice if you have multiple hard disks, but if you are doing this for a laptop (i.e. with a single hdd), this should be fine. Note that you shouldn't need to rebuild your kernel in order to bypass the initrd. Distribution kernels (at least Fedora's) work fine with and without initrd, and systemd supports both ways to be started. +3. Consider disabling SELinux and auditing. We recommend leaving SELinux on, for security reasons, but truth be told you can save 100ms of your boot if you disable it. Use selinux=0 on the kernel cmdline. +4. Consider disabling Plymouth. If userspace boots in less than 1s, a boot splash is hardly useful, hence consider passing plymouth.enable=0 on the kernel command line. Plymouth is generally quite fast, but currently still forces settling device enumerations for graphics cards, which is slow. Disabling plymouth removes this bit of the boot. +5. Consider uninstalling syslog. The journal is used anyway on newer systemd systems, and is usually more than sufficient for desktops, and embedded, and even many servers. Just uninstall all syslog implementations and remember that "journalctl" will get you a pixel perfect copy of the classic /var/log/messages message log. To make journal logs persistent (i.e. so that they aren't lost at boot) make sure to run "mkdir -p /var/log/journal". +6. Consider masking a couple of redundant distribution boot scripts, that artificially slow down the boot. For example, on Fedora it's a good idea to mask fedora-autoswap.service fedora-configure.service fedora-loadmodules.service fedora-readonly.service. Also remove all LVM/RAID/FCOE/iSCSI related packages which slow down the boot substantially even if no storage of the specific kind is used (and if these RPMs can't be removed because some important packages require them, at least mask the respective services). +7. Console output is slow. So if you measure your boot times and ship your system, make sure to use "quiet" on the command line and disable systemd debug logging (if you enabled it before). +8. Consider removing cron from your system and use systemd timer units instead. Timer units currently have no support for calendar times (i.e. cannot be used to spawn things "at 6 am every Monday", but can do "run this every 7 days"), but for the usual /etc/cron.daily/, /etc/cron.weekly/, ... should be good enough, if the time of day of the execution doesn't matter (just add four small service and timer units for supporting these dirs. Eventually we might support these out of the box, but until then, just write your own scriplets for this). +9. If you work on an appliance, consider disabling readahead collection in the shipped devices, but leave readahead replay enabled. +10. If you work on an appliance, make sure to build all drivers you need into the kernel, since module loading is slow. If you build a distribution at least built all the stuff 90% of all people need into your kernel, i.e. at least USB, AHCI and HDA! +11. If it works, use libahci.ignore_sss=1 when booting. +12. Use a modern desktop that doesn't pull in ConsoleKit anymore. For example GNOME 3.4. +13. Get rid of a local MTA, if you are building a desktop or appliance. I.e. on Fedora remove the sendmail RPMs which are (still!) installed by default. +14. If you build an appliance, don't forget that various components of systemd are optional and may be disabled during build time, see "./configure --help" for details. For example, get rid of the virtual console setup if you never have local console users (this is a major source of slowness, actually). In addition, if you never have local users at all, consider disabling logind. And there are more components that are frequently unnecessary on appliances. +15. This goes without saying: the boot-up gets faster if you started less stuff at boot. So run "systemctl" and check if there's stuff you don't need and disable it, or even remove its package. +16. Don't use debug kernels. Debug kernels are slow. Fedora exclusively uses debug kernels during the development phase of each release. If you care about boot performance, either recompile these kernels with debugging turned off or wait for the final distribution release. It's a drastic difference. That also means that if you publish boot performance data of a Fedora pre-release distribution you are doing something wrong. ;-) So much about the basics of how to get a quick boot. Now, here's an incomprehensive list of things we'd like to see improved in systemd (and elsewhere) over short or long and need a bit of hacking (sometimes more, and sometimes less): +17. Get rid of systemd-cgroups-agent. Currently, whenever a systemd cgroup runs empty a tool "systemd-cgroups-agent" is invoked by the kernel which then notifies systemd about it. The need for this tool should really go away, which will save a number of forked processes at boot, and should make things faster (especially shutdown). This requires introduction of a new kernel interface to get notifications for cgroups running empty, for example via fanotify() on cgroupfs. +18. Make use of EXT4_IOC_MOVE_EXT in systemd's readahead implementation. This allows reordering/defragmentation of the files needed for boot. According to the data from [http://e4rat.sourceforge.net/](http://e4rat.sourceforge.net/) this might shorten the boot time to 40%. Implementation is not trivial, but given that we already support btrfs defragmentation and example code for this exists (e4rat as linked) should be fairly straightforward. +19. Compress readahead pack files with XZ or so. Since boot these days tends to be clearly IO bound (and not CPU bound) it might make sense to reduce the IO load for the pack file by compressing it. Since we already have a dependency on XZ we'd recommend using XZ for this. +20. Update the readahead logic to also precache directories (in addition to files). +21. Improve a couple of algorithms in the unit dependency graph calculation logic, as well as unit file loading. For example, right now when loading units we match them up with a subset of the other loaded units in order to add automatic dependencies between them where appropriate. Usually the set of units matched up is small, but the complexity is currently O(n^2), and this could be optimized. Since unit file loading and calculations in the dependency graphs is the only major, synchronous, computation-intensive bit of PID 1, and is executed before any services are started this should bring relevant improvements, especially on systems with big dependency graphs. +22. Add socket activation to X. Due to the special socket allocation semantics of X this is useful only for display :0. This should allow parallelization of X startup with its clients. +23. The usual housekeeping: get rid of shell-based services (i.e. SysV init scripts), replace them with unit files. Don't make use of Type=forking and ordering dependencies if possible, use socket activation with Type=simple instead. This allows drastically better parallelized start-up for your services. Also, if you cannot use socket activation, at least consider patching your services to support Type=notify in place of Type=forking. Consider making seldom used services activated on-demand (for example, printer services), and start frequently used services already at boot instead of delaying them until they are used. +24. Consider making use of systemd for the session as well, the way Tizen is doing this. This still needs some love in systemd upstream to be a smooth ride, but we definitely would like to go this way sooner or later, even for the normal desktops. +25. Add an option for service units to temporarily bump the CPU and IO priority of the startup code of important services. Note however, that we assume that this will not bring much and hence recommend looking into this only very late. Since boot-up tends to be IO bound, solutions such as readahead are probably more interesting than prioritizing service startup IO. Also, this would probably always require a certain amount of manual configuration since determining automatically which services are important is hard (if not impossible), because we cannot track properly which services other services wait for. +26. Same as the previous item, but temporarily lower the CPU/IO priority of the startups part of unimportant leaf services. This is probably more useful than 11 as it is easier to determine which processes don't matter. +27. Add a kernel sockopt for AF_UNIX to increase the maximum datagram queue length for SOCK_DGRAM sockets. This would allow us to queue substantially more logging datagrams in the syslog and journal sockets, and thus move the point where syslog/journal clients have to block before their message writes finish much later in the boot process. The current kernel default is rather low with 10. (As a temporary hack it is possible to increase /proc/sys/net/unix/max_dgram_qlen globally, but this has implications beyond systemd, and should probably be avoided.) The kernel patch to make this work is most likely trivial. In general, this should allow us to improve the level of parallelization between clients and servers for AF_UNIX sockets of type SOCK_DGRAM or SOCK_SEQPACKET. Again: the list above contains things we'd like to see in systemd anyway. We didn't do much profiling for these features, but we have enough indication to assume that these bits will bring some improvements. But yeah, if you work on this, keep your profiling tools ready at all times. diff --git a/docs/PASSWORD_AGENTS.md b/docs/PASSWORD_AGENTS.md new file mode 100644 index 0000000..29bd949 --- /dev/null +++ b/docs/PASSWORD_AGENTS.md @@ -0,0 +1,41 @@ +--- +title: Password Agents +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Password Agents + +systemd 12 and newer support lightweight password agents which can be used to query the user for system-level passwords or passphrases. These are passphrases that are not related to a specific user, but to some kind of hardware or service. Right now this is used exclusively for encrypted hard-disk passphrases but later on this is likely to be used to query passphrases of SSL certificates at Apache startup time as well. The basic idea is that a system component requesting a password entry can simply drop a simple .ini-style file into `/run/systemd/ask-password` which multiple different agents may watch via `inotify()`, and query the user as necessary. The answer is then sent back to the querier via an `AF_UNIX`/`SOCK_DGRAM` socket. Multiple agents might be running at the same time in which case they all should query the user and the agent which answers first wins. Right now systemd ships with the following passphrase agents: + +* A Plymouth agent used for querying passwords during boot-up +* A console agent used in similar situations if Plymouth is not available +* A GNOME agent which can be run as part of the normal user session which pops up a notification message and icon which when clicked receives the passphrase from the user. This is useful and necessary in case an encrypted system hard-disk is plugged in when the machine is already up. +* A [`wall(1)`](https://man7.org/linux/man-pages/man1/wall.1.html) agent which sends wall messages as soon as a password shall be entered. +* A simple tty agent which is built into "`systemctl start`" (and similar commands) and asks passwords to the user during manual startup of a service +* A simple tty agent which can be run manually to respond to all queued passwords + +It is easy to write additional agents. The basic algorithm to follow looks like this: + +* Create an inotify watch on /run/systemd/ask-password, watch for `IN_CLOSE_WRITE|IN_MOVED_TO` +* Ignore all events on files in that directory that do not start with "`ask.`" +* As soon as a file named "`ask.xxxx`" shows up, read it. It's a simple `.ini` file that may be parsed with the usual parsers. The `xxxx` suffix is randomized. +* Make sure to ignore unknown `.ini` file keys in those files, so that we can easily extend the format later on. +* You'll find the question to ask the user in the `Message=` field in the `[Ask]` section. It is a single-line string in UTF-8, which might be internationalized (by the party that originally asks the question, not by the agent). +* You'll find an icon name (following the XDG icon naming spec) to show next to the message in the `Icon=` field in the `[Ask]` section +* You'll find the PID of the client asking the question in the `PID=` field in the `[Ask]` section (Before asking your question use `kill(PID, 0)` and ignore the file if this returns `ESRCH`; there's no need to show the data of this field but if you want to you may) +* `Echo=` specifies whether the input should be obscured. If this field is missing or is `Echo=0`, the input should not be shown. +* The socket to send the response to is configured via `Socket=` in the `[Ask]` section. It is a `AF_UNIX`/`SOCK_DGRAM` socket in the file system. +* Ignore files where the time specified in the `NotAfter=` field in the `[Ask]` section is in the past. The time is specified in usecs, and refers to the `CLOCK_MONOTONIC` clock. If `NotAfter=` is `0`, no such check should take place. +* Make sure to hide a password query dialog as soon as a) the `ask.xxxx` file is deleted, watch this with inotify. b) the `NotAfter=` time elapses, if it is set `!= 0`. +* Access to the socket is restricted to privileged users. To acquire the necessary privileges to send the answer back, consider using PolicyKit. In fact, the GNOME agent we ship does that, and you may simply piggyback on that, by executing "`/usr/bin/pkexec /lib/systemd/systemd-reply-password 1 /path/to/socket`" or "`/usr/bin/pkexec /lib/systemd/systemd-reply-password 0 /path/to/socket`" and writing the password to its standard input. Use '`1`' as argument if a password was entered by the user, or '`0`' if the user canceled the request. +* If you do not want to use PK ensure to acquire the necessary privileges in some other way and send a single datagram to the socket consisting of the password string either prefixed with "`+`" or with "`-`" depending on whether the password entry was successful or not. You may but don't have to include a final `NUL` byte in your message. + +Again, it is essential that you stop showing the password box/notification/status icon if the `ask.xxx` file is removed or when `NotAfter=` elapses (if it is set `!= 0`)! + +It may happen that multiple password entries are pending at the same time. Your agent needs to be able to deal with that. Depending on your environment you may either choose to show all outstanding passwords at the same time or instead only one and as soon as the user has replied to that one go on to the next one. + +You may test this all with manually invoking the "`systemd-ask-password`" tool on the command line. Pass `--no-tty` to ensure the password is asked via the agent system. Note that only privileged users may use this tool (after all this is intended purely for system-level passwords). + +If you write a system level agent a smart way to activate it is using systemd `.path` units. This will ensure that systemd will watch the `/run/systemd/ask-password` directory and spawn the agent as soon as that directory becomes non-empty. In fact, the console, wall and Plymouth agents are started like this. If systemd is used to maintain user sessions as well you can use a similar scheme to automatically spawn your user password agent as well. (As of this moment we have not switched any DE over to use systemd for session management, however.) diff --git a/docs/PORTABILITY_AND_STABILITY.md b/docs/PORTABILITY_AND_STABILITY.md new file mode 100644 index 0000000..abdc3dc --- /dev/null +++ b/docs/PORTABILITY_AND_STABILITY.md @@ -0,0 +1,171 @@ +--- +title: Interface Portability and Stability +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Interface Portability and Stability Promise + +systemd provides various interfaces developers and programs might rely on. Starting with version 26 (the first version released with Fedora 15) we promise to keep a number of them stable and compatible for the future. + +The stable interfaces are: + +* **The unit configuration file format**. Unit files written now will stay compatible with future versions of systemd. Extensions to the file format will happen in a way that existing files remain compatible. + +* **The command line interface** of `systemd`, `systemctl`, `loginctl`, `journalctl`, and all other command line utilities installed in `$PATH` and documented in a man page. We will make sure that scripts invoking these commands will continue to work with future versions of systemd. Note however that the output generated by these commands is generally not included in the promise, unless it is documented in the man page. Example: the output of `systemctl status` is not stable, but that of `systemctl show` is, because the former is intended to be human readable and the latter computer readable, and this is documented in the man page. + +* **The protocol spoken on the socket referred to by `$NOTIFY_SOCKET`**, as documented in [sd_notify(3)](https://www.freedesktop.org/software/systemd/man/sd_notify.html). + +* Some of the **"special" unit names** and their semantics. To be precise the ones that are necessary for normal services, and not those required only for early boot and late shutdown, with very few exceptions. To list them here: `basic.target`, `shutdown.target`, `sockets.target`, `network.target`, `getty.target`, `graphical.target`, `multi-user.target`, `rescue.target`, `emergency.target`, `poweroff.target`, `reboot.target`, `halt.target`, `runlevel[1-5].target`. + +* **The D-Bus interfaces of the main service daemon and other daemons**. We try to always preserve backwards compatibility, and intentional breakage is never introduced. Nevertheless, when we find bugs that mean that the existing interface was not useful, or when the implementation did something different than stated by the documentation and the implemented behaviour is not useful, we will fix the implementation and thus introduce a change in behaviour. But the API (parameter counts and types) is never changed, and existing attributes and methods will not be removed. + +* For a more comprehensive and authoritative list, consult the chart below. + +The following interfaces will not necessarily be kept stable for now, but we will eventually make a stability promise for these interfaces too. In the meantime we will however try to keep breakage of these interfaces at a minimum: + +* **The set of states of the various state machines used in systemd**, e.g. the high-level unit states inactive, active, deactivating, and so on, as well (and in particular) the low-level per-unit states. + +* **All "special" units that aren't listed above**. + +The following interfaces are considered private to systemd, and are not and will not be covered by any stability promise: + +* **Undocumented switches** to `systemd`, `systemctl` and otherwise. + +* **The internal protocols** used on the various sockets such as the sockets `/run/systemd/shutdown`, `/run/systemd/private`. + +One of the main goals of systemd is to unify basic Linux configurations and service behaviors across all distributions. Systemd project does not contain any distribution-specific parts. Distributions are expected to convert over time their individual configurations to the systemd format, or they will need to carry and maintain patches in their package if they still decide to stay different. + +What does this mean for you? When developing with systemd, don't use any of the latter interfaces, or we will tell your mom, and she won't love you anymore. You are welcome to use the other interfaces listed here, but if you use any of the second kind (i.e. those where we don't yet make a stability promise), then make sure to subscribe to our mailing list, where we will announce API changes, and be prepared to update your program eventually. + +Note that this is a promise, not an eternal guarantee. These are our intentions, but if in the future there are very good reasons to change or get rid of an interface we have listed above as stable, then we might take the liberty to do so, despite this promise. However, if we do this, then we'll do our best to provide a smooth and reasonably long transition phase. + + +## Interface Portability And Stability Chart + +systemd provides a number of APIs to applications. Below you'll find a table detailing which APIs are considered stable and how portable they are. + +This list is intended to be useful for distribution and OS developers who are interested in maintaining a certain level of compatibility with the new interfaces systemd introduced, without relying on systemd itself. + +In general it is our intention to cooperate through interfaces and not code with other distributions and OSes. That means that the interfaces where this applies are best reimplemented in a compatible fashion on those other operating systems. To make this easy we provide detailed interface documentation where necessary. That said, it's all Open Source, hence you have the option to a) fork our code and maintain portable versions of the parts you are interested in independently for your OS, or b) build systemd for your distro, but leave out all components except the ones you are interested in and run them without the core of systemd involved. We will try not to make this any more difficult than necessary. Patches to allow systemd code to be more portable will be accepted on case-by-case basis (essentially, patches to follow well-established standards instead of e.g. glibc or linux extensions have a very high chance of being accepted, while patches which make the code ugly or exist solely to work around bugs in other projects have a low chance of being accepted). + +Many of these interfaces are already being used by applications and 3rd party code. If you are interested in compatibility with these applications, please consider supporting these interfaces in your distribution, where possible. + + +## General Portability of systemd and its Components + +**Portability to OSes:** systemd is not portable to non-Linux systems. It makes use of a large number of Linux-specific interfaces, including many that are used by its very core. We do not consider it feasible to port systemd to other Unixes (let alone non-Unix operating systems) and will not accept patches for systemd core implementing any such portability (but hey, it's git, so it's as easy as it can get to maintain your own fork...). APIs that are supposed to be used as library code are exempted from this: it is important to us that these compile nicely on non-Linux and even non-Unix platforms, even if they might just become NOPs. + +**Portability to Architectures:** It is important to us that systemd is portable to little endian as well as big endian systems. We will make sure to provide portability with all important architectures and hardware Linux runs on and are happy to accept patches for this. + +**Portability to Distributions:** It is important to us that systemd is portable to all Linux distributions. However, the goal is to unify many of the needless differences between the distributions, and hence will not accept patches for certain distribution-specific work-arounds. Compatibility with the distribution's legacy should be maintained in the distribution's packaging, and not in the systemd source tree. + +**Compatibility with Specific Versions of Other packages:** We generally avoid adding compatibility kludges to systemd that work around bugs in certain versions of other software systemd interfaces with. We strongly encourage fixing bugs where they are, and if that's not systemd we rather not try to fix it there. (There are very few exceptions to this rule possible, and you need an exceptionally strong case for it). + + +## General Portability of systemd's APIs + +systemd's APIs are available everywhere where systemd is available. Some of the APIs we have defined are supposed to be generic enough to be implementable independently of systemd, thus allowing compatibility with systems systemd itself is not compatible with, i.e. other OSes, and distributions that are unwilling to fully adopt systemd. + +A number of systemd's APIs expose Linux or systemd-specific features that cannot sensibly be implemented elsewhere. Please consult the table below for information about which ones these are. + +Note that not all of these interfaces are our invention (but most), we just adopted them in systemd to make them more prominently implemented. For example, we adopted many Debian facilities in systemd to push it into the other distributions as well. + + +--- + + +And now, here's the list of (hopefully) all APIs that we have introduced with systemd: + +| API | Type | Covered by Interface Stability Promise | Fully documented | Known External Consumers | Reimplementable Independently | Known Other Implementations | systemd Implementation portable to other OSes or non-systemd distributions | +| --- | ---- | ----------------------------------------------------------------------------------------- | ---------------- | ------------------------ | ----------------------------- | --------------------------- | -------------------------------------------------------------------------- | +| [hostnamed](https://www.freedesktop.org/software/systemd/man/org.freedesktop.hostname1.html) | D-Bus | yes | yes | GNOME | yes | [Ubuntu](https://launchpad.net/ubuntu/+source/ubuntu-system-service), [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | +| [localed](https://www.freedesktop.org/software/systemd/man/org.freedesktop.locale1.html) | D-Bus | yes | yes | GNOME | yes | [Ubuntu](https://launchpad.net/ubuntu/+source/ubuntu-system-service), [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | +| [timedated](https://www.freedesktop.org/software/systemd/man/org.freedesktop.timedate1.html) | D-Bus | yes | yes | GNOME | yes | [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | +| [initrd interface](INITRD_INTERFACE) | Environment, flag files | yes | yes | mkosi, dracut, ArchLinux | yes | ArchLinux | no | +| [Container interface](CONTAINER_INTERFACE) | Environment, Mounts | yes | yes | libvirt/LXC | yes | - | no | +| [Boot Loader interface](BOOT_LOADER_INTERFACE) | EFI variables | yes | yes | gummiboot | yes | - | no | +| [Service bus API](https://www.freedesktop.org/software/systemd/man/org.freedesktop.systemd1.html) | D-Bus | yes | yes | system-config-services | no | - | no | +| [logind](https://www.freedesktop.org/software/systemd/man/org.freedesktop.login1.html) | D-Bus | yes | yes | GNOME | no | - | no | +| [sd-bus.h API](https://www.freedesktop.org/software/systemd/man/sd-bus.html) | C Library | yes | yes | - | maybe | - | maybe | +| [sd-daemon.h API](https://www.freedesktop.org/software/systemd/man/sd-daemon.html) | C Library or Drop-in | yes | yes | numerous | yes | - | yes | +| [sd-device.h API](https://www.freedesktop.org/software/systemd/man/sd-device.html) | C Library | yes | no | numerous | yes | - | yes | +| [sd-event.h API](https://www.freedesktop.org/software/systemd/man/sd-event.html) | C Library | yes | yes | - | maybe | - | maybe | +| [sd-gpt.h API](https://www.freedesktop.org/software/systemd/man/sd-gpt.html) | Header Library | yes | no | - | yes | - | yes | +| [sd-hwdb.h API](https://www.freedesktop.org/software/systemd/man/sd-hwdb.html) | C Library | yes | yes | - | maybe | - | yes | +| [sd-id128.h API](https://www.freedesktop.org/software/systemd/man/sd-id128.html) | C Library | yes | yes | - | yes | - | yes | +| [sd-journal.h API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) | C Library | yes | yes | - | maybe | - | no | +| [sd-login.h API](https://www.freedesktop.org/software/systemd/man/sd-login.html) | C Library | yes | yes | GNOME, polkit, ... | no | - | no | +| [sd-messages.h API](https://www.freedesktop.org/software/systemd/man/sd-messages.html) | Header Library | yes | yes | - | yes | python-systemd | yes | +| [sd-path.h API](https://www.freedesktop.org/software/systemd/man/sd-path.html) | C Library | yes | no | - | maybe | - | maybe | +| [$XDG_RUNTIME_DIR](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html) | Environment | yes | yes | glib, GNOME | yes | - | no | +| [$LISTEN_FDS $LISTEN_PID FD Passing](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html) | Environment | yes | yes | numerous (via sd-daemon.h) | yes | - | no | +| [$NOTIFY_SOCKET Daemon Notifications](https://www.freedesktop.org/software/systemd/man/sd_notify.html) | Environment | yes | yes | a few, including udev | yes | - | no | +| [argv[0][0]='@' Logic](ROOT_STORAGE_DAEMONS) | `/proc` marking | yes | yes | mdadm | yes | - | no | +| [Unit file format](https://www.freedesktop.org/software/systemd/man/systemd.unit.html) | File format | yes | yes | numerous | no | - | no | +| [Network](https://www.freedesktop.org/software/systemd/man/systemd.network.html) & [Netdev file format](https://www.freedesktop.org/software/systemd/man/systemd.netdev.html) | File format | yes | yes | no | no | - | no | +| [Link file format](https://www.freedesktop.org/software/systemd/man/systemd.link.html) | File format | yes | yes | no | no | - | no | +| [Journal File Format](JOURNAL_FILE_FORMAT) | File format | yes | yes | - | maybe | - | no | +| [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format) | File format | yes | yes | - | yes | - | no | +| [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format) | File format | yes | yes | - | yes | - | no | +| [Cooperation in cgroup tree](https://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups) | Treaty | yes | yes | libvirt | yes | libvirt | no | +| [Password Agents](PASSWORD_AGENTS) | Socket+Files | yes | yes | - | yes | - | no | +| [udev multi-seat properties](https://www.freedesktop.org/software/systemd/man/sd-login.html) | udev Property | yes | yes | X11, gdm | no | - | no | +| udev session switch ACL properties | udev Property | no | no | - | no | - | no | +| [CLI of systemctl,...](https://www.freedesktop.org/software/systemd/man/systemctl.html) | CLI | yes | yes | numerous | no | - | no | +| [tmpfiles.d](https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html) | File format | yes | yes | numerous | yes | ArchLinux | partially | +| [sysusers.d](https://www.freedesktop.org/software/systemd/man/sysusers.d.html) | File format | yes | yes | unknown | yes | | partially | +| [/etc/machine-id](https://www.freedesktop.org/software/systemd/man/machine-id.html) | File format | yes | yes | D-Bus | yes | - | no | +| [binfmt.d](https://www.freedesktop.org/software/systemd/man/binfmt.d.html) | File format | yes | yes | numerous | yes | - | partially | +| [/etc/hostname](https://www.freedesktop.org/software/systemd/man/hostname.html) | File format | yes | yes | numerous (it's a Debian thing) | yes | Debian, ArchLinux | no | +| [/etc/locale.conf](https://www.freedesktop.org/software/systemd/man/locale.conf.html) | File format | yes | yes | - | yes | ArchLinux | partially | +| [/etc/machine-info](https://www.freedesktop.org/software/systemd/man/machine-info.html) | File format | yes | yes | - | yes | - | partially | +| [modules-load.d](https://www.freedesktop.org/software/systemd/man/modules-load.d.html) | File format | yes | yes | numerous | yes | - | partially | +| [/usr/lib/os-release](https://www.freedesktop.org/software/systemd/man/os-release.html) | File format | yes | yes | some | yes | Fedora, OpenSUSE, ArchLinux, Angstrom, Frugalware, others... | no | +| [sysctl.d](https://www.freedesktop.org/software/systemd/man/sysctl.d.html) | File format | yes | yes | some (it's a Debian thing) | yes | procps/Debian, ArchLinux | partially | +| [/etc/timezone](https://www.freedesktop.org/software/systemd/man/timezone.html) | File format | yes | yes | numerous (it's a Debian thing) | yes | Debian | partially | +| [/etc/vconsole.conf](https://www.freedesktop.org/software/systemd/man/vconsole.conf.html) | File format | yes | yes | - | yes | ArchLinux | partially | +| `/run` | File hierarchy change | yes | yes | numerous | yes | OpenSUSE, Debian, ArchLinux | no | +| [Generators](https://www.freedesktop.org/software/systemd/man/systemd.generator.html) | Subprocess | yes | yes | - | no | - | no | +| [System Updates](https://www.freedesktop.org/software/systemd/man/systemd.offline-updates.html) | System Mode | yes | yes | - | no | - | no | +| [Presets](https://www.freedesktop.org/software/systemd/man/systemd.preset.html) | File format | yes | yes | - | no | - | no | +| Udev rules | File format | yes | yes | numerous | no | no | partially | + + +### Explanations + +Items for which "systemd implementation portable to other OSes" is "partially" means that it is possible to run the respective tools that are included in the systemd tarball outside of systemd. Note however that this is not officially supported, so you are more or less on your own if you do this. If you are opting for this solution simply build systemd as you normally would but drop all files except those which you are interested in. + +Of course, it is our intention to eventually document all interfaces we defined. If we haven't documented them for now, this is usually because we want the flexibility to still change things, or don't want 3rd party applications to make use of these interfaces already. That said, our sources are quite readable and open source, so feel free to spelunk around in the sources if you want to know more. + +If you decide to reimplement one of the APIs for which "Reimplementable independently" is "no", then we won't stop you, but you are on your own. + +This is not an attempt to comprehensively list all users of these APIs. We are just listing the most obvious/prominent ones which come to our mind. + +Of course, one last thing I can't make myself not ask you before we finish here, and before you start reimplementing these APIs in your distribution: are you sure it's time well spent if you work on reimplementing all this code instead of just spending it on adopting systemd on your distro as well? + +## Independent Operation of systemd Programs + +Some programs in the systemd suite are intended to operate independently of the +running init process (or even without an init process, for example when +creating system installation chroots). They can be safely called on systems with +a different init process or for example in package installation scriptlets. + +The following programs currently and in the future will support operation +without communicating with the `systemd` process: +`systemd-escape`, +`systemd-id128`, +`systemd-path`, +`systemd-tmpfiles`, +`systemd-sysctl`, +`systemd-sysusers`. + +Many other programs support operation without the system manager except when +the specific functionality requires such communication. For example, +`journalctl` operates almost independently, but will query the boot id when +`--boot` option is used; it also requires `systemd-journald` (and thus +`systemd`) to be running for options like `--flush` and `--sync`. +`systemd-journal-remote`, `systemd-journal-upload`, `systemd-journal-gatewayd`, +`coredumpctl`, `busctl`, `systemctl --root` also fall into this category of +mostly-independent programs. diff --git a/docs/PORTABLE_SERVICES.md b/docs/PORTABLE_SERVICES.md new file mode 100644 index 0000000..6f5ff11 --- /dev/null +++ b/docs/PORTABLE_SERVICES.md @@ -0,0 +1,384 @@ +--- +title: Portable Services Introduction +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Portable Services + +systemd (since version 239) supports a concept of "Portable Services". +"Portable Services" are a delivery method for system services that uses +two specific features of container management: + +1. Applications are bundled. I.e. multiple services, their binaries and all + their dependencies are packaged in an image, and are run directly from it. + +2. Stricter default security policies, i.e. sand-boxing of applications. + +The primary tool for interacting with Portable Services is `portablectl`, +and they are managed by the `systemd-portabled` service. + +Portable services don't bring anything inherently new to the table. All they do +is put together known concepts to cover a specific set of use-cases in a +slightly nicer way. + +## So, what *is* a "Portable Service"? + +A portable service is ultimately just an OS tree, either inside of a directory, +or inside a raw disk image containing a Linux file system. This tree is called +the "image". It can be "attached" or "detached" from the system. When +"attached", specific systemd units from the image are made available on the +host system, then behaving pretty much exactly like locally installed system +services. When "detached", these units are removed again from the host, leaving +no artifacts around (except maybe messages they might have logged). + +The OS tree/image can be created with any tool of your choice. For example, you +can use `dnf --installroot=` if you like, or `debootstrap`, the image format is +entirely generic, and doesn't have to carry any specific metadata beyond what +distribution images carry anyway. Or to say this differently: the image format +doesn't define any new metadata as unit files and OS tree directories or disk +images are already sufficient, and pretty universally available these days. One +particularly nice tool for creating suitable images is +[mkosi](https://github.com/systemd/mkosi), but many other existing tools will +do too. + +Portable services may also be constructed from layers, similarly to container +environments. See [Extension Images](#extension-images) below. + +If you so will, "Portable Services" are a nicer way to manage chroot() +environments, with better security, tooling and behavior. + +## Where's the difference to a "Container"? + +"Container" is a very vague term, after all it is used for +systemd-nspawn/LXC-type OS containers, for Docker/rkt-like micro service +containers, and even certain 'lightweight' VM runtimes. + +"Portable services" do not provide a fully isolated environment to the payload, +like containers mostly intend to. Instead, they are more like regular system +services, can be controlled with the same tools, are exposed the same way in +all infrastructure, and so on. The main difference is that they use a different +root directory than the rest of the system. Hence, the intent is not to run +code in a different, isolated environment from the host — like most containers +would — but to run it in the same environment, but with stricter access +controls on what the service can see and do. + +One point of differentiation: since programs running as "portable services" are +pretty much regular system services, they won't run as PID 1 (like they would +under Docker), but as normal processes. A corollary of that is that they aren't +supposed to manage anything in their own environment (such as the network) as +the execution environment is mostly shared with the rest of the system. + +The primary focus use-case of "portable services" is to extend the host system +with encapsulated extensions, but provide almost full integration with the rest +of the system, though possibly restricted by security knobs. This focus +includes system extensions otherwise sometimes called "super-privileged +containers". + +Note that portable services are only available for system services, not for +user services (i.e. the functionality cannot be used for the stuff +bubblewrap/flatpak is focusing on). + +## Mode of Operation + +If you have a portable service image, maybe in a raw disk image called +`foobar_0.7.23.raw`, then attaching the services to the host is as easy as: + +``` +# portablectl attach foobar_0.7.23.raw +``` + +This command does the following: + +1. It dissects the image, checks and validates the `os-release` file of the + image, and looks for all included unit files. + +2. It copies out all unit files with a suffix of `.service`, `.socket`, + `.target`, `.timer` and `.path`, whose name begins with the image's name + (with `.raw` removed), truncated at the first underscore if there is one. + This prefix name generated from the image name must be followed by a ".", + "-" or "@" character in the unit name. Or in other words, given the image + name of `foobar_0.7.23.raw` all unit files matching + `foobar-*.{service|socket|target|timer|path}`, + `foobar@.{service|socket|target|timer|path}` as well as + `foobar.*.{service|socket|target|timer|path}` and + `foobar.{service|socket|target|timer|path}` are copied out. These unit files + are placed in `/etc/systemd/system.attached/` (which is part of the normal + unit file search path of PID 1, and thus loaded exactly like regular unit + files). Within the images the unit files are looked for at the usual + locations, i.e. in `/usr/lib/systemd/system/` and `/etc/systemd/system/` and + so on, relative to the image's root. + +3. For each such unit file a drop-in file is created. Let's say + `foobar-waldo.service` was one of the unit files copied to + `/etc/systemd/system.attached/`, then a drop-in file + `/etc/systemd/system.attached/foobar-waldo.service.d/20-portable.conf` is + created, containing a few lines of additional configuration: + + ``` + [Service] + RootImage=/path/to/foobar.raw + Environment=PORTABLE=foobar + LogExtraFields=PORTABLE=foobar + ``` + +4. For each such unit a "profile" drop-in is linked in. This "profile" drop-in + generally contains security options that lock down the service. By default + the `default` profile is used, which provides a medium level of security. + There's also `trusted`, which runs the service with no restrictions, i.e. in + the host file system root and with full privileges. The `strict` profile + comes with the toughest security restrictions. Finally, `nonetwork` is like + `default` but without network access. Users may define their own profiles + too (or modify the existing ones). + +And that's already it. + +Note that the images need to stay around (and in the same location) as long as the +portable service is attached. If an image is moved, the `RootImage=` line +written to the unit drop-in would point to an non-existent path, and break +access to the image. + +The `portablectl detach` command executes the reverse operation: it looks for +the drop-ins and the unit files associated with the image, and removes them. + +Note that `portablectl attach` won't enable or start any of the units it copies +out by default, but `--enable` and `--now` parameter are available as shortcuts. +The same is true for the opposite `detach` operation. + +The `portablectl reattach` command combines a `detach` with an `attach`. It is +useful in case an image gets upgraded, as it allows performing a `restart` +operation on the units instead of `stop` plus `start`, thus providing lower +downtime and avoiding losing runtime state associated with the unit such as the +file descriptor store. + +## Requirements on Images + +Note that portable services don't introduce any new image format, but most OS +images should just work the way they are. Specifically, the following +requirements are made for an image that can be attached/detached with +`portablectl`. + +1. It must contain an executable that shall be invoked, along with all its + dependencies. Any binary code needs to be compiled for an architecture + compatible with the host. + +2. The image must either be a plain sub-directory (or btrfs subvolume) + containing the binaries and its dependencies in a classic Linux OS tree, or + must be a raw disk image either containing only one, naked file system, or + an image with a partition table understood by the Linux kernel with only a + single partition defined, or alternatively, a GPT partition table with a set + of properly marked partitions following the + [Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification). + +3. The image must at least contain one matching unit file, with the right name + prefix and suffix (see above). The unit file is searched in the usual paths, + i.e. primarily /etc/systemd/system/ and /usr/lib/systemd/system/ within the + image. (The implementation will check a couple of other paths too, but it's + recommended to use these two paths.) + +4. The image must contain an os-release file, either in `/etc/os-release` or + `/usr/lib/os-release`. The file should follow the standard format. + +5. The image must contain the files `/etc/resolv.conf` and `/etc/machine-id` + (empty files are ok), they will be bind mounted from the host at runtime. + +6. The image must contain directories `/proc/`, `/sys/`, `/dev/`, `/run/`, + `/tmp/`, `/var/tmp/` that can be mounted over with the corresponding version + from the host. + +7. The OS might require other files or directories to be in place. For example, + if the image is built based on glibc, the dynamic loader needs to be + available in `/lib/ld-linux.so.2` or `/lib64/ld-linux-x86-64.so.2` (or + similar, depending on architecture), and if the distribution implements a + merged `/usr/` tree, this means `/lib` and/or `/lib64` need to be symlinks + to their respective counterparts below `/usr/`. For details see your + distribution's documentation. + +Note that images created by tools such as `debootstrap`, `dnf --installroot=` +or `mkosi` generally satisfy all of the above. If you wonder what the most +minimal image would be that complies with the requirements above, it could +consist of this: + +``` +/usr/bin/minimald # a statically compiled binary +/usr/lib/systemd/system/minimal-test.service # the unit file for the service, with ExecStart=/usr/bin/minimald +/usr/lib/os-release # an os-release file explaining what this is +/etc/resolv.conf # empty file to mount over with host's version +/etc/machine-id # ditto +/proc/ # empty directory to use as mount point for host's API fs +/sys/ # ditto +/dev/ # ditto +/run/ # ditto +/tmp/ # ditto +/var/tmp/ # ditto +``` + +And that's it. + +Note that qualifying images do not have to contain an init system of their +own. If they do, it's fine, it will be ignored by the portable service logic, +but they generally don't have to, and it might make sense to avoid any, to keep +images minimal. + +If the image is writable, and some of the files or directories that are +overmounted from the host do not exist yet they will be automatically created. +On read-only, immutable images (e.g. `erofs` or `squashfs` images) all files +and directories to over-mount must exist already. + +Note that as no new image format or metadata is defined, it's very +straightforward to define images than can be made use of in a number of +different ways. For example, by using `mkosi -b` you can trivially build a +single, unified image that: + +1. Can be attached as portable service, to run any container services natively + on the host. + +2. Can be run as OS container, using `systemd-nspawn`, by booting the image + with `systemd-nspawn -i -b`. + +3. Can be booted directly as VM image, using a generic VM executor such as + `virtualbox`/`qemu`/`kvm` + +4. Can be booted directly on bare-metal systems. + +Of course, to facilitate 2, 3 and 4 you need to include an init system in the +image. To facilitate 3 and 4 you also need to include a boot loader in the +image. As mentioned, `mkosi -b` takes care of all of that for you, but any +other image generator should work too. + +The +[os-release(5)](https://www.freedesktop.org/software/systemd/man/os-release.html) +file may optionally be extended with a `PORTABLE_PREFIXES=` field listing all +supported portable service prefixes for the image (see above). This is useful +for informational purposes (as it allows recognizing portable service images +from their contents as such), but is also useful to protect the image from +being used under a wrong name and prefix. This is particularly relevant if the +images are cryptographically authenticated (via Verity or a similar mechanism) +as this way the (not necessarily authenticated) image file name can be +validated against the (authenticated) image contents. If the field is not +specified the image will work fine, but is not necessarily recognizable as +portable service image, and any set of units included in the image may be +attached, there are no restrictions enforced. + +## Extension Images + +Portable services can be delivered as one or multiple images that extend the base +image, and are combined with OverlayFS at runtime, when they are attached. This +enables a workflow that splits the base 'runtime' from the daemon, so that multiple +portable services can share the same 'runtime' image (libraries, tools) without +having to include everything each time, with the layering happening only at runtime. +The `--extension` parameter of `portablectl` can be used to specify as many upper +layers as desired. On top of the requirements listed in the previous section, the +following must be also be observed: + +1. The base/OS image must contain an `os-release file`, either in `/etc/os-release` + or `/usr/lib/os-release`, in the standard format. + +2. The upper extension images must contain an extension-release file in + `/usr/lib/extension-release.d/`, with an `ID=` and `SYSEXT_LEVEL=`/`VERSION_ID=` + matching the base image for sysexts, or `/etc/extension-release.d/`, with an + `ID=` and `CONFEXT_LEVEL=`/`VERSION_ID=` matching the base image for confexts. + +3. The base/OS image does not need to have any unit files. + +4. The upper sysext images must contain at least one matching unit file each, + with the right name prefix and suffix (see above). Confext images do not have + to contain units. + +5. As with the base/OS image, each upper extension image must be a plain + sub-directory, btrfs subvolume, or a raw disk image. + +``` +# portablectl attach --extension foobar_0.7.23.raw debian-runtime_11.1.raw foobar +# portablectl attach --extension barbaz_7.0.23/ debian-runtime_11.1.raw barbaz +``` + +## Execution Environment + +Note that the code in portable service images is run exactly like regular +services. Hence there's no new execution environment to consider. And, unlike +Docker would do it, as these are regular system services they aren't run as PID +1 either, but with regular PID values. + +## Access to host resources + +If services shipped with this mechanism shall be able to access host resources +(such as files or AF_UNIX sockets for IPC), use the normal `BindPaths=` and +`BindReadOnlyPaths=` settings in unit files to mount them in. In fact, the +`default` profile mentioned above makes use of this to ensure +`/etc/resolv.conf`, the D-Bus system bus socket or write access to the logging +subsystem are available to the service. + +## Instantiation + +Sometimes it makes sense to instantiate the same set of services multiple +times. The portable service concept does not introduce a new logic for this. It +is recommended to use the regular systemd unit templating for this, i.e. to +include template units such as `foobar@.service`, so that instantiation is as +simple as: + +``` +# portablectl attach foobar_0.7.23.raw +# systemctl enable --now foobar@instancea.service +# systemctl enable --now foobar@instanceb.service +… +``` + +The benefit of this approach is that templating works exactly the same for +units shipped with the OS itself as for attached portable services. + +## Immutable images with local data + +It's a good idea to keep portable service images read-only during normal +operation. In fact, all but the `trusted` profile will default to this kind of +behaviour, by setting the `ProtectSystem=strict` option. In this case writable +service data may be placed on the host file system. Use `StateDirectory=` in +the unit files to enable such behaviour and add a local data directory to the +services copied onto the host. + +## Logging + +Several fields are autotmatically added to log messages generated by a portable +service (or about a portable service, e.g.: start/stop logs from systemd). +The `PORTABLE=` field will refer to the name of the portable image where the unit +was loaded from. In case extensions are used, additionally there will be a +`PORTABLE_ROOT=` field, referring to the name of image used as the base layer +(i.e.: `RootImage=` or `RootDirectory=`), and one `PORTABLE_EXTENSION=` field per +each extension image used. + +The `os-release` file from the portable image will be parsed and added as structured +metadata to the journal log entries. The parsed fields will be the first ID field which +is set from the set of `IMAGE_ID` and `ID` in this order of preference, and the first +version field which is set from a set of `IMAGE_VERSION`, `VERSION_ID`, and `BUILD_ID` +in this order of preference. The ID and version, if any, are concatenated with an +underscore (`_`) as separator. If only either one is found, it will be used by itself. +The field will be named `PORTABLE_NAME_AND_VERSION=`. + +In case extensions are used, the same fields in the same order are, but prefixed by +`SYSEXT_`/`CONFEXT_`, are parsed from each `extension-release` file, and are appended +to the journal as log entries, using `PORTABLE_EXTENSION_NAME_AND_VERSION=` as the +field name. The base layer's field will be named `PORTABLE_ROOT_NAME_AND_VERSION=` +instead of `PORTABLE_NAME_AND_VERSION=` in this case. + +For example, a portable service `app0` using two extensions `app0.raw` and +`app1.raw` (with `SYSEXT_ID=app`, and `SYSEXT_VERSION_ID=` `0` and `1` in their +respective extension-releases), and a base layer `base.raw` (with `VERSION_ID=10` and +`ID=debian` in `os-release`), will create log entries with the following fields: + +``` +PORTABLE=app0.raw +PORTABLE_ROOT=base.raw +PORTABLE_ROOT_NAME_AND_VERSION=debian_10 +PORTABLE_EXTENSION=app0.raw +PORTABLE_EXTENSION_NAME_AND_VERSION=app_0 +PORTABLE_EXTENSION=app1.raw +PORTABLE_EXTENSION_NAME_AND_VERSION=app_1 +``` + +## Links + +[`portablectl(1)`](https://www.freedesktop.org/software/systemd/man/portablectl.html)<br> +[`systemd-portabled.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-portabled.service.html)<br> +[Walkthrough for Portable Services](https://0pointer.net/blog/walkthrough-for-portable-services.html)<br> +[Repo with examples](https://github.com/systemd/portable-walkthrough) diff --git a/docs/PORTING_TO_NEW_ARCHITECTURES.md b/docs/PORTING_TO_NEW_ARCHITECTURES.md new file mode 100644 index 0000000..a4dc6c2 --- /dev/null +++ b/docs/PORTING_TO_NEW_ARCHITECTURES.md @@ -0,0 +1,58 @@ +--- +title: Porting to New Architectures +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Porting systemd to New Architectures + +Here's a brief checklist of things to implement when porting systemd to a new +architecture. + +1. Patch + [src/basic/architecture.h](https://github.com/systemd/systemd/blob/main/src/basic/architecture.h) + and + [src/basic/architecture.c](https://github.com/systemd/systemd/blob/main/src/basic/architecture.c) + to make your architecture known to systemd. Besides an `ARCHITECTURE_XYZ` + enumeration entry you need to provide an implementation of + `native_architecture()` and `uname_architecture()`. + +2. Patch + [src/shared/gpt.h](https://github.com/systemd/systemd/blob/main/src/shared/gpt.h) + and + [src/shared/gpt.c](https://github.com/systemd/systemd/blob/main/src/shared/gpt.c) + and define a new set of GPT partition type UUIDs for the root file system, + `/usr/` file system, and the matching Verity and Verity signature + partitions. Use `systemd-id128 new -p` to generate new suitable UUIDs you + can use for this. Make sure to register your new types in the various + functions in `gpt.c`. Also make sure to update the tables in + [Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification) + and `man/systemd-gpt-auto-generator.xml` accordingly. + +3. If your architecture supports UEFI, make sure to update the `efi_arch` + variable logic in `meson.build` to be set to the right architecture string + as defined by the UEFI specification. (This ensures that `systemd-boot` will + be built as the appropriately named `BOOT<arch>.EFI` binary.) Also, if your + architecture uses a special boot protocol for the Linux kernel, make sure to + implement it in `src/boot/efi/linux*.c`, so that the `systemd-stub` EFI stub + can work. + +4. Make sure to register the right system call numbers for your architecture in + `src/basic/missing_syscall_def.h`. systemd uses various system calls the + Linux kernel provides that are currently not wrapped by glibc (or are only + in very new glibc), and we need to know the right numbers for them. It might + also be necessary to tweak `src/basic/raw-clone.h`. + +5. Make sure the code in `src/shared/seccomp-util.c` properly understands the + local architecture and its system call quirks. + +6. If your architecture uses a `/lib64/` library directory, then make sure that + the `BaseFilesystem` table in `src/shared/base-filesystem.c` has an entry + for it so that it can be set up automatically if missing. This is useful to + support booting into OS trees that have an empty root directory with only + `/usr/` mounted in. + +7. If your architecture supports VM virtualization and provides CPU opcodes + similar to x86' CPUID, consider adding native support for detecting VMs this + way to `src/basic/virt.c`. diff --git a/docs/PREDICTABLE_INTERFACE_NAMES.md b/docs/PREDICTABLE_INTERFACE_NAMES.md new file mode 100644 index 0000000..9d79f8f --- /dev/null +++ b/docs/PREDICTABLE_INTERFACE_NAMES.md @@ -0,0 +1,70 @@ +--- +title: Predictable Network Interface Names +category: Networking +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Predictable Network Interface Names + +Starting with v197 systemd/udev will automatically assign predictable, stable network interface names for all local Ethernet, WLAN and WWAN interfaces. This is a departure from the traditional interface naming scheme (`eth0`, `eth1`, `wlan0`, ...), but should fix real problems. + +## Why? + +The classic naming scheme for network interfaces applied by the kernel is to simply assign names beginning with `eth0`, `eth1`, ... to all interfaces as they are probed by the drivers. As the driver probing is generally not predictable for modern technology this means that as soon as multiple network interfaces are available the assignment of the names `eth0`, `eth1` and so on is generally not fixed anymore and it might very well happen that `eth0` on one boot ends up being `eth1` on the next. This can have serious security implications, for example in firewall rules which are coded for certain naming schemes, and which are hence very sensitive to unpredictable changing names. + +To fix this problem multiple solutions have been proposed and implemented. For a longer time udev shipped support for assigning permanent `ethX` names to certain interfaces based on their MAC addresses. This turned out to have a multitude of problems, among them: this required a writable root directory which is generally not available; the statelessness of the system is lost as booting an OS image on a system will result in changed configuration of the image; on many systems MAC addresses are not actually fixed, such as on a lot of embedded hardware and particularly on all kinds of virtualization solutions. The biggest of all however is that the userspace components trying to assign the interface name raced against the kernel assigning new names from the same `ethX` namespace, a race condition with all kinds of weird effects, among them that assignment of names sometimes failed. As a result support for this has been removed from systemd/udev a while back. + +Another solution that has been implemented is `biosdevname` which tries to find fixed slot topology information in certain firmware interfaces and uses them to assign fixed names to interfaces which incorporate their physical location on the mainboard. In a way this naming scheme is similar to what is already done natively in udev for various device nodes via `/dev/*/by-path/` symlinks. In many cases, biosdevname departs from the low-level kernel device identification schemes that udev generally uses for these symlinks, and instead invents its own enumeration schemes. + +Finally, many distributions support renaming interfaces to user-chosen names (think: `internet0`, `dmz0`, ...) keyed off their MAC addresses or physical locations as part of their networking scripts. This is a very good choice but does have the problem that it implies that the user is willing and capable of choosing and assigning these names. + +We believe it is a good default choice to generalize the scheme pioneered by `biosdevname`. Assigning fixed names based on firmware/topology/location information has the big advantage that the names are fully automatic, fully predictable, that they stay fixed even if hardware is added or removed (i.e. no reenumeration takes place) and that broken hardware can be replaced seamlessly. That said, they admittedly are sometimes harder to read than the `eth0` or `wlan0` everybody is used to. Example: `enp5s0` + + +## What precisely has changed in v197? + +With systemd 197 we have added native support for a number of different naming policies into systemd/udevd proper and made a scheme similar to biosdevname's (but generally more powerful, and closer to kernel-internal device identification schemes) the default. The following different naming schemes for network interfaces are now supported by udev natively: + +1. Names incorporating Firmware/BIOS provided index numbers for on-board devices (example: `eno1`) +1. Names incorporating Firmware/BIOS provided PCI Express hotplug slot index numbers (example: `ens1`) +1. Names incorporating physical/geographical location of the connector of the hardware (example: `enp2s0`) +1. Names incorporating the interfaces's MAC address (example: `enx78e7d1ea46da`) +1. Classic, unpredictable kernel-native ethX naming (example: `eth0`) + +By default, systemd v197 will now name interfaces following policy 1) if that information from the firmware is applicable and available, falling back to 2) if that information from the firmware is applicable and available, falling back to 3) if applicable, falling back to 5) in all other cases. Policy 4) is not used by default, but is available if the user chooses so. + +This combined policy is only applied as last resort. That means, if the system has biosdevname installed, it will take precedence. If the user has added udev rules which change the name of the kernel devices these will take precedence too. Also, any distribution specific naming schemes generally take precedence. + + +## Come again, what good does this do? + +With this new scheme you now get: + +* Stable interface names across reboots +* Stable interface names even when hardware is added or removed, i.e. no re-enumeration takes place (to the level the firmware permits this) +* Stable interface names when kernels or drivers are updated/changed +* Stable interface names even if you have to replace broken ethernet cards by new ones +* The names are automatically determined without user configuration, they just work +* The interface names are fully predictable, i.e. just by looking at lspci you can figure out what the interface is going to be called +* Fully stateless operation, changing the hardware configuration will not result in changes in `/etc` +* Compatibility with read-only root +* The network interface naming now follows more closely the scheme used for aliasing block device nodes and other device nodes in `/dev` via symlinks +* Applicability to both x86 and non-x86 machines +* The same on all distributions that adopted systemd/udev +* It's easy to opt out of the scheme (see below) + +Does this have any drawbacks? Yes, it does. Previously it was practically guaranteed that hosts equipped with a single ethernet card only had a single `eth0` interface. With this new scheme in place, an administrator now has to check first what the local interface name is before they can invoke commands on it, where previously they had a good chance that `eth0` was the right name. + + +## I don't like this, how do I disable this? + +You basically have three options: + +1. You disable the assignment of fixed names, so that the unpredictable kernel names are used again. For this, simply mask udev's .link file for the default policy: `ln -s /dev/null /etc/systemd/network/99-default.link` +1. You create your own manual naming scheme, for example by naming your interfaces `internet0`, `dmz0` or `lan0`. For that create your own `.link` files in `/etc/systemd/network/`, that choose an explicit name or a better naming scheme for one, some, or all of your interfaces. See [systemd.link(5)](https://www.freedesktop.org/software/systemd/man/systemd.link.html) for more information. +1. You pass the `net.ifnames=0` on the kernel command line + +## How does the new naming scheme look like, precisely? + +That's documented in detail the [systemd.net-naming-scheme(7)](https://www.freedesktop.org/software/systemd/man/systemd.net-naming-scheme.html) man page. Please refer to this in case you are wondering how to decode the new interface names. diff --git a/docs/PRESET.md b/docs/PRESET.md new file mode 100644 index 0000000..a2ae323 --- /dev/null +++ b/docs/PRESET.md @@ -0,0 +1,44 @@ +--- +title: Presets +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Presets + +## Why? + +Different **distributions** have different policies on which services shall be enabled by default when the package they are shipped in is installed. On Fedora all services stay off by default, so that installing a package will not cause a service to be enabled (with some exceptions). On Debian all services are immediately enabled by default, so that installing a package will cause its service(s) to be enabled right-away. + +Different **spins** (flavours, remixes, whatever you might want to call them) of a distribution also have different policies on what services to enable, and what services to leave off. For example, the Fedora default will enable gdm as display manager by default, while the Fedora KDE spin will enable kdm instead. + +Different **sites** might also have different policies what to turn on by default and what to turn off. For example, one administrator would prefer to enforce the policy of "ssh should be always on, but everything else off", while another one might say "snmp always on, and for everything else use the distribution policy defaults". + +## The Logic + +Traditionally, policy about what services shall be enabled and what services shall not have been decided globally by the distributions, and were enforced in each package individually. This made it cumbersome to implement different policies per spin or per site, or to create software packages that do the right thing on more than one distribution. The enablement _mechanism_ was also encoding the enablement _policy_. + +systemd 32 and newer support package "preset" policies. These encode which units shall be enabled by default when they are installed, and which units shall not be enabled. + +Preset files may be written for specific distributions, for specific spins or for specific sites, in order to enforce different policies as needed. Preset policies are stored in .preset files in /usr/lib/systemd/system-preset/. If no policy exists the default implied policy of "enable everything" is enforced, i.e. in Debian style. + +The policy encoded in preset files is applied to a unit by invoking "systemctl preset ". It is recommended to use this command in all package post installation scriptlets. "systemctl preset " is identical to "systemctl enable " resp. "systemctl disable " depending on the policy. + +Preset files allow clean separation of enablement mechanism (inside the package scriptlets, by invoking "systemctl preset"), and enablement policy (centralized in the preset files). + +## Documentation + +Documentation for the preset policy file format is available here: [http://www.freedesktop.org/software/systemd/man/systemd.preset.html](http://www.freedesktop.org/software/systemd/man/systemd.preset.html) + +Documentation for "systemctl preset" you find here: [http://www.freedesktop.org/software/systemd/man/systemctl.html](http://www.freedesktop.org/software/systemd/man/systemctl.html) + +Documentation for the recommended package scriptlets you find here: [http://www.freedesktop.org/software/systemd/man/daemon.html](http://www.freedesktop.org/software/systemd/man/daemon.html) + +## How To + +For the preset logic to be useful, distributions need to implement a couple of steps: + +- The default distribution policy needs to be encoded in a preset file /usr/lib/systemd/system-preset/99-default.preset or suchlike, unless the implied policy of "enable everything" is the right choice. For a Fedora-like policy of "enable nothing" it is sufficient to include the single line "disable" into that file. The default preset file should be installed as part of one the core packages of the distribution. +- All packages need to be updated to use "systemctl preset" in the post install scriptlets. +- (Optionally) spins/remixes/flavours should define their own preset file, either overriding or extending the default distribution preset policy. Also see the fedora feature page: [https://fedoraproject.org/wiki/Features/PackagePresets](https://fedoraproject.org/wiki/Features/PackagePresets) diff --git a/docs/RANDOM_SEEDS.md b/docs/RANDOM_SEEDS.md new file mode 100644 index 0000000..b2712ca --- /dev/null +++ b/docs/RANDOM_SEEDS.md @@ -0,0 +1,408 @@ +--- +title: Random Seeds +category: Concepts +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Random Seeds + +systemd can help in a number of ways with providing reliable, high quality +random numbers from early boot on. + +## Linux Kernel Entropy Pool + +Today's computer systems require random number generators for numerous +cryptographic and other purposes. On Linux systems, the kernel's entropy pool +is typically used as high-quality source of random numbers. The kernel's +entropy pool combines various entropy inputs together, mixes them and provides +an API to userspace as well as to internal kernel subsystems to retrieve +it. This entropy pool needs to be initialized with a minimal level of entropy +before it can provide high quality, cryptographic random numbers to +applications. Until the entropy pool is fully initialized application requests +for high-quality random numbers cannot be fulfilled. + +The Linux kernel provides three relevant userspace APIs to request random data +from the kernel's entropy pool: + +* The [`getrandom()`](https://man7.org/linux/man-pages/man2/getrandom.2.html) + system call with its `flags` parameter set to 0. If invoked, the calling + program will synchronously block until the random pool is fully initialized + and the requested bytes can be provided. + +* The `getrandom()` system call with its `flags` parameter set to + `GRND_NONBLOCK`. If invoked, the request for random bytes will fail if the + pool is not initialized yet. + +* Reading from the + [`/dev/urandom`](https://man7.org/linux/man-pages/man4/urandom.4.html) + pseudo-device will always return random bytes immediately, even if the pool + is not initialized. The provided random bytes will be of low quality in this + case however. Moreover, the kernel will log about all programs using this + interface in this state, and which thus potentially rely on an uninitialized + entropy pool. + +(Strictly speaking, there are more APIs, for example `/dev/random`, but these +should not be used by almost any application and hence aren't mentioned here.) + +Note that the time it takes to initialize the random pool may differ between +systems. If local hardware random number generators are available, +initialization is likely quick, but particularly in embedded and virtualized +environments available entropy is small and thus random pool initialization +might take a long time (up to tens of minutes!). + +Modern hardware tends to come with a number of hardware random number +generators (hwrng), that may be used to relatively quickly fill up the entropy +pool. Specifically: + +* All recent Intel and AMD CPUs provide the CPU opcode + [RDRAND](https://en.wikipedia.org/wiki/RdRand) to acquire random bytes. Linux + includes random bytes generated this way in its entropy pool, but didn't use + to credit entropy for it (i.e. data from this source wasn't considered good + enough to consider the entropy pool properly filled even though it was + used). This has changed recently however, and most big distributions have + turned on the `CONFIG_RANDOM_TRUST_CPU=y` kernel compile time option. This + means systems with CPUs supporting this opcode will be able to very quickly + reach the "pool filled" state. + +* The TPM security chip that is available on all modern desktop systems has a + hwrng. It is also fed into the entropy pool, but generally not credited + entropy. You may use `rng_core.default_quality=1000` on the kernel command + line to change that, but note that this is a global setting affect all + hwrngs. (Yeah, that's weird.) + +* Many Intel and AMD chipsets have hwrng chips. Their Linux drivers usually + don't credit entropy. (But there's `rng_core.default_quality=1000`, see + above.) + +* Various embedded boards have hwrng chips. Some drivers automatically credit + entropy, others do not. Some WiFi chips appear to have hwrng sources too, and + they usually do not credit entropy for them. + +* `virtio-rng` is used in virtualized environments and retrieves random data + from the VM host. It credits full entropy. + +* The EFI firmware typically provides a RNG API. When transitioning from UEFI + to kernel mode Linux will query some random data through it, and feed it into + the pool, but not credit entropy to it. What kind of random source is behind + the EFI RNG API is often not entirely clear, but it hopefully is some kind of + hardware source. + +If neither of these are available (in fact, even if they are), Linux generates +entropy from various non-hwrng sources in various subsystems, all of which +ultimately are rooted in IRQ noise, a very "slow" source of entropy, in +particular in virtualized environments. + +## `systemd`'s Use of Random Numbers + +systemd is responsible for bringing up the OS. It generally runs as the first +userspace process the kernel invokes. Because of that it runs at a time where +the entropy pool is typically not yet initialized, and thus requests to acquire +random bytes will either be delayed, will fail or result in a noisy kernel log +message (see above). + +Various other components run during early boot that require random bytes. For +example, initrds nowadays communicate with encrypted networks or access +encrypted storage which might need random numbers. systemd itself requires +random numbers as well, including for the following uses: + +* systemd assigns 'invocation' UUIDs to all services it invokes that uniquely + identify each invocation. This is useful to retain a global handle on a specific + service invocation and relate it to other data. For example, log data + collected by the journal usually includes the invocation UUID and thus the + runtime context the service manager maintains can be neatly matched up with + the log data a specific service invocation generated. systemd also + initializes `/etc/machine-id` with a randomized UUID. (systemd also makes use + of the randomized "boot id" the kernel exposes in + `/proc/sys/kernel/random/boot_id`). These UUIDs are exclusively Type 4 UUIDs, + i.e. randomly generated ones. + +* systemd maintains various hash tables internally. In order to harden them + against [collision + attacks](https://www.cs.auckland.ac.nz/~mcw/Teaching/refs/misc/denial-of-service.pdf) + they are seeded with random numbers. + +* At various places systemd needs random bytes for temporary file name + generation, UID allocation randomization, and similar. + +* systemd-resolved and systemd-networkd use random number generators to harden + the protocols they implement against packet forgery. + +* systemd-udevd and systemd-nspawn can generate randomized MAC addresses for + network devices. + +Note that these cases generally do not require a cryptographic-grade random +number generator, as most of these utilize random numbers to minimize risk of +collision and not to generate secret key material. However, they usually do +require "medium-grade" random data. For example: systemd's hash-maps are +reseeded if they grow beyond certain thresholds (and thus collisions are more +likely). This means they are generally fine with low-quality (even constant) +random numbers initially as long as they get better with time, so that +collision attacks are eventually thwarted as better, non-guessable seeds are +acquired. + +## Keeping `systemd'`s Demand on the Kernel Entropy Pool Minimal + +Since most of systemd's own use of random numbers do not require +cryptographic-grade RNGs, it tries to avoid blocking reads to the kernel's RNG, +opting instead for using `getrandom(GRND_INSECURE)`. After the pool is +initialized, this is identical to `getrandom(0)`, returning cryptographically +secure random numbers, but before it's initialized it has the nice effect of +not blocking system boot. + +## `systemd`'s Support for Filling the Kernel Entropy Pool + +systemd has various provisions to ensure the kernel entropy is filled during +boot, in order to ensure the entropy pool is filled up quickly. + +1. When systemd's PID 1 detects it runs in a virtualized environment providing + the `virtio-rng` interface it will load the necessary kernel modules to make + use of it during earliest boot, if possible — much earlier than regular + kernel module loading done by `systemd-udevd.service`. This should ensure + that in VM environments the entropy pool is quickly filled, even before + systemd invokes the first service process — as long as the VM environment + provides virtualized RNG hardware (and VM environments really should!). + +2. The + [`systemd-random-seed.service`](https://www.freedesktop.org/software/systemd/man/systemd-random-seed.service.html) + system service will load a random seed from `/var/lib/systemd/random-seed` + into the kernel entropy pool. By default it does not credit entropy for it + though, since the seed is — more often than not — not reset when 'golden' + master images of an OS are created, and thus replicated into every + installation. If OS image builders carefully reset the random seed file + before generating the image it should be safe to credit entropy, which can + be enabled by setting the `$SYSTEMD_RANDOM_SEED_CREDIT` environment variable + for the service to `1` (or even `force`, see man page). Note however, that + this service typically runs relatively late during early boot: long after + the initrd completed, and after the `/var/` file system became + writable. This is usually too late for many applications, it is hence not + advised to rely exclusively on this functionality to seed the kernel's + entropy pool. Also note that this service synchronously waits until the + kernel's entropy pool is initialized before completing start-up. It may thus + be used by other services as synchronization point to order against, if they + require an initialized entropy pool to operate correctly. + +3. The + [`systemd-boot`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) + EFI boot loader included in systemd is able to maintain and provide a random + seed stored in the EFI System Partition (ESP) to the booted OS, which allows + booting up with a fully initialized entropy pool from earliest boot + on. During installation of the boot loader (or when invoking [`bootctl + random-seed`](https://www.freedesktop.org/software/systemd/man/bootctl.html#random-seed)) + a seed file with an initial seed is placed in a file `/loader/random-seed` + in the ESP. In addition, an identically sized randomized EFI variable called + the 'system token' is set, which is written to the machine's firmware NVRAM. + During boot, when `systemd-boot` finds both the random seed file and the + system token they are combined and hashed with SHA256 (in counter mode, to + generate sufficient data), to generate a new random seed file to store in + the ESP as well as a random seed to pass to the OS kernel. The new random + seed file for the ESP is then written to the ESP, ensuring this is completed + before the OS is invoked. + + The kernel then reads the random seed that the boot loader passes to it, via + the EFI configuration table entry, `LINUX_EFI_RANDOM_SEED_TABLE_GUID` + (1ce1e5bc-7ceb-42f2-81e5-8aadf180f57b), which is allocated with pool memory + of type `EfiACPIReclaimMemory`. Its contents have the form: + ``` + struct linux_efi_random_seed { + u32 size; // of the 'seed' array in bytes + u8 seed[]; + }; + ``` + The size field is generally set to 32 bytes, and the seed field includes a + hashed representation of any prior seed in `LINUX_EFI_RANDOM_SEED_TABLE_GUID` + together with the new seed. + + This mechanism is able to safely provide an initialized entropy pool before + userspace even starts and guarantees that different seeds are passed from + the boot loader to the OS on every boot (in a way that does not allow + regeneration of an old seed file from a new seed file). Moreover, when an OS + image is replicated between multiple images and the random seed is not + reset, this will still result in different random seeds being passed to the + OS, as the per-machine 'system token' is specific to the physical host, and + not included in OS disk images. If the 'system token' is properly + initialized and kept sufficiently secret it should not be possible to + regenerate the entropy pool of different machines, even if this seed is the + only source of entropy. + + Note that the writes to the ESP needed to maintain the random seed should be + minimal. Because the size of the random seed file is generally set to 32 bytes, + updating the random seed in the ESP should be doable safely with a single + sector write (since hard-disk sectors typically happen to be 512 bytes long, + too), which should be safe even with FAT file system drivers built into + low-quality EFI firmwares. + +4. A kernel command line option `systemd.random_seed=` may be used to pass in a + base64 encoded seed to initialize the kernel's entropy pool from during + early service manager initialization. This option is only safe in testing + environments, as the random seed passed this way is accessible to + unprivileged programs via `/proc/cmdline`. Using this option outside of + testing environments is a security problem since cryptographic key material + derived from the entropy pool initialized with a seed accessible to + unprivileged programs should not be considered secret. + +With the four mechanisms described above it should be possible to provide +early-boot entropy in most cases. Specifically: + +1. On EFI systems, `systemd-boot`'s random seed logic should make sure good + entropy is available during earliest boot — as long as `systemd-boot` is + used as boot loader, and outside of virtualized environments. + +2. On virtualized systems, the early `virtio-rng` hookup should ensure entropy + is available early on — as long as the VM environment provides virtualized + RNG devices, which they really should all do in 2019. Complain to your + hosting provider if they don't. For VMs used in testing environments, + `systemd.random_seed=` may be used as an alternative to a virtualized RNG. + +3. In general, systemd's own reliance on the kernel entropy pool is minimal + (due to the use of `GRND_INSECURE`). + +4. In all other cases, `systemd-random-seed.service` will help a bit, but — as + mentioned — is too late to help with early boot. + +This primarily leaves two kind of systems in the cold: + +1. Some embedded systems. Many embedded chipsets have hwrng functionality these + days. Consider using them while crediting + entropy. (i.e. `rng_core.default_quality=1000` on the kernel command line is + your friend). Or accept that the system might take a bit longer to + boot. Alternatively, consider implementing a solution similar to + systemd-boot's random seed concept in your platform's boot loader. + +2. Virtualized environments that lack both virtio-rng and RDRAND, outside of + test environments. Tough luck. Talk to your hosting provider, and ask them + to fix this. + +3. Also note: if you deploy an image without any random seed and/or without + installing any 'system token' in an EFI variable, as described above, this + means that on the first boot no seed can be passed to the OS + either. However, as the boot completes (with entropy acquired elsewhere), + systemd will automatically install both a random seed in the GPT and a + 'system token' in the EFI variable space, so that any future boots will have + entropy from earliest boot on — all provided `systemd-boot` is used. + +## Frequently Asked Questions + +1. *Why don't you just use getrandom()? That's all you need!* + + Did you read any of the above? getrandom() is hooked to the kernel entropy + pool, and during early boot it's not going to be filled yet, very likely. We + do use it in many cases, but not in all. Please read the above again! + +2. *Why don't you use + [getentropy()](https://man7.org/linux/man-pages/man3/getentropy.3.html)? That's + all you need!* + + Same story. That call is just a different name for `getrandom()` with + `flags` set to zero, and some additional limitations, and thus it also needs + the kernel's entropy pool to be initialized, which is the whole problem we + are trying to address here. + +3. *Why don't you generate your UUIDs with + [`uuidd`](https://man7.org/linux/man-pages/man8/uuidd.8.html)? That's all you + need!* + + First of all, that's a system service, i.e. something that runs as "payload" + of systemd, long after systemd is already up and hence can't provide us + UUIDs during earliest boot yet. Don't forget: to assign the invocation UUID + for the `uuidd.service` start we already need a UUID that the service is + supposed to provide us. More importantly though, `uuidd` needs state/a random + seed/a MAC address/host ID to operate, all of which are not available during + early boot. + +4. *Why don't you generate your UUIDs with `/proc/sys/kernel/random/uuid`? + That's all you need!* + + This is just a different, more limited interface to `/dev/urandom`. It gains + us nothing. + +5. *Why don't you use [`rngd`](https://github.com/nhorman/rng-tools), + [`haveged`](http://www.issihosts.com/haveged/), + [`egd`](http://egd.sourceforge.net/)? That's all you need!* + + Like `uuidd` above these are system services, hence come too late for our + use-case. In addition much of what `rngd` provides appears to be equivalent + to `CONFIG_RANDOM_TRUST_CPU=y` or `rng_core.default_quality=1000`, except + being more complex and involving userspace. These services partly measure + system behavior (such as scheduling effects) which the kernel either + already feeds into its pool anyway (and thus shouldn't be fed into it a + second time, crediting entropy for it a second time) or is at least + something the kernel could much better do on its own. Hence, if what these + daemons do is still desirable today, this would be much better implemented + in kernel (which would be very welcome of course, but wouldn't really help + us here in our specific problem, see above). + +6. *Why don't you use [`arc4random()`](https://man.openbsd.org/arc4random.3)? + That's all you need!* + + This doesn't solve the issue, since it requires a nonce to start from, and + it gets that from `getrandom()`, and thus we have to wait for random pool + initialization the same way as calling `getrandom()` + directly. `arc4random()` is nothing more than optimization, in fact it + implements similar algorithms that the kernel entropy pool implements + anyway, hence besides being able to provide random bytes with higher + throughput there's little it gets us over just using `getrandom()`. Also, + it's not supported by glibc. And as long as that's the case we are not keen + on using it, as we'd have to maintain that on our own, and we don't want to + maintain our own cryptographic primitives if we don't have to. Since + systemd's uses are not performance relevant (besides the pool initialization + delay, which this doesn't solve), there's hence little benefit for us to + call these functions. That said, if glibc learns these APIs one day, we'll + certainly make use of them where appropriate. + +7. *This is boring: NetBSD had [boot loader entropy seed + support](https://netbsd.gw.com/cgi-bin/man-cgi?boot+8) since ages!* + + Yes, NetBSD has that, and the above is inspired by that (note though: this + article is about a lot more than that). NetBSD's support is not really safe, + since it neither updates the random seed before using it, nor has any + safeguards against replicating the same disk image with its random seed on + multiple machines (which the 'system token' mentioned above is supposed to + address). This means reuse of the same random seed by the boot loader is + much more likely. + +8. *Why does PID 1 upload the boot loader provided random seed into kernel + instead of kernel doing that on its own?* + + That's a good question. Ideally the kernel would do that on its own, and we + wouldn't have to involve userspace in this. + +9. *What about non-EFI?* + + The boot loader random seed logic described above uses EFI variables to pass + the seed from the boot loader to the OS. Other systems might have similar + functionality though, and it shouldn't be too hard to implement something + similar for them. Ideally, we'd have an official way to pass such a seed as + part of the `struct boot_params` from the boot loader to the kernel, but + this is currently not available. + +10. *I use a different boot loader than `systemd-boot`, I'd like to use boot + loader random seeds too!* + + Well, consider just switching to `systemd-boot`, it's worth it. See + [systemd-boot(7)](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) + for an introduction why. That said, any boot loader can re-implement the + logic described above, and can pass a random seed that systemd as PID 1 + will then upload into the kernel's entropy pool. For details see the + [Boot Loader Interface](BOOT_LOADER_INTERFACE) documentation. + +11. *Why not pass the boot loader random seed via kernel command line instead + of as EFI variable?* + + The kernel command line is accessible to unprivileged processes via + `/proc/cmdline`. It's not desirable if unprivileged processes can use this + information to possibly gain too much information about the current state + of the kernel's entropy pool. + + That said, we actually do implement this with the `systemd.random_seed=` + kernel command line option. Don't use this outside of testing environments, + however, for the aforementioned reasons. + +12. *Why doesn't `systemd-boot` rewrite the 'system token' too each time + when updating the random seed file stored in the ESP?* + + The system token is stored as persistent EFI variable, i.e. in some form of + NVRAM. These memory chips tend be of low quality in many machines, and + hence we shouldn't write them too often. Writing them once during + installation should generally be OK, but rewriting them on every single + boot would probably wear the chip out too much, and we shouldn't risk that. diff --git a/docs/RELEASE.md b/docs/RELEASE.md new file mode 100644 index 0000000..df04cb4 --- /dev/null +++ b/docs/RELEASE.md @@ -0,0 +1,28 @@ +--- +title: Steps to a Successful Release +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Steps to a Successful Release + +1. Add all items to NEWS +2. Update the contributors list in NEWS (`ninja -C build git-contrib`) +3. Update the time and place in NEWS +4. Update hwdb (`ninja -C build update-hwdb`, `ninja -C build update-hwdb-autosuspend`, commit separately). +5. Update syscall numbers (`ninja -C build update-syscall-tables update-syscall-header`). +6. [RC1] Update version and library numbers in `meson.build` +7. Check dbus docs with `ninja -C build update-dbus-docs` +8. Update translation strings (`cd build`, `meson compile systemd-pot`, `meson compile systemd-update-po`) - drop the header comments from `systemd.pot` + re-add SPDX before committing. If the only change in a file is the 'POT-Creation-Date' field, then ignore that file. +9. Tag the release: `version=vXXX-rcY && git tag -s "${version}" -m "systemd ${version}"` +10. Do `ninja -C build` +11. Make sure that the version string and package string match: `build/systemctl --version` +12. [FINAL] Close the github milestone and open a new one (https://github.com/systemd/systemd/milestones) +13. "Draft" a new release on github (https://github.com/systemd/systemd/releases/new), mark "This is a pre-release" if appropriate. +14. Check that announcement to systemd-devel, with a copy&paste from NEWS, was sent. This should happen automatically. +15. Update IRC topic (`/msg chanserv TOPIC #systemd Version NNN released | Online resources https://systemd.io/`) +16. Push commits to stable, create an empty -stable branch: `git push systemd-stable --atomic origin/main:main origin/main:refs/heads/${version}-stable`. +17. Build and upload the documentation (on the -stable branch): `ninja -C build doc-sync` +18. [FINAL] Change the default branch to latest release (https://github.com/systemd/systemd-stable/settings/branches). +19. [FINAL] Change the Github Pages branch in the stable repository to the newly created branch (https://github.com/systemd/systemd-stable/settings/pages) and set the 'Custom domain' to 'systemd.io' diff --git a/docs/RESOLVED-VPNS.md b/docs/RESOLVED-VPNS.md new file mode 100644 index 0000000..dbf43f9 --- /dev/null +++ b/docs/RESOLVED-VPNS.md @@ -0,0 +1,268 @@ +--- +title: systemd-resolved and VPNs +category: Networking +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# `systemd-resolved.service` and VPNs + +`systemd-resolved.service` supports routing lookups for specific domains to specific +interfaces. This is useful for hooking up VPN software with systemd-resolved +and making sure the exact right lookups end up on the VPN and on the other +interfaces. + +For a verbose explanation of `systemd-resolved.service`'s domain routing logic, +see its [man +page](https://www.freedesktop.org/software/systemd/man/systemd-resolved.service.html). This +document is supposed to provide examples to use the concepts for the specific +purpose of managing VPN DNS configuration. + +Let's first define two distinct VPN use-cases: + +1. *Corporate* VPNs, i.e. VPNs that open access to a specific set of additional + hosts. Only specific domains should be resolved via the VPN's DNS servers, + and everything that is not related to the company's domain names should go + to regular, non-VPN DNS instead. + +2. *Privacy* VPNs, i.e. VPNs that should be used for basically all DNS traffic, + once they are up. If this type of VPN is used, any regular, non-VPN DNS + servers should not get any traffic anymore. + +Then, let's briefly introduce three DNS routing concepts that software managing +a network interface may configure. + +1. Search domains: these are traditional DNS configuration parameters and are + used to suffix non-qualified domain names (i.e. single-label ones), to turn + them into fully qualified domain names. Traditionally (before + `systemd-resolved.service`), search domain names are attached to a system's + IP configuration as a whole, in `systemd-resolved.service` they are + associated to individual interfaces instead, since they are typically + acquired through some network associated concept, such as a DHCP, IPv6RA or + PPP lease. Most importantly though: in `systemd-resolved.service` they are + not just used to suffix single-label domain names, but also for routing + domain name lookups: if a network interface has a search domain `foo.com` + configured on it, then any lookups for names ending in `.foo.com` (or for + `foo.com` itself) are preferably routed to the DNS servers configured on the + same network interface. + +2. Routing domains: these are very similar to search domains, but are purely + about DNS domain name lookup routing — they are not used for qualifying + single-label domain names. When it comes to routing, assigning a routing + domain to a network interface is identical to assigning a search domain to + it. + + Why the need to have both concepts, i.e. search *and* routing domains? + Mostly because in many cases the qualifying of single-label names is not + desirable (as it has security implications), but needs to be supported for + specific use-cases. Routing domains are a concept `systemd-resolved.service` + introduced, while search domains are traditionally available and are part of + DHCP/IPv6RA/PPP leases and thus universally supported. In many cases routing + domains are probably the more appropriate concept, but not easily available, + since they are not part of DHCP/IPv6RA/PPP. + + Routing domains for `systemd-resolved.service` are usually presented along + with search domains in mostly the same way, but prefixed with `~` to + differentiate them. i.e. `~foo.com` is a configured routing domain, while + `foo.com` would be a configured search domain. + + One routing domain is particularly interesting: `~.` — the catch-all routing + domain. (The *dot* domain `.` is how DNS denotes the "root" domain, i.e. the + parent domain of all domains, but itself.) When used on an interface any DNS + traffic is preferably routed to its DNS servers. (A search domain – i.e. `.` + instead of `~.` — would have the same effect, but given that it's mostly + pointless to suffix an unqualified domain with `.`, we generally declare it + as a routing domain, not a search domain). + + Routing domains also have particular relevance when it comes to the reverse + lookup DNS domains `.in-addr.arpa` and `.ip6.arpa`. An interface that has + these (or sub-domains thereof) defined as routing domains, will be preferably + used for doing reverse IP to domain name lookups. e.g. declaring + `~168.192.in-addr.arpa` on an interface means that all lookups to find the + domain names for IPv4 addresses 192.168.x.y are preferably routed to it. + +3. The `default-route` boolean. This is a simple boolean value that may be set + on an interface. If true (the default), any DNS lookups for which no + matching routing or search domains are defined are routed to interfaces + marked like this. If false then the DNS servers on this interface are not + considered for routing lookups to except for the ones listed in the + search/routing domain list. An interface that has no search/routing domain + associated and also has this boolean off is not considered for *any* + lookups. + +One more thing to mention: in `systemd-resolved.service` if lookups match the +search/routing domains of multiple interfaces at once, then they are sent to +all of them in parallel, and the first positive reply used. If all lookups fail +the last negative reply is used. This means the DNS zones on the relevant +interfaces are "merged": domains existing on one but not the other will "just +work" and vice versa. + +And one more note: the domain routing logic implemented is a tiny bit more +complex that what described above: if there two interfaces have search domains +that are suffix of each other, and a name is looked up that matches both, the +interface with the longer match will win and get the lookup routed to is DNS +servers. Only if the match has the same length, then both will be used in +parallel. Example: one interface has `~foo.example.com` as routing domain, and +another one `example.com` has search domain. A lookup for +`waldo.foo.example.com` is the exclusively routed to the first interface's DNS +server, since it matches by three suffix labels instead of just two. The fact +that the matching length is taken into consideration for the routing decision +is particularly relevant if you have one interface with the `~.` routing domain +and another one with `~corp.company.example` — both suffixes match a lookup for +`foo.corp.company.example`, but the latter interface wins, since the match is +for four labels, while the other is for zero labels. + +## Putting it Together + +Let's discuss how the three DNS routing concepts above are best used for a +reasonably complex scenario consisting of: + +1. One VPN interface of the *corporate* kind, maybe called `company0`. It makes + available a bunch of servers, all in the domain `corp.company.example`. + +2. One VPN interface of the *privacy* kind, maybe called `privacy0`. When it is + up all DNS traffic shall preferably routed to its DNS servers. + +3. One regular WiFi interface, maybe called `wifi0`. It has a regular DNS + server on it. + +Here's how to best configure this for `systemd-resolved.service`: + +1. `company0` should get a routing domain `~corp.company.example` + configured. (A search domain `corp.company.example` would work too, if + qualifying of single-label names is desired or the VPN lease information + does not provide for the concept of routing domains, but does support search + domains.) This interface should also set `default-route` to false, to ensure + that really only the DNS lookups for the company's servers are routed there + and nothing else. Finally, it might make sense to also configure a routing + domain `~2.0.192.in-addr.arpa` on the interface, ensuring that all IPv4 + addresses from the 192.0.2.x range are preferably resolved via the DNS + server on this interface (assuming that that's the IPv4 address range the + company uses internally). + +2. `privacy0` should get a routing domain `~.` configured. The setting of + `default-route` for this interface is then irrelevant. This means: once the + interface is up, all DNS traffic is preferably routed there. + +3. `wifi0` should not get any special settings, except possibly whatever the + local WiFi router considers suitable as search domain, for example + `fritz.box`. The default `true` setting for `default-route` is good too. + +With this configuration if only `wifi0` is up, all DNS traffic goes to its DNS +server, since there are no other interfaces with better matching DNS +configuration. If `privacy0` is then upped, all DNS traffic will exclusively go +to this interface now — with the exception of names below the `fritz.box` +domain, which will continue to go directly to `wifi0`, as the search domain +there says so. Now, if `company0` is also upped, it will receive DNS traffic +for the company's internal domain and internal IP subnet range, but nothing +else. If `privacy0` is then downed again, `wifi0` will get the regular DNS +traffic again, and `company0` will still get the company's internal domain and +IP subnet traffic and nothing else. Everything hence works as intended. + +## How to Implement this in Your VPN Software + +Most likely you want to expose a boolean in some way that declares whether a +specific VPN is of the *corporate* or the *privacy* kind: + +1. If managing a *corporate* VPN, you configure any search domains the user or + the VPN contact point provided. And you set `default-route` to false. If you + have IP subnet information for the VPN, it might make sense to insert + `~….in-addr.arpa` and `~….ip6.arpa` reverse lookup routing domains for it. + +2. If managing a *privacy* VPN, you include `~.` in the routing domains, the + value for `default-route` is actually irrelevant, but I'd set it to true. No + need to configure any reverse lookup routing domains for it. + +(If you also manage regular WiFi/Ethernet devices, just configure them as +traditional, i.e. with any search domains as acquired, do not set `~.` though, +and do not disable `default-route`.) + +## The APIs + +Now we determined how we want to configure things, but how do you actually get +the configuration to `systemd-resolved.service`? There are three relevant +interfaces: + +1. Ideally, you use D-Bus and talk to [`systemd-resolved.service`'s D-Bus + API](https://www.freedesktop.org/software/systemd/man/org.freedesktop.resolve1.html) + directly. Use `SetLinkDomains()` to set the per-interface search and routing + domains on the interfaces you manage, and `SetLinkDefaultRoute()` to manage + the `default-route` boolean, all on the `org.freedesktop.resolve1.Manager` + interface of the `/org/freedesktop/resolve1` object. + +2. If that's not in the cards, you may shell out to + [`resolvectl`](https://www.freedesktop.org/software/systemd/man/resolvectl.html), + which is a thin wrapper around the D-Bus interface mentioned above. Use + `resolvectl domain <iface> …` to set the search/routing domains and + `resolvectl default-route <iface> …` to set the `default-route` boolean. + + Example use from a shell callout of your VPN software for a *corporate* VPN: + + resolvectl domain corporate0 '~corp-company.example' '~2.0.192.in-addr.arpa' + resolvectl default-route corporate0 false + resolvectl dns corporate0 192.0.2.1 + + Example use from a shell callout of your VPN software for a *privacy* VPN: + + resolvectl domain privacy0 '~.' + resolvectl default-route privacy0 true + resolvectl dns privacy0 8.8.8.8 + +3. If you don't want to use any `systemd-resolved` commands, you may use the + `resolvconf` wrapper we provide. `resolvectl` is actually a multi-call + binary and may be symlinked to `resolvconf`, and when invoked like that + behaves in a way that is largely compatible with FreeBSD's and + Ubuntu's/Debian's + [`resolvconf(8)`](https://manpages.ubuntu.com/manpages/trusty/man8/resolvconf.8.html) + tool. When the `-x` switch is specified, the `~.` routing domain is + automatically appended to the domain list configured, as appropriate for a + *privacy* VPN. Note that the `resolvconf` interface only covers *privacy* + VPNs and regular network interfaces (such as WiFi or Ethernet) well. The + *corporate* kind of VPN is not well covered, since the interface cannot + propagate the `default-route` boolean, nor can be used to configure the + `~….in-addr.arpa` or `~.ip6.arpa` routing domains. + +## Ordering + +When configuring per-interface DNS configuration settings it is wise to +configure everything *before* actually upping the interface. Once the interface +is up `systemd-resolved.service` might start using it, and hence it's important +to have everything configured properly (this is particularly relevant when +LLMNR or MulticastDNS is enabled, since that works without any explicitly +configured DNS configuration). It is also wise to configure search/routing +domains and the `default-route` boolean *before* configuring the DNS servers, +as the former without the latter has no effect, but the latter without the +former will result in DNS traffic possibly being generated, in a non-desirable +way given that the routing information is not set yet. + +## Downgrading Search Domains to Routing Domains + +Many VPN implementations provide a way how VPN servers can inform VPN clients +about search domains to use. In some cases it might make sense to install those +as routing domains instead of search domains. Unqualified domain names usually +imply a context of locality: the same unqualified name typically is expected to +resolve to one system in one local network, and to another one in a different +network. Search domains thus generally come with security implications: they +might cause that unqualified domains are resolved in a different (possibly +remote) context, contradicting user expectations. Thus it might be wise to +downgrade *search domains* provided by VPN servers to *routing domains*, so +that local unqualified name resolution remains untouched and strictly maintains +its local focus — in particular in the aforementioned less trusted *corporate* +VPN scenario. + +To illustrate this further, here's an example for an attack scenario using +search domains: a user assumes the printer system they daily contact under the +unqualified name "printer" is the network printer in their basement (with the +fully qualified domain name "printer.home"). Sometimes the user joins the +corporate VPN of their employer, which comes with a search domain +"foocorp.example", so that the user's confidential documents (maybe a job +application to a competing company) might end up being printed on +"printer.foocorp.example" instead of "printer.home". If the local VPN software +had downgraded the VPN's search domain to a routing domain "~foocorp.example", +this mismapping would not have happened. + +When connecting to untrusted WiFi networks it might be wise to go one step +further even: suppress installation of search/routing domains by the network +entirely, to ensure that the local DNS information is only used for name +resolution of qualified names and only when no better DNS configuration is +available. diff --git a/docs/ROOT_STORAGE_DAEMONS.md b/docs/ROOT_STORAGE_DAEMONS.md new file mode 100644 index 0000000..69812c9 --- /dev/null +++ b/docs/ROOT_STORAGE_DAEMONS.md @@ -0,0 +1,194 @@ +--- +title: Storage Daemons for the Root File System +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# systemd and Storage Daemons for the Root File System + +a.k.a. _Pax Cellae pro Radix Arbor_ + +(or something like that, my Latin is a bit rusty) + +A number of complex storage technologies on Linux (e.g. RAID, volume +management, networked storage) require user space services to run while the +storage is active and mountable. This requirement becomes tricky as soon as the +root file system of the Linux operating system is stored on such storage +technology. Previously no clear path to make this work was available. This text +tries to clear up the resulting confusion, and what is now supported and what +is not. + +## A Bit of Background + +When complex storage technologies are used as backing for the root file system +this needs to be set up by the initrd, i.e. on Fedora by Dracut. In newer +systemd versions tear-down of the root file system backing is also done by the +initrd: after terminating all remaining running processes and unmounting all +file systems it can (which means excluding the root file system) systemd will +jump back into the initrd code allowing it to unmount the final file systems +(and its storage backing) that could not be unmounted as long as the OS was +still running from the main root file system. The job of the initrd is to +detach/unmount the root file system, i.e. inverting the exact commands it used +to set them up in the first place. This is not only cleaner, but also allows +for the first time arbitrary complex stacks of storage technology. + +Previous attempts to handle root file system setups with complex storage as +backing usually tried to maintain the root storage with program code stored on +the root storage itself, thus creating a number of dependency loops. Safely +detaching such a root file system becomes messy, since the program code on the +storage needs to stay around longer than the storage, which is technically +contradicting. + +## What's new? + +As a result, we hereby clarify that we do not support storage technology setups +where the storage daemons are being run from the storage they maintain +themselves. In other words: a storage daemon backing the root file system cannot +be stored on the root file system itself. + +What we do support instead is that these storage daemons are started from the +initrd, stay running all the time during normal operation and are terminated +only after we returned control back to the initrd and by the initrd. As such, +storage daemons involved with maintaining the root file system storage +conceptually are more like kernel threads than like normal system services: +from the perspective of the init system (i.e. systemd), these services have been +started before systemd was initialized and stay around until after systemd is +already gone. These daemons can only be updated by updating the initrd and +rebooting; a takeover from initrd-supplied services to replacements from the +root file system is not supported. + +## What does this mean? + +Near the end of system shutdown, systemd executes a small tool called +systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as +it entirely replaces the systemd init process) then iterates through the +mounted file systems and running processes (as well as a couple of other +resources) and tries to unmount/read-only mount/detach/kill them. It continues +to do this in a tight loop as long as this results in any effect. From this +killing spree a couple of processes are automatically excluded: PID 1 itself of +course, as well as all kernel threads. After the killing/unmounting spree +control is passed back to the initrd, whose job is then to unmount/detach +whatever might be remaining. + +The same killing spree logic (but not the unmount/detach/read-only logic) is +applied during the transition from the initrd to the main system (i.e. the +"`switch_root`" operation), so that no processes from the initrd survive to the +main system. + +To implement the supported logic proposed above (i.e. where storage daemons +needed for the root file system which are started by the initrd stay around +during normal operation and are only killed after control is passed back to the +initrd), we need to exclude these daemons from the shutdown/switch_root killing +spree. To accomplish this, the following logic is available starting with +systemd 38: + +Processes (run by the root user) whose first character of the zeroth command +line argument is `@` are excluded from the killing spree, much the same way as +kernel threads are excluded too. Thus, a daemon which wants to take advantage +of this logic needs to place the following at the top of its `main()` function: + +```c +... +argv[0][0] = '@'; +... +``` + +And that's already it. Note that this functionality is only to be used by +programs running from the initrd, and **not** for programs running from the +root file system itself. Programs which use this functionality and are running +from the root file system are considered buggy since they effectively prohibit +clean unmounting/detaching of the root file system and its backing storage. + +_Again: if your code is being run from the root file system, then this logic +suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you +to find a different solution to your problem._ + +The recommended way to distinguish between run-from-initrd and run-from-rootfs +for a daemon is to check for `/etc/initrd-release` (which exists on all modern +initrd implementations, see the [initrd Interface](INITRD_INTERFACE) for +details) which when exists results in `argv[0][0]` being set to `@`, and +otherwise doesn't. Something like this: + +```c +#include <unistd.h> + +int main(int argc, char *argv[]) { + ... + if (access("/etc/initrd-release", F_OK) >= 0) + argv[0][0] = '@'; + ... + } +``` + +Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without +precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify +they are login shells. This logic is also very easy to implement. We have been +looking for other ways to mark processes for exclusion from the killing spree, +but could not find any that was equally simple to implement and quick to read +when traversing through `/proc/`. Also, as a side effect replacing the first +character of `argv[0]` with `@` also visually invalidates the path normally +stored in `argv[0]` (which usually starts with `/`) thus helping the +administrator to understand that your daemon is actually not originating from +the actual root file system, but from a path in a completely different +namespace (i.e. the initrd namespace). Other than that we just think that `@` +is a cool character which looks pretty in the ps output... 😎 + +Note that your code should only modify `argv[0][0]` and leave the comm name +(i.e. `/proc/self/comm`) of your process untouched. + +Since systemd v255, alternatively the `SurviveFinalKillSignal=yes` unit option +can be set, and provides the equivalent functionality to modifying `argv[0][0]`. + +## To which technologies does this apply? + +These recommendations apply to those storage daemons which need to stay around +until after the storage they maintain is unmounted. If your storage daemon is +fine with being shut down before its storage device is unmounted, you may ignore +the recommendations above. + +This all applies to storage technology only, not to daemons with any other +(non-storage related) purposes. + +## What else to keep in mind? + +If your daemon implements the logic pointed out above, it should work nicely +from initrd environments. In many cases it might be necessary to additionally +support storage daemons to be started from within the actual OS, for example +when complex storage setups are used for auxiliary file systems, i.e. not the +root file system, or created by the administrator during runtime. Here are a +few additional notes for supporting these setups: + +* If your storage daemon is run from the main OS (i.e. not the initrd) it will + also be terminated when the OS shuts down (i.e. before we pass control back + to the initrd). Your daemon needs to handle this properly. + +* It is not acceptable to spawn off background processes transparently from + user commands or udev rules. Whenever a process is forked off on Unix it + inherits a multitude of process attributes (ranging from the obvious to the + not-so-obvious such as security contexts or audit trails) from its parent + process. It is practically impossible to fully detach a service from the + process context of the spawning process. In particular, systemd tracks which + processes belong to a service or login sessions very closely, and by spawning + off your storage daemon from udev or an administrator command you thus make + it part of its service/login. Effectively this means that whenever udev is + shut down, your storage daemon is killed too, resp. whenever the login + session goes away your storage might be terminated as well. (Also note that + recent udev versions will automatically kill all long running background + processes forked off udev rules now.) So, in summary: double-forking off + processes from user commands or udev rules is **NOT** OK! + +* To automatically spawn storage daemons from udev rules or administrator + commands, the recommended technology is socket-based activation as + implemented by systemd. Transparently for your client code connecting to the + socket of your storage daemon will result in the storage to be started. For + that it is simply necessary to inform systemd about the socket you'd like it + to listen on behalf of your daemon and minimally modify the daemon to + receive the listening socket for its services from systemd instead of + creating it on its own. Such modifications can be minimal, and are easily + written in a way that does not negatively impact usability on non-systemd + systems. For more information on making use of socket activation in your + program consult this blog story: [Socket + Activation](https://0pointer.de/blog/projects/socket-activation.html) + +* Consider having a look at the [initrd Interface of systemd](INITRD_INTERFACE). diff --git a/docs/SECURITY.md b/docs/SECURITY.md new file mode 100644 index 0000000..a44b90d --- /dev/null +++ b/docs/SECURITY.md @@ -0,0 +1,14 @@ +--- +title: Reporting of Security Vulnerabilities +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Reporting of Security Vulnerabilities + +If you discover a security vulnerability, we'd appreciate a non-public disclosure. systemd developers can be contacted privately on the **[systemd-security@redhat.com](mailto:systemd-security@redhat.com) mailing list**. The disclosure will be coordinated with distributions. + +(The [issue tracker](https://github.com/systemd/systemd/issues) and [systemd-devel mailing list](https://lists.freedesktop.org/mailman/listinfo/systemd-devel) are fully public.) + +Subscription to the systemd-security mailing list is open to **regular systemd contributors and people working in the security teams of various distributions**. Those conditions should be backed by publicly accessible information (ideally, a track of posts and commits from the mail address in question). If you fall into one of those categories and wish to be subscribed, submit a **[subscription request](https://www.redhat.com/mailman/listinfo/systemd-security)**. diff --git a/docs/SEPARATE_USR_IS_BROKEN.md b/docs/SEPARATE_USR_IS_BROKEN.md new file mode 100644 index 0000000..8e9390e --- /dev/null +++ b/docs/SEPARATE_USR_IS_BROKEN.md @@ -0,0 +1,40 @@ +--- +title: Booting Without /usr is Broken +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Booting Without /usr is Broken + +You probably discovered this page because your shiny new systemd system referred you here during boot time, when it warned you that booting without /usr pre-mounted wasn't supported anymore. And now you wonder what this all is about. Here's an attempt of an explanation: + +One thing in advance: systemd itself is actually mostly fine with /usr on a separate file system that is not pre-mounted at boot time. However, the common basic set of OS components of modern Linux machines is not, and has not been in quite some time. And it is unlikely that this is going to be fixed any time soon, or even ever. + +Most of the failures you will experience with /usr split off and not pre-mounted in the initramfs are graceful failures: they won't become directly visible, however certain features become unavailable due to these failures. Quite a number of programs these days hook themselves into the early boot process at various stages. A popular way to do this is for example via udev rules. The binaries called from these rules are sometimes located on /usr/bin, or link against libraries in /usr/lib, or use data files from /usr/share. If these rules fail udev will proceed with the next one, however later on applications will then not properly detect these udev devices or features of these devices. Here's a short, very in-comprehensive list of software we are aware of that currently are not able to provide the full set of functionality when /usr is split off and not pre-mounted at boot: udev-pci-db/udev-usb-db and all rules depending on this (using the PCI/USB database in /usr/share), PulseAudio, NetworkManager, ModemManager, udisks, libatasmart, usb\_modeswitch, gnome-color-manager, usbmuxd, ALSA, D-Bus, CUPS, Plymouth, LVM, hplip, multipath, Argyll, VMWare, the locale logic of most programs and a lot of other stuff. + +You don't believe us? Well, here's a command line that reveals a few obvious cases of udev rules that will silently fail to work if /usr is split off and not pre-mounted: `egrep 'usb-db|pci-db|FROM_DATABASE|/usr' /*/udev/rules.d/*` -- and you find a lot more if you actually look for it. On my fresh Fedora 15 install that's 23 obvious cases. + +## The Status Quo + +Due to this, many upstream developers have decided to consider the problem of a separate /usr that is not mounted during early boot an outdated question, and started to close bugs regarding these issues as WONTFIX. We certainly cannot blame them, as the benefit of supporting this is questionable and brings a lot of additional work with it. + +And let's clarify a few things: + +1. **It isn't systemd's fault.** systemd mostly works fine with /usr on a separate file system that is not pre-mounted at boot. +2. **systemd is merely the messenger.** Don't shoot the messenger. +3. **There's no news in all of this.** The message you saw is just a statement of fact, describing the status quo. Things have been this way since a while. +4. **The message is merely a warning.** You can choose to ignore it. +5. **Don't blame us**, don't abuse us, it's not our fault. We have been working on the Linux userspace since quite some time, and simply have enough of the constant bug reports regarding these issues, since they are actually very hard to track down because the failures are mostly graceful. Hence we placed this warning into the early boot process of every systemd Linux system with a split off and not pre-mounted /usr, so that people understand what is going on. + +## Going Forward + +/usr on its own filesystem is useful in some custom setups. But instead of expecting the traditional Unix way to (sometimes mindlessly) distributing tools between /usr and /, and require more and more tools to move to /, we now just expect /usr to be pre-mounted from inside the initramfs, to be available before 'init' starts. The duty of the minimal boot system that consisted of /bin, /sbin and /lib on traditional Unix, has been taken over by the initramfs of modern Linux. An initramfs that supports mounting /usr on top of / before it starts 'init', makes all existing setups work properly. + +There is no way to reliably bring up a modern system with an empty /usr. There are two alternatives to fix it: move /usr back to the rootfs or use an initramfs which can hide the split-off from the system. + +On the Fedora distribution we have succeeded to clean up the situation and the confusion the current split between / and /usr has created. We have moved all tools that over time have been moved to / back to /usr (where they belong), and the root file system only contains compatibility symlinks for /bin and /sbin into /usr. All binaries of the system are exclusively located within the /usr hierarchy. + +In this new definition of /usr, the directory can be mounted read-only by default, while the rootfs may be either read-write or read-only (for stateless systems) and contains only the empty mount point directories, compat-symlinks to /usr and the host-specific data like /etc, /root, /srv. In comparison to today's setups, the rootfs will be very small. The host-specific data will be properly separated from the installed operating system. The new /usr could also easily be shared read-only across several systems. Such a setup would be more efficient, can provide additional security, is more flexible to use, provides saner options for custom setups, and is much simpler to setup and maintain. + +For more information on this please continue to [The Case for the /usr Merge](../THE_CASE_FOR_THE_USR_MERGE). diff --git a/docs/SYSLOG.md b/docs/SYSLOG.md new file mode 100644 index 0000000..35c6225 --- /dev/null +++ b/docs/SYSLOG.md @@ -0,0 +1,50 @@ +--- +title: Writing syslog Daemons Which Cooperate Nicely With systemd +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Writing syslog Daemons Which Cooperate Nicely With systemd + +Here are a few notes on things to keep in mind when you work on a classic BSD syslog daemon for Linux, to ensure that your syslog daemon works nicely together with systemd. If your syslog implementation does not follow these rules, then it will not be compatible with systemd v38 and newer. + +A few notes in advance: systemd centralizes all log streams in the Journal daemon. Messages coming in via /dev/log, via the native protocol, via STDOUT/STDERR of all services and via the kernel are received in the journal daemon. The journal daemon then stores them to disk or in RAM (depending on the configuration of the Storage= option in journald.conf), and optionally forwards them to the console, the kernel log buffer, or to a classic BSD syslog daemon -- and that's where you come in. + +Note that it is now the journal that listens on /dev/log, no longer the BSD syslog daemon directly. If your logging daemon wants to get access to all logging data then it should listen on /run/systemd/journal/syslog instead via the syslog.socket unit file that is shipped along with systemd. On a systemd system it is no longer OK to listen on /dev/log directly, and your daemon may not bind to the /run/systemd/journal/syslog socket on its own. If you do that then you will lose logging from STDOUT/STDERR of services (as well as other stuff). + +Your BSD compatible logging service should alias `syslog.service` to itself (i.e. symlink) when it is _enabled_. That way [syslog.socket](http://cgit.freedesktop.org/systemd/systemd/plain/units/syslog.socket) will activate your service when things are logged. Of course, only one implementation of BSD syslog can own that symlink, and hence only one implementation can be enabled at a time, but that's intended as there can only be one process listening on that socket. (see below for details how to manage this symlink.) Note that this means that syslog.socket as shipped with systemd is _shared_ among all implementations, and the implementation that is in control is configured with where syslog.service points to. + +Note that journald tries hard to forward to your BSD syslog daemon as much as it can. That means you will get more than you traditionally got on /dev/log, such as stuff all daemons log on STDOUT/STDERR and the messages that are logged natively to systemd. Also, we will send stuff like the original SCM_CREDENTIALS along if possible. + +(BTW, journald is smart enough not to forward the kernel messages it gets to you, you should read that on your own, directly from /proc/kmsg, as you always did. It's also smart enough never to forward kernel messages back to the kernel, but that probably shouldn't concern you too much...) + +And here are the recommendations: + +- First of all, make sure your syslog daemon installs a native service unit file (SysV scripts are not sufficient!) and is socket activatable. Newer systemd versions (v35+) do not support non-socket-activated syslog daemons anymore and we do no longer recommend people to order their units after syslog.target. That means that unless your syslog implementation is socket activatable many services will not be able to log to your syslog implementation and early boot messages are lost entirely to your implementation. Note that your service should install only one unit file, and nothing else. Do not install socket unit files. +- Make sure that in your unit file you set StandardOutput=null in the [Service] block. This makes sure that regardless what the global default for StandardOutput= is the output of your syslog implementation goes to /dev/null. This matters since the default StandardOutput= value for all units can be set to syslog and this should not create a feedback loop with your implementation where the messages your syslog implementation writes out are fed back to it. In other words: you need to explicitly opt out of the default standard output redirection we do for other services. (Also note that you do not need to set StandardError= explicitly, since that inherits the setting of StandardOutput= by default) +- /proc/kmsg is your property, flush it to disk as soon as you start up. +- Name your service unit after your daemon (e.g. rsyslog.service or syslog-ng.service) and make sure to include Alias=syslog.service in your [Install] section in the unit file. This is ensures that the symlink syslog.service is created if your service is enabled and that it points to your service. Also add WantedBy=multi-user.target so that your service gets started at boot, and add Requires=syslog.socket in [Unit] so that you pull in the socket unit. + +Here are a few other recommendations, that are not directly related to systemd: + +- Make sure to read the priority prefixes of the kmsg log messages the same way like from normal userspace syslog messages. When systemd writes to kmsg it will prefix all messages with valid priorities which include standard syslog facility values. OTOH for kernel messages the facility is always 0. If you need to know whether a message originated in the kernel rely on the facility value, not just on the fact that you read the message from /proc/kmsg! A number of userspace applications write messages to kmsg (systemd, udev, dracut, others), and they'll nowadays all set correct facility values. +- When you read a message from the socket use SCM_CREDENTIALS to get information about the client generating it, and possibly patch the message with this data in order to make it impossible for clients to fake identities. + +The unit file you install for your service should look something like this: + +``` +[Unit] +Description=System Logging Service +Requires=syslog.socket + +[Service] +ExecStart=/usr/sbin/syslog-ng -n +StandardOutput=null + +[Install] +Alias=syslog.service +WantedBy=multi-user.target +``` + +And remember: don't ship any socket unit for /dev/log or /run/systemd/journal/syslog (or even make your daemon bind directly to these sockets)! That's already shipped along with systemd for you. diff --git a/docs/SYSTEMD_FILE_HIERARCHY_REQUIREMENTS.md b/docs/SYSTEMD_FILE_HIERARCHY_REQUIREMENTS.md new file mode 100644 index 0000000..574df93 --- /dev/null +++ b/docs/SYSTEMD_FILE_HIERARCHY_REQUIREMENTS.md @@ -0,0 +1,20 @@ +--- +title: systemd File Hierarchy Requirements +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# systemd File Hierarchy Requirements + +There are various attempts to standardize the file system hierarchy of Linux systems. In systemd we leave much of the file system layout open to the operating system, but here's what systemd strictly requires: + +- /, /usr, /etc must be mounted when the host systemd is first invoked. This may be achieved either by using the kernel's built-in root disk mounting (in which case /, /usr and /etc need to be on the same file system), or via an initrd, which could mount the three directories from different sources. +- /bin, /sbin, /lib (and /lib64 if applicable) should reside on /, or be symlinks to the /usr file system (recommended). All of them must be available before the host systemd is first executed. +- /var does not have to be mounted when the host systemd is first invoked, however, it must be configured so that it is mounted writable before local-fs.target is reached (for example, by simply listing it in /etc/fstab). +- /tmp is recommended to be a tmpfs (default), but doesn't have to. If configured, it must be mounted before local-fs.target is reached (for example, by listing it in /etc/fstab). +- /dev must exist as an empty mount point and will automatically be mounted by systemd with a devtmpfs. Non-devtmpfs boots are not supported. +- /proc and /sys must exist as empty mount points and will automatically be mounted by systemd with procfs and sysfs. +- /run must exist as an empty mount point and will automatically be mounted by systemd with a tmpfs. + +The other directories usually found in the root directory (such as /home, /boot, /opt) are irrelevant to systemd. If they are defined they may be mounted from any source and at any time, though it is a good idea to mount them also before local-fs.target is reached. diff --git a/docs/TEMPORARY_DIRECTORIES.md b/docs/TEMPORARY_DIRECTORIES.md new file mode 100644 index 0000000..bc9cb7b --- /dev/null +++ b/docs/TEMPORARY_DIRECTORIES.md @@ -0,0 +1,220 @@ +--- +title: Using /tmp/ and /var/tmp/ Safely +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Using `/tmp/` and `/var/tmp/` Safely + +`/tmp/` and `/var/tmp/` are two world-writable directories Linux systems +provide for temporary files. The former is typically on `tmpfs` and thus +backed by RAM/swap, and flushed out on each reboot. The latter is typically a +proper, persistent file system, and thus backed by physical storage. This +means: + +1. `/tmp/` should be used for smaller, size-bounded files only; `/var/tmp/` + should be used for everything else. + +2. Data that shall survive a boot cycle shouldn't be placed in `/tmp/`. + +If the `$TMPDIR` environment variable is set, use that path, and neither use +`/tmp/` nor `/var/tmp/` directly. + +See +[file-hierarchy(7)](https://www.freedesktop.org/software/systemd/man/file-hierarchy.html) +for details about these two (and most other) directories of a Linux system. + +## Common Namespace + +Note that `/tmp/` and `/var/tmp/` each define a common namespace shared by all +local software. This means guessable file or directory names below either +directory directly translate into a 🚨 Denial-of-Service (DoS) 🚨 vulnerability +or worse: if some software creates a file or directory `/tmp/foo` then any +other software that wants to create the same file or directory `/tmp/foo` +either will fail (as the file already exists) or might be tricked into using +untrusted files. Hence: do not use guessable names in `/tmp/` or `/var/tmp/` — +if you do you open yourself up to a local DoS exploit or worse. (You can get +away with using guessable names, if you pre-create subdirectories below `/tmp/` +for them, like X11 does with `/tmp/.X11-unix/` through `tmpfiles.d/` +drop-ins. However this is not recommended, as it is fully safe only if these +directories are pre-created during early boot, and thus problematic if package +installation during runtime is permitted.) + +To protect yourself against these kinds of attacks Linux provides a couple of +APIs that help you avoiding guessable names. Specifically: + +1. Use [`mkstemp()`](https://man7.org/linux/man-pages/man3/mkstemp.3.html) + (POSIX), `mkostemp()` (glibc), + [`mkdtemp()`](https://man7.org/linux/man-pages/man3/mkdtemp.3.html) (POSIX), + [`tmpfile()`](https://man7.org/linux/man-pages/man3/tmpfile.3.html) (C89) + +2. Use [`open()`](https://man7.org/linux/man-pages/man2/open.2.html) with + `O_TMPFILE` (Linux) + +3. [`memfd_create()`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) + (Linux; this doesn't bother with `/tmp/` or `/var/tmp/` at all, but uses the + same RAM/swap backing as `tmpfs` uses, hence is very similar to `/tmp/` + semantics.) + +For system services systemd provides the `PrivateTmp=` boolean setting. If +turned on for a service (👍 which is highly recommended), `/tmp/` and +`/var/tmp/` are replaced by private sub-directories, implemented through Linux +file system namespacing and bind mounts. This means from the service's point of +view `/tmp/` and `/var/tmp/` look and behave like they normally do, but in +reality they are private sub-directories of the host's real `/tmp/` and +`/var/tmp/`, and thus not system-wide locations anymore, but service-specific +ones. This reduces the surface for local DoS attacks substantially. While it is +recommended to turn this option on, it's highly recommended for applications +not to rely on this solely to avoid DoS vulnerabilities, because this option is +not available in environments where file system namespaces are prohibited, for +example in certain container environments. This option is hence an extra line +of defense, but should not be used as an excuse to rely on guessable names in +`/tmp/` and `/var/tmp/`. When this option is used, the per-service temporary +directories are removed whenever the service shuts down, hence the lifecycle of +temporary files stored in it is substantially different from the case where +this option is not used. Also note that some applications use `/tmp/` and +`/var/tmp/` for sharing files and directories. If this option is turned on this +is not possible anymore as after all each service gets its own instances of +both directories. + +## Automatic Clean-Up + +By default, `systemd-tmpfiles` will apply a concept of ⚠️ "ageing" to all files +and directories stored in `/tmp/` and `/var/tmp/`. This means that files that +have neither been changed nor read within a specific time frame are +automatically removed in regular intervals. (This concept is not new to +`systemd-tmpfiles`, it's inherited from previous subsystems such as +`tmpwatch`.) By default files in `/tmp/` are cleaned up after 10 days, and +those in `/var/tmp` after 30 days. + +This automatic clean-up is important to ensure disk usage of these temporary +directories doesn't grow without bounds, even when programs abort unexpectedly +or otherwise don't clean up the temporary files/directories they create. On the +other hand it creates problems for long-running software that does not expect +temporary files it operates on to be suddenly removed. There are a couple of +strategies to avoid these issues: + +1. Make sure to always keep a file descriptor to the temporary files you + operate on open, and only access the files through them. This way it doesn't + matter whether the files have been unlinked from the file system: as long as + you have the file descriptor open you can still access the file for both + reading and writing. When operating this way it is recommended to delete the + files right after creating them to ensure that on unexpected program + termination the files or directories are implicitly released by the kernel. + +2. 🥇 Use `memfd_create()` or `O_TMPFILE`. This is an extension of the + suggestion above: files created this way are never linked under a filename + in the file system. This means they are not subject to ageing (as they come + unlinked out of the box), and there's no time window where a directory entry + for the file exists in the file system, and thus behaviour is fully robust + towards unexpected program termination as there are never files on disk that + need to be explicitly deleted. + +3. 🥇 Take an exclusive or shared BSD file lock ([`flock()`]( + https://man7.org/linux/man-pages/man2/flock.2.html)) on files and directories + you don't want to be removed. This is particularly interesting when operating + on more than a single file, or on file nodes that are not plain regular files, + for example when extracting a tarball to a temporary directory. The ageing + algorithm will skip all directories (and everything below them) and files that + are locked through a BSD file lock. As BSD file locks are automatically released + when the file descriptor they are taken on is closed, and all file + descriptors opened by a process are implicitly closed when it exits, this is + a robust mechanism that ensures all temporary files are subject to ageing + when the program that owns them dies, but not while it is still running. Use + this when decompressing tarballs that contain files with old + modification/access times, as extracted files are otherwise immediately + candidates for deletion by the ageing algorithm. The + [`flock`](https://man7.org/linux/man-pages/man1/flock.1.html) tool of the + `util-linux` packages makes this concept available to shell scripts. + +4. Keep the access time of all temporary files created current. In regular + intervals, use `utimensat()` or a related call to update the access time + ("atime") of all files that shall be kept around. Since the ageing algorithm + looks at the access time of files when deciding whether to delete them, it's + sufficient to update their access times in sufficiently frequent intervals to + ensure the files are not deleted. Since most applications (and tools such as + `ls`) primarily care for the modification time (rather than the access time) + using the access time for this purpose should be acceptable. + +5. Set the "sticky" bit on regular files. The ageing logic skips deletion of + all regular files that have the sticky bit (`chmod +t`) set. This is + honoured for regular files only however, and has no effect on directories as + the sticky bit has a different meaning for them. + +6. Don't use `/tmp/` or `/var/tmp/`, but use your own sub-directory under + `/run/` or `$XDG_RUNTIME_DIRECTORY` (the former if privileged, the latter if + unprivileged), or `/var/lib/` and `~/.config/` (similar, but with + persistency and suitable for larger data). The two temporary directories + `/tmp/` and `/var/tmp/` come with the implicit clean-up semantics described + above. When this is not desired, it's possible to create private per-package + runtime or state directories, and place all temporary files there. However, + do note that this means opting out of any kind of automatic clean-up, and it + is hence particularly essential that the program cleans up generated files + in these directories when they are no longer needed, in particular when the + program dies unexpectedly. Note: this strategy is only really suitable for + packages that operate in a "system wide singleton" fashion with "long" + persistence of its data or state, i.e. as opposed to programs that run in + multiple parallel or short-living instances. This is because a private + directory under `/run` (and the other mentioned directories) is itself + system and package specific singleton with greater longevity. + +5. Exclude your temporary files from clean-ups via a `tmpfiles.d/` drop-in + (which includes drop-ins in the runtime-only directory + `/run/tmpfiles.d/`). The `x`/`X` line types may be used to exclude files + matching the specified globbing patterns from the ageing logic. If this is + used, automatic clean-up is not done for matching files and directory, and + much like with the previous option it's hence essential that the program + generating these temporary files carefully removes the temporary files it + creates again, and in particular so if it dies unexpectedly. + +🥇 The semantics of options 2 (in case you only deal with temporary files, not +directories) and 3 (in case you deal with both) in the list above are in most +cases the most preferable. It is thus recommended to stick to these two +options. + +While the ageing logic is very useful as a safety concept to ensure unused +files and directories are eventually removed a well written program avoids even +creating files that need such a clean-up. In particular: + +1. Use `memfd_create()` or `O_TMPFILE` when creating temporary files. + +2. `unlink()` temporary files right after creating them. This is very similar + to `O_TMPFILE` behaviour: consider deleting temporary files right after + creating them, while keeping open a file descriptor to them. Unlike + `O_TMPFILE` this method also works on older Linux systems and other OSes + that do not implement `O_TMPFILE`. + +## Disk Quota + +Generally, files allocated from `/tmp/` and `/var/tmp/` are allocated from a +pool shared by all local users. Moreover the space available in `/tmp/` is +generally more restricted than `/var/tmp/`. This means, that in particular in +`/tmp/` space should be considered scarce, and programs need to be prepared +that no space is available. Essential programs might require a fallback logic +using a different location for storing temporary files hence. Non-essential +programs at least need to be prepared for `ENOSPC` errors and generate useful, +actionable error messages. + +Some setups employ per-user quota on `/var/tmp/` and possibly `/tmp/`, to make +`ENOSPC` situations less likely, and harder to trigger from unprivileged +users. However, in the general case no such per-user quota is implemented +though, in particular not when `tmpfs` is used as backing file system, because +— even today — `tmpfs` still provides no native quota support in the kernel. + +## Early Boot Considerations + +Both `/tmp/` and `/var/tmp/` are not necessarily available during early boot, +or — if they are available early — are not writable. This means software that +is intended to run during early boot (i.e. before `basic.target` — or more +specifically `local-fs.target` — is up) should not attempt to make use of +either. Interfaces such as `memfd_create()` or files below a package-specific +directory in `/run/` are much better options in this case. (Note that some +packages instead use `/dev/shm/` for temporary files during early boot; this is +not advisable however, as it offers no benefits over a private directory in +`/run/` as both are backed by the same concept: `tmpfs`. The directory +`/dev/shm/` exists to back POSIX shared memory (see +[`shm_open()`](https://man7.org/linux/man-pages/man3/shm_open.3.html) and +related calls), and not as a place for temporary files. `/dev/shm` is +problematic as it is world-writable and there's no automatic clean-up logic in +place.) diff --git a/docs/TESTING_WITH_SANITIZERS.md b/docs/TESTING_WITH_SANITIZERS.md new file mode 100644 index 0000000..39920c6 --- /dev/null +++ b/docs/TESTING_WITH_SANITIZERS.md @@ -0,0 +1,106 @@ +--- +title: Testing systemd Using Sanitizers +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Testing systemd Using Sanitizers + +To catch the *nastier* kind of bugs, you can run your code with [Address Sanitizer](https://clang.llvm.org/docs/AddressSanitizer.html) +and [Undefined Behavior Sanitizer](https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html). +This is mostly done automagically by various CI systems for each PR, but you may +want to do it locally as well. The process slightly varies depending on the +compiler you want to use and which part of the test suite you want to run. + +## mkosi + +To build with sanitizers in mkosi, create a file `mkosi.local.conf` and add the following contents: + +``` +[Content] +Environment=SANITIZERS=address,undefined +``` + +The value of `SANITIZERS` is passed directly to meson's `b_sanitize` option, See +https://mesonbuild.com/Builtin-options.html#base-options for the format expected by the option. Currently, +only the sanitizers supported by gcc can be used, which are `address` and `undefined`. + +Note that this will only work with a recent version of mkosi (>= 14 or by running mkosi directly from source). + +## gcc +gcc compiles in sanitizer libraries dynamically by default, so you need to get +the shared libraries first - on Fedora these are shipped as separate packages +(`libasan` for Address Sanitizer and `libubsan` for Undefined Behavior Sanitizer). + +The compilation itself is then a matter of simply adding `-Db_sanitize=address,undefined` +to `meson`. That's it - following executions of `meson test` and integration tests +under `test/` subdirectory will run with sanitizers enabled. However, to get +truly useful results, you should tweak the runtime configuration of respective +sanitizers; e.g. in systemd we set the following environment variables: + +```bash +ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1 +UBSAN_OPTIONS=print_stacktrace=1:print_summary=1:halt_on_error=1 +``` +## clang +In case of clang things are somewhat different - the sanitizer libraries are +compiled in statically by default. This is not an issue if you plan to run +only the unit tests, but for integration tests you'll need to convince clang +to use the dynamic versions of sanitizer libraries. + +First of all, pass `-shared-libsan` to both `clang` and `clang++`: + +```bash +CFLAGS=-shared-libasan +CXXFLAGS=-shared-libasan +``` + +The `CXXFLAGS` are necessary for `src/libsystemd/sd-bus/test-bus-vtable-cc.c`. Compilation +is then the same as in case of gcc, simply add `-Db_sanitize=address,undefined` +to the `meson` call and use the same environment variables for runtime configuration. + +```bash +ASAN_OPTIONS=strict_string_checks=1:detect_stack_use_after_return=1:check_initialization_order=1:strict_init_order=1 +UBSAN_OPTIONS=print_stacktrace=1:print_summary=1:halt_on_error=1 +``` + +After this, you'll probably notice that all compiled binaries complain about +missing `libclang_rt.asan*` library. To fix this, you have to install clang's +runtime libraries, usually shipped in the `compiler-rt` package. As these libraries +are installed in a non-standard location (non-standard for `ldconfig`), you'll +need to manually direct binaries to the respective runtime libraries. + +``` +# Optionally locate the respective runtime DSO +$ ldd build/systemd | grep libclang_rt.asan + libclang_rt.asan-x86_64.so => not found + libclang_rt.asan-x86_64.so => not found +$ find /usr/lib* /usr/local/lib* -type f -name libclang_rt.asan-x86_64.so 2>/dev/null +/usr/lib64/clang/7.0.1/lib/libclang_rt.asan-x86_64.so + +# Set the LD_LIBRARY_PATH accordingly +export LD_LIBRARY_PATH=/usr/lib64/clang/7.0.1/lib/ + +# If the path is correct, the "not found" message should change to an actual path +$ ldd build/systemd | grep libclang_rt.asan + libclang_rt.asan-x86_64.so => /usr/lib64/clang/7.0.1/lib/libclang_rt.asan-x86_64.so (0x00007fa9752fc000) +``` + +This should help binaries to correctly find necessary sanitizer DSOs. + +Also, to make the reports useful, `llvm-symbolizer` tool is required (usually +part of the `llvm` package). + +## Background notes +The reason why you need to force dynamic linking in case of `clang` is that some +applications make use of `libsystemd`, which is compiled with sanitizers as well. +However, if a *standard* (uninstrumented) application loads an instrumented library, +it will immediately fail due to unresolved symbols. To fix/workaround this, you +need to pre-load the ASan DSO using `LD_PRELOAD=/path/to/asan/dso`, which will +make things work as expected in most cases. This will, obviously, not work with +statically linked sanitizer libraries. + +These shenanigans are performed automatically when running the integration test +suite (i.e. `test/TEST-??-*`) and are located in `test/test-functions` (mainly, +but not only, in the `create_asan_wrapper` function). diff --git a/docs/THE_CASE_FOR_THE_USR_MERGE.md b/docs/THE_CASE_FOR_THE_USR_MERGE.md new file mode 100644 index 0000000..2cdb6db --- /dev/null +++ b/docs/THE_CASE_FOR_THE_USR_MERGE.md @@ -0,0 +1,115 @@ +--- +title: The Case for the /usr Merge +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# The Case for the /usr Merge + +**Why the /usr Merge Makes Sense for Compatibility Reasons** + +_This is based on the [Fedora feature](https://fedoraproject.org/wiki/Features/UsrMove) for the same topic, put together by Harald Hoyer and Kay Sievers. This feature has been implemented successfully in Fedora 17._ + +Note that this page discusses a topic that is actually independent of systemd. systemd supports both systems with split and with merged /usr, and the /usr merge also makes sense for systemd-less systems. That said we want to encourage distributions adopting systemd to also adopt the /usr merge. + +### What's Being Discussed Here? + +Fedora (and other distributions) have finished work on getting rid of the separation of /bin and /usr/bin, as well as /sbin and /usr/sbin, /lib and /usr/lib, and /lib64 and /usr/lib64. All files from the directories in / will be merged into their respective counterparts in /usr, and symlinks for the old directories will be created instead: + +``` +/bin → /usr/bin +/sbin → /usr/sbin +/lib → /usr/lib +/lib64 → /usr/lib64 +``` + +You are wondering why merging /bin, /sbin, /lib, /lib64 into their respective counterparts in /usr makes sense, and why distributions are pushing for it? You are wondering whether your own distribution should adopt the same change? Here are a few answers to these questions, with an emphasis on a compatibility point of view: + +### Compatibility: The Gist of It + +- Improved compatibility with other Unixes/Linuxes in _behavior_: After the /usr merge all binaries become available in both /bin and /usr/bin, resp. both /sbin and /usr/sbin (simply because /bin becomes a symlink to /usr/bin, resp. /sbin to /usr/sbin). That means scripts/programs written for other Unixes or other Linuxes and ported to your distribution will no longer need fixing for the file system paths of the binaries called, which is otherwise a major source of frustration. /usr/bin and /bin (resp. /usr/sbin and /sbin) become entirely equivalent. +- Improved compatibility with other Unixes (in particular Solaris) in _appearance_: The primary commercial Unix implementation is nowadays Oracle Solaris. Solaris has already completed the same /usr merge in Solaris 11. By making the same change in Linux we minimize the difference towards the primary Unix implementation, thus easing portability from Solaris. +- Improved compatibility with GNU build systems: The biggest part of Linux software is built with GNU autoconf/automake (i.e. GNU autotools), which are unaware of the Linux-specific /usr split. Maintaining the /usr split requires non-trivial project-specific handling in the upstream build system, and in your distribution's packages. With the /usr merge, this work becomes unnecessary and porting packages to Linux becomes simpler. +- Improved compatibility with current upstream development: In order to minimize the delta from your Linux distribution to upstream development the /usr merge is key. + +### Compatibility: The Longer Version + +A unified filesystem layout (as it results from the /usr merge) is more compatible with UNIX than Linux’ traditional split of /bin vs. /usr/bin. Unixes differ in where individual tools are installed, their locations in many cases are not defined at all and differ in the various Linux distributions. The /usr merge removes this difference in its entirety, and provides full compatibility with the locations of tools of any Unix via the symlink from /bin to /usr/bin. + +#### Example + +- /usr/bin/foo may be called by other tools either via /usr/bin/foo or /bin/foo, both paths become fully equivalent through the /usr merge. The operating system ends up executing exactly the same file, simply because the symlink /bin just redirects the invocation to /usr/bin. + +The historical justification for a /bin, /sbin and /lib separate from /usr no longer applies today. ([More on the historical justification for the split](http://lists.busybox.net/pipermail/busybox/2010-December/074114.html), by Rob Landley) They were split off to have selected tools on a faster hard disk (which was small, because it was more expensive) and to contain all the tools necessary to mount the slower /usr partition. Today, a separate /usr partition already must be mounted by the initramfs during early boot, thus making the justification for a split-off moot. In addition a lot of tools in /bin and /sbin in the status quo already lost the ability to run without a pre-mounted /usr. There is no valid reason anymore to have the operating system spread over multiple hierarchies, it lost its purpose. + +Solaris implemented the core part of the /usr merge 15 years ago already, and completed it with the introduction of Solaris 11. Solaris has /bin and /sbin only as symlinks in the root file system, the same way as you will have after the /usr merge: [Transitioning From Oracle Solaris 10 to Oracle Solaris 11 - User Environment Feature Changes](http://docs.oracle.com/cd/E23824_01/html/E24456/userenv-1.html). + +Not implementing the /usr merge in your distribution will isolate it from upstream development. It will make porting of packages needlessly difficult, because packagers need to split up installed files into multiple directories and hard code different locations for tools; both will cause unnecessary incompatibilities. Several Linux distributions are agreeing with the benefits of the /usr merge and are already in the process to implement the /usr merge. This means that upstream projects will adapt quickly to the change, those making portability to your distribution harder. + +### Beyond Compatibility + +One major benefit of the /usr merge is the reduction of complexity of our system: the new file system hierarchy becomes much simpler, and the separation between (read-only, potentially even immutable) vendor-supplied OS resources and users resources becomes much cleaner. As a result of the reduced complexity of the hierarchy, packaging becomes much simpler too, since the problems of handling the split in the .spec files go away. + +The merged directory /usr, containing almost the entire vendor-supplied operating system resources, offers us a number of new features regarding OS snapshotting and options for enterprise environments for network sharing or running multiple guests on one host. Static vendor-supplied OS resources are monopolized at a single location, that can be made read-only easily, either for the whole system or individually for each service. Most of this is much harder to accomplish, or even impossible, with the current arbitrary split of tools across multiple directories. + +_With all vendor-supplied OS resources in a single directory /usr they may be shared atomically, snapshots of them become atomic, and the file system may be made read-only as a single unit._ + +#### Example: /usr Network Share + +- With the merged /usr directory we can offer a read-only export of the vendor supplied OS to the network, which will contain almost the entire operating system resources. The client hosts will then only need a minimal host-specific root filesystem with symlinks pointing into the shared /usr filesystem. From a maintenance perspective this is the first time where sharing the operating system over the network starts to make sense. Without the merged /usr directory (like in traditional Linux) we can only share parts of the OS at a time, but not the core components of it that are located in the root file system. The host-specific root filesystem hence traditionally needs to be much larger, cannot be shared among client hosts and needs to be updated individually and often. Vendor-supplied OS resources traditionally ended up "leaking" into the host-specific root file system. + +#### Example: Multiple Guest Operating Systems on the Same Host + +- With the merged /usr directory, we can offer to share /usr read-only with multiple guest operating systems, which will shrink down the guest file system to a couple of MB. The ratio of the per-guest host-only part vs. the shared operating system becomes nearly optimal. + In the long run the maintenance burden resulting of the split-up tools in your distribution, and hard-coded deviating installation locations to distribute binaries and other packaged files into multiple hierarchies will very likely cause more trouble than the /usr merge itself will cause. + +## Myths and Facts + +**Myth #1**: Fedora is the first OS to implement the /usr merge + +**Fact**: Oracle Solaris has implemented the /usr merge in parts 15 years ago, and completed it in Solaris 11. Fedora is following suit here, it is not the pioneer. + +**Myth #2**: Fedora is the only Linux distribution to implement the /usr merge + +**Fact**: Multiple other Linux distributions have been working in a similar direction. + +**Myth #3**: The /usr merge decreases compatibility with other Unixes/Linuxes + +**Fact**: By providing all binary tools in /usr/bin as well as in /bin (resp. /usr/sbin + /sbin) compatibility with hard coded binary paths in scripts is increased. When a distro A installs a tool “foo” in /usr/bin, and distro B installs it in /bin, then we’ll provide it in both, thus creating compatibility with both A and B. + +**Myth #4**: The /usr merge’s only purpose is to look pretty, and has no other benefits + +**Fact**: The /usr merge makes sharing the vendor-supplied OS resources between a host and networked clients as well as a host and local light-weight containers easier and atomic. Snapshotting the OS becomes a viable option. The /usr merge also allows making the entire vendor-supplied OS resources read-only for increased security and robustness. + +**Myth #5**: Adopting the /usr merge in your distribution means additional work for your distribution's package maintainers + +**Fact**: When the merge is implemented in other distributions and upstream, not adopting the /usr merge in your distribution means more work, adopting it is cheap. + +**Myth #6**: A split /usr is Unix “standard”, and a merged /usr would be Linux-specific + +**Fact**: On SysV Unix /bin traditionally has been a symlink to /usr/bin. A non-symlinked version of that directory is specific to non-SysV Unix and Linux. + +**Myth #7**: After the /usr merge one can no longer mount /usr read-only, as it is common usage in many areas. + +**Fact**: Au contraire! One of the reasons we are actually doing this is to make a read-only /usr more thorough: the entire vendor-supplied OS resources can be made read-only, i.e. all of what traditionally was stored in /bin, /sbin, /lib on top of what is already in /usr. + +**Myth #8**: The /usr merge will break my old installation which has /usr on a separate partition. + +**Fact**: This is perfectly well supported, and one of the reasons we are actually doing this is to make placing /usr of a separate partition more thorough. What changes is simply that you need to boot with an initrd that mounts /usr before jumping into the root file system. Most distributions rely on initrds anyway, so effectively little changes. + +**Myth #9**: The /usr split is useful to have a minimal rescue system on the root file system, and the rest of the OS on /usr. + +**Fact**: On Fedora the root directory contains ~450MB already. This hasn't been minimal since a long time, and due to today's complex storage and networking technologies it's unrealistic to ever reduce this again. In fact, since the introduction of initrds to Linux the initrd took over the role as minimal rescue system that requires only a working boot loader to be started, but not a full file system. + +**Myth #10**: The status quo of a split /usr with mounting it without initrd is perfectly well supported right now and works. + +**Fact**: A split /usr without involvement of an initrd mounting it before jumping into the root file system [hasn't worked correctly since a long time](http://freedesktop.org/wiki/Software/systemd/separate-usr-is-broken). + +**Myth #11**: Instead of merging / into /usr it would make a lot more sense to merge /usr into /. + +**Fact**: This would make the separation between vendor-supplied OS resources and machine-specific even worse, thus making OS snapshots and network/container sharing of it much harder and non-atomic, and clutter the root file system with a multitude of new directories. + +--- + +If this page didn't answer your questions you may continue reading [on the Fedora feature page](https://fedoraproject.org/wiki/Features/UsrMove) and this [mail from Lennart](http://thread.gmane.org/gmane.linux.redhat.fedora.devel/155511/focus=155792). diff --git a/docs/TIPS_AND_TRICKS.md b/docs/TIPS_AND_TRICKS.md new file mode 100644 index 0000000..f181f12 --- /dev/null +++ b/docs/TIPS_AND_TRICKS.md @@ -0,0 +1,185 @@ +--- +title: Tips And Tricks +category: Manuals and Documentation for Users and Administrators +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Tips & Tricks + +Also check out the [Frequently Asked Questions](http://www.freedesktop.org/wiki/Software/systemd/FrequentlyAskedQuestions)! + +## Listing running services + +```sh +$ systemctl +UNIT LOAD ACTIVE SUB JOB DESCRIPTION +accounts-daemon.service loaded active running Accounts Service +atd.service loaded active running Job spooling tools +avahi-daemon.service loaded active running Avahi mDNS/DNS-SD Stack +bluetooth.service loaded active running Bluetooth Manager +colord-sane.service loaded active running Daemon for monitoring attached scanners and registering them with colord +colord.service loaded active running Manage, Install and Generate Color Profiles +crond.service loaded active running Command Scheduler +cups.service loaded active running CUPS Printing Service +dbus.service loaded active running D-Bus System Message Bus +... +``` + +## Showing runtime status + +```sh +$ systemctl status udisks2.service +udisks2.service - Storage Daemon + Loaded: loaded (/usr/lib/systemd/system/udisks2.service; static) + Active: active (running) since Wed, 27 Jun 2012 20:49:25 +0200; 1 day and 1h ago + Main PID: 615 (udisksd) + CGroup: name=systemd:/system/udisks2.service + └ 615 /usr/lib/udisks2/udisksd --no-debug + +Jun 27 20:49:25 epsilon udisksd[615]: udisks daemon version 1.94.0 starting +Jun 27 20:49:25 epsilon udisksd[615]: Acquired the name org.freedesktop.UDisks2 on the system message bus +``` + +## cgroup tree + +```sh +$ systemd-cgls +└ system +├ 1 /usr/lib/systemd/systemd --system --deserialize 18 +├ ntpd.service +│ └ 8471 /usr/sbin/ntpd -u ntp:ntp -g +├ upower.service +│ └ 798 /usr/libexec/upowerd +├ wpa_supplicant.service +│ └ 751 /usr/sbin/wpa_supplicant -u -f /var/log/wpa_supplicant.log -c /etc/wpa_supplicant/wpa_supplicant.conf -u -f /var/log/wpa_supplicant.log -P /var/run/wpa_supplicant.pid +├ nfs-idmap.service +│ └ 731 /usr/sbin/rpc.idmapd +├ nfs-rquotad.service +│ └ 753 /usr/sbin/rpc.rquotad +├ nfs-mountd.service +│ └ 732 /usr/sbin/rpc.mountd +├ nfs-lock.service +│ └ 704 /sbin/rpc.statd +├ rpcbind.service +│ └ 680 /sbin/rpcbind -w +├ postfix.service +│ ├ 859 /usr/libexec/postfix/master +│ ├ 877 qmgr -l -t fifo -u +│ └ 32271 pickup -l -t fifo -u +├ colord-sane.service +│ └ 647 /usr/libexec/colord-sane +├ udisks2.service +│ └ 615 /usr/lib/udisks2/udisksd --no-debug +├ colord.service +│ └ 607 /usr/libexec/colord +├ prefdm.service +│ ├ 567 /usr/sbin/gdm-binary -nodaemon +│ ├ 602 /usr/libexec/gdm-simple-slave --display-id /org/gnome/DisplayManager/Display1 +│ ├ 612 /usr/bin/Xorg :0 -br -verbose -auth /var/run/gdm/auth-for-gdm-O00GPA/database -seat seat0 -nolisten tcp +│ └ 905 gdm-session-worker [pam/gdm-password] +├ systemd-ask-password-wall.service +│ └ 645 /usr/bin/systemd-tty-ask-password-agent --wall +├ atd.service +│ └ 544 /usr/sbin/atd -f +├ ksmtuned.service +│ ├ 548 /bin/bash /usr/sbin/ksmtuned +│ └ 1092 sleep 60 +├ dbus.service +│ ├ 586 /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation +│ ├ 601 /usr/libexec/polkit-1/polkitd --no-debug +│ └ 657 /usr/sbin/modem-manager +├ cups.service +│ └ 508 /usr/sbin/cupsd -f +├ avahi-daemon.service +│ ├ 506 avahi-daemon: running [epsilon.local] +│ └ 516 avahi-daemon: chroot helper +├ system-setup-keyboard.service +│ └ 504 /usr/bin/system-setup-keyboard +├ accounts-daemon.service +│ └ 502 /usr/libexec/accounts-daemon +├ systemd-logind.service +│ └ 498 /usr/lib/systemd/systemd-logind +├ crond.service +│ └ 486 /usr/sbin/crond -n +├ NetworkManager.service +│ ├ 484 /usr/sbin/NetworkManager --no-daemon +│ └ 8437 /sbin/dhclient -d -4 -sf /usr/libexec/nm-dhcp-client.action -pf /var/run/dhclient-wlan0.pid -lf /var/lib/dhclient/dhclient-903b6f6aa7a1-46c8-82a9-7f637dfbb3e4-wlan0.lease -cf /var/run/nm-d... +├ libvirtd.service +│ ├ 480 /usr/sbin/libvirtd +│ └ 571 /sbin/dnsmasq --strict-order --bind-interfaces --pid-file=/var/run/libvirt/network/default.pid --conf-file= --except-interface lo --listenaddress 192.168.122.1 --dhcp-range 192.168.122.2,1... +├ bluetooth.service +│ └ 479 /usr/sbin/bluetoothd -n +├ systemd-udev.service +│ └ 287 /usr/lib/systemd/systemd-udevd +└ systemd-journald.service +└ 280 /usr/lib/systemd/systemd-journald +``` + +### ps with cgroups + +```sh +$ alias psc='ps xawf -eo pid,user,cgroup,args' +$ psc + PID USER CGROUP COMMAND +... + 1 root name=systemd:/systemd-1 /bin/systemd systemd.log_target=kmsg systemd.log_level=debug selinux=0 + 415 root name=systemd:/systemd-1/sysinit.service /sbin/udevd -d + 928 root name=systemd:/systemd-1/atd.service /usr/sbin/atd -f + 930 root name=systemd:/systemd-1/ntpd.service /usr/sbin/ntpd -n + 932 root name=systemd:/systemd-1/crond.service /usr/sbin/crond -n + 935 root name=systemd:/systemd-1/auditd.service /sbin/auditd -n + 943 root name=systemd:/systemd-1/auditd.service \_ /sbin/audispd + 964 root name=systemd:/systemd-1/auditd.service \_ /usr/sbin/sedispatch + 937 root name=systemd:/systemd-1/acpid.service /usr/sbin/acpid -f + 941 rpc name=systemd:/systemd-1/rpcbind.service /sbin/rpcbind -f + 944 root name=systemd:/systemd-1/rsyslog.service /sbin/rsyslogd -n -c 4 + 947 root name=systemd:/systemd-1/systemd-logger.service /lib/systemd/systemd-logger + 950 root name=systemd:/systemd-1/cups.service /usr/sbin/cupsd -f + 955 dbus name=systemd:/systemd-1/messagebus.service /bin/dbus-daemon --system --address=systemd: --nofork --systemd-activation + 969 root name=systemd:/systemd-1/getty@.service/tty6 /sbin/mingetty tty6 + 970 root name=systemd:/systemd-1/getty@.service/tty5 /sbin/mingetty tty5 + 971 root name=systemd:/systemd-1/getty@.service/tty1 /sbin/mingetty tty1 + 973 root name=systemd:/systemd-1/getty@.service/tty4 /sbin/mingetty tty4 + 974 root name=systemd:/user/lennart/2 login -- lennart + 1824 lennart name=systemd:/user/lennart/2 \_ -bash + 975 root name=systemd:/systemd-1/getty@.service/tty3 /sbin/mingetty tty3 + 988 root name=systemd:/systemd-1/polkitd.service /usr/libexec/polkit-1/polkitd + 994 rtkit name=systemd:/systemd-1/rtkit-daemon.service /usr/libexec/rtkit-daemon +... +``` + +## Changing the Default Boot Target + +```sh +$ ln -sf /usr/lib/systemd/system/multi-user.target /etc/systemd/system/default.target +``` + +This line makes the multi user target (i.e. full system, but no graphical UI) the default target to boot into. This is kinda equivalent to setting runlevel 3 as the default runlevel on Fedora/sysvinit systems. + +```sh +$ ln -sf /usr/lib/systemd/system/graphical.target /etc/systemd/system/default.target +``` + +This line makes the graphical target (i.e. full system, including graphical UI) the default target to boot into. Kinda equivalent to runlevel 5 on fedora/sysvinit systems. This is how things are shipped by default. + +## What other units does a unit depend on? + +For example, if you want to figure out which services a target like multi-user.target pulls in, use something like this: + +```sh +$ systemctl show -p "Wants" multi-user.target +Wants=rc-local.service avahi-daemon.service rpcbind.service NetworkManager.service acpid.service dbus.service atd.service crond.service auditd.service ntpd.service udisks.service bluetooth.service cups.service wpa_supplicant.service getty.target modem-manager.service portreserve.service abrtd.service yum-updatesd.service upowerd.service test-first.service pcscd.service rsyslog.service haldaemon.service remote-fs.target plymouth-quit.service systemd-update-utmp-runlevel.service sendmail.service lvm2-monitor.service cpuspeed.service udev-post.service mdmonitor.service iscsid.service livesys.service livesys-late.service irqbalance.service iscsi.service netfs.service +``` + +Instead of "Wants" you might also try "WantedBy", "Requires", "RequiredBy", "Conflicts", "ConflictedBy", "Before", "After" for the respective types of dependencies and their inverse. + +## What would get started if I booted into a specific target? + +If you want systemd to calculate the "initial" transaction it would execute on boot, try something like this: + +```sh +$ systemd --test --system --unit=foobar.target +``` + +for a boot target foobar.target. Note that this is mostly a debugging tool that actually does a lot more than just calculate the initial transaction, so don't build scripts based on this. diff --git a/docs/TPM2_PCR_MEASUREMENTS.md b/docs/TPM2_PCR_MEASUREMENTS.md new file mode 100644 index 0000000..462a86b --- /dev/null +++ b/docs/TPM2_PCR_MEASUREMENTS.md @@ -0,0 +1,192 @@ +--- +title: TPM2 PCR Measurements Made by systemd +category: Booting +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# TPM2 PCR Measurements Made by systemd + +Various systemd components issue TPM2 PCR measurements during the boot process, +both in UEFI mode and from userspace. The following lists all measurements +done, and describes (in case done before `ExitBootServices()`) how they appear +in the TPM2 Event Log, maintained by the PC firmware. Note that the userspace +measurements listed below are (by default) only done if a system is booted with +`systemd-stub` — or in other words: systemd's userspace measurements are linked +to systemd's UEFI-mode measurements, and if the latter are not done the former +aren't made either. + +systemd will measure to PCRs 5 (`boot-loader-config`), 11 (`kernel-boot`), +12 (`kernel-config`), 13 (`sysexts`), 15 (`system-identity`). + +Currently, four components will issue TPM2 PCR measurements: + +* The [`systemd-boot`](https://www.freedesktop.org/software/systemd/man/systemd-boot.html) boot menu (UEFI) +* The [`systemd-stub`](https://www.freedesktop.org/software/systemd/man/systemd-stub.html) boot stub (UEFI) +* The [`systemd-pcrextend`](https://www.freedesktop.org/software/systemd/man/systemd-pcrphase.service.html) measurement tool (userspace) +* The [`systemd-cryptsetup`](https://www.freedesktop.org/software/systemd/man/systemd-cryptsetup@.service.html) disk encryption tool (userspace) + +A userspace measurement event log in a format close to TCG CEL-JSON is +maintained in `/run/log/systemd/tpm2-measure.log`. + +## Measurements Added in Future + +We expect that we'll add further PCR extensions in future (both in firmware and +user mode), which also will be documented here. When executed from firmware +mode future additions are expected to be recorded as `EV_EVENT_TAG` +measurements in the event log, in order to make them robustly +recognizable. Measurements currently recorded as `EV_IPL` will continue to be +recorded as `EV_IPL`, for compatibility reasons. However, `EV_IPL` will not be +used for new, additional measurements. + +## PCR Measurements Made by `systemd-boot` (UEFI) + +### PCS 5, `EV_EVENT_TAG`, "loader.conf" + +The content of `systemd-boot`'s configuration file, `loader/loader.conf`, is +measured as a tagged event. + +→ **Event Tag** `0xf5bc582a` + +→ **Description** in the event log record is the file name, `loader.conf`. + +→ **Measured hash** covers the content of `loader.conf` as it is read from the ESP. + +### PCR 12, `EV_IPL`, "Kernel Command Line" + +If the kernel command line was specified explicitly (by the user or in a Boot +Loader Specification Type #1 file), the kernel command line passed to the +invoked kernel is measured before it is executed. (In case an UKI/Boot Loader +Specification Type #2 entry is booted, the built-in kernel command line is +implicitly measured as part of the PE sections, because it is embedded in the +`.cmdline` PE section, hence doesn't need to be measured by `systemd-boot`; see +below for details on PE section measurements done by `systemd-stub`.) + +→ **Description** in the event log record is the literal kernel command line in +UTF-16. + +→ **Measured hash** covers the literal kernel command line in UTF-16 (without any +trailing NUL bytes). + +## PCR Measurements Made by `systemd-stub` (UEFI) + +### PCR 11, `EV_IPL`, "PE Section Name" + +A measurement is made for each PE section of the UKI that is defined by the +[UKI +specification](https://uapi-group.org/specifications/specs/unified_kernel_image/), +in the canonical order described in the specification. + +Happens once for each UKI-defined PE section of the UKI, in the canonical UKI +PE section order, as per the UKI specification. For each record a pair of +records is written, first one that covers the PE section name (described here), +and the second one that covers the PE section data (described below), so that +both types of records appear interleaved in the event log. + +→ **Description** in the event log record is the PE section name in UTF-16. + +→ **Measured hash** covers the PE section name in ASCII (*including* a trailing NUL byte!). + +### PCR 11, `EV_IPL`, "PE Section Data" + +Happens once for each UKI-defined PE section of the UKI, in the canonical UKI +PE section order, as per the UKI specification, see above. + +→ **Description** in the event log record is the PE section name in UTF-16. + +→ **Measured hash** covers the (binary) PE section contents. + +### PCR 12, `EV_IPL`, "Kernel Command Line" + +Might happen up to three times, for kernel command lines from: + + 1. Passed cmdline + 2. System and per-UKI cmdline add-ons (one measurement covering all add-ons combined) + 3. SMBIOS cmdline + +→ **Description** in the event log record is the literal kernel command line in +UTF-16. + +→ **Measured hash** covers the literal kernel command line in UTF-16 (without any +trailing NUL bytes). + +### PCR 12, `EV_EVENT_TAG`, "Devicetrees" + +Devicetree addons are measured individually as a tagged event. + +→ **Event Tag** `0x6c46f751` + +→ **Description** the addon filename. + +→ **Measured hash** covers the content of the Devicetree. + +### PCR 12, `EV_IPL`, "Per-UKI Credentials initrd" + +→ **Description** in the event log record is the constant string "Credentials +initrd" in UTF-16. + +→ **Measured hash** covers the per-UKI credentials cpio archive (which is generated + on-the-fly by `systemd-stub`). + +### PCR 12, `EV_IPL`, "Global Credentials initrd" + +→ **Description** in the event log record is the constant string "Global +credentials initrd" in UTF-16. + +→ **Measured hash** covers the global credentials cpio archive (which is generated +on-the-fly by `systemd-stub`). + +### PCR 13, `EV_IPL`, "sysext initrd" + +→ **Description** in the event log record is the constant string "System extension +initrd" in UTF-16. + +→ **Measured hash** covers the per-UKI sysext cpio archive (which is generated +on-the-fly by `systemd-stub`). + +## PCR Measurements Made by `systemd-pcrextend` (Userspace) + +### PCR 11, "Boot Phases" + +The `systemd-pcrphase.service`, `systemd-pcrphase-initrd.service`, +`systemd-pcrphase-sysinit.service` services will measure the boot phase reached +during various times of the boot process. Specifically, the strings +"enter-initrd", "leave-initrd", "sysinit", "ready", "shutdown", "final" are +measured, in this order. (These are regular units, and administrators may +choose to define additional/different phases.) + +→ **Measured hash** covers the phase string (in UTF-8, without trailing NUL +bytes). + +### PCR 15, "Machine ID" + +The `systemd-pcrmachine.service` service will measure the machine ID (as read +from `/etc/machine-id`) during boot. + +→ **Measured hash** covers the string "machine-id:" suffixed by the machine ID +formatted in hexadecimal lowercase characters (in UTF-8, without trailing NUL +bytes). + +### PCR 15, "File System" + +The `systemd-pcrfs-root.service` and `systemd-pcrfs@.service` services will +measure a string identifying a specific file system, typically covering the +root file system and `/var/` (if it is its own file system). + +→ **Measured hash** covers the string "file-system:" suffixed by a series of six +colon-separated strings, identifying the file system type, UUID, label as well +as the GPT partition entry UUID, entry type UUID and entry label (in UTF-8, +without trailing NUL bytes). + +## PCR Measurements Made by `systemd-cryptsetup` (Userspace) + +### PCR 15, "Volume Key" + +The `systemd-cryptsetup@.service` service will measure a key derived from the +LUKS volume key of a specific encrypted volume, typically covering the backing +encryption device of the root file system and `/var/` (if it is its own file +system). + +→ **Measured hash** covers the (binary) result of the HMAC(V,S) calculation where V +is the LUKS volume key, and S is the string "cryptsetup:" followed by the LUKS +volume name and the UUID of the LUKS superblock. diff --git a/docs/TRANSIENT-SETTINGS.md b/docs/TRANSIENT-SETTINGS.md new file mode 100644 index 0000000..15f1cbc --- /dev/null +++ b/docs/TRANSIENT-SETTINGS.md @@ -0,0 +1,511 @@ +--- +title: What Settings Are Currently Available For Transient Units? +category: Interfaces +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# What Settings Are Currently Available For Transient Units? + +Our intention is to make all settings that are available as unit file settings +also available for transient units, through the D-Bus API. At the moment, +device, swap, and target units are not supported at all as transient units, but +others are pretty well supported. + +The lists below contain all settings currently available in unit files. The +ones currently available in transient units are prefixed with `✓`. + +## Generic Unit Settings + +Most generic unit settings are available for transient units. + +``` +✓ Description= +✓ Documentation= +✓ SourcePath= +✓ Requires= +✓ Requisite= +✓ Wants= +✓ BindsTo= +✓ Conflicts= +✓ Before= +✓ After= +✓ OnFailure= +✓ PropagatesReloadTo= +✓ ReloadPropagatedFrom= +✓ PartOf= +✓ Upholds= +✓ JoinsNamespaceOf= +✓ RequiresMountsFor= +✓ StopWhenUnneeded= +✓ RefuseManualStart= +✓ RefuseManualStop= +✓ AllowIsolate= +✓ DefaultDependencies= +✓ OnFailureJobMode= +✓ IgnoreOnIsolate= +✓ JobTimeoutSec= +✓ JobRunningTimeoutSec= +✓ JobTimeoutAction= +✓ JobTimeoutRebootArgument= +✓ StartLimitIntervalSec= +✓ StartLimitBurst= +✓ StartLimitAction= +✓ FailureAction= +✓ SuccessAction= +✓ FailureActionExitStatus= +✓ SuccessActionExitStatus= +✓ RebootArgument= +✓ ConditionPathExists= +✓ ConditionPathExistsGlob= +✓ ConditionPathIsDirectory= +✓ ConditionPathIsSymbolicLink= +✓ ConditionPathIsMountPoint= +✓ ConditionPathIsReadWrite= +✓ ConditionDirectoryNotEmpty= +✓ ConditionFileNotEmpty= +✓ ConditionFileIsExecutable= +✓ ConditionNeedsUpdate= +✓ ConditionFirstBoot= +✓ ConditionKernelCommandLine= +✓ ConditionKernelVersion= +✓ ConditionArchitecture= +✓ ConditionFirmware= +✓ ConditionVirtualization= +✓ ConditionSecurity= +✓ ConditionCapability= +✓ ConditionHost= +✓ ConditionACPower= +✓ ConditionUser= +✓ ConditionGroup= +✓ ConditionControlGroupController= +✓ AssertPathExists= +✓ AssertPathExistsGlob= +✓ AssertPathIsDirectory= +✓ AssertPathIsSymbolicLink= +✓ AssertPathIsMountPoint= +✓ AssertPathIsReadWrite= +✓ AssertDirectoryNotEmpty= +✓ AssertFileNotEmpty= +✓ AssertFileIsExecutable= +✓ AssertNeedsUpdate= +✓ AssertFirstBoot= +✓ AssertKernelCommandLine= +✓ AssertKernelVersion= +✓ AssertArchitecture= +✓ AssertVirtualization= +✓ AssertSecurity= +✓ AssertCapability= +✓ AssertHost= +✓ AssertACPower= +✓ AssertUser= +✓ AssertGroup= +✓ AssertControlGroupController= +✓ CollectMode= +``` + +## Execution-Related Settings + +All execution-related settings are available for transient units. + +``` +✓ WorkingDirectory= +✓ RootDirectory= +✓ RootImage= +✓ User= +✓ Group= +✓ SupplementaryGroups= +✓ Nice= +✓ OOMScoreAdjust= +✓ CoredumpFilter= +✓ IOSchedulingClass= +✓ IOSchedulingPriority= +✓ CPUSchedulingPolicy= +✓ CPUSchedulingPriority= +✓ CPUSchedulingResetOnFork= +✓ CPUAffinity= +✓ UMask= +✓ Environment= +✓ EnvironmentFile= +✓ PassEnvironment= +✓ UnsetEnvironment= +✓ DynamicUser= +✓ RemoveIPC= +✓ StandardInput= +✓ StandardOutput= +✓ StandardError= +✓ StandardInputText= +✓ StandardInputData= +✓ TTYPath= +✓ TTYReset= +✓ TTYVHangup= +✓ TTYVTDisallocate= +✓ TTYRows= +✓ TTYColumns= +✓ SyslogIdentifier= +✓ SyslogFacility= +✓ SyslogLevel= +✓ SyslogLevelPrefix= +✓ LogLevelMax= +✓ LogExtraFields= +✓ LogFilterPatterns= +✓ LogRateLimitIntervalSec= +✓ LogRateLimitBurst= +✓ SecureBits= +✓ CapabilityBoundingSet= +✓ AmbientCapabilities= +✓ TimerSlackNSec= +✓ NoNewPrivileges= +✓ KeyringMode= +✓ ProtectProc= +✓ ProcSubset= +✓ SystemCallFilter= +✓ SystemCallArchitectures= +✓ SystemCallErrorNumber= +✓ SystemCallLog= +✓ MemoryDenyWriteExecute= +✓ RestrictNamespaces= +✓ RestrictRealtime= +✓ RestrictSUIDSGID= +✓ RestrictAddressFamilies= +✓ RootHash= +✓ RootHashSignature= +✓ RootVerity= +✓ LockPersonality= +✓ LimitCPU= +✓ LimitFSIZE= +✓ LimitDATA= +✓ LimitSTACK= +✓ LimitCORE= +✓ LimitRSS= +✓ LimitNOFILE= +✓ LimitAS= +✓ LimitNPROC= +✓ LimitMEMLOCK= +✓ LimitLOCKS= +✓ LimitSIGPENDING= +✓ LimitMSGQUEUE= +✓ LimitNICE= +✓ LimitRTPRIO= +✓ LimitRTTIME= +✓ ReadWritePaths= +✓ ReadOnlyPaths= +✓ InaccessiblePaths= +✓ BindPaths= +✓ BindReadOnlyPaths= +✓ TemporaryFileSystem= +✓ PrivateTmp= +✓ PrivateDevices= +✓ PrivateMounts= +✓ ProtectKernelTunables= +✓ ProtectKernelModules= +✓ ProtectKernelLogs= +✓ ProtectControlGroups= +✓ PrivateNetwork= +✓ PrivateUsers= +✓ ProtectSystem= +✓ ProtectHome= +✓ ProtectClock= +✓ MountFlags= +✓ MountAPIVFS= +✓ Personality= +✓ RuntimeDirectoryPreserve= +✓ RuntimeDirectoryMode= +✓ RuntimeDirectory= +✓ StateDirectoryMode= +✓ StateDirectory= +✓ CacheDirectoryMode= +✓ CacheDirectory= +✓ LogsDirectoryMode= +✓ LogsDirectory= +✓ ConfigurationDirectoryMode= +✓ ConfigurationDirectory= +✓ PAMName= +✓ IgnoreSIGPIPE= +✓ UtmpIdentifier= +✓ UtmpMode= +✓ SELinuxContext= +✓ SmackProcessLabel= +✓ AppArmorProfile= +✓ Slice= +``` + +## Resource Control Settings + +All cgroup/resource control settings are available for transient units + +``` +✓ CPUAccounting= +✓ CPUWeight= +✓ StartupCPUWeight= +✓ CPUShares= +✓ StartupCPUShares= +✓ CPUQuota= +✓ CPUQuotaPeriodSec= +✓ AllowedCPUs= +✓ StartupAllowedCPUs= +✓ AllowedMemoryNodes= +✓ StartupAllowedMemoryNodes= +✓ MemoryAccounting= +✓ DefaultMemoryMin= +✓ MemoryMin= +✓ DefaultMemoryLow= +✓ MemoryLow= +✓ MemoryHigh= +✓ MemoryMax= +✓ MemorySwapMax= +✓ MemoryLimit= +✓ DeviceAllow= +✓ DevicePolicy= +✓ IOAccounting= +✓ IOWeight= +✓ StartupIOWeight= +✓ IODeviceWeight= +✓ IOReadBandwidthMax= +✓ IOWriteBandwidthMax= +✓ IOReadIOPSMax= +✓ IOWriteIOPSMax= +✓ BlockIOAccounting= +✓ BlockIOWeight= +✓ StartupBlockIOWeight= +✓ BlockIODeviceWeight= +✓ BlockIOReadBandwidth= +✓ BlockIOWriteBandwidth= +✓ TasksAccounting= +✓ TasksMax= +✓ Delegate= +✓ DisableControllers= +✓ IPAccounting= +✓ IPAddressAllow= +✓ IPAddressDeny= +✓ ManagedOOMSwap= +✓ ManagedOOMMemoryPressure= +✓ ManagedOOMMemoryPressureLimit= +✓ ManagedOOMPreference= +✓ CoredumpReceive= +``` + +## Process Killing Settings + +All process killing settings are available for transient units: + +``` +✓ SendSIGKILL= +✓ SendSIGHUP= +✓ KillMode= +✓ KillSignal= +✓ RestartKillSignal= +✓ FinalKillSignal= +✓ WatchdogSignal= +``` + +## Service Unit Settings + +Most service unit settings are available for transient units. + +``` +✓ BusName= +✓ ExecCondition= +✓ ExecReload= +✓ ExecStart= +✓ ExecStartPost= +✓ ExecStartPre= +✓ ExecStop= +✓ ExecStopPost= +✓ ExitType= +✓ FileDescriptorStoreMax= +✓ GuessMainPID= +✓ NonBlocking= +✓ NotifyAccess= +✓ OOMPolicy= +✓ PIDFile= +✓ RemainAfterExit= +✓ Restart= +✓ RestartForceExitStatus= +✓ RestartPreventExitStatus= +✓ RestartSec= +✓ RootDirectoryStartOnly= +✓ RuntimeMaxSec= +✓ RuntimeRandomizedExtraSec= + Sockets= +✓ SuccessExitStatus= +✓ TimeoutAbortSec= +✓ TimeoutSec= +✓ TimeoutStartFailureMode= +✓ TimeoutStartSec= +✓ TimeoutStopFailureMode= +✓ TimeoutStopSec= +✓ Type= +✓ USBFunctionDescriptors= +✓ USBFunctionStrings= +✓ WatchdogSec= +``` + +## Mount Unit Settings + +All mount unit settings are available to transient units: + +``` +✓ What= +✓ Where= +✓ Options= +✓ Type= +✓ TimeoutSec= +✓ DirectoryMode= +✓ SloppyOptions= +✓ LazyUnmount= +✓ ForceUnmount= +✓ ReadWriteOnly= +``` + +## Automount Unit Settings + +All automount unit setting is available to transient units: + +``` +✓ Where= +✓ DirectoryMode= +✓ TimeoutIdleSec= +``` + +## Timer Unit Settings + +Most timer unit settings are available to transient units. + +``` +✓ OnActiveSec= +✓ OnBootSec= +✓ OnCalendar= +✓ OnClockChange= +✓ OnStartupSec= +✓ OnTimezoneChange= +✓ OnUnitActiveSec= +✓ OnUnitInactiveSec= +✓ Persistent= +✓ WakeSystem= +✓ RemainAfterElapse= +✓ AccuracySec= +✓ RandomizedDelaySec= +✓ FixedRandomDelay= + Unit= +``` + +## Slice Unit Settings + +Slice units are fully supported as transient units, but they have no settings +of their own beyond the generic unit and resource control settings. + +## Scope Unit Settings + +Scope units are fully supported as transient units (in fact they only exist as +such). + +``` +✓ RuntimeMaxSec= +✓ RuntimeRandomizedExtraSec= +✓ TimeoutStopSec= +``` + +## Socket Unit Settings + +Most socket unit settings are available to transient units. + +``` +✓ ListenStream= +✓ ListenDatagram= +✓ ListenSequentialPacket= +✓ ListenFIFO= +✓ ListenNetlink= +✓ ListenSpecial= +✓ ListenMessageQueue= +✓ ListenUSBFunction= +✓ SocketProtocol= +✓ BindIPv6Only= +✓ Backlog= +✓ BindToDevice= +✓ ExecStartPre= +✓ ExecStartPost= +✓ ExecStopPre= +✓ ExecStopPost= +✓ TimeoutSec= +✓ SocketUser= +✓ SocketGroup= +✓ SocketMode= +✓ DirectoryMode= +✓ Accept= +✓ FlushPending= +✓ Writable= +✓ MaxConnections= +✓ MaxConnectionsPerSource= +✓ KeepAlive= +✓ KeepAliveTimeSec= +✓ KeepAliveIntervalSec= +✓ KeepAliveProbes= +✓ DeferAcceptSec= +✓ NoDelay= +✓ Priority= +✓ ReceiveBuffer= +✓ SendBuffer= +✓ IPTOS= +✓ IPTTL= +✓ Mark= +✓ PipeSize= +✓ FreeBind= +✓ Transparent= +✓ Broadcast= +✓ PassCredentials= +✓ PassSecurity= +✓ PassPacketInfo= +✓ TCPCongestion= +✓ ReusePort= +✓ MessageQueueMaxMessages= +✓ MessageQueueMessageSize= +✓ RemoveOnStop= +✓ Symlinks= +✓ FileDescriptorName= + Service= +✓ TriggerLimitIntervalSec= +✓ TriggerLimitBurst= +✓ SmackLabel= +✓ SmackLabelIPIn= +✓ SmackLabelIPOut= +✓ SELinuxContextFromNet= +``` + +## Swap Unit Settings + +Swap units are currently not available at all as transient units: + +``` + What= + Priority= + Options= + TimeoutSec= +``` + +## Path Unit Settings + +Most path unit settings are available to transient units. + +``` +✓ PathExists= +✓ PathExistsGlob= +✓ PathChanged= +✓ PathModified= +✓ DirectoryNotEmpty= + Unit= +✓ MakeDirectory= +✓ DirectoryMode= +``` + +## Install Section + +The `[Install]` section is currently not available at all for transient units, and it probably doesn't even make sense. + +``` + Alias= + WantedBy= + RequiredBy= + Also= + DefaultInstance= +``` diff --git a/docs/TRANSLATORS.md b/docs/TRANSLATORS.md new file mode 100644 index 0000000..2f578cc --- /dev/null +++ b/docs/TRANSLATORS.md @@ -0,0 +1,81 @@ +--- +title: Notes for Translators +category: Contributing +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Notes for Translators + +systemd depends on the `gettext` package for multilingual support. + +You'll find the i18n files in the `po/` directory. + +The build system (meson/ninja) can be used to generate a template (`*.pot`), +which can be used to create new translations. + +It can also merge the template into the existing translations (`*.po`), to pick +up new strings in need of translation. + +Finally, it is able to compile the translations (to `*.gmo` files), so that +they can be used by systemd software. (This step is also useful to confirm the +syntax of the `*.po` files is correct.) + +## Creating a New Translation + +To create a translation to a language not yet available, start by creating the +initial template: + +``` +$ ninja -C build/ systemd-pot +``` + +This will generate file `po/systemd.pot` in the source tree. + +Then simply copy it to a new `${lang_code}.po` file, where +`${lang_code}` is the two-letter code for a language +(possibly followed by a two-letter uppercase country code), according to the +ISO 639 standard. + +In short: + +``` +$ cp po/systemd.pot po/${lang_code}.po +``` + +Then edit the new `po/${lang_code}.po` file (for example, +using the `poedit` GUI editor.) + +## Updating an Existing Translation + +Start by updating the `*.po` files from the latest template: + +``` +$ ninja -C build/ systemd-update-po +``` + +This will touch all the `*.po` files, so you'll want to pay attention when +creating a git commit from this change, to only include the one translation +you're actually updating. + +Edit the `*.po` file, looking for empty translations and translations marked as +"fuzzy" (which means the merger found a similar message that needs to be +reviewed as it's expected not to match exactly.) + +You can use any text editor to update the `*.po` files, but a good choice is +the `poedit` editor, a graphical application specifically designed for this +purpose. + +Once you're done, create a git commit for the update of the `po/*.po` file you +touched. Remember to undo the changes to the other `*.po` files (for instance, +using `git checkout -- po/` after you commit the changes you do want to keep.) + +## Recompiling Translations + +You can recompile the `*.po` files using the following command: + +``` +$ ninja -C build/ systemd-gmo +``` + +The resulting files will be saved in the `build/po/` directory. diff --git a/docs/UIDS-GIDS.md b/docs/UIDS-GIDS.md new file mode 100644 index 0000000..e84f037 --- /dev/null +++ b/docs/UIDS-GIDS.md @@ -0,0 +1,326 @@ +--- +title: Users, Groups, UIDs and GIDs on systemd Systems +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Users, Groups, UIDs and GIDs on systemd Systems + +Here's a summary of the requirements `systemd` (and Linux) make on UID/GID +assignments and their ranges. + +Note that while in theory UIDs and GIDs are orthogonal concepts they really +aren't IRL. With that in mind, when we discuss UIDs below it should be assumed +that whatever we say about UIDs applies to GIDs in mostly the same way, and all +the special assignments and ranges for UIDs always have mostly the same +validity for GIDs too. + +## Special Linux UIDs + +In theory, the range of the C type `uid_t` is 32-bit wide on Linux, +i.e. 0…4294967295. However, four UIDs are special on Linux: + +1. 0 → The `root` super-user. + +2. 65534 → The `nobody` UID, also called the "overflow" UID or similar. It's + where various subsystems map unmappable users to, for example file systems + only supporting 16-bit UIDs, NFS or user namespacing. (The latter can be + changed with a sysctl during runtime, but that's not supported on + `systemd`. If you do change it you void your warranty.) Because Fedora is a + bit confused the `nobody` user is called `nfsnobody` there (and they have a + different `nobody` user at UID 99). I hope this will be corrected eventually + though. (Also, some distributions call the `nobody` group `nogroup`. I wish + they didn't.) + +3. 4294967295, aka "32-bit `(uid_t) -1`" → This UID is not a valid user ID, as + `setresuid()`, `chown()` and friends treat -1 as a special request to not + change the UID of the process/file. This UID is hence not available for + assignment to users in the user database. + +4. 65535, aka "16-bit `(uid_t) -1`" → Before Linux kernel 2.4 `uid_t` used to be + 16-bit, and programs compiled for that would hence assume that `(uid_t) -1` + is 65535. This UID is hence not usable either. + +The `nss-systemd` glibc NSS module will synthesize user database records for +the UIDs 0 and 65534 if the system user database doesn't list them. This means +that any system where this module is enabled works to some minimal level +without `/etc/passwd`. + +## Special Distribution UID ranges + +Distributions generally split the available UID range in two: + +1. 1…999 → System users. These are users that do not map to actual "human" + users, but are used as security identities for system daemons, to implement + privilege separation and run system daemons with minimal privileges. + +2. 1000…65533 and 65536…4294967294 → Everything else, i.e. regular (human) users. + +Some older systems placed the boundary at 499/500, or even 99/100, +and some distributions allow the boundary between system and regular users to be changed +via local configuration. +In `systemd`, the boundary is configurable during compilation time +and is also queried from `/etc/login.defs` at runtime, +if the `-Dcompat-mutable-uid-boundaries=true` compile-time setting is used. +We strongly discourage downstreams from changing the boundary from the upstream default of 999/1000. + +Also note that programs such as `adduser` tend to allocate from a subset of the +available regular user range only, usually 1000..60000. +This range can also be configured using `/etc/login.defs`. + +Note that systemd requires that system users and groups are resolvable without +network — a requirement that is not made for regular users. This +means regular users may be stored in remote LDAP or NIS databases, but system +users may not (except when there's a consistent local cache kept, that is +available during earliest boot, including in the initrd). + +## Special `systemd` GIDs + +`systemd` defines no special UIDs beyond what Linux already defines (see +above). However, it does define some special group/GID assignments, which are +primarily used for `systemd-udevd`'s device management. The precise list of the +currently defined groups is found in this `sysusers.d` snippet: +[basic.conf](https://raw.githubusercontent.com/systemd/systemd/main/sysusers.d/basic.conf.in) + +It's strongly recommended that downstream distributions include these groups in +their default group databases. + +Note that the actual GID numbers assigned to these groups do not have to be +constant beyond a specific system. There's one exception however: the `tty` +group must have the GID 5. That's because it must be encoded in the `devpts` +mount parameters during earliest boot, at a time where NSS lookups are not +possible. (Note that the actual GID can be changed during `systemd` build time, +but downstreams are strongly advised against doing that.) + +## Special `systemd` UID ranges + +`systemd` defines a number of special UID ranges: + +1. 60001…60513 → UIDs for home directories managed by + [`systemd-homed.service(8)`](https://www.freedesktop.org/software/systemd/man/systemd-homed.service.html). UIDs + from this range are automatically assigned to any home directory discovered, + and persisted locally on first login. On different systems the same user + might get different UIDs assigned in case of conflict, though it is + attempted to make UID assignments stable, by deriving them from a hash of + the user name. + +2. 61184…65519 → UIDs for dynamic users are allocated from this range (see the + `DynamicUser=` documentation in + [`systemd.exec(5)`](https://www.freedesktop.org/software/systemd/man/systemd.exec.html)). This + range has been chosen so that it is below the 16-bit boundary (i.e. below + 65535), in order to provide compatibility with container environments that + assign a 64K range of UIDs to containers using user namespacing. This range + is above the 60000 boundary, so that its allocations are unlikely to be + affected by `adduser` allocations (see above). And we leave some room + upwards for other purposes. (And if you wonder why precisely these numbers: + if you write them in hexadecimal, they might make more sense: 0xEF00 and + 0xFFEF). The `nss-systemd` module will synthesize user records implicitly + for all currently allocated dynamic users from this range. Thus, NSS-based + user record resolving works correctly without those users being in + `/etc/passwd`. + +3. 524288…1879048191 → UID range for `systemd-nspawn`'s automatic allocation of + per-container UID ranges. When the `--private-users=pick` switch is used (or + `-U`) then it will automatically find a so far unused 16-bit subrange of this + range and assign it to the container. The range is picked so that the upper + 16-bit of the 32-bit UIDs are constant for all users of the container, while + the lower 16-bit directly encode the 65536 UIDs assigned to the + container. This mode of allocation means that the upper 16-bit of any UID + assigned to a container are kind of a "container ID", while the lower 16-bit + directly expose the container's own UID numbers. If you wonder why precisely + these numbers, consider them in hexadecimal: 0x00080000…0x6FFFFFFF. This + range is above the 16-bit boundary. Moreover it's below the 31-bit boundary, + as some broken code (specifically: the kernel's `devpts` file system) + erroneously considers UIDs signed integers, and hence can't deal with values + above 2^31. The `systemd-machined.service` service will synthesize user + database records for all UIDs assigned to a running container from this + range. + +Note for both allocation ranges: when a UID allocation takes place NSS is +checked for collisions first, and a different UID is picked if an entry is +found. Thus, the user database is used as synchronization mechanism to ensure +exclusive ownership of UIDs and UID ranges. To ensure compatibility with other +subsystems allocating from the same ranges it is hence essential that they +ensure that whatever they pick shows up in the user/group databases, either by +providing an NSS module, or by adding entries directly to `/etc/passwd` and +`/etc/group`. For performance reasons, do note that `systemd-nspawn` will only +do an NSS check for the first UID of the range it allocates, not all 65536 of +them. Also note that while the allocation logic is operating, the glibc +`lckpwdf()` user database lock is taken, in order to make this logic race-free. + +## Figuring out the system's UID boundaries + +The most important boundaries of the local system may be queried with +`pkg-config`: + +``` +$ pkg-config --variable=system_uid_max systemd +999 +$ pkg-config --variable=dynamic_uid_min systemd +61184 +$ pkg-config --variable=dynamic_uid_max systemd +65519 +$ pkg-config --variable=container_uid_base_min systemd +524288 +$ pkg-config --variable=container_uid_base_max systemd +1878982656 +``` + +(Note that the latter encodes the maximum UID *base* `systemd-nspawn` might +pick — given that 64K UIDs are assigned to each container according to this +allocation logic, the maximum UID used for this range is hence +1878982656+65535=1879048191.) + +Systemd has compile-time default for these boundaries. Using those defaults is +recommended. It will nevertheless query `/etc/login.defs` at runtime, when +compiled with `-Dcompat-mutable-uid-boundaries=true` and that file is present. +Support for this is considered only a compatibility feature and should not be +used except when upgrading systems which were created with different defaults. + +## Considerations for container managers + +If you hack on a container manager, and wonder how and how many UIDs best to +assign to your containers, here are a few recommendations: + +1. Definitely, don't assign less than 65536 UIDs/GIDs. After all the `nobody` +user has magic properties, and hence should be available in your container, and +given that it's assigned the UID 65534, you should really cover the full 16-bit +range in your container. Note that systemd will — as mentioned — synthesize +user records for the `nobody` user, and assumes its availability in various +other parts of its codebase, too, hence assigning fewer users means you lose +compatibility with running systemd code inside your container. And most likely +other packages make similar restrictions. + +2. While it's fine to assign more than 65536 UIDs/GIDs to a container, there's +most likely not much value in doing so, as Linux distributions won't use the +higher ranges by default (as mentioned neither `adduser` nor `systemd`'s +dynamic user concept allocate from above the 16-bit range). Unless you actively +care for nested containers, it's hence probably a good idea to allocate exactly +65536 UIDs per container, and neither less nor more. A pretty side-effect is +that by doing so, you expose the same number of UIDs per container as Linux 2.2 +supported for the whole system, back in the days. + +3. Consider allocating UID ranges for containers so that the first UID you +assign has the lower 16-bits all set to zero. That way, the upper 16-bits become +a container ID of some kind, while the lower 16-bits directly encode the +internal container UID. This is the way `systemd-nspawn` allocates UID ranges +(see above). Following this allocation logic ensures best compatibility with +`systemd-nspawn` and all other container managers following the scheme, as it +is sufficient then to check NSS for the first UID you pick regarding conflicts, +as that's what they do, too. Moreover, it makes `chown()`ing container file +system trees nicely robust to interruptions: as the external UID encodes the +internal UID in a fixed way, it's very easy to adjust the container's base UID +without the need to know the original base UID: to change the container base, +just mask away the upper 16-bit, and insert the upper 16-bit of the new container +base instead. Here are the easy conversions to derive the internal UID, the +external UID, and the container base UID from each other: + + ``` + INTERNAL_UID = EXTERNAL_UID & 0x0000FFFF + CONTAINER_BASE_UID = EXTERNAL_UID & 0xFFFF0000 + EXTERNAL_UID = INTERNAL_UID | CONTAINER_BASE_UID + ``` + +4. When picking a UID range for containers, make sure to check NSS first, with +a simple `getpwuid()` call: if there's already a user record for the first UID +you want to pick, then it's already in use: pick a different one. Wrap that +call in a `lckpwdf()` + `ulckpwdf()` pair, to make allocation +race-free. Provide an NSS module that makes all UIDs you end up taking show up +in the user database, and make sure that the NSS module returns up-to-date +information before you release the lock, so that other system components can +safely use the NSS user database as allocation check, too. Note that if you +follow this scheme no changes to `/etc/passwd` need to be made, thus minimizing +the artifacts the container manager persistently leaves in the system. + +5. `systemd-homed` by default mounts the home directories it manages with UID +mapping applied. It will map four UID ranges into that uidmap, and leave +everything else unmapped: the range from 0…60000, the user's own UID, the range +60514…65534, and the container range 524288…1879048191. This means +files/directories in home directories managed by `systemd-homed` cannot be +owned by UIDs/GIDs outside of these four ranges (attempts to `chown()` files to +UIDs outside of these ranges will fail). Thus, if container trees are to be +placed within a home directory managed by `systemd-homed` they should take +these ranges into consideration and either place the trees at base UID 0 (and +then map them to a higher UID range for use in user namespacing via another +level of UID mapped mounts, at *runtime*) or at a base UID from the container +UID range. That said, placing container trees (and in fact any +files/directories not owned by the home directory's user) in home directories +is generally a questionable idea (regardless of whether `systemd-homed` is used +or not), given this typically breaks quota assumptions, makes it impossible for +users to properly manage all files in their own home directory due to +permission problems, introduces security issues around SETUID and severely +restricts compatibility with networked home directories. Typically, it's a much +better idea to place container images outside of the home directory, +i.e. somewhere below `/var/` or similar. + +## Summary + +| UID/GID | Purpose | Defined By | Listed in | +|-----------------------|-----------------------|---------------|-------------------------------| +| 0 | `root` user | Linux | `/etc/passwd` + `nss-systemd` | +| 1…4 | System users | Distributions | `/etc/passwd` | +| 5 | `tty` group | `systemd` | `/etc/passwd` | +| 6…999 | System users | Distributions | `/etc/passwd` | +| 1000…60000 | Regular users | Distributions | `/etc/passwd` + LDAP/NIS/… | +| 60001…60513 | Human users (homed) | `systemd` | `nss-systemd` | +| 60514…60577 | Host users mapped into containers | `systemd` | `systemd-nspawn` | +| 60578…61183 | Unused | | | +| 61184…65519 | Dynamic service users | `systemd` | `nss-systemd` | +| 65520…65533 | Unused | | | +| 65534 | `nobody` user | Linux | `/etc/passwd` + `nss-systemd` | +| 65535 | 16-bit `(uid_t) -1` | Linux | | +| 65536…524287 | Unused | | | +| 524288…1879048191 | Container UID ranges | `systemd` | `nss-systemd` | +| 1879048192…2147483647 | Unused | | | +| 2147483648…4294967294 | HIC SVNT LEONES | | | +| 4294967295 | 32-bit `(uid_t) -1` | Linux | | + +Note that "Unused" in the table above doesn't mean that these ranges are +really unused. It just means that these ranges have no well-established +pre-defined purposes between Linux, generic low-level distributions and +`systemd`. There might very well be other packages that allocate from these +ranges. + +Note that the range 2147483648…4294967294 (i.e. 2^31…2^32-2) should be handled +with care. Various programs (including kernel file systems — see `devpts` — or +even kernel syscalls – see `setfsuid()`) have trouble with UIDs outside of the +signed 32-bit range, i.e any UIDs equal to or above 2147483648. It is thus +strongly recommended to stay away from this range in order to avoid +complications. This range should be considered reserved for future, special +purposes. + +## Notes on resolvability of user and group names + +User names, UIDs, group names and GIDs don't have to be resolvable using NSS +(i.e. getpwuid() and getpwnam() and friends) all the time. However, systemd +makes the following requirements: + +System users generally have to be resolvable during early boot already. This +means they should not be provided by any networked service (as those usually +become available during late boot only), except if a local cache is kept that +makes them available during early boot too (i.e. before networking is +up). Specifically, system users need to be resolvable at least before +`systemd-udevd.service` and `systemd-tmpfiles-setup.service` are started, as both +need to resolve system users — but note that there might be more services +requiring full resolvability of system users than just these two. + +Regular users do not need to be resolvable during early boot, it is sufficient +if they become resolvable during late boot. Specifically, regular users need to +be resolvable at the point in time the `nss-user-lookup.target` unit is +reached. This target unit is generally used as synchronization point between +providers of the user database and consumers of it. Services that require that +the user database is fully available (for example, the login service +`systemd-logind.service`) are ordered *after* it, while services that provide +parts of the user database (for example an LDAP user database client) are +ordered *before* it. Note that `nss-user-lookup.target` is a *passive* unit: in +order to minimize synchronization points on systems that don't need it the unit +is pulled into the initial transaction only if there's at least one service +that really needs it, and that means only if there's a service providing the +local user database somehow through IPC or suchlike. Or in other words: if you +hack on some networked user database project, then make sure you order your +service `Before=nss-user-lookup.target` and that you pull it in with +`Wants=nss-user-lookup.target`. However, if you hack on some project that needs +the user database to be up in full, then order your service +`After=nss-user-lookup.target`, but do *not* pull it in via a `Wants=` +dependency. diff --git a/docs/USERDB_AND_DESKTOPS.md b/docs/USERDB_AND_DESKTOPS.md new file mode 100644 index 0000000..3a3da13 --- /dev/null +++ b/docs/USERDB_AND_DESKTOPS.md @@ -0,0 +1,169 @@ +--- +title: systemd-homed and JSON User/Group Record Support in Desktop Environments +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# `systemd-homed` and JSON User/Group Record Support in Desktop Environments + +Starting with version 245, systemd supports a new subsystem +[`systemd-homed.service`](https://www.freedesktop.org/software/systemd/man/systemd-homed.service.html) +for managing regular ("human") users and their home directories. Along with it +a new concept `userdb` got merged that brings rich, extensible JSON user/group +records, extending the classic UNIX/glibc NSS `struct passwd`/`struct group` +structures. Both additions are added in a fully backwards compatible way, +accessible through `getpwnam()`/`getgrnam()`/… (i.e. libc NSS) and PAM as +usual, meaning that for basic support no changes in the upper layers of the +stack (in particular desktop environments, such as GNOME or KDE) have to be +made. However, for better support a number of changes to desktop environments +are recommended. A few areas where that applies are discussed below. + +Before reading on, please read up on the basic concepts, specifically: + +* [Home Directories](HOME_DIRECTORY) +* [JSON User Records](USER_RECORD) +* [JSON Group Records](GROUP_RECORD) +* [User/Group Record Lookup API via Varlink](USER_GROUP_API) + +## Support for Suspending Home Directory Access during System Suspend + +One key feature of `systemd-homed` managed encrypted home directories is the +ability that access to them can be suspended automatically during system sleep, +removing any cryptographic key material from memory while doing so. This is +important in a world where most laptop users seldom shut down their computers +but most of the time just suspend them instead. Previously, the encryption keys +for the home directories remained in memory during system suspend, so that +sufficiently equipped attackers could read them from there and gain full access +to the device. By removing the key material from memory before suspend, and +re-requesting it on resume this attack vector can be closed down effectively. + +Supporting this mechanism requires support in the desktop environment, since +the encryption keys (i.e. the user's login password) need to be reacquired on +system resume, from a lock screen or similar. This lock screen must run in +system context, and cannot run in the user's own context, since otherwise it +might end up accessing the home directory of the user even though access to it +is temporarily suspended and thus will hang if attempted. + +It is suggested that desktop environments that implement lock screens run them +from system context, for example by switching back to the display manager, and +only revert back to the session after re-authentication via this system lock +screen (re-authentication in this case refers to passing the user's login +credentials to the usual PAM authentication hooks). Or in other words, when +going into system suspend it is recommended that GNOME Shell switches back to +the GNOME Display Manager login screen which now should double as screen lock, +and only switches back to the shell's UI after the user re-authenticated there. + +Note that this change in behavior is a good idea in any case, and does not +create any dependencies on `systemd-homed` or systemd-specific APIs. It's +simply a change of behavior regarding use of existing APIs, not a suggested +hook-up to any new APIs. + +A display manager which supports this kind of out-of-context screen lock +operation needs to inform systemd-homed about this so that systemd-homed knows +that it is safe to suspend the user's home directory on suspend. This is done +via the `suspend=` argument to the +[`pam_systemd_home`](https://www.freedesktop.org/software/systemd/man/pam_systemd_home.html) +PAM module. A display manager should hence change its PAM stack configuration +to set this parameter to on. `systemd-homed` will not suspend home directories +if there's at least one active session of the user that does not support +suspending, as communicated via this parameter. + +## User Management UIs + +The rich user/group records `userdb` and `systemd-homed` support carry various +fields of relevance to UIs that manage the local user database or parts +thereof. In particular, most of the metadata `accounts-daemon` (also see below) +supports is directly available in these JSON records. Hence it makes sense for +any user management UI to expose them directly. + +`systemd-homed` exposes APIs to add, remove and make changes to local users via +D-Bus, with full [polkit](https://www.freedesktop.org/software/polkit/docs/latest/) +hook-up. On the command line this is exposed via the +`homectl` command. A graphical UI that exposes similar functionality would be +very useful, exposing the various new account settings, and in particular +providing a stream-lined UI for enrolling new-style authentication tokens such +as PKCS#11/YubiKey-style devices. (Ideally, if the user plugs in an +uninitialized YubiKey during operation it might be nice if the Desktop would +automatically ask if a key pair shall be written to it and the local account be +bound to it, `systemd-homed` provides enough YubiKey/PKCS#11 support to make +this a reality today; except that it will not take care of token +initialization). + +A strong point of `systemd-homed` is per-user resource management. In +particular disk space assignments are something that most likely should be +exposed in a user management UI. Various metadata fields are supplied allowing +exposure of disk space assignment "slider" UI. Note however that the file system +back-ends of `systemd-homed.service` have different feature sets. Specifically, +only btrfs has online file system shrinking support, ext4 only offline file +system shrinking support, and xfs no shrinking support at all (all three file +systems support online file system growing however). This means if the LUKS +back-end is used, disk space assignment cannot be instant for logged in users, +unless btrfs is used. + +Note that only `systemd-homed` provides an API for modifying/creating/deleting +users. The generic `userdb` subsystem (which might have other back-ends, besides +`systemd-homed`, for example LDAP or Windows) exclusively provides a read-only +interface. (This is unlikely to change, as the other back-ends might have very +different concepts of adding or modifying users, i.e. might not even have any +local concept for that at all). This means any user management UI that intends +to change (and not just view) user accounts should talk directly to +`systemd-homed` to make use of its features; there's no abstraction available +to support other back-ends under the same API. + +Unfortunately there's currently no documentation for the `systemd-homed` D-Bus +API. Consider using the `homectl` sources as guidelines for implementing a user +management UI. The JSON user/records are well documented however, see above, +and the D-Bus API provides limited introspection. + +## Relationship to `accounts-daemon` + +For a long time `accounts-daemon` has been included in Linux distributions +providing richer user accounts. The functionality of this daemon overlaps in +many areas with the functionality of `systemd-homed` or `userdb`, but there are +systematic differences, which means that `systemd-homed` cannot replace +`accounts-daemon` fully. Most importantly: `accounts-daemon` provides +"side-car" metadata for *any* type of user account, while `systemd-homed` only +provides additional metadata for the users it defines itself. In other words: +`accounts-daemon` will augment foreign accounts; `systemd-homed` cannot be used +to augment users defined elsewhere, for example in LDAP or as classic +`/etc/passwd` records. + +This probably means that for the time being, a user management UI (or other UI) +that wants to support rich user records with compatibility with the status quo +ante should probably talk to both `systemd-homed` and `accounts-daemon` at the +same time, and ignore `accounts-daemon`'s records if `systemd-homed` defines +them. While I (Lennart) personally believe in the long run `systemd-homed` is +the way to go for rich user records, any UI that wants to manage and support +rich records for classic records has to support `accounts-daemon` in parallel +for the time being. + +In the short term, it might make sense to also expose the `userdb` provided +records via `accounts-daemon`, so that clients of the latter can consume them +without changes. However, I think in the long run `accounts-daemon` should +probably be removed from the general stack, hence this sounds like a temporary +solution only. + +In case you wonder, there's no automatic mechanism for converting existing +users registered in `/etc/passwd` or LDAP to users managed by +`systemd-homed`. There's documentation for doing this manually though, see +[Converting Existing Users to systemd-homed managed Users](CONVERTING_TO_HOMED). + +## Future Additions + +JSON user/group records are extensible, hence we can easily add any additional +fields desktop environments require. For example, pattern-based authentication +is likely very useful on touch-based devices, and the user records should hence +learn them natively. Fields for other authentication mechanisms, such as +fingerprint authentication should be provided as well, eventually. + +It is planned to extend the `userdb` Varlink API to support look-ups by partial +user name and real name (GECOS) data, so that log-in screens can optionally +implement simple complete-as-you-type login screens. + +It is planned to extend the `systemd-homed` D-Bus API to instantly inform clients +about hardware associated with a specific user being plugged in, to which login +screens can listen in order to initiate authentication. Specifically, any +YubiKey-like security token plugged in that is associated with a local user +record should initiate authentication for that user, making typing in of the +username unnecessary. diff --git a/docs/USER_GROUP_API.md b/docs/USER_GROUP_API.md new file mode 100644 index 0000000..567b817 --- /dev/null +++ b/docs/USER_GROUP_API.md @@ -0,0 +1,285 @@ +--- +title: User/Group Record Lookup API via Varlink +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# User/Group Record Lookup API via Varlink + +JSON User/Group Records (as described in the [JSON User Records](USER_RECORD) +and [JSON Group Records](GROUP_RECORD) documents) that are defined on the +local system may be queried with a [Varlink](https://varlink.org/) API. This +API takes both the role of what +[`getpwnam(3)`](https://man7.org/linux/man-pages/man3/getpwnam.3.html) and +related calls are for `struct passwd`, as well as the interfaces modules +implementing the [glibc Name Service Switch +(NSS)](https://www.gnu.org/software/libc/manual/html_node/Name-Service-Switch.html) +expose. Or in other words, it both allows applications to efficiently query +user/group records from local services, and allows local subsystems to provide +user/group records efficiently to local applications. + +The concepts described here define an IPC interface. Alternatively, user/group +records may be dropped in number of drop-in directories as files where they are +picked up in addition to the users/groups defined by this IPC logic. See +[`nss-systemd(8)`](https://www.freedesktop.org/software/systemd/man/nss-systemd.html) +for details. + +This simple API only exposes only three method calls, and requires only a small +subset of the Varlink functionality. + +## Why Varlink? + +The API described in this document is based on a simple subset of the +mechanisms described by [Varlink](https://varlink.org/). The choice of +preferring Varlink over D-Bus and other IPCs in this context was made for three +reasons: + +1. User/Group record resolution should work during early boot and late shutdown + without special handling. This is very hard to do with D-Bus, as the broker + service for D-Bus generally runs as regular system daemon and is hence only + available at the latest boot stage. + +2. The JSON user/group records are native JSON data, hence picking an IPC + system that natively operates with JSON data is natural and clean. + +3. IPC systems such as D-Bus do not provide flow control and are thus unusable + for streaming data. They are useful to pass around short control messages, + but as soon as potentially many and large objects shall be transferred, + D-Bus is not suitable, as any such streaming of messages would be considered + flooding in D-Bus' logic, and thus possibly result in termination of + communication. Since the APIs defined in this document need to support + enumerating potentially large numbers of users and groups, D-Bus is simply + not an appropriate option. + +## Concepts + +Each subsystem that needs to define users and groups on the local system is +supposed to implement this API, and offer its interfaces on a Varlink +`AF_UNIX`/`SOCK_STREAM` file system socket bound into the +`/run/systemd/userdb/` directory. When a client wants to look up a user or +group record, it contacts all sockets bound in this directory in parallel, and +enqueues the same query to each. The first positive reply is then returned to +the application, or if all fail the last seen error is returned +instead. (Alternatively a special Varlink service is available, +`io.systemd.Multiplexer` which acts as frontend and will do the parallel +queries on behalf of the client, drastically simplifying client +development. This service is not available during earliest boot and final +shutdown phases.) + +Unlike with glibc NSS there's no order or programmatic expression language +defined in which queries are issued to the various services. Instead, all +queries are always enqueued in parallel to all defined services, in order to +make look-ups efficient, and the simple rule of "first successful lookup wins" +is unconditionally followed for user and group look-ups (though not for +membership lookups, see below). + +This simple scheme only works safely as long as every service providing +user/group records carefully makes sure not to answer with conflicting +records. This API does not define any mechanisms for dealing with user/group +name/ID collisions during look-up nor during record registration. It assumes +the various subsystems that want to offer user and group records to the rest of +the system have made sufficiently sure in advance that their definitions do not +collide with those of other services. Clients are not expected to merge +multiple definitions for the same user or group, and will also not be able to +detect conflicts and suppress such conflicting records. + +It is recommended to name the sockets in the directory in reverse domain name +notation, but this is neither required nor enforced. + +## Well-Known Services + +Any subsystem that wants to provide user/group records can do so, simply by +binding a socket in the aforementioned directory. By default two +services are listening there, that have special relevance: + +1. `io.systemd.NameServiceSwitch` → This service makes the classic UNIX/glibc + NSS user/group records available as JSON User/Group records. Any such + records are automatically converted as needed, and possibly augmented with + information from the shadow databases. + +2. `io.systemd.Multiplexer` → This service multiplexes client queries to all + other running services. It's supposed to simplify client development: in + order to look up or enumerate user/group records it's sufficient to talk to + one service instead of all of them in parallel. Note that it is not available + during earliest boot and final shutdown phases, hence for programs running + in that context it is preferable to implement the parallel lookup + themselves. + +Both these services are implemented by the same daemon +`systemd-userdbd.service`. + +Note that these services currently implement a subset of Varlink only. For +example, introspection is not available, and the resolver logic is not used. + +## Other Services + +The `systemd` project provides three other services implementing this +interface. Specifically: + +1. `io.systemd.DynamicUser` → This service is implemented by the service + manager itself, and provides records for the users and groups synthesized + via `DynamicUser=` in unit files. + +2. `io.systemd.Home` → This service is implemented by `systemd-homed.service` + and provides records for the users and groups defined by the home + directories it manages. + +3. `io.systemd.Machine` → This service is implemented by + `systemd-machined.service` and provides records for the users and groups used + by local containers that use user namespacing. + +Other projects are invited to implement these services too. For example, it +would make sense for LDAP/ActiveDirectory projects to implement these +interfaces, which would provide them a way to do per-user resource management +enforced by systemd and defined directly in LDAP directories. + +## Compatibility with NSS + +Two-way compatibility with classic UNIX/glibc NSS user/group records is +provided. When using the Varlink API, lookups into databases provided only via +NSS (and not natively via Varlink) are handled by the +`io.systemd.NameServiceSwitch` service (see above). When using the NSS API +(i.e. `getpwnam()` and friends) the `nss-systemd` module will automatically +synthesize NSS records for users/groups natively defined via a Varlink +API. Special care is taken to avoid recursion between these two compatibility +mechanisms. + +Subsystems that shall provide user/group records to the system may choose +between offering them via an NSS module or via a this Varlink API, either way +all records are accessible via both APIs, due to the bidirectional +forwarding. It is also possible to provide the same records via both APIs +directly, but in that case the compatibility logic must be turned off. There +are mechanisms in place for this, please contact the systemd project for +details, as these are currently not documented. + +## Caching of User Records + +This API defines no concepts for caching records. If caching is desired it +should be implemented in the subsystems that provide the user records, not in +the clients consuming them. + +## Method Calls + +``` +interface io.systemd.UserDatabase + +method GetUserRecord( + uid : ?int, + userName : ?string, + service : string +) -> ( + record : object, + incomplete : bool +) + +method GetGroupRecord( + gid : ?int, + groupName : ?string, + service : string +) -> ( + record : object, + incomplete : bool +) + +method GetMemberships( + userName : ?string, + groupName : ?string, + service : string +) -> ( + userName : string, + groupName : string +) + +error NoRecordFound() +error BadService() +error ServiceNotAvailable() +error ConflictingRecordFound() +error EnumerationNotSupported() +``` + +The `GetUserRecord` method looks up or enumerates a user record. If the `uid` +parameter is set it specifies the numeric UNIX UID to search for. If the +`userName` parameter is set it specifies the name of the user to search +for. Typically, only one of the two parameters are set, depending whether a +look-up by UID or by name is desired. However, clients may also specify both +parameters, in which case a record matching both will be returned, and if only +one exists that matches one of the two parameters but not the other an error of +`ConflictingRecordFound` is returned. If neither of the two parameters are set +the whole user database is enumerated. In this case the method call needs to be +made with `more` set, so that multiple method call replies may be generated as +effect, each carrying one user record. + +The `service` parameter is mandatory and should be set to the service name +being talked to (i.e. to the same name as the `AF_UNIX` socket path, with the +`/run/systemd/userdb/` prefix removed). This is useful to allow implementation +of multiple services on the same socket (which is used by +`systemd-userdbd.service`). + +The method call returns one or more user records, depending which type of query is +used (see above). The record is returned in the `record` field. The +`incomplete` field indicates whether the record is complete. Services providing +user record lookup should only pass the `privileged` section of user records to +clients that either match the user the record is about or to sufficiently +privileged clients, for all others the section must be removed so that no +sensitive data is leaked this way. The `incomplete` parameter should indicate +whether the record has been modified like this or not (i.e. it is `true` if a +`privileged` section existed in the user record and was removed, and `false` if +no `privileged` section existed or one existed but hasn't been removed). + +If no user record matching the specified UID or name is known the error +`NoRecordFound` is returned (this is also returned if neither UID nor name are +specified, and hence enumeration requested but the subsystem currently has no +users defined). + +If a method call with an incorrectly set `service` field is received +(i.e. either not set at all, or not to the service's own name) a `BadService` +error is generated. Finally, `ServiceNotAvailable` should be returned when the +backing subsystem is not operational for some reason and hence no information +about existence or non-existence of a record can be returned nor any user +record at all. (The `service` field is defined in order to allow implementation +of daemons that provide multiple distinct user/group services over the same +`AF_UNIX` socket: in order to correctly determine which service a client wants +to talk to, the client needs to provide the name in each request.) + +The `GetGroupRecord` method call works analogously but for groups. + +The `GetMemberships` method call may be used to inquire about group +memberships. The `userName` and `groupName` arguments take what the name +suggests. If one of the two is specified all matching memberships are returned, +if neither is specified all known memberships of any user and any group are +returned. The return value is a pair of user name and group name, where the +user is a member of the group. If both arguments are specified the specified +membership will be tested for, but no others, and the pair is returned if it is +defined. Unless both arguments are specified the method call needs to be made +with `more` set, so that multiple replies can be returned (since typically +there are multiple members per group and also multiple groups a user is +member of). As with `GetUserRecord` and `GetGroupRecord` the `service` +parameter needs to contain the name of the service being talked to, in order to +allow implementation of multiple services within the same IPC socket. In case no +matching membership is known `NoRecordFound` is returned. The other two errors +are also generated in the same cases as for `GetUserRecord` and +`GetGroupRecord`. + +Unlike with `GetUserRecord` and `GetGroupRecord` the lists of memberships +returned by services are always combined. Thus unlike the other two calls a +membership lookup query has to wait for the last simultaneous query to complete +before the complete list is acquired. + +Note that only the `GetMemberships` call is authoritative about memberships of +users in groups. i.e. it should not be considered sufficient to check the +`memberOf` field of user records and the `members` field of group records to +acquire the full list of memberships. The full list can only be determined by +`GetMemberships`, and as mentioned requires merging of these lists of all local +services. Result of this is that it can be one service that defines a user A, +and another service that defines a group B, and a third service that declares +that A is a member of B. + +Looking up explicit users/groups by their name or UID/GID, or querying +user/group memberships must be supported by all services implementing these +interfaces. However, supporting enumeration (i.e. user/group lookups that may +result in more than one reply, because neither UID/GID nor name is specified) +is optional. Services which are asked for enumeration may return the +`EnumerationNotSupported` error in this case. + +And that's really all there is to it. diff --git a/docs/USER_NAMES.md b/docs/USER_NAMES.md new file mode 100644 index 0000000..74c24b5 --- /dev/null +++ b/docs/USER_NAMES.md @@ -0,0 +1,170 @@ +--- +title: User/Group Name Syntax +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# User/Group Name Syntax + +The precise set of allowed user and group names on Linux systems is weakly +defined. Depending on the distribution a different set of requirements and +restrictions on the syntax of user/group names are enforced — on some +distributions the accepted syntax is even configurable by the administrator. In +the interest of interoperability systemd enforces different rules when +processing users/group defined by other subsystems and when defining users/groups +itself, following the principle of "Be conservative in what you send, be +liberal in what you accept". Also in the interest of interoperability systemd +will enforce the same rules everywhere and not make them configurable or +distribution dependent. The precise rules are described below. + +Generally, the same rules apply for user as for group names. + +## Other Systems + +* On POSIX the set of [valid user + names](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_437) + is defined as [lower and upper case ASCII letters, digits, period, + underscore, and + hyphen](https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap03.html#tag_03_282), + with the restriction that hyphen is not allowed as first character of the + user name. Interestingly no size limit is declared, i.e. in neither + direction, meaning that strictly speaking, according to POSIX, both the empty + string is a valid user name as well as a string of gigabytes in length. + +* Debian/Ubuntu based systems enforce the regular expression + `^[a-z][-a-z0-9]*$`, i.e. only lower case ASCII letters, digits and + hyphens. As first character only lowercase ASCII letters are allowed. This + regular expression is configurable by the administrator at runtime + though. This rule enforces a minimum length of one character but no maximum + length. + +* Upstream shadow-utils enforces the regular expression + `^[a-z_][a-z0-9_-]*[$]$`, i.e. is similar to the Debian/Ubuntu rule, but + allows underscores and hyphens, but the latter not as first character. Also, + an optional trailing dollar character is permitted. + +* Fedora/Red Hat based systems enforce the regular expression of + `^[a-zA-Z0-9_.][a-zA-Z0-9_.-]{0,30}[a-zA-Z0-9_.$-]?$`, i.e. a size limit of + 32 characters, with upper and lower case letters, digits, underscores, + hyphens and periods. No hyphen as first character though, and the last + character may be a dollar character. On top of that, `.` and `..` are not + allowed as user/group names. + +* sssd is known to generate user names with embedded `@` and white-space + characters, as well as non-ASCII (i.e. UTF-8) user/group names. + +* winbindd is known to generate user/group names with embedded `\` and + white-space characters, as well as non-ASCII (i.e. UTF-8) user/group names. + +Other operating systems enforce different rules; in this documentation we'll +focus on Linux systems only however, hence those are out of scope. That said, +software like Samba is frequently deployed on Linux for providing compatibility +with Windows systems; on such systems it might be wise to stick to user/group +names also valid according to Windows rules. + +## Rules systemd enforces + +Distilled from the above, below are the rules systemd enforces on user/group +names. An additional, common rule between both modes listed below is that empty +strings are not valid user/group names. + +Philosophically, the strict mode described below enforces an allow list of +what's allowed and prohibits everything else, while the relaxed mode described +below implements a deny list of what's not allowed and permits everything else. + +### Strict mode + +Strict user/group name syntax is enforced whenever a systemd component is used +to register a user or group in the system, for example a system user/group +using +[`systemd-sysusers.service`](https://www.freedesktop.org/software/systemd/man/systemd-sysusers.html) +or a regular user with +[`systemd-homed.service`](https://www.freedesktop.org/software/systemd/man/systemd-homed.html). + +In strict mode, only uppercase and lowercase characters are allowed, as well as +digits, underscores and hyphens. The first character may not be a digit or +hyphen. A size limit is enforced: the minimum of `sysconf(_SC_LOGIN_NAME_MAX)` +(typically 256 on Linux; rationale: this is how POSIX suggests to detect the +limit), `UT_NAMESIZE-1` (typically 31 on Linux; rationale: names longer than +this cannot correctly appear in `utmp`/`wtmp` and create ambiguity with login +accounting) and `NAME_MAX` (255 on Linux; rationale: user names typically +appear in directory names, i.e. the home directory), thus MIN(256, 31, 255) = +31. + +Note that these rules are both more strict and more relaxed than all of the +rules enforced by other systems listed above. A user/group name conforming to +systemd's strict rules will not necessarily pass a test by the rules enforced +by these other subsystems. + +Written as regular expression the above is: `^[a-zA-Z_][a-zA-Z0-9_-]{0,30}$` + +### Relaxed mode + +Relaxed user/group name syntax is enforced whenever a systemd component accepts +and makes use of user/group names registered by other (non-systemd) +components of the system, for example in +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.html). + +Relaxed syntax is also enforced by the `User=` setting in service unit files, +i.e. for system services used for running services. Since these users may be +registered by a variety of tools relaxed mode is used, but since the primary +purpose of these users is to run a system service and thus a job for systemd a +warning is shown if the specified user name does not qualify by the strict +rules above. + +* No embedded NUL bytes (rationale: handling in C must be possible and + straightforward) + +* No names consisting fully of digits (rationale: avoid confusion with numeric + UID/GID specifications) + +* Similar, no names consisting of an initial hyphen and otherwise entirely made + up of digits (rationale: avoid confusion with negative, numeric UID/GID + specifications, e.g. `-1`) + +* No strings that do not qualify as valid UTF-8 (rationale: we want to be able + to embed these strings in JSON, with permits only valid UTF-8 in its strings; + user names using other character sets, such as JIS/Shift-JIS will cause + validation errors) + +* No control characters (i.e. characters in ASCII range 1…31; rationale: they + tend to have special meaning when output on a terminal in other contexts, + moreover the newline character — as a specific control character — is used as + record separator in `/etc/passwd`, and hence it's crucial to avoid + ambiguities here) + +* No colon characters (rationale: it is used as field separator in `/etc/passwd`) + +* The two strings `.` and `..` are not permitted, as these have special meaning + in file system paths, and user names are frequently included in file system + paths, in particular for the purpose of home directories. + +* Similar, no slashes, as these have special meaning in file system paths + +* No leading or trailing white-space is permitted; and hence no user/group names + consisting of white-space only either (rationale: this typically indicates + parsing errors, and creates confusion since not visible on screen) + +Note that these relaxed rules are implied by the strict rules above, i.e. all +user/group names accepted by the strict rules are also accepted by the relaxed +rules, but not vice versa. + +Note that this relaxed mode does not refuse a couple of very questionable +syntaxes. For example, it permits a leading or embedded period. A leading period +is problematic because the matching home directory would typically be hidden +from the user's/administrator's view. An embedded period is problematic since +it creates ambiguity in traditional `chown` syntax (which is still accepted +today) that uses it to separate user and group names in the command's +parameter: without consulting the user/group databases it is not possible to +determine if a `chown` invocation would change just the owning user or both the +owning user and group. It also allows embedding `@` (which is confusing to +MTAs). + +## Common Core + +Combining all rules listed above, user/group names that shall be considered +valid in all systemd contexts and on all Linux systems should match the +following regular expression (at least according to our understanding): + +`^[a-z][a-z0-9-]{0,30}$` diff --git a/docs/USER_RECORD.md b/docs/USER_RECORD.md new file mode 100644 index 0000000..8cfb053 --- /dev/null +++ b/docs/USER_RECORD.md @@ -0,0 +1,1149 @@ +--- +title: JSON User Records +category: Users, Groups and Home Directories +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# JSON User Records + +systemd optionally processes user records that go beyond the classic UNIX (or +glibc NSS) `struct passwd`. Various components of systemd are able to provide +and consume records in a more extensible format of a dictionary of key/value +pairs, encoded as JSON. Specifically: + +1. [`systemd-homed.service`](https://www.freedesktop.org/software/systemd/man/systemd-homed.service.html) + manages `human` user home directories and embeds these JSON records + directly in the home directory images + (see [Home Directories](HOME_DIRECTORY) for details). + +2. [`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) + processes these JSON records for users that log in, and applies various + settings to the activated session, including environment variables, nice + levels and more. + +3. [`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html) + processes these JSON records of users that log in, and applies various + resource management settings to the per-user slice units it manages. This + allows setting global limits on resource consumption by a specific user. + +4. [`nss-systemd`](https://www.freedesktop.org/software/systemd/man/nss-systemd.html) + is a glibc NSS module that synthesizes classic NSS records from these JSON + records, providing full backwards compatibility with the classic UNIX APIs + both for look-up and enumeration. + +5. The service manager (PID 1) exposes dynamic users (i.e. users synthesized as + effect of `DynamicUser=` in service unit files) as these advanced JSON + records, making them discoverable to the rest of the system. + +6. [`systemd-userdbd.service`](https://www.freedesktop.org/software/systemd/man/systemd-userdbd.service.html) + is a small service that can translate UNIX/glibc NSS records to these JSON + user records. It also provides a unified [Varlink](https://varlink.org/) API + for querying and enumerating records of this type, optionally acquiring them + from various other services. + +JSON user records may contain various fields that are not available in `struct +passwd`, and are extensible for other applications. For example, the record may +contain information about: + +1. Additional security credentials (PKCS#11 security token information, + biometrical authentication information, SSH public key information) + +2. Additional user metadata, such as a picture, email address, location string, + preferred language or timezone + +3. Resource Management settings (such as CPU/IO weights, memory and tasks + limits, classic UNIX resource limits or nice levels) + +4. Runtime parameters such as environment variables or the `nodev`, `noexec`, + `nosuid` flags to use for the home directory + +5. Information about where to mount the home directory from + +And various other things. The record is intended to be extensible, for example +the following extensions are envisioned: + +1. Windows network credential information + +2. Information about default IMAP, SMTP servers to use for this user + +3. Parental control information to enforce on this user + +4. Default parameters for backup applications and similar + +Similar to JSON User Records there are also +[JSON Group Records](GROUP_RECORD) that encapsulate UNIX groups. + +JSON User Records may be transferred or written to disk in various protocols +and formats. To inquire about such records defined on the local system use the +[User/Group Lookup API via Varlink](USER_GROUP_API). User/group records may +also be dropped in number of drop-in directories as files. See +[`nss-systemd(8)`](https://www.freedesktop.org/software/systemd/man/nss-systemd.html) +for details. + +## Why JSON? + +JSON is nicely extensible and widely used. In particular it's easy to +synthesize and process with numerous programming languages. It's particularly +popular in the web communities, which hopefully should make it easy to link +user credential data from the web and from local systems more closely together. + +Please note that this specification assumes that JSON numbers may cover the full +integer range of -2^63 … 2^64-1 without loss of precision (i.e. INT64_MIN … +UINT64_MAX). Please read, write and process user records as defined by this +specification only with JSON implementations that provide this number range. + +## General Structure + +The JSON user records generated and processed by systemd follow a general +structure, consisting of seven distinct "sections". Specifically: + +1. Various fields are placed at the top-level of user record (the `regular` + section). These are generally fields that shall apply unconditionally to the + user in all contexts, are portable and not security sensitive. + +2. A number of fields are located in the `privileged` section (a sub-object of + the user record). Fields contained in this object are security sensitive, + i.e. contain information that the user and the administrator should be able + to see, but other users should not. In many ways this matches the data + stored in `/etc/shadow` in classic Linux user accounts, i.e. includes + password hashes and more. Algorithmically, when a user record is passed to + an untrusted client, by monopolizing such sensitive records in a single + object field we can easily remove it from view. + +3. A number of fields are located in objects inside the `perMachine` section + (an array field of the user record). Primarily these are resource + management-related fields, as those tend to make sense on a specific system + only, e.g. limiting a user's memory use to 1G only makes sense on a specific + system that has more than 1G of memory. Each object inside the `perMachine` + array comes with a `matchMachineId` or `matchHostname` field which indicate + which systems to apply the listed settings to. Note that many fields + accepted in the `perMachine` section can also be set at the top level (the + `regular` section), where they define the fallback if no matching object in + `perMachine` is found. + +4. Various fields are located in the `binding` section (a sub-sub-object of the + user record; an intermediary object is inserted which is keyed by the + machine ID of the host). Fields included in this section "bind" the object + to a specific system. They generally include non-portable information about + paths or UID assignments, that are true on a specific system, but not + necessarily on others, and which are managed automatically by some user + record manager (such as `systemd-homed`). Data in this section is considered + part of the user record only in the local context, and is generally not + ported to other systems. Due to that it is not included in the reduced user + record the cryptographic signature defined in the `signature` section is + calculated on. In `systemd-homed` this section is also removed when the + user's record is stored in the `~/.identity` file in the home directory, so + that every system with access to the home directory can manage these + `binding` fields individually. Typically, the binding section is persisted + to the local disk. + +5. Various fields are located in the `status` section (a sub-sub-object of the + user record, also with an intermediary object between that is keyed by the + machine ID, similar to the way the `binding` section is organized). This + section is augmented during runtime only, and never persisted to disk. The + idea is that this section contains information about current runtime + resource usage (for example: currently used disk space of the user), that + changes dynamically but is otherwise immediately associated with the user + record and for many purposes should be considered to be part of the user + record. + +6. The `signature` section contains one or more cryptographic signatures of a + reduced version of the user record. This is used to ensure that only user + records defined by a specific source are accepted on a system, by validating + the signature against the set of locally accepted signature public keys. The + signature is calculated from the JSON user record with all sections removed, + except for `regular`, `privileged`, `perMachine`. Specifically, `binding`, + `status`, `signature` itself and `secret` are removed first and thus not + covered by the signature. This section is optional, and is only used when + cryptographic validation of user records is required (as it is by + `systemd-homed.service` for example). + +7. The `secret` section contains secret user credentials, such as password or + PIN information. This data is never persisted, and never returned when user + records are inquired by a client, privileged or not. This data should only + be included in a user record very briefly, for example when certain very + specific operations are executed. For example, in tools such as + `systemd-homed` this section may be included in user records, when creating + a new home directory, as passwords and similar credentials need to be + provided to encrypt the home directory with. + +Here's a tabular overview of the sections and their properties: + +| Section | Included in Signature | Persistent | Security Sensitive | Contains Host-Specific Data | +|------------|-----------------------|------------|--------------------|-----------------------------| +| regular | yes | yes | no | no | +| privileged | yes | yes | yes | no | +| perMachine | yes | yes | no | yes | +| binding | no | yes | no | yes | +| status | no | no | no | yes | +| signature | no | yes | no | no | +| secret | no | no | yes | no | + +Note that services providing user records to the local system are free to +manage only a subset of these sections and never include the others in +them. For example, a service that has no concept of signed records (for example +because the records it manages are inherently trusted anyway) does not have to +bother with the `signature` section. A service that only defines records in a +strictly local context and without signatures doesn't have to deal with the +`perMachine` or `binding` sections and can include its data exclusively in the +regular section. A service that uses a separate, private channel for +authenticating users (or that doesn't have a concept of authentication at all) +does not need to be concerned with the `secret` section of user records, as +the fields included therein are only useful when executing authentication +operations natively against JSON user records. + +The `systemd-homed` manager uses all seven sections for various +purposes. Inside the home directories (and if the LUKS2 backend is used, also +in the LUKS2 header) a user record containing the `regular`, `privileged`, +`perMachine` and `signature` sections is stored. `systemd-homed` also stores a +version of the record on the host, with the same four sections and augmented +with an additional, fifth `binding` section. When a local client enquires about +a user record managed by `systemd-homed` the service will add in some +additional information about the user and home directory in the `status` +section — this version is only transferred via IPC and never written to +disk. Finally the `secret` section is used during authentication operations via +IPC to transfer the user record along with its authentication tokens in one go. + +## Fields in the `regular` section + +As mentioned, the `regular` section's fields are placed at the top level +object. The following fields are currently defined: + +`userName` → The UNIX user name for this record. Takes a string with a valid +UNIX user name. This field is the only mandatory field, all others are +optional. Corresponds with the `pw_name` field of `struct passwd` and the +`sp_namp` field of `struct spwd` (i.e. the shadow user record stored in +`/etc/shadow`). See [User/Group Name Syntax](USER_NAMES) for +the (relaxed) rules the various systemd components enforce on user/group names. + +`realm` → The "realm" a user is defined in. This concept allows distinguishing +users with the same name that originate in different organizations or +installations. This should take a string in DNS domain syntax, but doesn't have +to refer to an actual DNS domain (though it is recommended to use one for +this). The idea is that the user `lpoetter` in the `redhat.com` realm might be +distinct from the same user in the `poettering.hq` realm. User records for the +same user name that have different realm fields are considered referring to +different users. When updating a user record it is required that any new +version has to match in both `userName` and `realm` field. This field is +optional, when unset the user should not be considered part of any realm. A +user record with a realm set is never compatible (for the purpose of updates, +see above) with a user record without one set, even if the `userName` field matches. + +`realName` → The real name of the user, a string. This should contain the +user's real ("human") name, and corresponds loosely to the GECOS field of +classic UNIX user records. When converting a `struct passwd` to a JSON user +record this field is initialized from GECOS (i.e. the `pw_gecos` field), and +vice versa when converting back. That said, unlike GECOS this field is supposed +to contain only the real name and no other information. This field must not +contain control characters (such as `\n`) or colons (`:`), since those are used +as record separators in classic `/etc/passwd` files and similar formats. + +`emailAddress` → The email address of the user, formatted as +string. [`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +initializes the `$EMAIL` environment variable from this value for all login +sessions. + +`iconName` → The name of an icon picked by the user, for example for the +purpose of an avatar. This must be a string, and should follow the semantics +defined in the [Icon Naming +Specification](https://standards.freedesktop.org/icon-naming-spec/icon-naming-spec-latest.html). + +`location` → A free-form location string describing the location of the user, +if that is applicable. It's probably wise to use a location string processable +by geo-location subsystems, but this is not enforced nor required. Example: +`Berlin, Germany` or `Basement, Room 3a`. + +`disposition` → A string, one of `intrinsic`, `system`, `dynamic`, `regular`, +`container`, `reserved`. If specified clarifies the disposition of the user, +i.e. the context it is defined in. For regular, "human" users this should be +`regular`, for system users (i.e. users that system services run under, and +similar) this should be `system`. The `intrinsic` disposition should be used +only for the two users that have special meaning to the OS kernel itself, +i.e. the `root` and `nobody` users. The `container` string should be used for +users that are used by an OS container, and hence will show up in `ps` listings +and such, but are only defined in container context. Finally `reserved` should +be used for any users outside of these use-cases. Note that this property is +entirely optional and applications are assumed to be able to derive the +disposition of a user automatically from a record even in absence of this +field, based on other fields, for example the numeric UID. By setting this +field explicitly applications can override this default determination. + +`lastChangeUSec` → An unsigned 64-bit integer value, referring to a timestamp in µs +since the epoch 1970, indicating when the user record (specifically, any of the +`regular`, `privileged`, `perMachine` sections) was last changed. This field is +used when comparing two records of the same user to identify the newer one, and +is used for example for automatic updating of user records, where appropriate. + +`lastPasswordChangeUSec` → Similar, also an unsigned 64-bit integer value, +indicating the point in time the password (or any authentication token) of the +user was last changed. This corresponds to the `sp_lstchg` field of `struct +spwd`, i.e. the matching field in the user shadow database `/etc/shadow`, +though provides finer resolution. + +`shell` → A string, referring to the shell binary to use for terminal logins of +this user. This corresponds with the `pw_shell` field of `struct passwd`, and +should contain an absolute file system path. For system users not suitable for +terminal log-in this field should not be set. + +`umask` → The `umask` to set for the user's login sessions. Takes an +integer. Note that usually on UNIX the umask is noted in octal, but JSON's +integers are generally written in decimal, hence in this context we denote it +umask in decimal too. The specified value should be in the valid range for +umasks, i.e. 0000…0777 (in octal as typical in UNIX), or 0…511 (in decimal, how +it actually appears in the JSON record). This `umask` is automatically set by +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +for all login sessions of the user. + +`environment` → An array of strings, each containing an environment variable +and its value to set for the user's login session, in a format compatible with +[`putenv()`](https://man7.org/linux/man-pages/man3/putenv.3.html). Any +environment variable listed here is automatically set by +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +for all login sessions of the user. + +`timeZone` → A string indicating a preferred timezone to use for the user. When +logging in +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +will automatically initialize the `$TZ` environment variable from this +string. The string should be a `tzdata` compatible location string, for +example: `Europe/Berlin`. + +`preferredLanguage` → A string indicating the preferred language/locale for the +user. When logging in +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +will automatically initialize the `$LANG` environment variable from this +string. The string hence should be in a format compatible with this environment +variable, for example: `de_DE.UTF8`. + +`niceLevel` → An integer value in the range -20…19. When logging in +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +will automatically initialize the login process' nice level to this value with, +which is then inherited by all the user's processes, see +[`setpriority()`](https://man7.org/linux/man-pages/man2/setpriority.2.html) for +more information. + +`resourceLimits` → An object, where each key refers to a Linux resource limit +(such as `RLIMIT_NOFILE` and similar). Their values should be an object with +two keys `cur` and `max` for the soft and hard resource limit. When logging in +[`pam_systemd`](https://www.freedesktop.org/software/systemd/man/pam_systemd.html) +will automatically initialize the login process' resource limits to these +values, which is then inherited by all the user's processes, see +[`setrlimit()`](https://man7.org/linux/man-pages/man2/setrlimit.2.html) for more +information. + +`locked` → A boolean value. If true, the user account is locked, the user may +not log in. If this field is missing it should be assumed to be false, +i.e. logins are permitted. This field corresponds to the `sp_expire` field of +`struct spwd` (i.e. the `/etc/shadow` data for a user) being set to zero or +one. + +`notBeforeUSec` → An unsigned 64-bit integer value, indicating a time in µs since +the UNIX epoch (1970) before which the record should be considered invalid for +the purpose of logging in. + +`notAfterUSec` → Similar, but indicates the point in time *after* which logins +shall not be permitted anymore. This corresponds to the `sp_expire` field of +`struct spwd`, when it is set to a value larger than one, but provides finer +granularity. + +`storage` → A string, one of `classic`, `luks`, `directory`, `subvolume`, +`fscrypt`, `cifs`. Indicates the storage mechanism for the user's home +directory. If `classic` the home directory is a plain directory as in classic +UNIX. When `directory`, the home directory is a regular directory, but the +`~/.identity` file in it contains the user's user record, so that the directory +is self-contained. Similar, `subvolume` is a `btrfs` subvolume that also +contains a `~/.identity` user record; `fscrypt` is an `fscrypt`-encrypted +directory, also containing the `~/.identity` user record; `luks` is a per-user +LUKS volume that is mounted as home directory, and `cifs` a home directory +mounted from a Windows File Share. The five latter types are primarily used by +`systemd-homed` when managing home directories, but may be used if other +managers are used too. If this is not set, `classic` is the implied default. + +`diskSize` → An unsigned 64-bit integer, indicating the intended home directory +disk space in bytes to assign to the user. Depending on the selected storage +type this might be implemented differently: for `luks` this is the intended size +of the file system and LUKS volume, while for the others this likely translates +to classic file system quota settings. + +`diskSizeRelative` → Similar to `diskSize` but takes a relative value, but +specifies a fraction of the available disk space on the selected storage medium +to assign to the user. This unsigned integer value is normalized to 2^32 = +100%. + +`skeletonDirectory` → Takes a string with the absolute path to the skeleton +directory to populate a new home directory from. This is only used when a home +directory is first created, and defaults to `/etc/skel` if not defined. + +`accessMode` → Takes an unsigned integer in the range 0…511 indicating the UNIX +access mask for the home directory when it is first created. + +`tasksMax` → Takes an unsigned 64-bit integer indicating the maximum number of +tasks the user may start in parallel during system runtime. This counts +all tasks (i.e. threads, where each process is at least one thread) the user starts or that are +forked from these processes even if the user identity is changed (for example +by setuid binaries/`su`/`sudo` and similar). +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html) +enforces this by setting the `TasksMax` slice property for the user's slice +`user-$UID.slice`. + +`memoryHigh`/`memoryMax` → These take unsigned 64-bit integers indicating upper +memory limits for all processes of the user (plus all processes forked off them +that might have changed user identity), in bytes. Enforced by +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html), +similar to `tasksMax`. + +`cpuWeight`/`ioWeight` → These take unsigned integers in the range 1…10000 +(defaults to 100) and configure the CPU and IO scheduling weights for the +user's processes as a whole. Also enforced by +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html), +similar to `tasksMax`, `memoryHigh` and `memoryMax`. + +`mountNoDevices`/`mountNoSuid`/`mountNoExecute` → Three booleans that control +the `nodev`, `nosuid`, `noexec` mount flags of the user's home +directories. Note that these booleans are only honored if the home directory +is managed by a subsystem such as `systemd-homed.service` that automatically +mounts home directories on login. + +`cifsDomain` → A string indicating the Windows File Sharing domain (CIFS) to +use. This is generally useful, but particularly when `cifs` is used as storage +mechanism for the user's home directory, see above. + +`cifsUserName` → A string indicating the Windows File Sharing user name (CIFS) +to associate this user record with. This is generally useful, but particularly +useful when `cifs` is used as storage mechanism for the user's home directory, +see above. + +`cifsService` → A string indicating the Windows File Share service (CIFS) to +mount as home directory of the user on login. Should be in format +`//<host>/<service>/<directory/…>`. The directory part is optional. If missing +the top-level directory of the CIFS share is used. + +`cifsExtraMountOptions` → A string with additional mount options to pass to +`mount.cifs` when mounting the home directory CIFS share. + +`imagePath` → A string with an absolute file system path to the file, directory +or block device to use for storage backing the home directory. If the `luks` +storage is used, this refers to the loopback file or block device node to store +the LUKS volume on. For `fscrypt`, `directory`, `subvolume` this refers to the +directory to bind mount as home directory on login. Not defined for `classic` +or `cifs`. + +`homeDirectory` → A string with an absolute file system path to the home +directory. This is where the image indicated in `imagePath` is mounted to on +login and thus indicates the application facing home directory while the home +directory is active, and is what the user's `$HOME` environment variable is set +to during log-in. It corresponds to the `pw_dir` field of `struct passwd`. + +`uid` → An unsigned integer in the range 0…4294967295: the numeric UNIX user ID (UID) to +use for the user. This corresponds to the `pw_uid` field of `struct passwd`. + +`gid` → An unsigned integer in the range 0…4294967295: the numeric UNIX group +ID (GID) to use for the user. This corresponds to the `pw_gid` field of +`struct passwd`. + +`memberOf` → An array of strings, each indicating a UNIX group this user shall +be a member of. The listed strings must be valid group names, but it is not +required that all groups listed exist in all contexts: any entry for which no +group exists should be silently ignored. + +`fileSystemType` → A string, one of `ext4`, `xfs`, `btrfs` (possibly others) to +use as file system for the user's home directory. This is primarily relevant +when the storage mechanism used is `luks` as a file system to use inside the +LUKS container must be selected. + +`partitionUuid` → A string containing a lower-case, text-formatted UUID, referencing +the GPT partition UUID the home directory is located in. This is primarily +relevant when the storage mechanism used is `luks`. + +`luksUuid` → A string containing a lower-case, text-formatted UUID, referencing +the LUKS volume UUID the home directory is located in. This is primarily +relevant when the storage mechanism used is `luks`. + +`fileSystemUuid` → A string containing a lower-case, text-formatted UUID, +referencing the file system UUID the home directory is located in. This is +primarily relevant when the storage mechanism used is `luks`. + +`luksDiscard` → A boolean. If true and `luks` storage is used, controls whether +the loopback block devices, LUKS and the file system on top shall be used in +`discard` mode, i.e. erased sectors should always be returned to the underlying +storage. If false and `luks` storage is used turns this behavior off. In +addition, depending on this setting an `FITRIM` or `fallocate()` operation is +executed to make sure the image matches the selected option. + +`luksOfflineDiscard` → A boolean. Similar to `luksDiscard`, it controls whether +to trim/allocate the file system/backing file when deactivating the home +directory. + +`luksExtraMountOptions` → A string with additional mount options to append to +the default mount options for the file system in the LUKS volume. + +`luksCipher` → A string, indicating the cipher to use for the LUKS storage mechanism. + +`luksCipherMode` → A string, selecting the cipher mode to use for the LUKS storage mechanism. + +`luksVolumeKeySize` → An unsigned integer, indicating the volume key length in +bytes to use for the LUKS storage mechanism. + +`luksPbkdfHashAlgorithm` → A string, selecting the hash algorithm to use for +the PBKDF operation for the LUKS storage mechanism. + +`luksPbkdfType` → A string, indicating the PBKDF type to use for the LUKS storage mechanism. + +`luksPbkdfForceIterations` → An unsigned 64-bit integer, indicating the intended +number of iterations for the PBKDF operation, when LUKS storage is used. + +`luksPbkdfTimeCostUSec` → An unsigned 64-bit integer, indicating the intended +time cost for the PBKDF operation, when the LUKS storage mechanism is used, in +µs. Ignored when `luksPbkdfForceIterations` is set. + +`luksPbkdfMemoryCost` → An unsigned 64-bit integer, indicating the intended +memory cost for the PBKDF operation, when LUKS storage is used, in bytes. + +`luksPbkdfParallelThreads` → An unsigned 64-bit integer, indicating the intended +required parallel threads for the PBKDF operation, when LUKS storage is used. + +`luksSectorSize` → An unsigned 64-bit integer, indicating the sector size to +use for the LUKS storage mechanism, in bytes. Must be a power of two between +512 and 4096. + +`autoResizeMode` → A string, one of `off`, `grow`, `shrink-and-grow`. Unless +set to `off`, controls whether the home area shall be grown automatically to +the size configured in `diskSize` automatically at login time. If set to +`shrink-and-grown` the home area is also shrunk to the minimal size possible +(as dictated by used disk space and file system constraints) on logout. + +`rebalanceWeight` → An unsigned integer, `null` or a boolean. Configures the +free disk space rebalancing weight for the home area. The integer must be in +the range 1…10000 to configure an explicit weight. If unset, or set to `null` +or `true` the default weight of 100 is implied. If set to 0 or `false` +rebalancing is turned off for this home area. + +`service` → A string declaring the service that defines or manages this user +record. It is recommended to use reverse domain name notation for this. For +example, if `systemd-homed` manages a user a string of `io.systemd.Home` is +used for this. + +`rateLimitIntervalUSec` → An unsigned 64-bit integer that configures the +authentication rate limiting enforced on the user account. This specifies a +timer interval (in µs) within which to count authentication attempts. When the +counter goes above the value configured n `rateLimitIntervalBurst` log-ins are +temporarily refused until the interval passes. + +`rateLimitIntervalBurst` → An unsigned 64-bit integer, closely related to +`rateLimitIntervalUSec`, that puts a limit on authentication attempts within +the configured time interval. + +`enforcePasswordPolicy` → A boolean. Configures whether to enforce the system's +password policy when creating the home directory for the user or changing the +user's password. By default the policy is enforced, but if this field is false +it is bypassed. + +`autoLogin` → A boolean. If true the user record is marked as suitable for +auto-login. Systems are supposed to automatically log in a user marked this way +during boot, if there's exactly one user on it defined this way. + +`stopDelayUSec` → An unsigned 64-bit integer, indicating the time in µs the +per-user service manager is kept around after the user fully logged out. This +value is honored by +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html). If +set to zero the per-user service manager is immediately terminated when the +user logs out, and longer values optimize high-frequency log-ins as the +necessary work to set up and tear down a log-in is reduced if the service +manager stays running. + +`killProcesses` → A boolean. If true all processes of the user are +automatically killed when the user logs out. This is enforced by +[`systemd-logind.service`](https://www.freedesktop.org/software/systemd/man/systemd-logind.service.html). If +false any processes left around when the user logs out are left running. + +`passwordChangeMinUSec`/`passwordChangeMaxUSec` → An unsigned 64-bit integer, +encoding how much time has to pass at least/at most between password changes of +the user. This corresponds with the `sp_min` and `sp_max` fields of `struct +spwd` (i.e. the `/etc/shadow` entries of the user), but offers finer +granularity. + +`passwordChangeWarnUSec` → An unsigned 64-bit integer, encoding how much time to +warn the user before their password expires, in µs. This corresponds with the +`sp_warn` field of `struct spwd`. + +`passwordChangeInactiveUSec` → An unsigned 64-bit integer, encoding how much +time has to pass after the password expired that the account is +deactivated. This corresponds with the `sp_inact` field of `struct spwd`. + +`passwordChangeNow` → A boolean. If true the user has to change their password +on next login. This corresponds with the `sp_lstchg` field of `struct spwd` +being set to zero. + +`pkcs11TokenUri` → An array of strings, each with an RFC 7512 compliant PKCS#11 +URI referring to security token (or smart card) of some form, that shall be +associated with the user and may be used for authentication. The URI is used to +search for an X.509 certificate and associated private key that may be used to +decrypt an encrypted secret key that is used to unlock the user's account (see +below). It's undefined how precise the URI is: during log-in it is tested +against all plugged in security tokens and if there's exactly one matching +private key found with it it is used. + +`fido2HmacCredential` → An array of strings, each with a Base64-encoded FIDO2 +credential ID that shall be used for authentication with FIDO2 devices that +implement the `hmac-secret` extension. The salt to pass to the FIDO2 device is +found in `fido2HmacSalt`. + +`recoveryKeyType` → An array of strings, each indicating the type of one +recovery key. The only supported recovery key type at the moment is `modhex64`, +for details see the description of `recoveryKey` below. An account may have any +number of recovery keys defined, and the array should have one entry for each. + +`privileged` → An object, which contains the fields of the `privileged` section +of the user record, see below. + +`perMachine` → An array of objects, which contain the `perMachine` section of +the user record, and thus fields to apply on specific systems only, see below. + +`binding` → An object, keyed by machine IDs formatted as strings, pointing +to objects that contain the `binding` section of the user record, +i.e. additional fields that bind the user record to a specific machine, see +below. + +`status` → An object, keyed by machine IDs formatted as strings, pointing to +objects that contain the `status` section of the user record, i.e. additional +runtime fields that expose the current status of the user record on a specific +system, see below. + +`signature` → An array of objects, which contain cryptographic signatures of +the user record, i.e. the fields of the `signature` section of the user record, +see below. + +`secret` → An object, which contains the fields of the `secret` section of the +user record, see below. + +## Fields in the `privileged` section + +As mentioned, the `privileged` section is encoded in a sub-object of the user +record top-level object, in the `privileged` field. Any data included in this +object shall only be visible to the administrator and the user themselves, and +be suppressed implicitly when other users get access to a user record. It thus +takes the role of the `/etc/shadow` records for each user, which has similarly +restrictive access semantics. The following fields are currently defined: + +`passwordHint` → A user-selected password hint in free-form text. This should +be a string like "What's the name of your first pet?", but is entirely for the +user to choose. + +`hashedPassword` → An array of strings, each containing a hashed UNIX password +string, in the format +[`crypt(3)`](https://man7.org/linux/man-pages/man3/crypt.3.html) generates. This +corresponds with `sp_pwdp` field of `struct spwd` (and in a way the `pw_passwd` +field of `struct passwd`). + +`sshAuthorizedKeys` → An array of strings, each listing an SSH public key that +is authorized to access the account. The strings should follow the same format +as the lines in the traditional `~/.ssh/authorized_keys` file. + +`pkcs11EncryptedKey` → An array of objects. Each element of the array should be +an object consisting of three string fields: `uri` shall contain a PKCS#11 +security token URI, `data` shall contain a Base64-encoded encrypted key and +`hashedPassword` shall contain a UNIX password hash to test the key +against. Authenticating with a security token against this account shall work +as follows: the encrypted secret key is converted from its Base64 +representation into binary, then decrypted with the PKCS#11 `C_Decrypt()` +function of the PKCS#11 module referenced by the specified URI, using the +private key found on the same token. The resulting decrypted key is then +Base64-encoded and tested against the specified UNIX hashed password. The +Base64-encoded decrypted key may also be used to unlock further resources +during log-in, for example the LUKS or `fscrypt` storage backend. It is +generally recommended that for each entry in `pkcs11EncryptedKey` there's also +a matching one in `pkcs11TokenUri` and vice versa, with the same URI, appearing +in the same order, but this should not be required by applications processing +user records. + +`fido2HmacSalt` → An array of objects, implementing authentication support with +FIDO2 devices that implement the `hmac-secret` extension. Each element of the +array should be an object consisting of three string fields: `credential`, +`salt`, `hashedPassword`, and three boolean fields: `up`, `uv` and +`clientPin`. The first two string fields shall contain Base64-encoded binary +data: the FIDO2 credential ID and the salt value to pass to the FIDO2 +device. During authentication this salt along with the credential ID is sent to +the FIDO2 token, which will HMAC hash the salt with its internal secret key and +return the result. This resulting binary key should then be Base64-encoded and +used as string password for the further layers of the stack. The +`hashedPassword` field of the `fido2HmacSalt` field shall be a UNIX password +hash to test this derived secret key against for authentication. The `up`, `uv` +and `clientPin` booleans map to the FIDO2 concepts of the same name and encode +whether the `uv`/`up` options are enabled during the authentication, and +whether a PIN shall be required. It is generally recommended that for each +entry in `fido2HmacSalt` there's also a matching one in `fido2HmacCredential`, +and vice versa, with the same credential ID, appearing in the same order, but +this should not be required by applications processing user records. + +`recoveryKey`→ An array of objects, each defining a recovery key. The object +has two mandatory fields: `type` indicates the type of recovery key. The only +currently permitted value is the string `modhex64`. The `hashedPassword` field +contains a UNIX password hash of the normalized recovery key. Recovery keys are +in most ways similar to regular passwords, except that they are generated by +the computer, not chosen by the user, and are longer. Currently, the only +supported recovery key format is `modhex64`, which consists of 64 +[modhex](https://developers.yubico.com/yubico-c/Manuals/modhex.1.html) +characters (i.e. 256bit of information), in groups of 8 chars separated by +dashes, +e.g. `lhkbicdj-trbuftjv-tviijfck-dfvbknrh-uiulbhui-higltier-kecfhkbk-egrirkui`. Recovery +keys should be accepted wherever regular passwords are. The `recoveryKey` field +should always be accompanied by a `recoveryKeyType` field (see above), and each +entry in either should map 1:1 to an entry in the other, in the same order and +matching the type. When accepting a recovery key it should be brought +automatically into normalized form, i.e. the dashes inserted when missing, and +converted into lowercase before tested against the UNIX password hash, so that +recovery keys are effectively case-insensitive. + +## Fields in the `perMachine` section + +As mentioned, the `perMachine` section contains settings that shall apply to +specific systems only. This is primarily interesting for resource management +properties as they tend to require a per-system focus, however they may be used +for other purposes too. + +The `perMachine` field in the top-level object is an array of objects. When +processing the user record first the various fields on the top-level object +should be parsed. Then, the `perMachine` array should be iterated in order, and +the various settings within each contained object should be applied that match +either the indicated machine ID or host name, overriding any corresponding +settings previously parsed from the top-level object. There may be multiple +array entries that match a specific system, in which case all settings should +be applied. If the same option is set in the top-level object as in a +per-machine object then the per-machine setting wins and entirely undoes the +setting in the top-level object (i.e. no merging of properties that are arrays +is done). If the same option is set in multiple per-machine objects the one +specified later in the array wins (and here too no merging of individual fields +is done, the later field always wins in full). To summarize, the order of +application is (last one wins): + +1. Settings in the top-level object +2. Settings in the first matching `perMachine` array entry +3. Settings in the second matching `perMachine` array entry +4. … +5. Settings in the last matching `perMachine` array entry + +The following fields are defined in this section: + +`matchMachineId` → An array of strings that are formatted 128-bit IDs in +hex. If any of the specified IDs match the system's local machine ID +(i.e. matches `/etc/machine-id`) the fields in this object are honored. (As a +special case, if only a single machine ID is listed this field may be a single +string rather than an array of strings.) + +`matchHostname` → An array of strings that are valid hostnames. If any of the +specified hostnames match the system's local hostname, the fields in this +object are honored. If both `matchHostname` and `matchMachineId` are used +within the same array entry, the object is honored when either match succeeds, +i.e. the two match types are combined in OR, not in AND. (As a special case, if +only a single machine ID is listed this field may be a single string rather +than an array of strings.) + +These two are the only two fields specific to this section. All other fields +that may be used in this section are identical to the equally named ones in the +`regular` section (i.e. at the top-level object). Specifically, these are: + +`iconName`, `location`, `shell`, `umask`, `environment`, `timeZone`, +`preferredLanguage`, `niceLevel`, `resourceLimits`, `locked`, `notBeforeUSec`, +`notAfterUSec`, `storage`, `diskSize`, `diskSizeRelative`, `skeletonDirectory`, +`accessMode`, `tasksMax`, `memoryHigh`, `memoryMax`, `cpuWeight`, `ioWeight`, +`mountNoDevices`, `mountNoSuid`, `mountNoExecute`, `cifsDomain`, +`cifsUserName`, `cifsService`, `cifsExtraMountOptions`, `imagePath`, `uid`, +`gid`, `memberOf`, `fileSystemType`, `partitionUuid`, `luksUuid`, +`fileSystemUuid`, `luksDiscard`, `luksOfflineDiscard`, `luksCipher`, +`luksCipherMode`, `luksVolumeKeySize`, `luksPbkdfHashAlgorithm`, +`luksPbkdfType`, `luksPbkdfForceIterations`, `luksPbkdfTimeCostUSec`, `luksPbkdfMemoryCost`, +`luksPbkdfParallelThreads`, `luksSectorSize`, `autoResizeMode`, `rebalanceWeight`, +`rateLimitIntervalUSec`, `rateLimitBurst`, `enforcePasswordPolicy`, +`autoLogin`, `stopDelayUSec`, `killProcesses`, `passwordChangeMinUSec`, +`passwordChangeMaxUSec`, `passwordChangeWarnUSec`, +`passwordChangeInactiveUSec`, `passwordChangeNow`, `pkcs11TokenUri`, +`fido2HmacCredential`. + +## Fields in the `binding` section + +As mentioned, the `binding` section contains additional fields about the user +record, that bind it to the local system. These fields are generally used by a +local user manager (such as `systemd-homed.service`) to add in fields that make +sense in a local context but not necessarily in a global one. For example, a +user record that contains no `uid` field in the regular section is likely +extended with one in the `binding` section to assign a local UID if no global +UID is defined. + +All fields in the `binding` section only make sense in a local context and are +suppressed when the user record is ported between systems. The `binding` section +is generally persisted on the system but not in the home directories themselves +and the home directory is supposed to be fully portable and thus not contain +the information that `binding` is supposed to contain that binds the portable +record to a specific system. + +The `binding` sub-object on the top-level user record object is keyed by the +machine ID the binding is intended for, which point to an object with the +fields of the bindings. These fields generally match fields that may also be +defined in the `regular` and `perMachine` sections, however override +both. Usually, the `binding` value should not contain settings different from +those set via `regular` or `perMachine`, however this might happen if some +settings are not supported locally (think: `fscrypt` is recorded as intended +storage mechanism in the `regular` section, but the local kernel does not +support `fscrypt`, hence `directory` was chosen as implicit fallback), or have +been changed in the `regular` section through updates (e.g. a home directory +was created with `luks` as storage mechanism but later the user record was +updated to prefer `subvolume`, which however doesn't change the actual storage +used already which is pinned in the `binding` section). + +The following fields are defined in the `binding` section. They all have an +identical format and override their equally named counterparts in the `regular` +and `perMachine` sections: + +`imagePath`, `homeDirectory`, `partitionUuid`, `luksUuid`, `fileSystemUuid`, +`uid`, `gid`, `storage`, `fileSystemType`, `luksCipher`, `luksCipherMode`, +`luksVolumeKeySize`. + +## Fields in the `status` section + +As mentioned, the `status` section contains additional fields about the user +record that are exclusively acquired during runtime, and that expose runtime +metrics of the user and similar metadata that shall not be persisted but are +only acquired "on-the-fly" when requested. + +This section is arranged similarly to the `binding` section: the `status` +sub-object of the top-level user record object is keyed by the machine ID, +which points to the object with the fields defined here. The following fields +are defined: + +`diskUsage` → An unsigned 64-bit integer. The currently used disk space of the +home directory in bytes. This value might be determined in different ways, +depending on the selected storage mechanism. For LUKS storage this is the file +size of the loopback file or block device size. For the +directory/subvolume/fscrypt storage this is the current disk space used as +reported by the file system quota subsystem. + +`diskFree` → An unsigned 64-bit integer, denoting the number of "free" bytes in +the disk space allotment, i.e. usually the difference between the disk size as +reported by `diskSize` and the used already as reported in `diskFree`, but +possibly skewed by metadata sizes, disk compression and similar. + +`diskSize` → An unsigned 64-bit integer, denoting the disk space currently +allotted to the user, in bytes. Depending on the storage mechanism this can mean +different things (see above). In contrast to the top-level field of the same +(or the one in the `perMachine` section), this field reports the current size +allotted to the user, not the intended one. The values may differ when user +records are updated without the home directory being re-sized. + +`diskCeiling`/`diskFloor` → Unsigned 64-bit integers indicating upper and lower +bounds when changing the `diskSize` value, in bytes. These values are typically +derived from the underlying data storage, and indicate in which range the home +directory may be re-sized in, i.e. in which sensible range the `diskSize` value +should be kept. + +`state` → A string indicating the current state of the home directory. The +precise set of values exposed here are up to the service managing the home +directory to define (i.e. are up to the service identified with the `service` +field below). However, it is recommended to stick to a basic vocabulary here: +`inactive` for a home directory currently not mounted, `absent` for a home +directory that cannot be mounted currently because it does not exist on the +local system, `active` for a home directory that is currently mounted and +accessible. + +`service` → A string identifying the service that manages this user record. For +example `systemd-homed.service` sets this to `io.systemd.Home` to all user +records it manages. This is particularly relevant to define clearly the context +in which `state` lives, see above. Note that this field also exists on the +top-level object (i.e. in the `regular` section), which it overrides. The +`regular` field should be used if conceptually the user record can only be +managed by the specified service, and this `status` field if it can +conceptually be managed by different managers, but currently is managed by the +specified one. + +`signedLocally` → A boolean. If true indicates that the user record is signed +by a public key for which the private key is available locally. This means that +the user record may be modified locally as it can be re-signed with the private +key. If false indicates that the user record is signed by a public key +recognized by the local manager but whose private key is not available +locally. This means the user record cannot be modified locally as it couldn't +be signed afterwards. + +`goodAuthenticationCounter` → An unsigned 64-bit integer. This counter is +increased by one on every successful authentication attempt, i.e. an +authentication attempt where a security token of some form was presented and it +was correct. + +`badAuthenticationCounter` → An unsigned 64-bit integer. This counter is +increased by one on every unsuccessfully authentication attempt, i.e. an +authentication attempt where a security token of some form was presented and it +was incorrect. + +`lastGoodAuthenticationUSec` → An unsigned 64-bit integer, indicating the time +of the last successful authentication attempt in µs since the UNIX epoch (1970). + +`lastBadAuthenticationUSec` → Similar, but the timestamp of the last +unsuccessfully authentication attempt. + +`rateLimitBeginUSec` → An unsigned 64-bit integer: the µs timestamp since the +UNIX epoch (1970) where the most recent rate limiting interval has been +started, as configured with `rateLimitIntervalUSec`. + +`rateLimitCount` → An unsigned 64-bit integer, counting the authentication +attempts in the current rate limiting interval, see above. If this counter +grows beyond the value configured in `rateLimitBurst` authentication attempts +are temporarily refused. + +`removable` → A boolean value. If true the manager of this user record +determined the home directory being on removable media. If false it was +determined the home directory is in internal built-in media. (This is used by +`systemd-logind.service` to automatically pick the right default value for +`stopDelayUSec` if the field is not explicitly specified: for home directories +on removable media the delay is selected very low to minimize the chance the +home directory remains in unclean state if the storage device is removed from +the system by the user). + +`accessMode` → The access mode currently in effect for the home directory +itself. + +`fileSystemType` → The file system type backing the home directory: a short +string, such as "btrfs", "ext4", "xfs". + +## Fields in the `signature` section + +As mentioned, the `signature` section of the user record may contain one or +more cryptographic signatures of the user record. Like all others, this section +is optional, and only used when cryptographic validation of user records shall +be used. Specifically, all user records managed by `systemd-homed.service` will +carry such signatures and the service refuses managing user records that come +without signature or with signatures not recognized by any locally defined +public key. + +The `signature` field in the top-level user record object is an array of +objects. Each object encapsulates one signature and has two fields: `data` and +`key` (both are strings). The `data` field contains the actual signature, +encoded in Base64, the `key` field contains a copy of the public key whose +private key was used to make the signature, in PEM format. Currently only +signatures with Ed25519 keys are defined. + +Before signing the user record should be brought into "normalized" form, +i.e. the keys in all objects should be sorted alphabetically. All redundant +white-space and newlines should be removed and the JSON text then signed. + +The signatures only cover the `regular`, `perMachine` and `privileged` sections +of the user records, all other sections (include `signature` itself), are +removed before the signature is calculated. + +Rationale for signing and threat model: while a multi-user operating system +like Linux strives for being sufficiently secure even after a user acquired a +local login session reality tells us this is not the case. Hence it is +essential to restrict carefully which users may gain access to a system and +which ones shall not. A minimal level of trust must be established between +system, user record and the user themselves before a log-in request may be +permitted. In particular if the home directory is provided in its own LUKS2 +encapsulated file system it is essential this trust is established before the +user logs in (and hence the file system mounted), since file system +implementations on Linux are well known to be relatively vulnerable to rogue +disk images. User records and home directories in many context are expected to +be something shareable between multiple systems, and the transfer between them +might not happen via exclusively trusted channels. Hence it's essential that +the user record is not manipulated between uses. Finally, resource management +(which may be done by the various fields of the user record) is security +sensitive, since it should forcefully lock the user into the assigned resource +usage and not allow them to use more. The requirement of being able to trust +the user record data combined with the potential transfer over untrusted +channels suggest a cryptographic signature mechanism where only user records +signed by a recognized key are permitted to log in locally. + +Note that other mechanisms for establishing sufficient trust exist too, and are +perfectly valid as well. For example, systems like LDAP/ActiveDirectory +generally insist on user record transfer from trusted servers via encrypted TLS +channels only. Or traditional UNIX users created locally in `/etc/passwd` never +exist outside of the local trusted system, hence transfer and trust in the +source are not an issue. The major benefit of operating with signed user +records is that they are self-sufficiently trusted, not relying on a secure +channel for transfer, and thus being compatible with a more distributed model +of home directory transfer, including on USB sticks and such. + +## Fields in the `secret` section + +As mentioned, the `secret` section of the user record should never be persisted +nor transferred across machines. It is only defined in short-lived operations, +for example when a user record is first created or registered, as the secret +key data needs to be available to derive encryption keys from and similar. + +The `secret` field of the top-level user record contains the following fields: + +`password` → an array of strings, each containing a plain text password. + +`tokenPin` → an array of strings, each containing a plain text PIN, suitable +for unlocking security tokens that require that. (The field `pkcs11Pin` should +be considered a compatibility alias for this field, and merged with `tokenPin` +in case both are set.) + +`pkcs11ProtectedAuthenticationPathPermitted` → a boolean. If set to true allows +the receiver to use the PKCS#11 "protected authentication path" (i.e. a +physical button/touch element on the security token) for authenticating the +user. If false or unset, authentication this way shall not be attempted. + +`fido2UserPresencePermitted` → a boolean. If set to true allows the receiver to +use the FIDO2 "user presence" flag. This is similar to the concept of +`pkcs11ProtectedAuthenticationPathPermitted`, but exposes the FIDO2 "up" +concept behind it. If false or unset authentication this way shall not be +attempted. + +`fido2UserVerificationPermitted` → a boolean. If set to true allows the +receiver to use the FIDO2 "user verification" flag. This is similar to the +concept of `pkcs11ProtectedAuthenticationPathPermitted`, but exposes the FIDO2 +"uv" concept behind it. If false or unset authentication this way shall not be +attempted. + +## Mapping to `struct passwd` and `struct spwd` + +When mapping classic UNIX user records (i.e. `struct passwd` and `struct spwd`) +to JSON user records the following mappings should be applied: + +| Structure | Field | Section | Field | Condition | +|-----------------|-------------|--------------|------------------------------|----------------------------| +| `struct passwd` | `pw_name` | `regular` | `userName` | | +| `struct passwd` | `pw_passwd` | `privileged` | `password` | (See notes below) | +| `struct passwd` | `pw_uid` | `regular` | `uid` | | +| `struct passwd` | `pw_gid` | `regular` | `gid` | | +| `struct passwd` | `pw_gecos` | `regular` | `realName` | | +| `struct passwd` | `pw_dir` | `regular` | `homeDirectory` | | +| `struct passwd` | `pw_shell` | `regular` | `shell` | | +| `struct spwd` | `sp_namp` | `regular` | `userName` | | +| `struct spwd` | `sp_pwdp` | `privileged` | `password` | (See notes below) | +| `struct spwd` | `sp_lstchg` | `regular` | `lastPasswordChangeUSec` | (if `sp_lstchg` > 0) | +| `struct spwd` | `sp_lstchg` | `regular` | `passwordChangeNow` | (if `sp_lstchg` == 0) | +| `struct spwd` | `sp_min` | `regular` | `passwordChangeMinUSec` | | +| `struct spwd` | `sp_max` | `regular` | `passwordChangeMaxUSec` | | +| `struct spwd` | `sp_warn` | `regular` | `passwordChangeWarnUSec` | | +| `struct spwd` | `sp_inact` | `regular` | `passwordChangeInactiveUSec` | | +| `struct spwd` | `sp_expire` | `regular` | `locked` | (if `sp_expire` in [0, 1]) | +| `struct spwd` | `sp_expire` | `regular` | `notAfterUSec` | (if `sp_expire` > 1) | + +At this time almost all Linux machines employ shadow passwords, thus the +`pw_passwd` field in `struct passwd` is set to `"x"`, and the actual password +is stored in the shadow entry `struct spwd`'s field `sp_pwdp`. + +## Extending These Records + +User records following this specifications are supposed to be extendable for +various applications. In general, subsystems are free to introduce their own +keys, as long as: + +* Care should be taken to place the keys in the right section, i.e. the most + appropriate for the data field. + +* Care should be taken to avoid namespace clashes. Please prefix your fields + with a short identifier of your project to avoid ambiguities and + incompatibilities. + +* This specification is supposed to be a living specification. If you need + additional fields, please consider submitting them upstream for inclusion in + this specification. If they are reasonably universally useful, it would be + best to list them here. + +## Examples + +The shortest valid user record looks like this: + +```json +{ + "userName" : "u" +} +``` + +A reasonable user record for a system user might look like this: + +```json +{ + "userName" : "httpd", + "uid" : 473, + "gid" : 473, + "disposition" : "system", + "locked" : true +} +``` + +A fully featured user record associated with a home directory managed by +`systemd-homed.service` might look like this: + +```json +{ + "autoLogin" : true, + "binding" : { + "15e19cf24e004b949ddaac60c74aa165" : { + "fileSystemType" : "ext4", + "fileSystemUuid" : "758e88c8-5851-4a2a-b88f-e7474279c111", + "gid" : 60232, + "homeDirectory" : "/home/grobie", + "imagePath" : "/home/grobie.home", + "luksCipher" : "aes", + "luksCipherMode" : "xts-plain64", + "luksUuid" : "e63581ba-79fb-4226-b9de-1888393f7573", + "luksVolumeKeySize" : 32, + "partitionUuid" : "41f9ce04-c827-4b74-a981-c669f93eb4dc", + "storage" : "luks", + "uid" : 60232 + } + }, + "disposition" : "regular", + "enforcePasswordPolicy" : false, + "lastChangeUSec" : 1565950024279735, + "memberOf" : [ + "wheel" + ], + "privileged" : { + "hashedPassword" : [ + "$6$WHBKvAFFT9jKPA4k$OPY4D4TczKN/jOnJzy54DDuOOagCcvxxybrwMbe1SVdm.Bbr.zOmBdATp.QrwZmvqyr8/SafbbQu.QZ2rRvDs/" + ] + }, + "signature" : [ + { + "data" : "LU/HeVrPZSzi3MJ0PVHwD5m/xf51XDYCrSpbDRNBdtF4fDVhrN0t2I2OqH/1yXiBidXlV0ptMuQVq8KVICdEDw==", + "key" : "-----BEGIN PUBLIC KEY-----\nMCowBQYDK2VwAyEA/QT6kQWOAMhDJf56jBmszEQQpJHqDsGDMZOdiptBgRk=\n-----END PUBLIC KEY-----\n" + } + ], + "userName" : "grobie", + "status" : { + "15e19cf24e004b949ddaac60c74aa165" : { + "goodAuthenticationCounter" : 16, + "lastGoodAuthenticationUSec" : 1566309343044322, + "rateLimitBeginUSec" : 1566309342340723, + "rateLimitCount" : 1, + "state" : "inactive", + "service" : "io.systemd.Home", + "diskSize" : 161118667776, + "diskCeiling" : 190371729408, + "diskFloor" : 5242880, + "signedLocally" : true + } + } +} +``` + +When `systemd-homed.service` manages a home directory it will also include a +version of the user record in the home directory itself in the `~/.identity` +file. This version lacks the `binding` and `status` sections which are used for +local management of the user, but are not intended to be portable between +systems. It would hence look like this: + +```json +{ + "autoLogin" : true, + "disposition" : "regular", + "enforcePasswordPolicy" : false, + "lastChangeUSec" : 1565950024279735, + "memberOf" : [ + "wheel" + ], + "privileged" : { + "hashedPassword" : [ + "$6$WHBKvAFFT9jKPA4k$OPY4D4TczKN/jOnJzy54DDuOOagCcvxxybrwMbe1SVdm.Bbr.zOmBdATp.QrwZmvqyr8/SafbbQu.QZ2rRvDs/" + ] + }, + "signature" : [ + { + "data" : "LU/HeVrPZSzi3MJ0PVHwD5m/xf51XDYCrSpbDRNBdtF4fDVhrN0t2I2OqH/1yXiBidXlV0ptMuQVq8KVICdEDw==", + "key" : "-----BEGIN PUBLIC KEY-----\nMCowBQYDK2VwAyEA/QT6kQWOAMhDJf56jBmszEQQpJHqDsGDMZOdiptBgRk=\n-----END PUBLIC KEY-----\n" + } + ], + "userName" : "grobie", +} +``` diff --git a/docs/VIRTUALIZED_TESTING.md b/docs/VIRTUALIZED_TESTING.md new file mode 100644 index 0000000..94a5606 --- /dev/null +++ b/docs/VIRTUALIZED_TESTING.md @@ -0,0 +1,78 @@ +--- +title: Testing systemd during Development in Virtualization +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Testing systemd during Development in Virtualization + +For quickly testing systemd during development it us useful to boot it up in a container and in a QEMU VM. + +## Testing in a VM + +Here's a nice hack if you regularly build and test-boot systemd, are gutsy enough to install it into your host, but too afraid or too lazy to continuously reboot your host. + +Create a shell script like this: + +``` +#!/bin/sh + +sudo sync +sudo /bin/sh -c 'echo 3 > /proc/sys/vm/drop_caches' +sudo umount / +sudo modprobe kvm-intel + +exec sudo qemu-kvm -smp 2 -m 512 -snapshot /dev/sda +``` + +This will boot your local host system as a throw-away VM guest. It will take your main harddisk, boot from it in the VM, allow changes to it, but these changes are all just buffered in memory and never hit the real disk. Any changes made in this VM will be lost when the VM terminates. I have called this script "q", and hence for test booting my own system all I do is type the following command in my systemd source tree and I can see if it worked. + +``` +$ make -j10 && sudo make install && q + +``` + +The first three lines are necessary to ensure that the kernel's disk caches are all synced to disk before qemu takes the snapshot of it. Yes, invoking "umount /" will sync your file system to disk as a side effect, even though it will actually fail. When the machine boots up the file system will still be marked dirty (and hence you will get an fsck, usually), but it will work fine nonetheless in virtually all cases. + +Of course, if the host's hard disk changes while the VM is running this will be visible to the VM, and might confuse it. If you use this little hack you should keep changes on the host at a minimum, hence. Yeah this all is a hack, but a really useful and neat one. + +YMMV if you use LVM or btrfs. + +## Testing in a Container + +Test-booting systemd in a container has the benefit of being much easier to debug/instrument from the outside. + +**Important**: As preparation it is essential to turn off auditing entirely on your system. Auditing is broken with containers, and will trigger all kinds of error in containers if not turned off. Use `audit=0` on the host's kernel command line to turn it off. + +Then, as the first step I install Fedora into a container tree: + +``` +$ sudo yum -y --releasever=20 --installroot=$HOME/fedora-tree --disablerepo='*' --enablerepo=fedora install systemd passwd yum fedora-release vim-minimal + +``` + +You can do something similar with debootstrap on a Debian system. Now, we need to set a root password in order to be able to log in: + +``` +$ sudo systemd-nspawn -D ~/fedora-tree/ +# passwd +... +# ^D +``` + +As the next step we can already boot the container: + +``` +$ sudo systemd-nspawn -bD ~/fedora-tree/ 3 + +``` + +To test systemd in the container I then run this from my source tree on the host: + +``` +$ make -j10 && sudo DESTDIR=$HOME/fedora-tree make install && sudo systemd-nspawn -bD ~/fedora-tree/ 3 + +``` + +And that's already it. diff --git a/docs/WRITING_DESKTOP_ENVIRONMENTS.md b/docs/WRITING_DESKTOP_ENVIRONMENTS.md new file mode 100644 index 0000000..b50c857 --- /dev/null +++ b/docs/WRITING_DESKTOP_ENVIRONMENTS.md @@ -0,0 +1,30 @@ +--- +title: Writing Desktop Environments +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Writing Desktop Environments + +_Or: how to hook up your favorite desktop environment with logind_ + +systemd's logind service obsoletes ConsoleKit which was previously widely used on Linux distributions. This provides a number of new features, but also requires updating of the Desktop Environment running on it, in a few ways. + +This document should be read together with [Writing Display Managers](http://www.freedesktop.org/wiki/Software/systemd/writing-display-managers) which focuses on the porting work necessary for display managers. + +If required it is possible to implement ConsoleKit and systemd-logind support in the same desktop environment code, detecting at runtime which interface is needed. The [sd_booted()](http://www.freedesktop.org/software/systemd/man/sd_booted.html) call may be used to determine at runtime whether systemd is used. + +To a certain level ConsoleKit and systemd-logind may be used side-by-side, but a number of features are not available if ConsoleKit is used. + +Please have a look at the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) and the C API as documented in [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html). (Also see below) + +Here are the suggested changes: + +- Your session manager should listen to "Lock" and "Unlock" messages that are emitted from the session object logind exposes for your DE session, on the system bus. If "Lock" is received the screen lock should be activated, if "Unlock" is received it should be deactivated. This can easily be tested with "loginctl lock-sessions". See the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) for further details. +- Whenever the session gets idle the DE should invoke the SetIdleHint(True) call on the respective session object on the session bus. This is necessary for the system to implement auto-suspend when all sessions are idle. If the session gets used again it should call SetIdleHint(False). A session should be considered idle if it didn't receive user input (mouse movements, keyboard) in a while. See the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) for further details. +- To reboot/power-off/suspend/hibernate the machine from the DE use logind's bus calls Reboot(), PowerOff(), Suspend(), Hibernate(), HybridSleep(). For further details see [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind). +- If your session manager handles the special power, suspend, hibernate hardware keys or the laptop lid switch on its own it is welcome to do so, but needs to disable logind's built-in handling of these events. Take one or more of the _handle-power-key_, _handle-suspend-key_, _handle-hibernate-key_, _handle-lid-switch_ inhibitor locks for that. See [Inhibitor Locks](http://www.freedesktop.org/wiki/Software/systemd/inhibit) for further details on this. +- Before rebooting/powering-off/suspending/hibernating and when the operation is triggered by the user by clicking on some UI elements (or suchlike) it is recommended to show the list of currently active inhibitors for the operation, and ask the user to acknowledge the operation. Note that PK often allows the user to execute the operation ignoring the inhibitors. Use logind's ListInhibitors() call to get a list of these inhibitors. See [Inhibitor Locks](http://www.freedesktop.org/wiki/Software/systemd/inhibit) for further details on this. +- If your DE contains a process viewer of some kind ("system monitor") it's a good idea to show session, service and seat information for each process. Use sd_pid_get_session(), sd_pid_get_unit(), sd_session_get_seat() to determine these. For details see [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html). + And that's all! Thank you! diff --git a/docs/WRITING_DISPLAY_MANAGERS.md b/docs/WRITING_DISPLAY_MANAGERS.md new file mode 100644 index 0000000..efdbccc --- /dev/null +++ b/docs/WRITING_DISPLAY_MANAGERS.md @@ -0,0 +1,39 @@ +--- +title: Writing Display Managers +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Writing Display Managers + +_Or: How to hook up your favorite X11 display manager with systemd_ + +systemd's logind service obsoletes ConsoleKit which was previously widely used on Linux distributions. For X11 display managers the switch to logind requires a minimal amount of porting, however brings a couple of new features: true automatic multi-seat support, proper tracking of session processes, (optional) automatic killing of user processes on logout, a synchronous low-level C API and much simplification. + +This document should be read together with [Writing Desktop Environments](http://www.freedesktop.org/wiki/Software/systemd/writing-desktop-environments) which focuses on the porting work necessary for desktop environments. + +If required it is possible to implement ConsoleKit and systemd-logind support in the same display manager, detecting at runtime which interface is needed. The [sd_booted()](http://www.freedesktop.org/software/systemd/man/sd_booted.html) call may be used to determine at runtime whether systemd is used. + +To a certain level ConsoleKit and systemd-logind may be used side-by-side, but a number of features are not available if ConsoleKit is used, for example automatic multi-seat support. + +Please have a look at the [Bus API of logind](http://www.freedesktop.org/wiki/Software/systemd/logind) and the C API as documented in [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html). (Also see below) + +Minimal porting (without multi-seat) requires the following: + +1. Remove/disable all code responsible for registering your service with ConsoleKit. +2. Make sure to register your greeter session via the PAM session stack, and make sure the PAM session modules include pam_systemd. Also, make sure to set the session class to "greeter." This may be done by setting the environment variable XDG_SESSION_CLASS to "greeter" with pam_misc_setenv() or setting the "class=greeter" option in the pam_systemd module, in order to allow applications to filter out greeter sessions from normal login sessions. +3. Make sure to register your logged in session via the PAM session stack as well, also including pam_systemd in it. +4. Optionally, use pam_misc_setenv() to set the environment variables XDG_SEAT and XDG_VTNR. The former should contain "seat0", the latter the VT number your session runs on. pam_systemd can determine these values automatically but it's nice to pass these variables anyway. + In summary: porting a display manager from ConsoleKit to systemd primarily means removing code, not necessarily adding any new code. Here, a cheers to simplicity! + +Complete porting (with multi-seat) requires the following (Before you continue, make sure to read up on [Multi-Seat on Linux](http://www.freedesktop.org/wiki/Software/systemd/multiseat) first.): + +1. Subscribe to seats showing up and going away, via the systemd-logind D-Bus interface's SeatAdded and SeatRemoved signals. Take possession of each seat by spawning your greeter on it. However, do so exclusively for seats where the boolean CanGraphical property is true. Note that there are seats that cannot do graphical, and there are seats that are text-only first, and gain graphical support later on. Most prominently this is actually seat0 which comes up in text mode, and where the graphics driver is then loaded and probed during boot. This means display managers must watch PropertyChanged events on all seats, to see if they gain (or lose) the CanGraphical field. +2. Use ListSeats() on the D-Bus interface to acquire a list of already available seats and also take possession of them. +3. For each seat you spawn a greeter/user session on use the XDG_SEAT and XDG_VTNR PAM environment variables to inform pam_systemd about the seat name, resp. VT number you start them on. Note that only the special seat "seat0" actually knows kernel VTs, so you shouldn't pass the VT number on any but the main seat, since it doesn't make any sense there. +4. Pass the seat name to the X server you start via the -seat parameter. +5. At this time X interprets the -seat parameter natively only for input devices, not for graphics devices. To work around this limitation we provide a tiny wrapper /lib/systemd/systemd-multi-seat-x which emulates the enumeration for graphics devices too. This wrapper will eventually go away, as soon as X learns udev-based graphics device enumeration natively, instead of the current PCI based one. Hence it is a good idea to fall back to the real X when this wrapper is not found. You may use this wrapper exactly like the real X server, and internally it will just exec() it after putting together a minimal multi-seat configuration. + And that's already it. + +While most information about seats, sessions and users is available on systemd-logind's D-Bus interface, this is not the only API. The synchronous [sd-login(7)](http://www.freedesktop.org/software/systemd/man/sd-login.html) C interface is often easier to use and much faster too. In fact it is possible to implement the scheme above entirely without D-Bus relying only on this API. Note however, that this C API is purely passive, and if you want to execute an actually state changing operation you need to use the bus interface (for example, to switch sessions, or to kill sessions and suchlike). Also have a look at the [logind Bus API](http://www.freedesktop.org/wiki/Software/systemd/logind). diff --git a/docs/WRITING_NETWORK_CONFIGURATION_MANAGERS.md b/docs/WRITING_NETWORK_CONFIGURATION_MANAGERS.md new file mode 100644 index 0000000..3a02c3a --- /dev/null +++ b/docs/WRITING_NETWORK_CONFIGURATION_MANAGERS.md @@ -0,0 +1,64 @@ +--- +title: Writing Network Configuration Managers +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Writing Network Configuration Managers + +_Or: How to hook up your favourite network configuration manager's DNS logic with `systemd-resolved`_ + +_(This is a longer explanation how to use some parts of `systemd-resolved` bus API. If you are just looking for an API reference, consult the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) instead.)_ + +Since systemd 229 `systemd-resolved` offers a powerful bus API that may be used by network configuration managers (e.g. NetworkManager, connman, …, but also lower level DHCP, VPN or PPP daemons managing specific interfaces) to pass DNS server and DNSSEC configuration directly to `systemd-resolved`. Note that `systemd-resolved` also reads the DNS configuration data in `/etc/resolv.conf`, for compatibility. However, by passing the DNS configuration directly to `systemd-resolved` via the bus a couple of benefits are available: + +1. `systemd-resolved` maintains DNS configuration per-interface, instead of simply system-wide, and is capable of sending DNS requests to servers on multiple different network interfaces simultaneously, returning the first positive response (or if all fail, the last negative one). This allows effective "merging" of DNS views on different interfaces, which makes private DNS zones on multi-homed hosts a lot nicer to use. For example, if you are connected to a LAN and a VPN, and both have private DNS zones, then you will be able to resolve both, as long as they don't clash in names. By using the bus API to configure DNS settings, the per-interface configuration is opened up. +2. Per-link configuration of DNSSEC is available. This is particularly interesting for network configuration managers that implement captive portal detection: as long as a verified connection to the Internet is not found DNSSEC should be turned off (as some captive portal systems alter the DNS in order to redirect clients to their internal pages). +3. Per-link configuration of LLMNR and MulticastDNS is available. +4. In contrast to changes to `/etc/resolv.conf` all changes made via the bus take effect immediately for all future lookups. +5. Statistical data about executed DNS transactions is available, as well as information about whether DNSSEC is supported on the chosen DNS server. + +Note that `systemd-networkd` is already hooked up with `systemd-resolved`, exposing this functionality in full. + +## Suggested Mode of Operation + +Whenever a network configuration manager sets up an interface for operation, it should pass the DNS configuration information for the interface to `systemd-resolved`. It's recommended to do that after the Linux network interface index ("ifindex") has been allocated, but before the interface has been upped (i.e. `IFF_UP` turned on). That way, `systemd-resolved` will be able to use the configuration the moment the network interface is available. (Note that `systemd-resolved` watches the kernel interfaces come and go, and will make use of them as soon as they are suitable to be used, which among other factors requires `IFF_UP` to be set). That said it is OK to change DNS configuration dynamically any time: simply pass the new data to resolved, and it is happy to use it. + +In order to pass the DNS configuration information to resolved, use the following methods of the `org.freedesktop.resolve1.Manager` interface of the `/org/freedesktop/resolve1` object, on the `org.freedesktop.resolve1` service: + +1. To set the DNS server IP addresses for a network interface, use `SetLinkDNS()` +2. To set DNS search and routing domains for a network interface, use `SetLinkDomains()` +3. To configure the DNSSEC mode for a network interface, use `SetLinkDNSSEC()` +4. To configure DNSSEC Negative Trust Anchors (NTAs, i.e. domains for which not to do DNSSEC validation), use `SetLinkDNSSECNegativeTrustAnchors()` +5. To configure the LLMNR and MulticastDNS mode, use `SetLinkLLMNR()` and `SetLinkMulticastDNS()` + +For details about these calls see the [full resolved bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/). + +The calls should be pretty obvious to use: they simply take an interface index and the parameters to set. IP addresses are encoded as an address family specifier (an integer, that takes the usual `AF_INET` and `AF_INET6` constants), followed by a 4 or 16 byte array with the address in network byte order. + +`systemd-resolved` distinguishes between "search" and "routing" domains. Routing domains are used to route DNS requests of specific domains to particular interfaces. i.e. requests for a hostname `foo.bar.com` will be routed to any interface that has `bar.com` as routing domain. The same routing domain may be defined on multiple interfaces, in which case the request is routed to all of them in parallel. Resolver requests for hostnames that do not end in any defined routing domain of any interface will be routed to all suitable interfaces. Search domains work like routing domain, but are also used to qualify single-label domain names. They hence are identical to the traditional search domain logic on UNIX. The `SetLinkDomains()` call may used to define both search and routing domains. + +The most basic support of `systemd-resolved` in a network configuration manager would be to simply invoke `SetLinkDNS()` and `SetLinkDomains()` for the specific interface index with the data traditionally written to `/etc/resolv.conf`. More advanced integration could mean the network configuration manager also makes the DNSSEC mode, the DNSSEC NTAs and the LLMNR/MulticastDNS modes available for configuration. + +It is strongly recommended for network configuration managers that implement captive portal detection to turn off DNSSEC validation during the detection phase, so that captive portals that modify DNS do not result in all DNSSEC look-ups to fail. + +If a network configuration manager wants to reset specific settings to the defaults (such as the DNSSEC, LLMNR or MulticastDNS mode), it may simply call the function with an empty argument. To reset all per-link changes it made it may call `RevertLink()`. + +To read back the various settings made, use `GetLink()` to get a `org.freedesktop.resolve1.Link` object for a specific network interface. It exposes the current settings in its bus properties. See the [full bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) for details on this. + +In order to translate a network interface name to an interface index, use the usual glibc `if_nametoindex()` call. + +If the network configuration UI shall expose information about whether the selected DNS server supports DNSSEC, check the `DNSSECSupported` on the link object. + +Note that it is fully OK if multiple different daemons push DNS configuration data into `systemd-resolved` as long as they do this only for the network interfaces they own and manage. + +## Handling of `/etc/resolv.conf` + +`systemd-resolved` receives DNS configuration from a number of sources, via the bus, as well as directly from `systemd-networkd` or user configuration. It uses this data to write a file that is compatible with the traditional Linux `/etc/resolv.conf` file. This file is stored in `/run/systemd/resolve/resolv.conf`. It is recommended to symlink `/etc/resolv.conf` to this file, in order to provide compatibility with programs reading the file directly and not going via the NSS and thus `systemd-resolved`. + +For network configuration managers it is recommended to rely on this resolved-provided mechanism to update `resolv.conf`. Specifically, the network configuration manager should stop modifying `/etc/resolv.conf` directly if it notices it being a symlink to `/run/systemd/resolve/resolv.conf`. + +If a system configuration manager desires to be compatible both with systems that use `systemd-resolved` and those which do not, it is recommended to first push any discovered DNS configuration into `systemd-resolved`, and deal gracefully with `systemd-resolved` not being available on the bus. If `/etc/resolv.conf` is a not a symlink to `/run/systemd/resolve/resolv.conf` the manager may then proceed and also update `/etc/resolv.conf`. With this mode of operation optimal compatibility is provided, as `systemd-resolved` is used for `/etc/resolv.conf` management when this is configured, but transparent compatibility with non-`systemd-resolved` systems is maintained. Note that `systemd-resolved` is part of systemd, and hence likely to be pretty universally available on Linux systems soon. + +By allowing `systemd-resolved` to manage `/etc/resolv.conf` ownership issues regarding different programs overwriting each other's DNS configuration are effectively removed. diff --git a/docs/WRITING_RESOLVER_CLIENTS.md b/docs/WRITING_RESOLVER_CLIENTS.md new file mode 100644 index 0000000..88a873a --- /dev/null +++ b/docs/WRITING_RESOLVER_CLIENTS.md @@ -0,0 +1,290 @@ +--- +title: Writing Resolver Clients +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# Writing Resolver Clients + +_Or: How to look up hostnames and arbitrary DNS Resource Records via \_systemd-resolved_'s bus APIs\_ + +_(This is a longer explanation how to use some parts of \_systemd-resolved_ bus API. If you are just looking for an API reference, consult the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/) instead.)\_ + +_systemd-resolved_ provides a set of APIs on the bus for resolving DNS resource records. These are: + +1. _ResolveHostname()_ for resolving hostnames to acquire their IP addresses +2. _ResolveAddress()_ for the reverse operation: acquire the hostname for an IP address +3. _ResolveService()_ for resolving a DNS-SD or SRV service +4. _ResolveRecord()_ for resolving arbitrary resource records. + +Below you'll find examples for two of these calls, to show how to use them. Note that glibc offers similar (and more portable) calls in _getaddrinfo()_, _getnameinfo()_ and _res_query()_. Of these _getaddrinfo()_ and _getnameinfo()_ are directed to the calls above via the _nss-resolve_ NSS module, but _req_query()_ is not. There are a number of reasons why it might be preferable to invoke _systemd-resolved_'s bus calls rather than the glibc APIs: + +1. Bus APIs are naturally asynchronous, which the glibc APIs generally are not. +2. The bus calls above pass back substantially more information about the resolved data, including where and how the data was found (i.e. which protocol was used: DNS, LLMNR, MulticastDNS, and on which network interface), and most importantly, whether the data could be authenticated via DNSSEC. This in particular makes these APIs useful for retrieving certificate data from the DNS, in order to implement DANE, SSHFP, OPENGPGKEY and IPSECKEY clients. +3. _ResolveService()_ knows no counterpart in glibc, and has the benefit of being a single call that collects all data necessary to connect to a DNS-SD or pure SRV service in one step. +4. _ResolveRecord()_ in contrast to _res_query()_ supports LLMNR and MulticastDNS as protocols on top of DNS, and makes use of _systemd-resolved_'s local DNS record cache. The processing of the request is done in the sandboxed _systemd-resolved_ process rather than in the local process, and all packets are pre-validated. Because this relies on _systemd-resolved_ the per-interface DNS zone handling is supported. + +Of course, by using _systemd-resolved_ you lose some portability, but this could be handled via an automatic fallback to the glibc counterparts. + +Note that the various resolver calls provided by _systemd-resolved_ will consult _/etc/hosts_ and synthesize resource records for these entries in order to ensure that this file is honoured fully. + +The examples below use the _sd-bus_ D-Bus client implementation, which is part of _libsystemd_. Any other D-Bus library, including the original _libdbus_ or _GDBus_ may be used too. + +## Resolving a Hostname + +To resolve a hostname use the _ResolveHostname()_ call. For details on the function parameters see the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/). + +This example specifies _AF_UNSPEC_ as address family for the requested address. This means both an _AF_INET_ (A) and an _AF_INET6_ (AAAA) record is looked for, depending on whether the local system has configured IPv4 and/or IPv6 connectivity. It is generally recommended to request _AF_UNSPEC_ addresses for best compatibility with both protocols, in particular on dual-stack systems. + +The example specifies a network interface index of "0", i.e. does not specify any at all, so that the request may be done on any. Note that the interface index is primarily relevant for LLMNR and MulticastDNS lookups, which distinguish different scopes for each network interface index. + +This examples makes no use of either the input flags parameter, nor the output flags parameter. See the _ResolveRecord()_ example below for information how to make use of the _SD_RESOLVED_AUTHENTICATED_ bit in the returned flags parameter. + +``` +#include <arpa/inet.h> +#include <netinet/in.h> +#include <stdio.h> +#include <stdlib.h> +#include <sys/socket.h> +#include <systemd/sd-bus.h> + +int main(int argc, char*argv[]) { + sd_bus_error error = SD_BUS_ERROR_NULL; + sd_bus_message *reply = NULL; + const char *canonical; + sd_bus *bus = NULL; + uint64_t flags; + int r; + + r = sd_bus_open_system(&bus); + if (r < 0) { + fprintf(stderr, "Failed to open system bus: %s\n", strerror(-r)); + goto finish; + } + + r = sd_bus_call_method(bus, + "org.freedesktop.resolve1", + "/org/freedesktop/resolve1", + "org.freedesktop.resolve1.Manager", + "ResolveHostname", + &error, + &reply, + "isit", + 0, /* Network interface index where to look (0 means any) */ + argc >= 2 ? argv[1] : "www.github.com", /* Hostname */ + AF_UNSPEC, /* Which address family to look for */ + UINT64_C(0)); /* Input flags parameter */ + if (r < 0) { + fprintf(stderr, "Failed to resolve hostnme: %s\n", error.message); + sd_bus_error_free(&error); + goto finish; + } + + r = sd_bus_message_enter_container(reply, 'a', "(iiay)"); + if (r < 0) + goto parse_failure; + + for (;;) { + char buf[INET6_ADDRSTRLEN]; + int ifindex, family; + const void *data; + size_t length; + + r = sd_bus_message_enter_container(reply, 'r', "iiay"); + if (r < 0) + goto parse_failure; + if (r == 0) /* Reached end of array */ + break; + r = sd_bus_message_read(reply, "ii", &ifindex, &family); + if (r < 0) + goto parse_failure; + r = sd_bus_message_read_array(reply, 'y', &data, &length); + if (r < 0) + goto parse_failure; + r = sd_bus_message_exit_container(reply); + if (r < 0) + goto parse_failure; + + printf("Found IP address %s on interface %i.\n", inet_ntop(family, data, buf, sizeof(buf)), ifindex); + } + + r = sd_bus_message_exit_container(reply); + if (r < 0) + goto parse_failure; + r = sd_bus_message_read(reply, "st", &canonical, &flags); + if (r < 0) + goto parse_failure; + + printf("Canonical name is %s\n", canonical); + goto finish; + +parse_failure: + fprintf(stderr, "Parse failure: %s\n", strerror(-r)); + +finish: + sd_bus_message_unref(reply); + sd_bus_flush_close_unref(bus); + return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS; +} +``` + +Compile this with a command line like the following (under the assumption you save the sources above as `addrtest.c`): + +``` +gcc addrtest.c -o addrtest -Wall `pkg-config --cflags --libs libsystemd` +``` + +## Resolving an Arbitrary DNS Resource Record + +Use `ResolveRecord()` in order to resolve arbitrary resource records. The call will return the binary RRset data. This calls is useful to acquire resource records for which no high-level calls such as ResolveHostname(), ResolveAddress() and ResolveService() exist. In particular RRs such as MX, SSHFP, TLSA, CERT, OPENPGPKEY or IPSECKEY may be requested via this API. + +This example also shows how to determine whether the acquired data has been authenticated via DNSSEC (or another means) by checking the `SD_RESOLVED_AUTHENTICATED` bit in the +returned `flags` parameter. + +This example contains a simple MX record parser. Note that the data comes pre-validated from `systemd-resolved`, hence we allow the example to parse the record slightly sloppily, to keep the example brief. For details on the MX RR binary format, see [RFC 1035](https://www.rfc-editor.org/rfc/rfc1035.txt). + +For details on the function parameters see the [bus API documentation](https://wiki.freedesktop.org/www/Software/systemd/resolved/). + +``` +#include <assert.h> +#include <endian.h> +#include <stdbool.h> +#include <stdio.h> +#include <stdlib.h> +#include <string.h> +#include <systemd/sd-bus.h> + +#define DNS_CLASS_IN 1U +#define DNS_TYPE_MX 15U + +#define SD_RESOLVED_AUTHENTICATED (UINT64_C(1) << 9) + +static const uint8_t* print_name(const uint8_t* p) { + bool dot = false; + for (;;) { + if (*p == 0) + return p + 1; + if (dot) + putchar('.'); + else + dot = true; + printf("%.*s", (int) *p, (const char*) p + 1); + p += *p + 1; + } +} + +static void process_mx(const void *rr, size_t sz) { + uint16_t class, type, rdlength, preference; + const uint8_t *p = rr; + uint32_t ttl; + + fputs("Found MX: ", stdout); + p = print_name(p); + + memcpy(&type, p, sizeof(uint16_t)); + p += sizeof(uint16_t); + memcpy(&class, p, sizeof(uint16_t)); + p += sizeof(uint16_t); + memcpy(&ttl, p, sizeof(uint32_t)); + p += sizeof(uint32_t); + memcpy(&rdlength, p, sizeof(uint16_t)); + p += sizeof(uint16_t); + memcpy(&preference, p, sizeof(uint16_t)); + p += sizeof(uint16_t); + + assert(be16toh(type) == DNS_TYPE_MX); + assert(be16toh(class) == DNS_CLASS_IN); + printf(" preference=%u ", be16toh(preference)); + + p = print_name(p); + putchar('\n'); + + assert(p == (const uint8_t*) rr + sz); +} + +int main(int argc, char*argv[]) { + sd_bus_error error = SD_BUS_ERROR_NULL; + sd_bus_message *reply = NULL; + sd_bus *bus = NULL; + uint64_t flags; + int r; + + r = sd_bus_open_system(&bus); + if (r < 0) { + fprintf(stderr, "Failed to open system bus: %s\n", strerror(-r)); + goto finish; + } + + r = sd_bus_call_method(bus, + "org.freedesktop.resolve1", + "/org/freedesktop/resolve1", + "org.freedesktop.resolve1.Manager", + "ResolveRecord", + &error, + &reply, + "isqqt", + 0, /* Network interface index where to look (0 means any) */ + argc >= 2 ? argv[1] : "gmail.com", /* Domain name */ + DNS_CLASS_IN, /* DNS RR class */ + DNS_TYPE_MX, /* DNS RR type */ + UINT64_C(0)); /* Input flags parameter */ + if (r < 0) { + fprintf(stderr, "Failed to resolve record: %s\n", error.message); + sd_bus_error_free(&error); + goto finish; + } + + r = sd_bus_message_enter_container(reply, 'a', "(iqqay)"); + if (r < 0) + goto parse_failure; + + for (;;) { + uint16_t class, type; + const void *data; + size_t length; + int ifindex; + + r = sd_bus_message_enter_container(reply, 'r', "iqqay"); + if (r < 0) + goto parse_failure; + if (r == 0) /* Reached end of array */ + break; + r = sd_bus_message_read(reply, "iqq", &ifindex, &class, &type); + if (r < 0) + goto parse_failure; + r = sd_bus_message_read_array(reply, 'y', &data, &length); + if (r < 0) + goto parse_failure; + r = sd_bus_message_exit_container(reply); + if (r < 0) + goto parse_failure; + + process_mx(data, length); + } + + r = sd_bus_message_exit_container(reply); + if (r < 0) + goto parse_failure; + r = sd_bus_message_read(reply, "t", &flags); + if (r < 0) + goto parse_failure; + + printf("Response is authenticated: %s\n", flags & SD_RESOLVED_AUTHENTICATED ? "yes" : "no"); + goto finish; + +parse_failure: + fprintf(stderr, "Parse failure: %s\n", strerror(-r)); + +finish: + sd_bus_message_unref(reply); + sd_bus_flush_close_unref(bus); + return r < 0 ? EXIT_FAILURE : EXIT_SUCCESS; + } +``` + +Compile this with a command line like the following (under the assumption you save the sources above as `rrtest.c`): + +``` +gcc rrtest.c -o rrtest -Wall `pkg-config --cflags --libs libsystemd` +``` diff --git a/docs/WRITING_VM_AND_CONTAINER_MANAGERS.md b/docs/WRITING_VM_AND_CONTAINER_MANAGERS.md new file mode 100644 index 0000000..4d1b649 --- /dev/null +++ b/docs/WRITING_VM_AND_CONTAINER_MANAGERS.md @@ -0,0 +1,29 @@ +--- +title: Writing VM and Container Managers +category: Documentation for Developers +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + + +# Writing VM and Container Managers + +_Or: How to hook up your favorite VM or container manager with systemd_ + +Nomenclature: a _Virtual Machine_ shall refer to a system running on virtualized hardware consisting of a full OS with its own kernel. A _Container_ shall refer to a system running on the same shared kernel of the host, but running a mostly complete OS with its own init system. Both kinds of virtualized systems shall collectively be called "machines". + +systemd provides a number of integration points with virtual machine and container managers, such as libvirt, LXC or systemd-nspawn. On one hand there are integration points of the VM/container manager towards the host OS it is running on, and on the other there integration points for container managers towards the guest OS it is managing. + +Note that this document does not cover lightweight containers for the purpose of application sandboxes, i.e. containers that do _not_ run a init system of their own. + +## Host OS Integration + +All virtual machines and containers should be registered with the [machined](http://www.freedesktop.org/wiki/Software/systemd/machined) mini service that is part of systemd. This provides integration into the core OS at various points. For example, tools like ps, cgls, gnome-system-manager use this registration information to show machine information for running processes, as each of the VM's/container's processes can reliably attributed to a registered machine. The various systemd tools (like systemctl, journalctl, loginctl, systemd-run, ...) all support a -M switch that operates on machines registered with machined. "machinectl" may be used to execute operations on any such machine. When a machine is registered via machined its processes will automatically be placed in a systemd scope unit (that is located in the machines.slice slice) and thus appear in "systemctl" and similar commands. The scope unit name is based on the machine meta information passed to machined at registration. + +For more details on the APIs provided by machine consult [the bus API interface documentation](http://www.freedesktop.org/wiki/Software/systemd/machined). + +## Guest OS Integration + +As container virtualization is much less comprehensive, and the guest is less isolated from the host, there are a number of interfaces defined how the container manager can set up the environment for systemd running inside a container. These Interfaces are documented in [Container Interface of systemd](http://www.freedesktop.org/wiki/Software/systemd/ContainerInterface). + +VM virtualization is more comprehensive and fewer integration APIs are available. In fact there's only one: a VM manager may initialize the SMBIOS DMI field "Product UUUID" to a UUID uniquely identifying this virtual machine instance. This is read in the guest via /sys/class/dmi/id/product_uuid, and used as configuration source for /etc/machine-id if in the guest, if that file is not initialized yet. Note that this is currently only supported for kvm hosts, but may be extended to other managers as well. diff --git a/docs/_config.yml b/docs/_config.yml new file mode 100644 index 0000000..412db1f --- /dev/null +++ b/docs/_config.yml @@ -0,0 +1,10 @@ +# Site settings +# SPDX-License-Identifier: LGPL-2.1-or-later +title: systemd +baseurl: "" # the subpath of your site, e.g. /blog/ +url: "https://systemd.io" # the base hostname & protocol for your site + +permalink: /:title/ + +# Build settings +markdown: kramdown diff --git a/docs/_data/extra_pages.json b/docs/_data/extra_pages.json new file mode 100644 index 0000000..d24e301 --- /dev/null +++ b/docs/_data/extra_pages.json @@ -0,0 +1,517 @@ +[ + { + "category": "Project", + "title": "mkosi Project - Build Bespoke OS Images", + "url": "https://mkosi.systemd.io/" + }, + { + "category": "Project", + "title": "Brand", + "url": "https://brand.systemd.io/" + }, + { + "category": "Project", + "title": "Mailing List", + "url": "https://lists.freedesktop.org/mailman/listinfo/systemd-devel" + }, + { + "category": "Project", + "title": "Mastodon", + "url": "https://mastodon.social/@pid_eins" + }, + { + "category": "Project", + "title": "Releases", + "url": "https://github.com/systemd/systemd/releases" + }, + { + "category": "Project", + "title": "GitHub Project Page", + "url": "https://github.com/systemd/systemd" + }, + { + "category": "Project", + "title": "Issues", + "url": "https://github.com/systemd/systemd/issues" + }, + { + "category": "Project", + "title": "Pull Requests", + "url": "https://github.com/systemd/systemd/pulls" + }, + { + "category": "Manual Pages", + "title": "Index", + "url": "https://www.freedesktop.org/software/systemd/man/" + }, + { + "category": "Manual Pages", + "title": "Directives", + "url": "https://www.freedesktop.org/software/systemd/man/systemd.directives.html" + }, + { + "category": "Publications", + "title": "Article in The H", + "url": "http://www.h-online.com/open/features/Control-Centre-The-systemd-Linux-init-system-1565543.html" + }, + { + "category": "Publications", + "title": "Article in The H, Part 2", + "url": "http://www.h-online.com/open/features/Booting-up-Tools-and-tips-for-systemd-1570630.html" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #1 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #2 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #3 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #4 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #5 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Bê-á-bá do systemd #6 (Brazilian Portuguese)", + "url": "https://community.ibm.com/community/user/legacy" + }, + { + "category": "Publications", + "title": "Évolutions techniques de systemd (French)", + "url": "https://linuxfr.org/news/evolutions-techniques-de-systemd" + }, + { + "category": "Publications", + "title": "RHEL7 docs", + "url": "https://access.redhat.com/documentation/en-us/red_hat_enterprise_linux/7/html/system_administrators_guide/chap-managing_services_with_systemd" + }, + { + "category": "Publications", + "title": "SUSE White Paper on systemd", + "url": "https://www.suse.com/media/white-paper/systemd_in_suse_linux_enterprise_12_white_paper.pdf" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about kdbus at linux.conf.au 2014", + "url": "https://mirror.linux.org.au/pub/linux.conf.au/2014/Friday/104-D-Bus_in_the_kernel_-_Lennart_Poettering.mp4" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at the Red Hat Summit 2013", + "url": "https://access.redhat.com/videos/403833" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about the journal at Devconf 2013", + "url": "https://www.youtube.com/watch?v=i4CACB7paLc" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about recent developments at Devconf 2013", + "url": "https://www.youtube.com/watch?v=_rrpjYD373A" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at FOSDEM 2013", + "url": "https://ftp.fau.de/fosdem/2013/maintracks/Janson/systemd,_Two_Years_Later.webm" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at FOSDEM 2013 (Slides)", + "url": "https://0pointer.de/public/systemd-fosdem2013.pdf" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at FOSS.in 2012", + "url": "https://www.youtube.com/watch?v=_2aa34Uzr3c" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at OSEC Barcamp 2012", + "url": "https://www.youtube.com/watch?v=9UnEV9SPuw8" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at FOSDEM 2011", + "url": "https://www.youtube.com/watch?v=TyMLi8QF6sw" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at linux.conf.au 2011", + "url": "http://linuxconfau.blip.tv/file/4696791/" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at linux.conf.au 2011 (Slides)", + "url": "https://0pointer.de/public/systemd-lca2011.pdf" + }, + { + "category": "Videos for Users and Administrators", + "title": "Interview about systemd at golem.de (German)", + "url": "https://video.golem.de/oss/4823/interview-mit-lennart-poettering-entwickler-systemd.html" + }, + { + "category": "Videos for Users and Administrators", + "title": "Presentation about systemd at OSworld 2014 (systemd cheat-sheet) (Polish)", + "url": "https://www.youtube.com/watch?v=tU3HJVUPMyw" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#01: Verifying Bootup", + "url": "https://0pointer.de/blog/projects/systemd-for-admins-1.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#02: Which Service Owns Which Processes?", + "url": "https://0pointer.de/blog/projects/systemd-for-admins-2.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#03: How Do I Convert A SysV Init Script Into A systemd Service File?", + "url": "https://0pointer.de/blog/projects/systemd-for-admins-3.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#04: Killing Services", + "url": "https://0pointer.de/blog/projects/systemd-for-admins-4.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#05: The Three Levels of \"Off\"", + "url": "https://0pointer.de/blog/projects/three-levels-of-off" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#06: Changing Roots", + "url": "https://0pointer.de/blog/projects/changing-roots.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#07: The Blame Game", + "url": "https://0pointer.de/blog/projects/blame-game.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#08: The New Configuration Files", + "url": "https://0pointer.de/blog/projects/the-new-configuration-files" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#09: On /etc/sysconfig and /etc/default", + "url": "https://0pointer.de/blog/projects/on-etc-sysinit.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#10: Instantiated Services", + "url": "https://0pointer.de/blog/projects/instances.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#11: Converting inetd Services", + "url": "https://0pointer.de/blog/projects/inetd.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#12: Securing Your Services", + "url": "https://0pointer.de/blog/projects/security.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#13: Log and Service Status", + "url": "https://0pointer.de/blog/projects/systemctl-journal.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#14: The Self-Explanatory Boot", + "url": "https://0pointer.de/blog/projects/self-documented-boot.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#15: Watchdogs", + "url": "https://0pointer.de/blog/projects/watchdog.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#16: Gettys on Serial Consoles (and Elsewhere)", + "url": "https://0pointer.de/blog/projects/serial-console.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#17: Using the Journal", + "url": "https://0pointer.de/blog/projects/journalctl.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#18: Managing Resources", + "url": "https://0pointer.de/blog/projects/resources.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#19: Detecting Virtualization", + "url": "https://0pointer.de/blog/projects/detect-virt.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#20: Socket Activated Internet Services and OS Containers", + "url": "https://0pointer.de/blog/projects/socket-activated-containers.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "#21: Container Integration", + "url": "https://0pointer.net/blog/systemd-for-administrators-part-xxi.html" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "A Russian translation", + "url": "https://wiki.opennet.ru/Systemd_%D0%B4%D0%BB%D1%8F_%D0%B0%D0%B4%D0%BC%D0%B8%D0%BD%D0%B8%D1%81%D1%82%D1%80%D0%B0%D1%82%D0%BE%D1%80%D0%BE%D0%B2" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "A more complete Russian translation (PDF)", + "url": "http://www2.kangran.su/~nnz/pub/s4a/s4a_latest.pdf" + }, + { + "category": "The systemd for Administrators Blog Series", + "title": "A Vietnamese translation", + "url": "https://archlinuxvn.org/doc/systemd/#lp" + }, + { + "category": "The systemd for Developers Series", + "title": "#1: Socket Activation", + "url": "https://0pointer.de/blog/projects/socket-activation.html" + }, + { + "category": "The systemd for Developers Series", + "title": "#2: Socket Activation (Part 2)", + "url": "https://0pointer.de/blog/projects/socket-activation2.html" + }, + { + "category": "The systemd for Developers Series", + "title": "#3: Logging to the Journal", + "url": "https://0pointer.de/blog/projects/journal-submit.html" + }, + { + "category": "Related Packages", + "title": "Go Bindings for the Journal API, socket activation and DBUS", + "url": "https://github.com/coreos/go-systemd" + }, + { + "category": "Related Packages", + "title": "PHP Bindings for the Journal APIs", + "url": "https://github.com/systemd/php-systemd" + }, + { + "category": "Related Packages", + "title": "Lua Bindinds for systemd APIs", + "url": "https://github.com/daurnimator/lua-systemd" + }, + { + "category": "Related Packages", + "title": "Node.JS bindings for the Journal APIs", + "url": "https://www.fourkitchens.com/blog/2012/09/25/nodejs-extension-systemd" + }, + { + "category": "Related Packages", + "title": "Node.JS support for systemd Socket Activation", + "url": "https://www.npmjs.com/package/systemd" + }, + { + "category": "Related Packages", + "title": "Node.JS wrapper for sd_notify", + "url": "https://www.npmjs.com/package/sd-notify" + }, + { + "category": "Related Packages", + "title": "Node.JS wrapper for sd_notify (repo)", + "url": "https://github.com/systemd/node-sd-notify" + }, + { + "category": "Related Packages", + "title": "Experimental Qt bindings", + "url": "https://github.com/ilpianista/libsystemd-qt" + }, + { + "category": "Related Packages", + "title": "Haskell socket activation", + "url": "https://hackage.haskell.org/package/socket-activation" + }, + { + "category": "Related Packages", + "title": "Haskell Journal API", + "url": "https://hackage.haskell.org/package/libsystemd-journal" + }, + { + "category": "Related Packages", + "title": "Ruby bindings for the Journal APIs", + "url": "https://github.com/ledbettj/systemd-journal" + }, + { + "category": "Related Packages", + "title": "Ruby bindings for the systemd D-Bus APIs", + "url": "https://github.com/nathwill/ruby-dbus-systemd" + }, + { + "category": "Related Packages", + "title": "Erlang bindings for the Journal APIs", + "url": "https://github.com/systemd/ejournald" + }, + { + "category": "Related Packages", + "title": "Erlang journald backend for Lager", + "url": "https://github.com/travelping/lager_journald_backend" + }, + { + "category": "Related Packages", + "title": "Perl bindings for the Journal APIs", + "url": "https://metacpan.org/release/LKUNDRAK/Log-Journald-0.10" + }, + { + "category": "Related Packages", + "title": "GLib bindings", + "url": "https://github.com/tcbrindle/systemd-glib" + }, + { + "category": "Related Packages", + "title": "python-systemd", + "url": "https://www.freedesktop.org/software/systemd/python-systemd/index.html" + }, + { + "category": "Related Packages", + "title": "pystemd", + "url": "https://github.com/systemd/pystemd" + }, + { + "category": "Related Packages", + "title": "C++ bindings for sd-bus", + "url": "https://github.com/Kistler-Group/sdbus-cpp/" + }, + { + "category": "Documentation for Developers - external links", + "title": "On /etc/os-release", + "url": "http://0pointer.de/blog/projects/os-release.html" + }, + { + "category": "Documentation for Developers - external links", + "title": "Control Groups vs. Control Groups", + "url": "http://0pointer.de/blog/projects/cgroups-vs-cgroups.html" + }, + { + "category": "Documentation for Developers - external links", + "title": "The 30 Biggest Myths about systemd", + "url": "http://0pointer.de/blog/projects/the-biggest-myths.html" + }, + { + "category": "Documentation for Developers - external links", + "title": "Introduction to systemd in French", + "url": "http://lea-linux.org/documentations/Systemd" + }, + { + "category": "The various distributions", + "title": "Fedora packages", + "url": "https://packages.fedoraproject.org/pkgs/systemd/systemd/" + }, + { + "category": "The various distributions", + "title": "Fedora sources", + "url": "https://src.fedoraproject.org/rpms/systemd" + }, + { + "category": "The various distributions", + "title": "Fedora bugtracker", + "url": "https://bugzilla.redhat.com/buglist.cgi?list_id=565273&classification=Fedora&query_format=advanced&bug_status=NEW&bug_status=ASSIGNED&bug_status=MODIFIED&bug_status=ON_DEV&bug_status=ON_QA&bug_status=VERIFIED&bug_status=RELEASE_PENDING&bug_status=POST&component=systemd&product=Fedora" + }, + { + "category": "The various distributions", + "title": "openSUSE packages", + "url": "https://build.opensuse.org/package/show/Base:System/systemd" + }, + { + "category": "The various distributions", + "title": "openSUSE instructions", + "url": "http://en.opensuse.org/SDB:Systemd" + }, + { + "category": "The various distributions", + "title": "openSUSE bugtracker", + "url": "https://bugzilla.novell.com/buglist.cgi?short_desc=systemd&field0-0-0=product&type0-0-1=substring&field0-0-1=component&classification=openSUSE&value0-0-2=systemd&query_based_on=systemd&query_format=advanced&type0-0-3=substring&field0-0-3=status_whiteboard&value0-0-3=systemd&bug_status=UNCONFIRMED&bug_status=NEW&bug_status=ASSIGNED&bug_status=NEEDINFO&bug_status=REOPENED&short_desc_type=allwordssubstr&field0-0-2=short_desc&value0-0-1=systemd&type0-0-0=substring&value0-0-0=systemd&type0-0-2=substring&known_name=systemd" + }, + { + "category": "The various distributions", + "title": "Arch Linux packages", + "url": "https://www.archlinux.org/packages/core/x86_64/systemd/" + }, + { + "category": "The various distributions", + "title": "Arch Linux wiki", + "url": "https://wiki.archlinux.org/index.php/Systemd" + }, + { + "category": "The various distributions", + "title": "Arch Linux bugtracker", + "url": "https://gitlab.archlinux.org/archlinux/packaging/packages/systemd/-/issues" + }, + { + "category": "The various distributions", + "title": "Debian packages", + "url": "http://packages.debian.org/systemd" + }, + { + "category": "The various distributions", + "title": "Debian wiki", + "url": "http://wiki.debian.org/systemd" + }, + { + "category": "The various distributions", + "title": "Debian bugtracker", + "url": "http://bugs.debian.org/cgi-bin/pkgreport.cgi?ordering=normal;archive=0;src=systemd" + }, + { + "category": "The various distributions", + "title": "Ubuntu packages", + "url": "https://launchpad.net/ubuntu/+source/systemd" + }, + { + "category": "The various distributions", + "title": "Ubuntu wiki", + "url": "https://wiki.ubuntu.com/systemd" + }, + { + "category": "The various distributions", + "title": "Mageia packages", + "url": "http://svnweb.mageia.org/packages/cauldron/systemd/current/" + }, + { + "category": "The various distributions", + "title": "Mageia bugtracker", + "url": "https://bugs.mageia.org/buglist.cgi?field0-0-0=cf_rpmpkg&query_format=advanced&bug_status=NEW&bug_status=UNCONFIRMED&bug_status=ASSIGNED&bug_status=REOPENED&type0-0-0=substring&value0-0-0=systemd&component=RPM%20Packages&product=Mageia" + }, + { + "category": "The various distributions", + "title": "Gentoo packages", + "url": "http://packages.gentoo.org/package/sys-apps/systemd" + }, + { + "category": "The various distributions", + "title": "Gentoo wiki", + "url": "http://wiki.gentoo.org/wiki/Systemd" + }, + { + "category": "The various distributions", + "title": "Gentoo bugtracker", + "url": "https://bugs.gentoo.org/buglist.cgi?quicksearch=systemd" + } +] diff --git a/docs/_includes/footer.html b/docs/_includes/footer.html new file mode 100644 index 0000000..3e5214e --- /dev/null +++ b/docs/_includes/footer.html @@ -0,0 +1,7 @@ +<!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + +<footer class="site-footer"> + <p>© systemd, 2023</p> + + <p><a href="https://github.com/systemd/systemd/tree/main/docs">Website source</a></p> +</footer> diff --git a/docs/_includes/head.html b/docs/_includes/head.html new file mode 100644 index 0000000..ae39a3c --- /dev/null +++ b/docs/_includes/head.html @@ -0,0 +1,16 @@ +<!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + +<head> + <meta charset="utf-8"> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="theme-color" content="#201A26"> + + <title>{% if page.title %}{{ page.title }}{% else %}{{ site.title }}{% endif %}</title> + + <link rel="canonical" href="{{ page.url | replace:'index.html','' | prepend: site.baseurl | prepend: site.url }}"> + + <link rel="stylesheet" href="{{ "/style.css" | prepend: site.baseurl }}"> + + <link rel="icon" type="image/png" href="/favicon.png" /> +</head> diff --git a/docs/_includes/header.html b/docs/_includes/header.html new file mode 100644 index 0000000..4f2f73b --- /dev/null +++ b/docs/_includes/header.html @@ -0,0 +1,15 @@ +<!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + +<header class="site-header"> + + <div class="wrapper"> + + <a class="page-logo" href="{{ site.baseurl }}/"> + <svg width="202" height="26" viewBox="0 0 202 26"> + <use href="/assets/systemd-logo.svg#systemd-logo"/> + </svg> + </a> + + </div> + +</header> diff --git a/docs/_layouts/default.html b/docs/_layouts/default.html new file mode 100644 index 0000000..9e3b1b0 --- /dev/null +++ b/docs/_layouts/default.html @@ -0,0 +1,20 @@ +<!DOCTYPE html> +<!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + +<html lang="en"> + + {% include head.html %} + + <body> + + {% include header.html %} + + <div class="container"> + {{ content }} + </div> + + {% include footer.html %} + + </body> + +</html> diff --git a/docs/_layouts/forward.html b/docs/_layouts/forward.html new file mode 100644 index 0000000..5d3799b --- /dev/null +++ b/docs/_layouts/forward.html @@ -0,0 +1,26 @@ +<!DOCTYPE html> +<!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + +<html lang="en"> + <head> + <meta charset="utf-8"> + <meta http-equiv="refresh" content="0;url={{ page.target }}"/> + <meta http-equiv="X-UA-Compatible" content="IE=edge"> + <meta name="viewport" content="width=device-width, initial-scale=1"> + <meta name="theme-color" content="#201A26"> + <meta name="robots" content="noindex,follow"> + <link rel="stylesheet" href="{{ "/style.css" | prepend: site.baseurl }}"> + <link rel="icon" type="image/png" href="/favicon.png" /> + <link rel="canonical" href="{{ page.target }}"/> + <title>Redirecting to {{ page.target }}</title> + </head> + <body> + {% include header.html %} + <div class="container"> + <p> + This document has moved.<br> + Redirecting to <a href="{{ page.target }}">{{ page.target }}</a>. + </p> + </div> + </body> +</html> diff --git a/docs/assets/systemd-boot-menu.png b/docs/assets/systemd-boot-menu.png Binary files differnew file mode 100644 index 0000000..25f3bed --- /dev/null +++ b/docs/assets/systemd-boot-menu.png diff --git a/docs/assets/systemd-logo.svg b/docs/assets/systemd-logo.svg new file mode 100644 index 0000000..a8af438 --- /dev/null +++ b/docs/assets/systemd-logo.svg @@ -0,0 +1,7 @@ +<svg xmlns="http://www.w3.org/2000/svg" width="202" height="26" viewBox="0 0 202 26" id="systemd-logo"> + <!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + <path d="M0 0v26h10v-4H4V4h6V0zm76 0v4h6v18h-6v4h10V0z" fill="currentColor"/> + <path d="M113.498 14.926q-4.5-.96-4.5-3.878 0-1.079.609-1.981.621-.902 1.781-1.441 1.16-.54 2.707-.54 1.63 0 2.848.528 1.219.516 1.875 1.453.656.926.656 2.121h-3.539q0-.762-.457-1.183-.457-.434-1.394-.434-.774 0-1.243.363-.457.364-.457.938 0 .55.516.89.527.34 1.781.575 1.5.28 2.543.738 1.043.445 1.653 1.242.62.797.62 2.027 0 1.114-.667 2.004-.657.88-1.887 1.383-1.219.504-2.836.504-1.711 0-2.965-.621-1.242-.633-1.898-1.617-.645-.985-.645-2.051h3.34q.036.914.656 1.36.621.433 1.594.433.902 0 1.383-.34.492-.351.492-.937 0-.364-.223-.61-.21-.258-.773-.48-.55-.223-1.57-.446zm19.384-7.606l-5.086 14.58q-.293.831-.726 1.523-.434.703-1.266 1.195-.832.504-2.098.504-.457 0-.75-.048-.281-.046-.785-.176v-2.672q.176.02.527.02.95 0 1.418-.293.47-.293.715-.961l.352-.926-4.43-12.738h3.797l2.262 7.687 2.285-7.687zm5.884 7.606q-4.5-.96-4.5-3.878 0-1.079.61-1.981.62-.902 1.781-1.441 1.16-.54 2.707-.54 1.629 0 2.848.528 1.218.516 1.875 1.453.656.926.656 2.121h-3.539q0-.762-.457-1.183-.457-.434-1.395-.434-.773 0-1.242.363-.457.364-.457.938 0 .55.516.89.527.34 1.781.575 1.5.28 2.543.738 1.043.445 1.652 1.242.621.797.621 2.027 0 1.114-.668 2.004-.656.88-1.886 1.383-1.219.504-2.836.504-1.711 0-2.965-.621-1.242-.633-1.899-1.617-.644-.985-.644-2.051h3.34q.036.914.656 1.36.621.433 1.594.433.902 0 1.383-.34.492-.351.492-.937 0-.364-.223-.61-.21-.258-.773-.48-.551-.223-1.57-.446zm13.983 2.403q.574 0 .984-.082v2.66q-.914.328-2.086.328-3.727 0-3.727-3.797V9.899h-1.793V7.321h1.793v-3.14h3.54v3.14h2.132v2.578h-2.133v6.129q0 .75.293 1.031.293.27.997.27zm14.228-2.519h-8.016q.2 1.183.985 1.886.785.691 2.015.691.914 0 1.688-.34.785-.351 1.336-1.042l1.699 1.957q-.668.96-1.957 1.617-1.278.656-3 .656-1.946 0-3.387-.82-1.43-.82-2.203-2.227-.762-1.406-.762-3.105v-.446q0-1.898.715-3.386.715-1.489 2.063-2.32 1.347-.844 3.187-.844 1.793 0 3.059.761 1.265.762 1.922 2.168.656 1.395.656 3.293zm-3.469-2.65q-.024-1.03-.574-1.628-.54-.598-1.617-.598-1.008 0-1.582.668-.563.668-.739 1.84h4.512zm19.923-5.073q1.934 0 2.989 1.148 1.054 1.148 1.054 3.727v8.039h-3.539V11.95q0-.797-.21-1.23-.212-.446-.61-.61-.387-.164-.984-.164-.715 0-1.219.352-.504.34-.797.972.02.082.02.27V20h-3.54v-8.015q0-.797-.21-1.242-.211-.445-.61-.621-.386-.176-.996-.176-.68 0-1.183.304-.492.293-.797.844V20h-3.539V7.32h3.316l.118 1.419q.633-.797 1.547-1.22.926-.433 2.086-.433 1.172 0 2.016.48.855.47 1.312 1.442.633-.926 1.582-1.418.961-.504 2.203-.504zM201.398 2v18h-3.187l-.176-1.359q-1.243 1.594-3.212 1.594-1.535 0-2.66-.82-1.113-.832-1.699-2.285-.574-1.454-.574-3.317v-.246q0-1.934.574-3.398.586-1.465 1.7-2.274 1.124-.808 2.683-.808 1.805 0 3.012 1.37V2.001zm-5.672 15.376q1.488 0 2.133-1.266v-4.898q-.61-1.266-2.11-1.266-1.207 0-1.77.984-.55.985-.55 2.637v.246q0 1.629.54 2.602.55.96 1.757.96z" fill="currentColor"/> + <path d="M45 13L63 3v20z" fill="#30d475"/> + <circle cx="30" cy="13" r="9" fill="#30d475"/> +</svg> diff --git a/docs/favicon.png b/docs/favicon.png Binary files differnew file mode 100644 index 0000000..f4b5cc1 --- /dev/null +++ b/docs/favicon.png diff --git a/docs/favicon.svg b/docs/favicon.svg new file mode 100644 index 0000000..37985c3 --- /dev/null +++ b/docs/favicon.svg @@ -0,0 +1,11 @@ +<svg xmlns="http://www.w3.org/2000/svg" width="16" height="16"> + <!-- SPDX-License-Identifier: LGPL-2.1-or-later --> + <g transform="translate(380 -506.52)"> + <rect ry="16.875" rx="16.875" y="2409.281" x="4128.568" height="90" width="90" fill="#201a26" transform="matrix(.17778 0 0 .17778 -1113.968 78.203)" stroke-width="5.625"/> + <g fill="none" stroke="#fff" stroke-width="2"> + <path d="M-377 513.02h-1.5v3h1.5M-367 513.02h1.5v3h-1.5" stroke-width="1"/> + </g> + <path d="M-368 516.77v-4.5l-3 1.25-1 1 1 1z" fill="#30d475"/> + <circle cx="-374.25" cy="514.52" r="1.75" fill="#30d475"/> + </g> +</svg> diff --git a/docs/fonts/heebo-bold.woff b/docs/fonts/heebo-bold.woff Binary files differnew file mode 100644 index 0000000..1e45115 --- /dev/null +++ b/docs/fonts/heebo-bold.woff diff --git a/docs/fonts/heebo-regular.woff b/docs/fonts/heebo-regular.woff Binary files differnew file mode 100644 index 0000000..484eae4 --- /dev/null +++ b/docs/fonts/heebo-regular.woff diff --git a/docs/index.md b/docs/index.md new file mode 100644 index 0000000..3c05c93 --- /dev/null +++ b/docs/index.md @@ -0,0 +1,98 @@ +--- +layout: default +SPDX-License-Identifier: LGPL-2.1-or-later +--- + +# System and Service Manager + +systemd is a suite of basic building blocks for a Linux system. It provides a system and service manager that runs as PID 1 and starts the rest of the system. + +systemd provides aggressive parallelization capabilities, uses socket and D-Bus activation for starting services, offers on-demand starting of daemons, keeps track of processes using Linux control groups, maintains mount and automount points, and implements an elaborate transactional dependency-based service control logic. systemd supports SysV and LSB init scripts and works as a replacement for sysvinit. + +Other parts include a logging daemon, utilities to control basic system configuration like the hostname, date, locale, maintain a list of logged-in users and running containers and virtual machines, system accounts, runtime directories and settings, and daemons to manage simple network configuration, network time synchronization, log forwarding, and name resolution. + +--- + +{% assign by_category = site.pages | group_by:"category" %} +{% assign extra_pages = site.data.extra_pages | group_by:"category" %} +{% assign merged = by_category | concat: extra_pages | sort:"name" %} + +{% for pair in merged %} + {% if pair.name != "" %} +## {{ pair.name }} +{% assign sorted = pair.items | sort:"title" %}{% for page in sorted %} +* [{{ page.title }}]({{ page.url | relative_url }}){% endfor %} + {% endif %} +{% endfor %} + +## See also + +* [Introductory blog story](https://0pointer.de/blog/projects/systemd.html) +* [Three](https://0pointer.de/blog/projects/systemd-update.html) [status](https://0pointer.de/blog/projects/systemd-update-2.html) [updates](https://0pointer.de/blog/projects/systemd-update-3.html) +* The [Wikipedia article](https://en.wikipedia.org/wiki/systemd) + +--- + +<pre class="intro-code-block"> +Welcome to <span class="color-blue">Fedora 20 (Heisenbug)</span>! + +[ <span class="color-green">OK</span> ] Reached target Remote File Systems. +[ <span class="color-green">OK</span> ] Listening on Delayed Shutdown Socket. +[ <span class="color-green">OK</span> ] Listening on /dev/initctl Compatibility Named Pipe. +[ <span class="color-green">OK</span> ] Reached target Paths. +[ <span class="color-green">OK</span> ] Reached target Encrypted Volumes. +[ <span class="color-green">OK</span> ] Listening on Journal Socket. + Mounting Huge Pages File System... + Mounting POSIX Message Queue File System... + Mounting Debug File System... + Starting Journal Service... +[ <span class="color-green">OK</span> ] Started Journal Service. + Mounting Configuration File System... + Mounting FUSE Control File System... +[ <span class="color-green">OK</span> ] Created slice Root Slice. +[ <span class="color-green">OK</span> ] Created slice User and Session Slice. +[ <span class="color-green">OK</span> ] Created slice System Slice. +[ <span class="color-green">OK</span> ] Reached target Slices. +[ <span class="color-green">OK</span> ] Reached target Swap. + Mounting Temporary Directory... +[ <span class="color-green">OK</span> ] Reached target Local File Systems (Pre). + Starting Load Random Seed... + Starting Load/Save Random Seed... +[ <span class="color-green">OK</span> ] Mounted Huge Pages File System. +[ <span class="color-green">OK</span> ] Mounted POSIX Message Queue File System. +[ <span class="color-green">OK</span> ] Mounted Debug File System. +[ <span class="color-green">OK</span> ] Mounted Configuration File System. +[ <span class="color-green">OK</span> ] Mounted FUSE Control File System. +[ <span class="color-green">OK</span> ] Mounted Temporary Directory. +[ <span class="color-green">OK</span> ] Started Load Random Seed. +[ <span class="color-green">OK</span> ] Started Load/Save Random Seed. +[ <span class="color-green">OK</span> ] Reached target Local File Systems. + Starting Recreate Volatile Files and Directories... + Starting Trigger Flushing of Journal to Persistent Storage... +[ <span class="color-green">OK</span> ] Started Recreate Volatile Files and Directories. + Starting Record System Reboot/Shutdown in UTMP... +[ <span class="color-green">OK</span> ] Started Trigger Flushing of Journal to Persistent Storage. +[ <span class="color-green">OK</span> ] Started Record System Reboot/Shutdown in UTMP. +[ <span class="color-green">OK</span> ] Reached target System Initialization. +[ <span class="color-green">OK</span> ] Reached target Timers. +[ <span class="color-green">OK</span> ] Listening on D-Bus System Message Bus Socket. +[ <span class="color-green">OK</span> ] Reached target Sockets. +[ <span class="color-green">OK</span> ] Reached target Basic System. + Starting Permit User Sessions... + Starting D-Bus System Message Bus... +[ <span class="color-green">OK</span> ] Started D-Bus System Message Bus. + Starting Login Service... + Starting Cleanup of Temporary Directories... +[ <span class="color-green">OK</span> ] Started Permit User Sessions. +[ <span class="color-green">OK</span> ] Started Cleanup of Temporary Directories. + Starting Console Getty... +[ <span class="color-green">OK</span> ] Started Console Getty. +[ <span class="color-green">OK</span> ] Reached target Login Prompts. +[ <span class="color-green">OK</span> ] Started Login Service. +[ <span class="color-green">OK</span> ] Reached target Multi-User System. + +Fedora release 20 (Heisenbug) +Kernel 3.9.2-200.fc18.x86_64 on an x86_64 (console) + +fedora login: +</pre> diff --git a/docs/style.css b/docs/style.css new file mode 100644 index 0000000..ee0fc7f --- /dev/null +++ b/docs/style.css @@ -0,0 +1,574 @@ +/* SPDX-License-Identifier: LGPL-2.1-or-later */ + +@font-face { + font-family: 'Heebo'; + src: url('fonts/heebo-regular.woff'); + font-weight: 400; +} + +@font-face { + font-family: 'Heebo'; + src: url('fonts/heebo-bold.woff'); + font-weight: 600; +} + +/* Variables */ +:root { + --sd-brand-black: hsl(270, 19%, 13%); /* #201A26; */ + --sd-brand-green: hsl(145, 66%, 51%); /* #30D475; */ + --sd-brand-white: #fff; + + --sd-black: hsl(270, 7%, 13%); + --sd-green: hsl(145, 66%, 43%); /* #26b763 */ + --sd-gray-extralight: hsl(30, 10%, 96%); /* #f6f5f4 */ + --sd-gray-light: hsl(30, 10%, 92%); + --sd-gray: hsl(30, 10%, 85%); + --sd-gray-dark: hsl(257, 23%, 20%); + --sd-gray-extradark: hsl(257, 23%, 16%); /* #241f31 */ + --sd-blue: hsl(200, 66%, 55%); + + --sd-highlight-bg-light: rgba(255, 255, 255, 1); + --sd-highlight-bg-dark: rgba(0, 0, 0, .6); + --sd-highlight-inline-bg-light: rgba(0, 0, 0, 0.07); + --sd-highlight-inline-bg-dark: rgba(255, 255, 255, 0.1); + + --sd-font-weight-normal: 400; + --sd-font-weight-bold: 600; + + /* Light mode variables */ + --sd-foreground-color: var(--sd-gray-extradark); + --sd-background-color: var(--sd-gray-extralight); + --sd-logo-color: var(--sd-brand-black); + --sd-link-color: var(--sd-green); + --sd-small-color: var(--sd-gray-dark); + --sd-highlight-bg: var(--sd-highlight-bg-light); + --sd-highlight-inline-bg: var(--sd-highlight-inline-bg-light); + --sd-link-font-weight: var(--sd-font-weight-bold); + --sd-table-row-bg: var(--sd-highlight-inline-bg-light); + --sd-table-row-hover-bg: var(--sd-gray); +} + +@media (prefers-color-scheme: dark) { + :root { + color-scheme: dark; + --sd-foreground-color: var(--sd-gray); + --sd-background-color: var(--sd-black); + --sd-logo-color: var(--sd-brand-white); + --sd-link-color: var(--sd-brand-green); + --sd-small-color: var(--sd-gray); + --sd-highlight-bg: var(--sd-highlight-bg-dark); + --sd-highlight-inline-bg: var(--sd-highlight-inline-bg-dark); + --sd-link-font-weight: var(--sd-font-weight-normal); + --sd-table-row-bg: var(--sd-highlight-inline-bg-dark); + --sd-table-row-hover-bg: var(--sd-highlight-bg-dark); + } +} + +/* Typography */ +* { + -moz-box-sizing: border-box; + -webkit-box-sizing: border-box; + box-sizing: border-box; +} +html, body { + margin: 0; + padding: 0; + font-size: 1rem; + font-family: "Heebo", sans-serif; + font-weight: 400; + line-height: 1.6; +} +body { + color: var(--sd-foreground-color); + background-color: var(--sd-background-color); +} +h1, h2, h3, h4, h5, h6 { + margin: 1rem 0 0.625rem; + font-weight: 600; + line-height: 1.25; +} +h1 { + text-align: center; + font-size: 1.87rem; + font-weight: 400; + font-style: normal; + margin-bottom: 2rem; +} +@media screen and (min-width: 650px) { + h1 { + font-size: 2.375em; + } +} +h2 { + font-size: 1.25rem; + margin-top: 2.5em; +} +h3 { + font-size: 1.15rem; +} +a { + font-weight: var(--sd-link-font-weight); + text-decoration: none; + color: var(--sd-link-color); + cursor: pointer; +} +a:hover { + text-decoration: underline; +} +b { + font-weight: 600; +} +small { + color: var(--sd-small-color); +} +hr { + margin: 3rem auto 4rem; + width: 40%; + opacity: 40%; +} + +/* Layout */ +.container > * { + width: 80%; + margin-left: auto; + margin-right: auto; + max-width: 720px; +} + +.container > table { + max-width: 1600px; +} + +.container > h1 { + max-width: 530px; +} + +/* Tables */ +table { + width: auto !important; + border-collapse: separate; + border-spacing: 0; + margin-top: 2em; + margin-bottom: 3em; + overflow-x: auto; + display: block; /* required for overflow-x auto on tables */ +} +@media screen and (min-width: 768px) { + table { + display: table; + border-left: 1rem solid transparent; + border-right: 1rem solid transparent; + } +} + +thead tr, +tbody:first-child tr:nth-child(odd), +thead + tbody tr:nth-child(even) { + background-color: var(--sd-table-row-bg); +} + +tbody tr:hover { + background-color: var(--sd-table-row-hover-bg) !important; +} + +th, td { + vertical-align: top; + text-align: left; + padding: .5rem; +} + +th:first-child, td:first-child { + padding-left: 0.75rem; +} + +th:last-child, td:last-child { + padding-right: 0.75rem; +} + +/* Custom content */ +.intro-code-block { + background-color: var(--sd-brand-black); + color: var(--sd-brand-white); + font-size: 0.875rem; + padding: 1em; + overflow-x: auto; +} +@media (prefers-color-scheme: dark) { + .intro-code-block { + background-color: var(--sd-highlight-bg); + } +} + +/* Singletons */ +.page-logo { + display: block; + padding: 5rem 0 3rem; + color: var(--sd-logo-color); +} +.page-logo > svg { + display: block; + width: 12.625em; + height: auto; + margin: 0 auto; +} + +.brand-white { + background-color: var(--sd-brand-white); +} + +.brand-green { + background-color: var(--sd-brand-green); +} + +.brand-black { + background-color: var(--sd-brand-black); + color: var(--sd-brand-white); +} + +.color-green { + color: var(--sd-brand-green); +} + +.color-blue { + color: var(--sd-blue); +} + +.page-link::after { + content: " ➜"; +} + + +/* Footer */ +footer { + text-align: center; + padding: 3em 0 3em; + font-size: 1em; + margin-top: 4rem; +} + +/* Make tables vertically aligned to the top */ +tbody td { + vertical-align: top; +} + +/* Rouge Code Highlight, github style */ +/* Generated with: rougify style github | sed '/background-color: #f8f8f8/d' */ +.highlight table td { padding: 5px; } +.highlight table pre { margin: 0; } + +@media (prefers-color-scheme: light) { + .highlight .cm { + color: #999988; + font-style: italic; + } + .highlight .cp { + color: #999999; + font-weight: bold; + } + .highlight .c1 { + color: #999988; + font-style: italic; + } + .highlight .cs { + color: #999999; + font-weight: bold; + font-style: italic; + } + .highlight .c, .highlight .ch, .highlight .cd, .highlight .cpf { + color: #999988; + font-style: italic; + } + .highlight .err { + color: #a61717; + background-color: #e3d2d2; + } + .highlight .gd { + color: #000000; + background-color: #ffdddd; + } + .highlight .ge { + color: #000000; + font-style: italic; + } + .highlight .gr { + color: #aa0000; + } + .highlight .gh { + color: #999999; + } + .highlight .gi { + color: #000000; + background-color: #ddffdd; + } + .highlight .go { + color: #888888; + } + .highlight .gp { + color: #555555; + } + .highlight .gs { + font-weight: bold; + } + .highlight .gu { + color: #aaaaaa; + } + .highlight .gt { + color: #aa0000; + } + .highlight .kc { + color: #000000; + font-weight: bold; + } + .highlight .kd { + color: #000000; + font-weight: bold; + } + .highlight .kn { + color: #000000; + font-weight: bold; + } + .highlight .kp { + color: #000000; + font-weight: bold; + } + .highlight .kr { + color: #000000; + font-weight: bold; + } + .highlight .kt { + color: #445588; + font-weight: bold; + } + .highlight .k, .highlight .kv { + color: #000000; + font-weight: bold; + } + .highlight .mf { + color: #009999; + } + .highlight .mh { + color: #009999; + } + .highlight .il { + color: #009999; + } + .highlight .mi { + color: #009999; + } + .highlight .mo { + color: #009999; + } + .highlight .m, .highlight .mb, .highlight .mx { + color: #009999; + } + .highlight .sa { + color: #000000; + font-weight: bold; + } + .highlight .sb { + color: #d14; + } + .highlight .sc { + color: #d14; + } + .highlight .sd { + color: #d14; + } + .highlight .s2 { + color: #d14; + } + .highlight .se { + color: #d14; + } + .highlight .sh { + color: #d14; + } + .highlight .si { + color: #d14; + } + .highlight .sx { + color: #d14; + } + .highlight .sr { + color: #009926; + } + .highlight .s1 { + color: #d14; + } + .highlight .ss { + color: #990073; + } + .highlight .s, .highlight .dl { + color: #d14; + } + .highlight .na { + color: #008080; + } + .highlight .bp { + color: #999999; + } + .highlight .nb { + color: #0086B3; + } + .highlight .nc { + color: #445588; + font-weight: bold; + } + .highlight .no { + color: #008080; + } + .highlight .nd { + color: #3c5d5d; + font-weight: bold; + } + .highlight .ni { + color: #800080; + } + .highlight .ne { + color: #990000; + font-weight: bold; + } + .highlight .nf, .highlight .fm { + color: #990000; + font-weight: bold; + } + .highlight .nl { + color: #990000; + font-weight: bold; + } + .highlight .nn { + color: #555555; + } + .highlight .nt { + color: #000080; + } + .highlight .vc { + color: #008080; + } + .highlight .vg { + color: #008080; + } + .highlight .vi { + color: #008080; + } + .highlight .nv, .highlight .vm { + color: #008080; + } + .highlight .ow { + color: #000000; + font-weight: bold; + } + .highlight .o { + color: #000000; + font-weight: bold; + } + .highlight .w { + color: #bbbbbb; + } +} + +@media (prefers-color-scheme: dark) { + /* rouge "base16.dark" code highlight */ + /* generated with: rougify style base16.dark | sed '/background-color: #151515/d' */ + .highlight, .highlight .w { + color: #d0d0d0; + } + .highlight .err { + color: #151515; + background-color: #ac4142; + } + .highlight .c, .highlight .ch, .highlight .cd, .highlight .cm, .highlight .cpf, .highlight .c1, .highlight .cs { + color: #505050; + } + .highlight .cp { + color: #f4bf75; + } + .highlight .nt { + color: #f4bf75; + } + .highlight .o, .highlight .ow { + color: #d0d0d0; + } + .highlight .p, .highlight .pi { + color: #d0d0d0; + } + .highlight .gi { + color: #90a959; + } + .highlight .gd { + color: #ac4142; + } + .highlight .gh { + color: #6a9fb5; + font-weight: bold; + } + .highlight .k, .highlight .kn, .highlight .kp, .highlight .kr, .highlight .kv { + color: #aa759f; + } + .highlight .kc { + color: #d28445; + } + .highlight .kt { + color: #d28445; + } + .highlight .kd { + color: #d28445; + } + .highlight .s, .highlight .sb, .highlight .sc, .highlight .dl, .highlight .sd, .highlight .s2, .highlight .sh, .highlight .sx, .highlight .s1 { + color: #90a959; + } + .highlight .sa { + color: #aa759f; + } + .highlight .sr { + color: #75b5aa; + } + .highlight .si { + color: #8f5536; + } + .highlight .se { + color: #8f5536; + } + .highlight .nn { + color: #f4bf75; + } + .highlight .nc { + color: #f4bf75; + } + .highlight .no { + color: #f4bf75; + } + .highlight .na { + color: #6a9fb5; + } + .highlight .m, .highlight .mb, .highlight .mf, .highlight .mh, .highlight .mi, .highlight .il, .highlight .mo, .highlight .mx { + color: #90a959; + } + .highlight .ss { + color: #90a959; + } +} + +/* Code Blocks */ +.highlighter-rouge { + padding: 2px 1rem; + border-radius: 5px; + color: var(--sd-foreground-color); + background-color: var(--sd-highlight-bg); + + overflow: auto; +} +.highlighter-rouge .highlight .err { + background: transparent !important; + color: inherit !important; +} + +/* Inline Code */ +code.highlighter-rouge { + padding: 2px 6px; + background-color: var(--sd-highlight-inline-bg); +} + +a code.highlighter-rouge { + color: inherit; +} diff --git a/docs/sysvinit/README.in b/docs/sysvinit/README.in new file mode 100644 index 0000000..89effc8 --- /dev/null +++ b/docs/sysvinit/README.in @@ -0,0 +1,27 @@ +You are looking for the traditional init scripts in {{ SYSTEM_SYSVINIT_PATH }}, +and they are gone? + +Here's an explanation on what's going on: + +You are running a systemd-based OS where traditional init scripts have +been replaced by native systemd services files. Service files provide +very similar functionality to init scripts. To make use of service +files simply invoke "systemctl", which will output a list of all +currently running services (and other units). Use "systemctl +list-unit-files" to get a listing of all known unit files, including +stopped, disabled and masked ones. Use "systemctl start +foobar.service" and "systemctl stop foobar.service" to start or stop a +service, respectively. For further details, please refer to +systemctl(1). + +Note that traditional init scripts continue to function on a systemd +system. An init script {{ SYSTEM_SYSVINIT_PATH }}/foobar is implicitly mapped +into a service unit foobar.service during system initialization. + +Thank you! + +Further reading: + man:systemctl(1) + man:systemd(1) + https://0pointer.de/blog/projects/systemd-for-admins-3.html + https://www.freedesktop.org/wiki/Software/systemd/Incompatibilities diff --git a/docs/sysvinit/meson.build b/docs/sysvinit/meson.build new file mode 100644 index 0000000..64476a5 --- /dev/null +++ b/docs/sysvinit/meson.build @@ -0,0 +1,9 @@ +# SPDX-License-Identifier: LGPL-2.1-or-later + +custom_target( + 'README', + input : 'README.in', + output : 'README', + command : [jinja2_cmdline, '@INPUT@', '@OUTPUT@'], + install : conf.get('HAVE_SYSV_COMPAT') == 1, + install_dir : sysvinit_path) diff --git a/docs/var-log/README.logs b/docs/var-log/README.logs new file mode 100644 index 0000000..3a39ce1 --- /dev/null +++ b/docs/var-log/README.logs @@ -0,0 +1,25 @@ +You are looking for the traditional text log files in /var/log, and they are +gone? + +Here's an explanation on what's going on: + +You are running a systemd-based OS where traditional syslog has been replaced +with the Journal. The journal stores the same (and more) information as classic +syslog. To make use of the journal and access the collected log data simply +invoke "journalctl", which will output the logs in the identical text-based +format the syslog files in /var/log used to be. For further details, please +refer to journalctl(1). + +Alternatively, consider installing one of the traditional syslog +implementations available for your distribution, which will generate the +classic log files for you. Syslog implementations such as syslog-ng or rsyslog +may be installed side-by-side with the journal and will continue to function +the way they always did. + +Thank you! + +Further reading: + man:journalctl(1) + man:systemd-journald.service(8) + man:journald.conf(5) + https://0pointer.de/blog/projects/the-journal.html diff --git a/docs/var-log/meson.build b/docs/var-log/meson.build new file mode 100644 index 0000000..35f756c --- /dev/null +++ b/docs/var-log/meson.build @@ -0,0 +1,6 @@ +# SPDX-License-Identifier: LGPL-2.1-or-later + +if conf.get('HAVE_SYSV_COMPAT') == 1 and get_option('create-log-dirs') + install_data('README.logs', + install_dir : docdir) +endif |