.. ischooklib:: libdhcp_ha.so .. _hooks-high-availability: ``libdhcp_ha.so``: High Availability Outage Resilience for Kea Servers ====================================================================== This hook library can be loaded on a pair of DHCPv4 or DHCPv6 servers, to increase the reliability of the DHCP service in the event of an outage on one server. .. note:: :ischooklib:`libdhcp_ha.so` is part of the open source code and is available to every Kea user. It was previously available only to ISC customers with a paid support contract. .. note:: This library can only be loaded by the :iscman:`kea-dhcp4` or :iscman:`kea-dhcp6` process. High Availability (HA) of the DHCP service is provided by running multiple cooperating server instances. If any of these instances becomes unavailable for any reason (DHCP software crash, Control Agent software crash, power outage, hardware failure), a surviving server instance can continue providing reliable service to clients. Many DHCP server implementations include the "DHCP Failover" protocol, whose most significant features are communication between the servers, partner failure detection, and lease synchronization between the servers. However, the DHCPv4 failover standardization process was never completed by the IETF. The DHCPv6 failover standard (RFC 8156) was published, but it is complex and difficult to use, has significant operational constraints, and is different from its v4 counterpart. Although it may be useful to use a "standard" failover protocol, most Kea users are simply interested in a working solution which guarantees high availability of the DHCP service. Therefore, the Kea HA hook library derives major concepts from the DHCP failover protocol but uses its own solutions for communication and configuration. It offers its own state machine, which greatly simplifies its implementation and generally fits better into Kea, and it provides the same features in both DHCPv4 and DHCPv6. This document intentionally uses the term "high availability" rather than "failover" to emphasize that it is not the failover protocol implementation. The following sections describe the configuration and operation of the Kea HA hook library. .. _ha-supported-configurations: Supported Configurations ~~~~~~~~~~~~~~~~~~~~~~~~ The Kea HA hook library supports three configurations, also known as HA modes: ``load-balancing``, ``hot-standby``, and ``passive-backup``. In the ``load-balancing`` mode, two servers respond to DHCP requests. The ``load-balancing`` function is implemented as described in `RFC 3074 `__, with each server responding to half the received DHCP queries. When one of the servers allocates a lease for a client, it notifies the partner server over the control channel (via the RESTful API), so the partner can save the lease information in its own database. If the communication with the partner is unsuccessful, the DHCP query is dropped and the response is not returned to the DHCP client. If the lease update is successful, the response is returned to the DHCP client by the server which has allocated the lease. By exchanging lease updates, both servers get a copy of all leases allocated by the entire HA setup, and either server can be switched to handle the entire DHCP traffic if its partner becomes unavailable. In the ``load-balancing`` configuration, one of the servers must be designated as ``primary`` and the other as ``secondary``. Functionally, there is no difference between the two during normal operation. However, this distinction is required when the two servers are started at (nearly) the same time and have to synchronize their lease databases. The primary server synchronizes the database first. The secondary server waits for the primary server to complete the lease database synchronization before it starts the synchronization. In the ``hot-standby`` configuration, one of the servers is designated as ``primary`` and the other as ``standby``. During normal operation, the primary server is the only one that responds to DHCP requests. The standby server receives lease updates from the primary over the control channel; however, it does not respond to any DHCP queries as long as the primary is running or, more accurately, until the standby considers the primary to be offline. If the standby server detects the failure of the primary, it starts responding to all DHCP queries. .. note:: Operators often wonder whether to use ``load-balancing`` or ``hot-standby`` mode. The ``load-balancing`` mode has the benefit of splitting the DHCP load between two instances, reducing the traffic processed by each of them. However, it is not always clear to the operators that using the ``load-balancing`` mode requires manually splitting the address pools between two Kea instances using client classification, to preclude both servers from allocating the same address to different clients. Such a split is not needed in the ``hot-standby`` mode. Thus, the benefit of using ``hot-standby`` over ``load-balancing`` is that the former has a simpler configuration. Conversely, ``load-balancing`` has higher performance potential at the cost of more complex configuration. See :ref:`ha-load-balancing-config` for details on how to split the pools using client classification. In the configurations described above, both the primary and secondary/standby are referred to as ``active`` servers, because they receive lease updates and can automatically react to the partner's failures by responding to the DHCP queries which would normally be handled by the partner. The HA hook library supports another server type/role: ``backup``. The use of a backup server is optional, and can be implemented in both ``load-balancing`` and ``hot-standby`` setup, in addition to the active servers. There is no limit on the number of backup servers in the HA setup; however, the presence of backup servers may increase the latency of DHCP responses, because not only do active servers send lease updates to each other, but also to the backup servers. The active servers do not expect acknowledgments from the backup servers before responding to the DHCP clients, so the overhead of sending lease updates to the backup servers is minimized. In the last supported configuration, ``passive-backup``, there is only one active server and typically one or more backup servers. A ``passive-backup`` configuration with no backup servers is also accepted, but it is no different than running a single server with no HA function at all. The ``passive-backup`` configuration is used in situations when an administrator wants to take advantage of the backup server(s) as an additional storage for leases without running the full-blown failover setup. In this case, if the primary server fails, the DHCP service is lost; it requires the administrator to manually restart the primary to resume DHCP service. The administrator may also configure one of the backup servers to provide DHCP service to the clients, as these servers should have accurate or nearly accurate information about the allocated leases. The major advantage of the ``passive-backup`` mode is that it provides some redundancy of the lease information but with better performance of the primary server responding to the DHCP queries. The primary server does not have to wait for acknowledgments to the lease updates from the backup servers before it sends a response to the DHCP client. This reduces the response time compared to the ``load-balancing`` and ``hot-standby`` cases, in which the server responding to the DHCP query has to wait for the acknowledgment from the other active server before it can respond to the client. .. note:: An interesting use case for a single active server running in the ``passive-backup`` mode is a notification service, in which software pretending to be a backup server receives live notifications about allocated and deleted leases from the primary server and can display them on a monitoring screen, trigger alerts, etc. Clocks on Active Servers ~~~~~~~~~~~~~~~~~~~~~~~~ Synchronized clocks are essential for the HA setup to operate reliably. The servers share lease information - via lease updates and during synchronization of the databases - including the time when the lease was allocated and when it expires. Some clock skew between the servers participating in the HA setup usually exists; this is acceptable as long as the clock skew is relatively low, compared to the lease lifetimes. However, if the clock skew becomes too high, the different lease expiration times on different servers may cause the HA system to malfunction. For example, one server may consider a lease to be expired when it is actually still valid. The lease reclamation process may remove a name associated with this lease from the DNS, causing problems when the client later attempts to renew the lease. Each active server monitors the clock skew by comparing its current time with the time returned by its partner in response to the :isccmd:`ha-heartbeat` command. This gives a good approximation of the clock skew, although it does not take into account the time between the partner sending the response and the receipt of this response by the server which sent the :isccmd:`ha-heartbeat` command. If the clock skew exceeds 30 seconds, a warning log message is issued. The administrator may correct this problem by synchronizing the clocks (e.g. using NTP); the servers should notice the clock skew correction and stop issuing the warning. If the clock skew is not corrected and exceeds 60 seconds, the HA service on each of the servers is terminated, i.e. the state machine enters the ``terminated`` state. The servers will continue to respond to DHCP clients (as in the ``load-balancing`` or ``hot-standby`` mode), but will exchange neither lease updates nor heartbeats and their lease databases will diverge. In this case, the administrator should synchronize the clocks and restart the servers. .. note:: It is possible to restart the servers one at a time, in no particular order. The clocks must be in sync before restarting the servers. .. note:: The clock skew is only assessed between two active servers, and only the active servers enter the ``terminated`` state if the skew is too high. The clock skew between active and backup servers is not assessed, because active servers do not exchange heartbeat messages with backup servers. .. _ha-https-support: HTTPS Support ~~~~~~~~~~~~~ Since Kea 1.9.7, the High Availability hook library supports HTTPS via TLS, as described in :ref:`tls`. The HTTPS configuration parameters are: - ``trust-anchor`` - specifies the name of a file or directory where the certification authority certificate of a Control Agent can be found. - ``cert-file`` - specifies the name of the file containing the end-entity certificate to use. - ``key-file`` - specifies the private key of the end-entity certificate to use. These parameters can be configured at the global and peer levels. When configured at both levels the peer value is used, allowing common values to be shared. The three parameters must be either all not specified (HTTPS disabled) or all specified (HTTPS enabled). Specification of the empty string is considered not specified; this can be used, for instance, to disable HTTPS for a particular peer when it is enabled at the global level. As the High Availability hook library is an HTTPS client, there is no ``cert-required`` parameter in this hook configuration. This parameter can be set in the Control Agent to require and verify a client certificate in client-server communication. It does not affect communication between HA peers at the client side; see below for information on the server side. Before Kea 2.1.7 using HTTPS in the HA setup required use of the Control Agent on all peers. (See :ref:`tls` for Control Agent TLS configuration). Since Kea 2.1.7 the HTTPS server side is supported: - the peer entry for the server name is used for the TLS setting. - the new ``require-client-certs`` parameter specifies whether client certificates are required and verified, i.e. like ``cert-required``. It defaults to ``true`` and is an HA config (vs. peer config) parameter. Kea 2.1.7 added a new security feature with the ``restrict-commands`` HA config parameter: when set to ``true``, commands which are not used by the hook are rejected. The default is ``false``. The following is an example of an HA server pair and Control Agent configuration for ``hot-standby`` with TLS. Server 1: .. code-block:: json { "Dhcp4": { "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "server1", "trust-anchor": "/usr/lib/kea/CA.pem", "cert-file": "/usr/lib/kea/server1_cert.pem", "key-file": "/usr/lib/kea/server1_key.pem", "mode": "hot-standby", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 5000, "max-unacked-clients": 5, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "standby", "auto-failover": true }] }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.250" }] }] } } Server 2: .. code-block:: json { "Dhcp4": { "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "server2", "trust-anchor": "/usr/lib/kea/CA.pem", "cert-file": "/usr/lib/kea/server2_cert.pem", "key-file": "/usr/lib/kea/server2_key.pem", "mode": "hot-standby", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 5000, "max-unacked-clients": 5, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "standby", "auto-failover": true }] }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.250" }] }] } } Control Agent on Server 1: :: { "Control-agent": { "http-host": "192.168.56.33", "http-port": 8000, "control-sockets": { "dhcp4": { "socket-type": "unix", "socket-name": "/var/run/kea/control_socket" } }, "trust-anchor": "/var/lib/kea/CA.pem", "cert-file": "/var/lib/kea/server1_cert.pem", "key-file": "/var/lib/kea/server1_key.pem", "cert-required": true } } Control Agent on Server 2: :: { "Control-agent": { "http-host": "192.168.56.66", "http-port": 8000, "control-sockets": { "dhcp4": { "socket-type": "unix", "socket-name": "/var/run/kea/control_socket" } }, "trust-anchor": "/var/lib/kea/CA.pem", "cert-file": "/var/lib/kea/server2_cert.pem", "key-file": "/var/lib/kea/server2_key.pem", "cert-required": true } } .. _ha-server-states: Server States ~~~~~~~~~~~~~ A DHCP server operating within an HA setup runs a state machine, and the state of the server can be retrieved by its peers using the :isccmd:`ha-heartbeat` command sent over the RESTful API. If the partner server does not respond to the :isccmd:`ha-heartbeat` command within the specified amount of time, the communication is considered interrupted and the server may, depending on the configuration, use additional measures (described later in this document) to verify that the partner is still operating. If it finds that the partner is not operating, the server transitions to the ``partner-down`` state to handle all the DHCP traffic directed to the system. In this case, the surviving server continues to send the :isccmd:`ha-heartbeat` command to detect when the partner wakes up. At that time, the partner synchronizes the lease database. When it is again ready to operate, the surviving server returns to normal operation, i.e. the ``load-balancing`` or ``hot-standby`` state. The following is the list of all possible server states: - ``backup`` - normal operation of the backup server. In this state it receives lease updates from the active server(s). - ``communication-recovery`` - an active server running in ``load-balancing`` mode may transition to this state when it experiences communication issues with a partner server over the control channel. This is an intermediate state between the ``load-balancing`` and ``partner-down`` states. In this state the server continues to respond to DHCP queries but does not send lease updates to the partner; lease updates are queued and are sent when normal communication is resumed. If communication does not resume within the time specified, the primary server then transitions to the ``partner-down`` state. The ``communication-recovery`` state was introduced to ensure reliable DHCP service when both active servers remain operational but the communication between them is interrupted for a prolonged period of time. Either server can be configured to never enter this state by setting the ``delayed-updates-limit`` to 0 (please refer to :ref:`ha-load-balancing-config`, later in this chapter, for details on this parameter). Disabling entry into the ``communication-recovery`` state causes the server to begin testing for the ``partner-down`` state as soon as the server is unable to communicate with its partner. .. note:: In Kea 1.9.4, with the introduction of ``delayed-updates-limit``, the default server's behavior in ``load-balancing`` mode changed. When a server experiences communication issues with its partner, it now enters the ``communication-recovery`` state and queues lease updates until communication is resumed. Prior to Kea 1.9.4, a server that could not communicate with its partner in ``load-balancing`` mode would immediately begin the transition to the ``partner-down`` state. - ``hot-standby`` - normal operation of the active server running in the ``hot-standby`` mode; both the primary and the standby server are in this state during their normal operation. The primary server responds to DHCP queries and sends lease updates to the standby server and to any backup servers that are present. - ``load-balancing`` - normal operation of the active server running in the ``load-balancing`` mode; both the primary and the secondary server are in this state during their normal operation. Both servers respond to DHCP queries and send lease updates to each other and to any backup servers that are present. - ``in-maintenance`` - an active server transitions to this state as a result of being notified by its partner that the administrator requested maintenance of the HA setup. The administrator requests the maintenance by sending the :isccmd:`ha-maintenance-start` command to the server which is supposed to take over the responsibility for responding to the DHCP clients while the other server is taken offline for maintenance. If the server is in the ``in-maintenance`` state it can be safely shut down. The partner transitions to the ``partner-down`` state immediately after discovering that the server in maintenance has been shut down. - ``partner-down`` - an active server transitions to this state after detecting that its partner (another active server) is offline. The server does not transition to this state if only a backup server is unavailable. In the ``partner-down`` state the active server responds to all DHCP queries, including those queries which are normally handled by the server that is now unavailable. - ``partner-in-maintenance`` - an active server transitions to this state after receiving a :isccmd:`ha-maintenance-start` command from the administrator. The server in this state becomes responsible for responding to all DHCP requests. The server sends a :isccmd:`ha-maintenance-notify` command to the partner, which should enter the ``in-maintenance`` state. The server remaining in the ``partner-in-maintenance`` state keeps sending lease updates to the partner until it finds that the partner has stopped responding to those lease updates, heartbeats, or any other commands. In this case, the server in the ``partner-in-maintenance`` state transitions to the ``partner-down`` state and keeps responding to the queries, but no longer sends lease updates. - ``passive-backup`` - a primary server running in the ``passive-backup`` HA mode transitions to this state immediately after it boots up. The primary server in this state responds to all DHCP traffic and sends lease updates to the backup servers it is connected to. By default, the primary server does not wait for acknowledgments from the backup servers and responds to a DHCP query right after sending lease updates to all backup servers. If any of the lease updates fail, a backup server misses the lease update but the DHCP client is still provisioned. This default configuration can be changed by setting the ``wait-backup-ack`` configuration parameter to ``true``, in which case the primary server always waits for the acknowledgements and drops the DHCP query if sending any of the corresponding lease updates fails. This improves lease database consistency between the primary and the secondary. However, if a communication failure between the active server and any of the backups occurs, it effectively causes the failure of the DHCP service from the DHCP clients' perspective. - ``ready`` - an active server transitions to this state after synchronizing its lease database with an active partner. This state indicates to the partner (which may be in the ``partner-down`` state) that it should return to normal operation. If and when it does, the server in the ``ready`` state also starts normal operation. - ``syncing`` - an active server transitions to this state to fetch leases from the active partner and update the local lease database. When in this state, the server issues the :isccmd:`dhcp-disable` command to disable the DHCP service of the partner from which the leases are fetched. The DHCP service is disabled for a maximum time of 60 seconds, after which it is automatically re-enabled, in case the syncing partner was unable to re-enable the service. If the synchronization completes successfully, the synchronizing server issues the :isccmd:`ha-sync-complete-notify` command to notify the partner. In most states, the partner re-enables its DHCP service to continue responding to the DHCP queries. In the ``partner-down`` state, the partner first ensures that communication between the servers is re-established before enabling the DHCP service. The syncing operation is synchronous; the server waits for an answer from the partner and does nothing else while the lease synchronization takes place. A server that is configured not to synchronize the lease database with its partner, i.e. when the ``sync-leases`` configuration parameter is set to ``false``, will never transition to this state. Instead, it transitions directly from the ``waiting`` state to the ``ready`` state. - ``terminated`` - an active server transitions to this state when the High Availability hook library is unable to further provide reliable service and a manual intervention of the administrator is required to correct the problem. Various issues with the HA setup may cause the server to transition to this state. While in this state, the server continues responding to DHCP clients based on the HA mode selected (``load-balancing`` or ``hot-standby``), but lease updates are not exchanged and heartbeats are not sent. Once a server has entered the ``terminated`` state, it remains in this state until it is restarted. The administrator must correct the issue which caused this situation prior to restarting the server (e.g. synchronize the clocks); otherwise, the server will return to the ``terminated`` state once it finds that the issue persists. - ``waiting`` - each started server instance enters this state. A backup server transitions directly from this state to the ``backup`` state. An active server sends a heartbeat to its partner to check its state; if the partner appears to be unavailable, the server transitions to the ``partner-down`` state. If the partner is available, the server transitions to the ``syncing`` or ``ready`` state, depending on the setting of the ``sync-leases`` configuration parameter. If both servers appear to be in the ``waiting`` state (concurrent startup), the primary server transitions to the next state first. The secondary or standby server remains in the ``waiting`` state until the primary transitions to the ``ready`` state. .. note:: Currently, restarting the HA service from the ``terminated`` state requires restarting the DHCP server or reloading its configuration. Whether the server responds to DHCP queries and which queries it responds to is a matter of the server's state, if no administrative action is performed to configure the server otherwise. The following table provides the default behavior for various states. The ``DHCP Service Scopes`` denote which group of received DHCP queries the server responds to in the given state. The HA configuration must specify a unique name for each server within the HA setup. This document uses the following convention within the provided examples: "server1" for a primary server, "server2" for the secondary or standby server, and "server3" for the backup server. In real life any names can be used as long as they remain unique. An in-depth explanation of the scopes can be found below. .. table:: Default behavior of the server in various HA states +------------------------+-----------------+-----------------+----------------+ | State | Server Type | DHCP Service | DHCP Service | | | | | Scopes | +========================+=================+=================+================+ | backup | backup server | disabled | none | +------------------------+-----------------+-----------------+----------------+ | communication-recovery | primary or | enabled | "HA_server1" | | | secondary | | or | | | (load-balancing | | "HA_server2" | | | mode only) | | | +------------------------+-----------------+-----------------+----------------+ | hot-standby | primary or | enabled | "HA_server1" | | | standby | | if primary, | | | (hot-standby | | none otherwise | | | mode) | | | +------------------------+-----------------+-----------------+----------------+ | load-balancing | primary or | enabled | "HA_server1" | | | secondary | | or | | | (load-balancing | | "HA_server2" | | | mode) | | | +------------------------+-----------------+-----------------+----------------+ | in-maintenance | active server | disabled | none | +------------------------+-----------------+-----------------+----------------+ | partner-down | active server | enabled | all scopes | +------------------------+-----------------+-----------------+----------------+ | partner-in-maintenance | active server | enabled | all scopes | +------------------------+-----------------+-----------------+----------------+ | passive-backup | active server | enabled | all scopes | +------------------------+-----------------+-----------------+----------------+ | ready | active server | disabled | none | +------------------------+-----------------+-----------------+----------------+ | syncing | active server | disabled | none | +------------------------+-----------------+-----------------+----------------+ | terminated | active server | enabled | same as in the | | | | | load-balancing | | | | | or hot-standby | | | | | state | +------------------------+-----------------+-----------------+----------------+ | waiting | any server | disabled | none | +------------------------+-----------------+-----------------+----------------+ In the ``load-balancing`` mode there are two scopes specified for the active servers: "HA_server1" and "HA_server2". The DHCP queries load-balanced to ``server1`` belong to the "HA_server1" scope and the queries load-balanced to ``server2`` belong to the "HA_server2" scope. If either server is in the ``partner-down`` state, the active partner is responsible for serving both scopes. In the ``hot-standby`` mode, there is only one scope - "HA_server1" - because only ``server1`` is responding to DHCP queries. If that server becomes unavailable, ``server2`` becomes responsible for this scope. The backup servers do not have their own scopes. In some cases they can be used to respond to queries belonging to the scopes of the active servers. Also, a backup server which is neither in the ``partner-down`` state nor in normal operation serves no scopes. The scope names can be used to associate pools, subnets, and networks with certain servers, so that only these servers can allocate addresses or prefixes from those pools, subnets, or networks. This is done via the client classification mechanism (see :ref:`ha-load-balancing-advanced-config` for more details). .. _ha-scope-transition: Scope Transition in a Partner-Down Case ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ When one of the servers finds that its partner is unavailable, it starts serving clients from both its own scope and the scope of the unavailable partner. This is straightforward for new clients, i.e. those sending DHCPDISCOVER (DHCPv4) or Solicit (DHCPv6), because those requests are not sent to any particular server. The available server responds to all such queries when it is in the ``partner-down`` state. When a client renews a lease, it sends its DHCPREQUEST (DHCPv4) or Renew (DHCPv6) message directly to the server which has allocated the lease being renewed. If this server is no longer available, the client will get no response. In that case, the client continues to use its lease and attempts to renew until the rebind timer (T2) elapses. The client then enters the rebinding phase, in which it sends a DHCPREQUEST (DHCPv4) or Rebind (DHCPv6) message to any available server. The surviving server receives the rebinding request and typically extends the lifetime of the lease. The client then continues to contact that new server to renew its lease as appropriate. If and when the other server once again becomes available, both active servers will eventually transition to the ``load-balancing`` or ``hot-standby`` state, in which they will again be responsible for their own scopes. Some clients belonging to the scope of the restarted server will try to renew their leases via the surviving server, but this server will no longer respond to them; the client will eventually transition back to the correct server via the rebinding mechanism. .. _ha-load-balancing-config: Load-Balancing Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following is the configuration snippet to enable high availability on the primary server within the ``load-balancing`` configuration. The same configuration should be applied on the secondary and backup servers, with the only difference that ``this-server-name`` should be set to "server2" and "server3" on those servers, respectively. .. note:: Remember that ``load-balancing`` mode requires the address pools and delegated prefix pools to be split between the active servers. During normal operation, the servers use non-overlapping pools to avoid allocating the same lease to different clients by both instances. A server only uses the pool fragments owned by the partner when the partner is not running. See the notes in :ref:`ha-supported-configurations` highlighting differences between the ``load-balancing`` and ``hot-standby`` modes. The semantics of pool partitioning is explained further in this section. The :ref:`ha-load-balancing-advanced-config` section provides advanced pool-partitioning examples. :: "Dhcp4": { "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "server1", "mode": "load-balancing", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 5000, "max-unacked-clients": 5, "max-rejected-lease-updates": 10, "delayed-updates-limit": 100, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "secondary", "auto-failover": true }, { "name": "server3", "url": "http://192.168.56.99:8000/", "role": "backup", "basic-auth-user": "foo", "basic-auth-password": "bar", "auto-failover": false }] }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.150", "client-class": "HA_server1" }, { "pool": "192.0.3.200 - 192.0.3.250", "client-class": "HA_server2" }], "option-data": [{ "name": "routers", "data": "192.0.3.1" }], "relay": { "ip-address": "10.1.2.3" } }] } Two hook libraries must be loaded to enable HA: :ischooklib:`libdhcp_lease_cmds.so` and :ischooklib:`libdhcp_ha.so`. The latter implements the HA feature, while the former enables control commands required by HA to fetch and manipulate leases on the remote servers. In the example provided above, it is assumed that Kea libraries are installed in the ``/usr/lib`` directory. If Kea is not installed in the ``/usr`` directory, the hook libraries' locations must be updated accordingly. The HA configuration is specified within the scope of :ischooklib:`libdhcp_ha.so`. Note that while the top-level parameter ``high-availability`` is a list, only a single entry is currently supported. The following are the global parameters which control the server's behavior with respect to HA: - ``this-server-name`` - is a unique identifier of the server within this HA setup. It must match one of the servers specified within the ``peers`` list. - ``mode`` - specifies an HA mode of operation. The currently supported modes are ``load-balancing`` and ``hot-standby``. - ``heartbeat-delay`` - specifies a duration in milliseconds between sending the last heartbeat (or other command sent to the partner) and the next heartbeat. Heartbeats are sent periodically to gather the status of the partner and to verify whether the partner is still operating. The default value of this parameter is 10000 ms. - ``max-response-delay`` - specifies a duration in milliseconds since the last successful communication with the partner, after which the server assumes that communication with the partner is interrupted. This duration should be greater than the ``heartbeat-delay``; typically it should be a multiple of ``heartbeat-delay``. When the server detects that communication is interrupted, it may transition to the ``partner-down`` state (when ``max-unacked-clients`` is 0) or trigger the failure-detection procedure using the values of the two parameters below. The default value of this parameter is 60000 ms. - ``max-ack-delay`` - is one of the parameters controlling partner failure-detection. When communication with the partner is interrupted, the server examines the values of the "secs" field (DHCPv4) or "elapsed time" option (DHCPv6), which denote how long the DHCP client has been trying to communicate with the DHCP server. This parameter specifies the maximum time in milliseconds for the client to try to communicate with the DHCP server, after which this server assumes that the client failed to communicate with the DHCP server (is unacknowledged or "unacked"). The default value of this parameter is 10000. - ``max-unacked-clients`` - specifies how many "unacked" clients are allowed (see ``max-ack-delay``) before this server assumes that the partner is offline and transitions to the ``partner-down`` state. The special value of 0 is allowed for this parameter, which disables the failure-detection mechanism. In this case, a server that cannot communicate with its partner over the control channel assumes that the partner server is down and transitions to the ``partner-down`` state immediately. The default value of this parameter is 10. - ``max-rejected-lease-updates`` - specifies how many lease updates for distinct clients can fail, due to a conflict between the lease and the partner configuration or state, before the server transitions to the ``terminated`` state. Conflict can be a sign of a misconfiguration; usually, a small number of conflicted leases are acceptable because they affect only a few devices. However, if the conflicts occur for many devices (e.g., an entire subnet), the HA service becomes unreliable and should be terminated, and the problem must be manually corrected by an administrator. It is up to the administrator to select the highest acceptable value of ``max-rejected-lease-updates``. The default value is 10. The special value of 0 configures the server to never terminate the HA service due to lease conflicts. If the value is 1, the server transitions to the ``terminated`` state when the first conflict occurs. This parameter does not pertain to conflicting lease updates sent to the backup servers. - ``delayed-updates-limit`` - specifies the maximum number of lease updates which can be queued while the server is in the ``communication-recovery`` state. This parameter was introduced in Kea 1.9.4. The special value of 0 configures the server to never transition to the ``communication-recovery`` state and the server behaves as in earlier Kea versions, i.e. if the server cannot reach its partner, it goes straight into the ``partner-down`` state. The default value of this parameter is 100. .. note:: The ``max-rejected-lease-updates`` parameter was introduced in Kea 2.3.1. Previously, the server did not differentiate between a lease update failure due to a non-functioning partner and a failure due to a conflict (e.g., configuration issues). As a result, the server could sometimes transition to the ``partner-down`` state even though the partner was operating normally, but only certain leases had issues. Conflicts should no longer cause such a transition. However, depending on the ``max-rejected-lease-updates`` setting, too many conflicts can lead to termination of the High Availability service. In that case, both servers continue to respond to DHCP queries but no longer send lease updates. The values of ``max-ack-delay`` and ``max-unacked-clients`` must be selected carefully, taking into account the specifics of the network in which the DHCP servers are operating. The server in question may not respond to some DHCP clients following administrative policy, or the server may drop malformed queries from clients. Therefore, selecting too low a value for the ``max-unacked-clients`` parameter may result in a transition to the ``partner-down`` state even though the partner is still operating. On the other hand, selecting too high a value may result in never transitioning to the ``partner-down`` state if the DHCP traffic in the network is very low (e.g. at night), because the number of distinct clients trying to communicate with the server could be lower than the ``max-unacked-clients`` setting. In some cases it may be useful to disable the failure-detection mechanism altogether, if the servers are located very close to each other and network partitioning is unlikely, i.e. failure to respond to heartbeats is only possible when the partner is offline. In such cases, set ``max-unacked-clients`` to 0. The ``delayed-updates-limit`` parameter is used to enable or disable the ``communication-recovery`` procedure, and controls the server's behavior in the ``communication-recovery`` state. This parameter can only be used in the ``load-balancing`` mode. If a server in the ``load-balancing`` state experiences communication issues with its partner (a heartbeat or lease-update failure), the server transitions to the ``communication-recovery`` state. In this state, the server keeps responding to DHCP queries but does not send lease updates to the partner. The lease updates are queued until communication is re-established, to ensure that DHCP service remains available even in the event of the communication loss between the partners. There may appear to be communication loss when either one of the servers has terminated, or when both servers remain available but cannot communicate with each other. In the former case, the surviving server will follow the normal procedure and should eventually transition to the ``partner-down`` state. In the latter case, both servers should transition to the ``communication-recovery`` state and should never transition to the ``partner-down`` state (if ``max-unacked-clients`` is set to a non-zero value), because all DHCP queries are answered and neither server would see any unacked DHCP queries. Introduction of the ``communication-recovery`` procedure was motivated by issues which may appear when two servers remain online but the communication between them remains interrupted for a period of time. In earlier Kea versions, the servers having communication issues used to drop DHCP packets before transitioning to the ``partner-down`` state. In some cases they both transitioned to the ``partner-down`` state, which could potentially result in allocations of the same IP addresses or delegated prefixes to different clients by both servers. By entering the intermediate ``communication-recovery`` state, these problems are avoided. If a server in the ``communication-recovery`` state re-establishes communication with its partner, it tries to send the partner all of the outstanding lease updates it has queued. This is done synchronously and may take a considerable amount of time before the server transitions to the ``load-balancing`` state and resumes normal operation. The maximum number of lease updates which can be queued in the ``communication-recovery`` state is controlled by ``delayed-updates-limit``. If the limit is exceeded, the server stops queuing lease updates and performs a full database synchronization after re-establishing the connection with the partner, instead of sending outstanding lease updates before transitioning to the ``load-balancing`` state. Even if the limit is exceeded, the server in the ``communication-recovery`` state remains responsive to DHCP clients. It may be preferable to set higher values of ``delayed-updates-limit`` when there is a risk of prolonged communication interruption between the servers and when the lease database is large, to avoid costly lease-database synchronization. On the other hand, if the lease database is small, the time required to send outstanding lease updates may be longer than the lease-database synchronization. In such cases it may be better to use a lower value, e.g. 10. The default value of 100 is a reasonable compromise and should work well in most deployments with moderate traffic. .. note:: This parameter is new and values for it that work well in some environments may not work well in others. Feedback from users will help us build a better working set of recommendations. The ``peers`` parameter contains a list of servers within this HA setup. This configuration must contain at least one primary and one secondary server. It may also contain an unlimited number of backup servers. In this example, there is one backup server which receives lease updates from the active servers. Since Kea version 1.9.0, basic HTTP authentication is available to protect the Kea control agent against local attackers. These are the parameters specified for each of the peers within this list: - ``name`` - specifies a unique name for the server. - ``url`` - specifies the URL to be used to contact this server over the control channel. Other servers use this URL to send control commands to that server. - ``basic-auth-user`` - specifies the user ID for basic HTTP authentication. If not specified or specified as an empty string, no authentication header is added to HTTP transactions. It must not contain the colon (:) character. - ``basic-auth-password`` - specifies the password for basic HTTP authentication. This parameter is ignored when the user ID is not specified or is empty. The password is optional; if not specified, an empty password is used. - ``basic-auth-password-file`` - is an alternative to ``basic-auth-password``: instead of presenting the password in the configuration file it is specified in the file indicated by this parameter. - ``role`` - denotes the role of the server in the HA setup. The following roles are supported in the ``load-balancing`` configuration: ``primary``, ``secondary``, and ``backup``. There must be exactly one primary and one secondary server in the ``load-balancing`` setup. - ``auto-failover`` - a boolean value which denotes whether a server detecting a partner's failure should automatically start serving the partner's clients. The default value of this parameter is ``true``. In our example configuration above, both active servers can allocate leases from the subnet "192.0.3.0/24". This subnet contains two address pools: "192.0.3.100 - 192.0.3.150" and "192.0.3.200 - 192.0.3.250", which are associated with HA server scopes using client classification. When ``server1`` processes a DHCP query, it uses the first pool for lease allocation. Conversely, when ``server2`` processes a DHCP query it uses the second pool. If either of the servers is in the ``partner-down`` state, the other can serve leases from both pools; it selects the pool which is appropriate for the received query. In other words, if the query would normally be processed by ``server2`` but this server is not available, ``server1`` allocates the lease from the pool of "192.0.3.200 - 192.0.3.250". The Kea control agent in front of ``server3`` requires basic HTTP authentication, and authorizes the user ID "foo" with the password "bar". .. note:: The ``url`` schema can be ``http`` or ``https``, but since Kea version 1.9.6 the ``https`` schema requires a TLS setup. The hostname part must be an IPv4 address or an IPv6 address between square brackets, e.g. ``http://[2001:db8::1]:8080/``. Names are not accepted. .. _ha-load-balancing-advanced-config: Load Balancing With Advanced Classification ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ In the previous section, we provided an example of a ``load-balancing`` configuration with client classification limited to the "HA_server1" and "HA_server2" classes, which are dynamically assigned to the received DHCP queries. In many cases, HA is needed in deployments which already use some other client classification. Suppose there is a system which classifies devices into two groups: "phones" and "laptops", based on some classification criteria specified in the Kea configuration file. Both types of devices are allocated leases from different address pools. Introducing HA in ``load-balancing`` mode results in a further split of each of those pools, as each server allocates leases for some phones and some laptops. This requires each of the existing pools to be split between "HA_server1" and "HA_server2", so we end up with the following classes: - "phones_server1" - "laptops_server1" - "phones_server2" - "laptops_server2" The corresponding server configuration, using advanced classification (and the ``member`` expression), is provided below. For brevity's sake, the HA hook library configuration has been removed from this example. .. code-block:: json { "Dhcp4": { "client-classes": [{ "name": "phones", "test": "substring(option[60].hex,0,6) == 'Aastra'" }, { "name": "laptops", "test": "not member('phones')" }, { "name": "phones_server1", "test": "member('phones') and member('HA_server1')" }, { "name": "phones_server2", "test": "member('phones') and member('HA_server2')" }, { "name": "laptops_server1", "test": "member('laptops') and member('HA_server1')" }, { "name": "laptops_server2", "test": "member('laptops') and member('HA_server2')" }], "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.125", "client-class": "phones_server1" }, { "pool": "192.0.3.126 - 192.0.3.150", "client-class": "laptops_server1" }, { "pool": "192.0.3.200 - 192.0.3.225", "client-class": "phones_server2" }, { "pool": "192.0.3.226 - 192.0.3.250", "client-class": "laptops_server2" }], "option-data": [{ "name": "routers", "data": "192.0.3.1" }], "relay": { "ip-address": "10.1.2.3" } }] } } The configuration provided above splits the address range into four pools: two pools dedicated to "HA_server1" and two to "HA_server2". Each server can assign leases to both phones and laptops. Both groups of devices are assigned addresses from different pools. The "HA_server1" and "HA_server2" classes are built-in (see :ref:`built-in-client-classes`) and do not need to be declared. They are assigned dynamically by the HA hook library as a result of the ``load-balancing`` algorithm. "phones_*" and "laptop_*" evaluate to ``true`` when the query belongs to a given combination of other classes, e.g. "HA_server1" and "phones". The pool is selected accordingly as a result of such an evaluation. Consult :ref:`classify` for details on how to use the ``member`` expression and class dependencies. .. _ha-hot-standby-config: Hot-Standby Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~ The following is an example configuration of the primary server in a ``hot-standby`` configuration: :: "Dhcp4": { "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "server1", "mode": "hot-standby", "heartbeat-delay": 10000, "max-response-delay": 60000, "max-ack-delay": 5000, "max-unacked-clients": 5, "max-rejected-lease-updates": 10, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "standby", "auto-failover": true }, { "name": "server3", "url": "http://192.168.56.99:8000/", "basic-auth-user": "foo", "basic-auth-password": "bar", "role": "backup", "auto-failover": false }] }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.250", "client-class": "HA_server1" }], "option-data": [{ "name": "routers", "data": "192.0.3.1" }], "relay": { "ip-address": "10.1.2.3" } }] } This configuration is very similar to the ``load-balancing`` configuration described in :ref:`ha-load-balancing-config`, with a few notable differences. The ``mode`` is now set to ``hot-standby``, in which only one server responds to DHCP clients. If the primary server is online, it responds to all DHCP queries. The ``standby`` server takes over all DHCP traffic only if it discovers that the primary is unavailable. In this mode, the non-primary active server is called ``standby`` and that is its role. Finally, because there is always only one server responding to DHCP queries, there is only one scope - "HA_server1" - in use within pool definitions. In fact, the ``client-class`` parameter could be removed from this configuration without harm, because there can be no conflicts in lease allocations by different servers as they do not allocate leases concurrently. The ``client-class`` remains in this example mostly for demonstration purposes, to highlight the differences between the ``hot-standby`` and ``load-balancing`` modes of operation. .. _ha-passive-backup-config: Passive-Backup Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The following is an example configuration file for the primary server in a ``passive-backup`` configuration: .. code-block:: json { "Dhcp4": { "hooks-libraries": [{ "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [{ "this-server-name": "server1", "mode": "passive-backup", "wait-backup-ack": false, "peers": [{ "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary" }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "backup" }, { "name": "server3", "url": "http://192.168.56.99:8000/", "basic-auth-user": "foo", "basic-auth-password": "bar", "role": "backup" }] }] } }], "subnet4": [{ "id": 1, "subnet": "192.0.3.0/24", "pools": [{ "pool": "192.0.3.100 - 192.0.3.250" }], "option-data": [{ "name": "routers", "data": "192.0.3.1" }], "relay": { "ip-address": "10.1.2.3" } }] } } The configurations of three peers are included: one for the primary and two for the backup servers. Many of the parameters present in the ``load-balancing`` and ``hot-standby`` configuration examples are not relevant in the ``passive-backup`` mode, thus they are not specified here. For example: ``heartbeat-delay``, ``max-unacked-clients``, ``max-rejected-lease-updates``, and others related to the failover mechanism should not be specified in the ``passive-backup`` mode. The ``wait-backup-ack`` is a boolean parameter not present in previous examples. It defaults to ``false`` and must not be modified in the ``load-balancing`` and ``hot-standby`` modes. In the ``passive-backup`` mode this parameter can be set to ``true``, which causes the primary server to expect acknowledgments to the lease updates from the backup servers prior to responding to the DHCP client. It ensures that the lease has propagated to all servers before the client is given the lease, but it poses a risk of losing a DHCP service if there is a communication problem with one of the backup servers. This setting also increases the latency of the DHCP response, because of the time that the primary spends waiting for the acknowledgements. We recommend that the ``wait-backup-ack`` setting be left at its default value (``false``) if the DHCP service reliability is more important than consistency of the lease information between the primary and the backups, and in all cases when the DHCP service latency should be minimal. .. note:: Currently, active servers place lease updates to be sent to peers onto internal queues (one queue per peer/URL). In ``passive-backup`` mode, active servers do not wait for lease updates to be acknowledged; thus during times of heavy client traffic it is possible for the number of lease updates queued for transmission to accumulate faster than they can be delivered. As client traffic lessens the queues begin to empty. Since Kea 2.0.0, active servers monitor the size of these queues and emit periodic warnings (see HTTP_CLIENT_QUEUE_SIZE_GROWING in :ref:`kea-messages`) if they perceive a queue as growing too quickly. The warnings cease once the queue size begins to shrink. These messages are intended as a bellwether and seeing them sporadically during times of heavy traffic load does not necessarily indicate a problem. If, however, they occur continually during times of routine traffic load, they likely indicate potential mismatches in server capabilities and/or configuration; this should be investigated, as the size of the queues may eventually impair an active server's ability to respond to clients in a timely manner. .. _ha-sharing-lease-info: Lease Information Sharing ~~~~~~~~~~~~~~~~~~~~~~~~~ An HA-enabled server informs its active partner about allocated or renewed leases by sending appropriate control commands, and the partner updates the lease information in its own database. When the server starts up for the first time or recovers after a failure, it synchronizes its lease database with its partner. These two mechanisms guarantee consistency of the lease information between the servers and allow the designation of one of the servers to handle the entire DHCP traffic load if the other server becomes unavailable. In some cases, though, it is desirable to disable lease updates and/or database synchronization between the active servers, if the exchange of information about the allocated leases is performed using some other mechanism. Kea supports various database types that can be used to store leases, including MySQL and PostgreSQL. Those databases include built-in solutions for data replication which are often used by Kea administrators to provide redundancy. The HA hook library supports such scenarios by disabling lease updates over the control channel and/or lease-database synchronization, leaving the server to rely on the database replication mechanism. This is controlled by the two boolean parameters ``send-lease-updates`` and ``sync-leases``, whose values default to ``true``: :: "Dhcp4": { "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", "mode": "load-balancing", "send-lease-updates": false, "sync-leases": false, "peers": [ { "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary" }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "secondary" } ] } ] } } ], ... } In the most typical use case, both parameters are set to the same value, i.e. both are ``false`` if database replication is in use, or both are ``true`` otherwise. Introducing two separate parameters to control lease updates and lease-database synchronization is aimed at possible special use cases; for example, when synchronization is performed by copying a lease file (therefore ``sync-leases`` is set to ``false``), but lease updates should be conducted as usual (``send-lease-updates`` is set to ``true``). It should be noted that Kea does not natively support such use cases, but users may develop their own scripts and tools around Kea to provide such mechanisms. The HA hook library configuration is designed to maximize flexibility of administration. .. _ha-syncing-page-limit: Controlling Lease-Page Size Limit ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ An HA-enabled server initiates synchronization of the lease database after downtime or upon receiving the :isccmd:`ha-sync` command. The server uses commands :isccmd:`lease4-get-page` and :isccmd:`lease6-get-page` to fetch leases from its partner server (lease queries). The size of the results page (the maximum number of leases to be returned in a single response to one of these commands) can be controlled via configuration of the HA hook library. Increasing the page size decreases the number of lease queries sent to the partner server, but it causes the partner server to generate larger responses, which lengthens transmission time as well as increases memory and CPU utilization on both servers. Decreasing the page size helps to decrease resource utilization, but requires more lease queries to be issued to fetch the entire lease database. The default value of the ``sync-page-limit`` command controlling the page size is 10000. This means that the entire lease database can be fetched with a single command if the size of the database is equal to or less than 10000 lines. .. _ha-syncing-timeouts: Timeouts ~~~~~~~~ In deployments with a large number of clients connected to the network, lease-database synchronization after a server failure may be a time-consuming operation. The synchronizing server must gather all leases from its partner, which yields a large response over the RESTful interface. The server receives leases using the paging mechanism described in :ref:`ha-syncing-page-limit`. Before the page of leases is fetched, the synchronizing server sends a :isccmd:`dhcp-disable` command to disable the DHCP service on the partner server. If the service is already disabled, this command resets the timeout for the DHCP service being disabled, which by default is set to 60 seconds. If fetching a single page of leases takes longer than the specified time, the partner server assumes that the synchronizing server has died and resumes its DHCP service. The connection of the synchronizing server with its partner is also protected by the timeout. If the synchronization of a single page of leases takes longer than the specified time, the synchronizing server terminates the connection and the synchronization fails. Both timeout values are controlled by a single configuration parameter, ``sync-timeout``. The following configuration snippet demonstrates how to modify the timeout for automatic re-enabling of the DHCP service on the partner server and how to increase the timeout for fetching a single page of leases from 60 seconds to 90 seconds: :: "Dhcp4": { "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", "mode": "load-balancing", "sync-timeout": 90000, "peers": [ { "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary" }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "secondary" } ] } ] } } ], ... } It is important to note that extending this ``sync-timeout`` value may sometimes be insufficient to prevent issues with timeouts during lease-database synchronization. The control commands travel via the Control Agent, which also monitors incoming (with a synchronizing server) and outgoing (with a DHCP server) connections for timeouts. The DHCP server also monitors the connection from the Control Agent for timeouts. Those timeouts cannot currently be modified via configuration; extending these timeouts is only possible by modifying them in the Kea code and recompiling the server. The relevant constants are located in the Kea source at: ``src/lib/config/timeouts.h``. .. _ha-pause-state-machine: Pausing the HA State Machine ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ The ``high-availability`` state machine includes many different states described in detail in :ref:`ha-server-states`. The server enters each state when certain conditions are met, most often taking into account the partner server's state. In some states the server performs specific actions, e.g. synchronization of the lease database in the ``syncing`` state, or responding to DHCP queries according to the configured mode of operation in the ``load-balancing`` and ``hot-standby`` states. By default, transitions between the states are performed automatically and the server administrator has no direct control over when the transitions take place; in most cases, the administrator does not need such control. In some situations, however, the administrator may want to "pause" the HA state machine in a selected state to perform some additional administrative actions before the server transitions to the next state. Consider a server failure which results in the loss of the entire lease database. Typically, the server rebuilds its lease database when it enters the ``syncing`` state by querying the partner server for leases, but it is possible that the partner was also experiencing a failure and lacks lease information. In this case, it may be required to reconstruct lease databases on both servers from some external source, e.g. a backup server. If the lease database is to be reconstructed via the RESTful API, the servers should be started in the initial, i.e. ``waiting``, state and remain in this state while leases are being added. In particular, the servers should not attempt to synchronize their lease databases nor start serving DHCP clients. The HA hook library provides configuration parameters and a command to control pausing and resuming the HA state machine. The following configuration causes the HA state machine to pause in the ``waiting`` state after server startup. :: "Dhcp4": { "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", "mode": "load-balancing", "peers": [ { "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary" }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "secondary" } ], "state-machine": { "states": [ { "state": "waiting", "pause": "once" } ] } } ] } } ], ... } The ``pause`` parameter value ``once`` denotes that the state machine should be paused upon the first transition to the ``waiting`` state; later transitions to this state will not cause the state machine to pause. Two other supported values of the ``pause`` parameter are ``always`` and ``never``. The latter is the default value for each state, which instructs the server never to pause the state machine. In order to "unpause" the state machine, the :isccmd:`ha-continue` command must be sent to the paused server. This command does not take any arguments. See :ref:`ha-control-commands` for details about commands specific to :ischooklib:`libdhcp_ha.so`. It is possible to configure the state machine to pause in more than one state. Consider the following configuration: :: "Dhcp4": { "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", "mode": "load-balancing", "peers": [ { "name": "server1", "url": "http://192.168.56.33:8000/", "role": "primary" }, { "name": "server2", "url": "http://192.168.56.66:8000/", "role": "secondary" } ], "state-machine": { "states": [ { "state": "ready", "pause": "always" }, { "state": "partner-down", "pause": "once" } ] } } ] } } ], ... } This configuration instructs the server to pause the state machine every time it transitions to the ``ready`` state and upon the first transition to the ``partner-down`` state. Refer to :ref:`ha-server-states` for a complete list of server states. The state machine can be paused in any of the supported states; however, it is not practical to pause in the ``backup`` or ``terminated`` states because the server never transitions out of these states anyway. .. note:: In the ``syncing`` state the server is paused before it makes an attempt to synchronize the lease database with a partner. To pause the state machine after lease-database synchronization, use the ``ready`` state instead. .. note:: The state of the HA state machine depends on the state of the cooperating server. Therefore, pausing the state machine of one server may affect the operation of the partner server. For example: if the primary server is paused in the ``waiting`` state, the partner server will also remain in the ``waiting`` state until the state machine of the primary server is resumed and that server transitions to the ``ready`` state. .. _ha-ctrl-agent-config: Control Agent Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The :ref:`kea-ctrl-agent` describes in detail the Kea daemon, which provides a RESTful interface to control the Kea servers. The same functionality is used by the High Availability hook library to establish communication between the HA peers. Therefore, the HA library requires that the Control Agent (CA) be started for each DHCP instance within the HA setup. If the Control Agent is not started, the peers cannot communicate with a particular DHCP server (even if the DHCP server itself is online) and may eventually consider this server to be offline. The following is an example configuration for the CA running on the same machine as the primary server. This configuration is valid for both the ``load-balancing`` and the ``hot-standby`` cases presented in previous sections. :: { "Control-agent": { "http-host": "192.168.56.33", // If enabling HA and multi-threading, the 8000 port is used by the HA // hook library http listener. When using HA hook library with // multi-threading to function, make sure the port used by dedicated // listener is different (e.g. 8001) than the one used by CA. Note // the commands should still be sent via CA. The dedicated listener // is specifically for HA updates only. "http-port": 8000, "control-sockets": { "dhcp4": { "socket-type": "unix", "socket-name": "/tmp/kea-dhcp4-ctrl.sock" }, "dhcp6": { "socket-type": "unix", "socket-name": "/tmp/kea-dhcp6-ctrl.sock" } } } } Since Kea 1.9.0, basic HTTP authentication is supported. .. _ha-mt-config: Multi-Threaded Configuration (HA+MT) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ HA peer communication consists of specialized API commands sent between HA peers. Prior to Kea 1.9.7, each peer had to be paired with a local instance of :iscman:`kea-ctrl-agent` in order to exchange commands. The agent received HA commands via HTTP, communicated via Linux socket with the local peer to carry out the command, and then sent the response back to the requesting peer via HTTP. To send HA commands, each peer opened its own HTTP client connection to the URL of each of its peers. In Kea 1.9.7 and newer, it is possible to configure HA to use direct multi-threaded communication between peers. We refer to this mode as HA+MT. With HA+MT enabled, each peer runs its own dedicated, internal HTTP listener (i.e. server) which receives and responds to commands directly, thus eliminating the need for an agent to carry out HA protocol between peers. In addition, both the listener and client components use multi-threading to support multiple, concurrent connections between peers. By eliminating the agent and executing multiple command exchanges in parallel, HA throughput between peers should improve considerably in most situations. The following parameters have been added to the HA configuration, to support HA+MT operation: - ``enable-multi-threading`` - enables or disables multi-threading HA peer communication (HA+MT). Kea core multi-threading must be enabled for HA+MT to operate. When ``false``, the server relies on :iscman:`kea-ctrl-agent` for communication with its peer, and uses single-threaded HTTP client processing. The default is ``true``. - ``http-dedicated-listener`` - enables or disables the creation of a dedicated, internal HTTP listener through which the server receives HA messages from its peers. The internal listener replaces the role of :iscman:`kea-ctrl-agent` traffic, allowing peers to send their HA commands directly to each other. The listener listens on the peer's ``url``. When ``false``, the server relies on :iscman:`kea-ctrl-agent`. This parameter has been provided largely for flexibility and testing; running HA+MT without dedicated listeners enabled will substantially limit HA throughput. The default is ``true``. - ``http-listener-threads`` - indicates the maximum number of threads the dedicated listener should use. A value of ``0`` instructs the server to use the same number of threads that the Kea core is using for DHCP multi-threading. The default is ``0``. - ``http-client-threads`` - indicates the maximum number of threads that should be used to send HA messages to its peers. A value of ``0`` instructs the server to use the same number of threads that the Kea core is using for DHCP multi-threading. The default is ``0``. These parameters are grouped together under a map element, ``multi-threading``, as illustrated below: :: "Dhcp4": { "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", "multi-threading": { "enable-multi-threading": true, "http-dedicated-listener": true, "http-listener-threads": 4, "http-client-threads": 4 }, "peers": [ // This is the configuration of this server instance. { "name": "server1", // This specifies the URL of our server instance. // Since the HA+MT uses a direct connection, the // DHCPv4 server open its own socket. Note that it // must be different than the one used by the CA // (typically 8000). In this example, 8001 is used. "url": "http://192.0.2.1:8001/", // This server is primary. The other one must be // secondary. "role": "primary" }, // This is the configuration of our HA peer. { "name": "server2", // This specifies the URL of our server instance. // Since the HA+MT uses a direct connection, the // DHCPv4 server open its own socket. Note that it // must be different than the one used by the CA // (typically 8000). In this example, 8001 is used. "url": "http://192.0.2.2:8001/", // The partner is a secondary. This server is a // primary as specified in the previous "peers" // entry and in "this-server-name" before that. "role": "secondary" }, ... ], ... }, ... ] } }, ... ], ... } In the example above, HA+MT is enabled with four threads for the listener and four threads for the client. .. note:: It is essential to configure the ports correctly. One common mistake is to configure CA to listen on port 8000 and also configure dedicated listeners on port 8000. In such a configuration, the communication will still work over CA, but it will be slow and the DHCP server will fail to bind sockets. Administrators should ensure that dedicated listeners use a different port (8001 is a suggested alternative); if ports are misconfigured or the ports dedicated to CA are used, the performance bottlenecks caused by the single-threaded nature of CA and the sequential nature of the UNIX socket that connects CA to DHCP servers will nullify any performance gains offered by HA+MT. .. _ha-parked-packet-limit: Parked-Packet Limit ~~~~~~~~~~~~~~~~~~~ Kea servers contain a mechanism by which the response to a client packet may be held, pending completion of hook library work. We refer to this as "parking" the packet. The HA hook library makes use of this mechanism. When an HA server needs to send a lease update to its peer(s) to notify it of the change to the lease, it will "park" the client response until the peer acknowledges the lease update. At that point, the server will "unpark" the response and send it to the client. This applies to client queries which cause lease changes, such as DHCPREQUEST for DHCPv4 and Request, Renew, and Rebind for DHCPv6. It does not apply to DHPCDISCOVERs (v4) or Solicits (v6). There is a global parameter, ``parked-packet-limit``, that may be used to limit the number of responses that may be parked at any given time. This acts as a form of congestion handling and protects the server from being swamped when the volume of client queries is outpacing the server's ability to respond. Once the limit is reached, the server emits a log and drops any new responses until parking spaces are available. In general, smaller values for the parking lot limit are likely to cause more drops but with shorter response times. Larger values are likely to result in fewer drops but with longer response times. Currently, the default value for ``parked-packet-limit`` is 256. .. warning:: Using too small a value may result in an unnecessarily high drop rate, while using too large a value may lead to response times that are simply too long to be useful. A value of 0, while allowed, disables the limit altogether, but this is highly discouraged as it may lead to Kea servers becoming unresponsive to clients. Choosing the best value is very site-specific; we recommend users initially leave it at the default value of 256 and observe how the system behaves over time with varying load conditions. :: "Dhcp6": { // Limit the number of concurrently parked packets to 128. "parked-packet-limit": 128, "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": { } }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server1", ... } ] } }, ... ], ... } .. note:: While ``parked-packet-limit`` is not specifically tied to HA, currently HA is the only ISC hook that employs packet parking. .. _ha-maintenance: Controlled Shutdown and Maintenance of DHCP Servers ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Having a pair of servers providing High Availability allows for controlled shutdown and maintenance of those servers without disrupting the DHCP service. For example, an administrator can perform an upgrade of one of the servers while the other one continues to respond to DHCP queries. When the first server is upgraded and back online, the upgrade can be performed for the second server. A typical problem reported with early versions of the High Availability hook library was that the administrator did not have direct control over the state of the DHCP server. Shutting down one of the servers for maintenance did not necessarily cause the other server to start responding to all DHCP queries, because the failure-detection algorithm described in :ref:`ha-scope-transition` requires that the partner not respond for a configured period of time and, depending on the configuration, may also require that a number of DHCP requests not be responded to for a specified period of time. The maintenance procedure, however, requires that the administrator be able to instruct one of the servers to instantly start serving all DHCP clients, and the other server to instantly stop serving any DHCP clients, so it can be safely shut down. The maintenance feature of the High Availability hook library addresses this situation. The :isccmd:`ha-maintenance-start` command was introduced to allow the administrator to put the pairs of the active servers in a state in which one of them is responding to all DHCP queries and the other one is awaiting shutdown. Suppose that the HA setup includes two active servers, ``server1`` and ``server2``, and the latter needs to be shut down for maintenance. The administrator can send the :isccmd:`ha-maintenance-start` command to ``server1``, as this is the server which is going to handle the DHCP traffic while the other one is offline. ``server1`` responds with an error if its state or the partner's state does not allow for a maintenance shutdown: for example, if maintenance is not supported for the backup server or if the server is in the ``terminated`` state. Also, an error is returned if the :isccmd:`ha-maintenance-start` request was already sent to the other server. Upon receiving the :isccmd:`ha-maintenance-start` command, ``server1`` sends the :isccmd:`ha-maintenance-notify` command to ``server2`` to put it in the ``in-maintenance`` state. If ``server2`` confirms, ``server1`` transitions to the ``partner-in-maintenance`` state. This is similar to the ``partner-down`` state, except that in the ``partner-in-maintenance`` state ``server1`` continues to send lease updates to ``server2`` until the administrator shuts down ``server2``. ``server1`` now responds to all DHCP queries. The administrator can now safely shut down ``server2`` in the ``in-maintenance`` state and perform any necessary maintenance actions. While ``server2`` is offline, ``server1`` will obviously not be able to communicate with its partner, so it will immediately transition to the ``partner-down`` state; it will continue to respond to all DHCP queries but will no longer send lease updates to ``server2``. Restarting ``server2`` after the maintenance will trigger normal state negotiation, lease-database synchronization, and, ultimately, a transition to the normal ``load-balancing`` or ``hot-standby`` state. Maintenance can then be performed on ``server1``, after sending the :isccmd:`ha-maintenance-start` command to ``server2``. If the :isccmd:`ha-maintenance-start` command was sent to the server and the server has transitioned to the ``partner-in-maintenance`` state, it is possible to transition both it and its partner back to their previous states to resume the normal operation of the HA pair. This is achieved by sending the :isccmd:`ha-maintenance-cancel` command to the server that is in the ``partner-in-maintenance`` state. However, if the server has already transitioned to the ``partner-down`` state as a result of detecting that the partner is offline, canceling the maintenance is no longer possible. In that case, it is necessary to restart the other server and allow it to complete its normal state negotiation process. If the server has many relationships with different partners, the ``ha-maintenance-start`` command attempts to transition all of the relationships into the ``partner-in-maintenance`` state by sending the ``ha-mainteance-notify`` to all partner servers. If this step fails for any server an error is returned. In that case, send the ``ha-maintenance-cancel`` command to resume normal operation and fix the issue. Upgrading From Older HA Versions ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ To upgrade from an older HA hook library to the current version, the administrator must shut down one of the servers and rely on the failover mechanism to force the online server to transition to the ``partner-down`` state, where it starts serving all DHCP clients. Once the hook library on the first server is upgraded to a current version, the :isccmd:`ha-maintenance-start` command can be used to upgrade the second server. In such a case, shut down the server running the old version. Next, send the :isccmd:`ha-maintenance-start` command to the server that has been upgraded. This server should immediately transition to the ``partner-down`` state as it cannot communicate with its offline partner. In the ``partner-down`` state the first (upgraded) server will respond to all DHCP requests, allowing the administrator to perform the upgrade on the second server. .. note:: Do not send the :isccmd:`ha-maintenance-start` command while the server running the old hook library is still online. The server receiving this command will return an error. .. _ha-control-commands: Control Commands for High Availability ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Even though the HA hook library is designed to automatically resolve issues with DHCP service interruptions by redirecting the DHCP traffic to a surviving server and synchronizing the lease database as needed, it may be useful for the administrator to have more control over both servers' behavior. In particular, it may be useful to be able to trigger lease-database synchronization on demand, or to manually set the HA scopes that are being served. The backup server can sometimes be used to handle DHCP traffic if both active servers are down. The backup server does not perform the failover function automatically; thus, in order to use the backup server to respond to DHCP queries, the server administrator must enable this function manually. The following sections describe commands supported by the HA hook library which are available for the administrator. .. isccmd:: ha-sync .. _command-ha-sync: The ``ha-sync`` Command ----------------------- The :isccmd:`ha-sync` command instructs the server to synchronize its local lease database with the selected peer. The server fetches all leases from the peer and updates any locally stored leases which are older than those fetched. It also creates new leases when any of those fetched do not exist in the local database. All leases that are not returned by the peer but are in the local database are preserved. The database synchronization is unidirectional; only the database on the server to which the command has been sent is updated. To synchronize the peer's database, a separate :isccmd:`ha-sync` command must be issued to that peer. Database synchronization may be triggered for both active and backup server types. The :isccmd:`ha-sync` command has the following structure (in a DHCPv4 example): :: { "command": "ha-sync", "service": [ "dhcp4 "], "arguments": { "server-name": "server2", "max-period": 60 } } When the server receives this command it first disables the DHCP service of the server from which it will be fetching leases, by sending the :isccmd:`dhcp-disable` command to that server. The ``max-period`` parameter specifies the maximum duration (in seconds) for which the DHCP service should be disabled. If the DHCP service is successfully disabled, the synchronizing server fetches leases from the remote server by issuing one or more :isccmd:`lease4-get-page` commands. When the lease-database synchronization is complete, the synchronizing server sends the :isccmd:`dhcp-enable` command to the peer to re-enable its DHCP service. The ``max-period`` value should be sufficiently long to guarantee that it does not elapse before the synchronization is completed. Otherwise, the DHCP server will automatically enable its DHCP function while the synchronization is still in progress. If the DHCP server subsequently allocates any leases during the synchronization, those new (or updated) leases will not be fetched by the synchronizing server, leading to database inconsistencies. .. isccmd:: ha-scopes .. _command-ha-scopes: The ``ha-scopes`` Command ------------------------- This command allows an administrator to modify the HA scopes being served. Consult :ref:`ha-load-balancing-config` and :ref:`ha-hot-standby-config` to learn which scopes are available for the different HA modes of operation. The :isccmd:`ha-scopes` command has the following structure (in a DHCPv4 example): :: { "command": "ha-scopes", "service": [ "dhcp4" ], "arguments": { "scopes": [ "HA_server1", "HA_server2" ], "server-name": "server2" } } This command configures the server to handle traffic from both the "HA_server1" and "HA_server2" scopes. To disable all scopes specify an empty list: :: { "command": "ha-scopes", "service": [ "dhcp4 "], "arguments": { "scopes": [ ], "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. .. isccmd:: ha-continue .. _command-ha-continue: The ``ha-continue`` Command --------------------------- This command is used to resume the operation of the paused HA state machine, as described in :ref:`ha-pause-state-machine`. It takes no arguments, so the command structure is simply: :: { "command": "ha-continue", "service": [ "dhcp4" ], "arguments": { "scopes": [ ], "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. .. isccmd:: ha-heartbeat .. _command-ha-heartbeat: The ``ha-heartbeat`` Command ---------------------------- The :ref:`ha-server-states` section describes how the :isccmd:`ha-heartbeat` command is used by a pair of active HA servers to detect one partner's failure. This command, however, can also be sent by the system administrator to one or both servers to check their HA state. This allows a monitoring system to be deployed on the HA enabled servers to periodically check whether they are operational or whether any manual intervention is required. The :isccmd:`ha-heartbeat` command takes no arguments: :: { "command": "ha-heartbeat", "service": [ "dhcp4" ], "arguments": { "scopes": [ ], "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. Upon successful communication with the server, a response similar to this should be returned: :: { "result": 0, "text": "HA peer status returned.", "arguments": { "state": "partner-down", "date-time": "Thu, 07 Nov 2019 08:49:37 GMT", "scopes": [ "server1" ], "unsent-update-count": 123 } } The returned ``state`` value should be one of the values listed in :ref:`ha-server-states`. In the example above, the ``partner-down`` state is returned, which indicates that the server which responded to the command believes that its partner is offline; thus, it is serving all DHCP requests sent to the servers. To ensure that the partner is indeed offline, the administrator should send the :isccmd:`ha-heartbeat` command to the second server. If sending the command fails, e.g. due to an inability to establish a TCP connection to the Control Agent, or if the Control Agent reports issues with communication with the DHCP server, it is very likely that the server is not running. The ``date-time`` parameter conveys the server's notion of time. The ``unsent-update-count`` value is a cumulative count of all unsent lease updates since the server was booted; its value is set to 0 when the server is started. It is never reset to 0 during the server's operation, even after the partner synchronizes the database. It is incremented by the partner sending the heartbeat response when it cannot send the lease update. For example, suppose the failure is a result of a temporary communication interruption. In that case, the partner receiving the ``partner-down`` heartbeat response tracks the value changes and can determine, once communication is reestablished, whether there are any new lease updates that it did not receive. If the values on both servers do not match, it is an indication that the partner should synchronize its lease database. A non-zero value itself is not an indication of any present issues with lease updates, but a constantly incrementing value is. The typical response returned by one server when both are operational is: :: { "result": 0, "text": "HA peer status returned.", "arguments": { "state": "load-balancing", "date-time": "Thu, 07 Nov 2019 08:49:37 GMT", "scopes": [ "server1" ], "unsent-update-count": 0 } } In most cases, the :isccmd:`ha-heartbeat` command should be sent to both HA-enabled servers to verify the state of the entire HA setup. In particular, if one of the servers indicates that it is in the ``load-balancing`` state, it means that this server is operating as if its partner is functional. When a partner goes down, it takes some time for the surviving server to realize it. The :ref:`ha-scope-transition` section describes the algorithm which the surviving server follows before it transitions to the ``partner-down`` state. If the :isccmd:`ha-heartbeat` command is sent during the time window between the failure of one of the servers and the transition of the surviving server to the ``partner-down`` state, the response from the surviving server does not reflect the failure. Resending the command detects the failure once the surviving server has entered the ``partner-down`` state. .. note: Always send the :isccmd:`ha-heartbeat` command to both active HA servers to check the state of the entire HA setup. Sending it to only one of the servers may not reflect issues that just began with one of the servers. .. isccmd:: ha-status-get .. _command-ha-status-get: The ``status-get`` Command -------------------------- :isccmd:`status-get` is a general-purpose command supported by several Kea daemons, not only the DHCP servers. However, when sent to a DHCP server with HA enabled, it can be used to get insight into the details of the HA-specific server status. Not only does the response contain the status information of the server receiving this command, but also the information about its partner if it is available. The following is an example response to the :isccmd:`status-get` command, including the HA status of two ``load-balancing`` servers: .. code-block:: json { "result": 0, "text": "", "arguments": { "pid": 1234, "uptime": 3024, "reload": 1111, "high-availability": [ { "ha-mode": "load-balancing", "ha-servers": { "local": { "role": "primary", "scopes": [ "server1" ], "state": "load-balancing", "server-name": "server1" }, "remote": { "age": 10, "in-touch": true, "role": "secondary", "last-scopes": [ "server2" ], "last-state": "load-balancing", "communication-interrupted": true, "connecting-clients": 2, "unacked-clients": 1, "unacked-clients-left": 2, "analyzed-packets": 8, "server-name": "server2" } } } ], "multi-threading-enabled": true, "thread-pool-size": 4, "packet-queue-size": 64, "packet-queue-statistics": [ 0.2, 0.1, 0.1 ], "sockets": { "status": "ready" } } } The ``high-availability`` argument is a list which currently comprises only one element. The ``ha-servers`` map contains two structures: ``local`` and ``remote``. The former contains the status information of the server which received the command, while the latter contains the status information known to the local server about the partner. The ``role`` of the partner server is gathered from the local configuration file, and thus should always be available. The remaining status information, such as ``last-scopes`` and ``last-state``, is not available until the local server communicates with the remote by successfully sending the :isccmd:`ha-heartbeat` command. If at least one such communication has taken place, the returned value of the ``in-touch`` parameter is set to ``true``. By examining this value, the command's sender can determine whether the information about the remote server is reliable. The ``last-scopes`` and ``last-state`` parameters contain information about the HA scopes served by the partner and its state. This information is gathered during the :isccmd:`ha-heartbeat` command exchange, so it may not be accurate if a communication problem occurs between the partners and this status information is not refreshed. In such a case, it may be useful to send the :isccmd:`status-get` command to the partner server directly to check its current state. The ``age`` parameter specifies the age of the information from the partner, in seconds. The ``communication-interrupted`` boolean value indicates whether the server receiving the :isccmd:`status-get` command (the local server) has been unable to communicate with the partner longer than the duration specified as ``max-response-delay``. In such a situation, the active servers are considered to be in the ``communication-interrupted`` state. At this point, the local server may start monitoring the DHCP traffic directed to the partner to see if the partner is responding to this traffic. More about the failover procedure can be found in :ref:`ha-load-balancing-config`. The ``connecting-clients``, ``unacked-clients``, ``unacked-clients-left``, and ``analyzed-packets`` parameters were introduced along with the ``communication-interrupted`` parameter and they convey useful information about the state of the DHCP traffic monitoring in the ``communication-interrupted`` state. Once the server leaves the ``communication-interrupted`` state, these parameters are all reset to 0. These parameters have the following meaning in the ``communication-interrupted`` state: - ``connecting-clients`` - this is the number of different clients which have attempted to get a lease from the remote server. These clients are differentiated by their MAC address and client identifier (in DHCPv4) or DUID (in DHCPv6). This number includes "unacked" clients (for which the "secs" field or "elapsed time" value exceeded the ``max-response-delay``). - ``unacked-clients`` - this is the number of different clients which have been considered "unacked", i.e. the clients which have been trying to get the lease longer than the value of the "secs" field, or for which the "elapsed time" exceeded the ``max-response-delay`` setting. - ``unacked-clients-left`` - this indicates the number of additional clients which have to be considered "unacked" before the server enters the ``partner-down`` state. This value decreases when the ``unacked-clients`` value increases. The local server enters the ``partner-down`` state when this value decreases to 0. - ``analyzed-packets`` - this is the total number of packets directed to the partner server and analyzed by the local server since entering the communication interrupted state. It includes retransmissions from the same clients. Monitoring these values helps to predict when the local server will enter the ``partner-down`` state or to understand why the server has not yet entered this state. The ``ha-mode`` parameter returns the HA mode of operation selected using the ``mode`` parameter in the configuration file. It can hold one of the following values: ``load-balancing``, ``hot-standby``, or ``passive-backup``. The :isccmd:`status-get` response has the format described above only in the ``load-balancing`` and ``hot-standby`` modes. In the ``passive-backup`` mode the ``remote`` map is not included in the response because in this mode there is only one active server (local). The response includes no information about the status of the backup servers. .. isccmd:: ha-maintenance-start .. _command-ha-maintenance-start: The ``ha-maintenance-start`` Command ------------------------------------ This command is used to initiate the transition of the server's partners into the ``in-maintenance`` state and the transition of the server receiving the command into the ``partner-in-maintenance`` state in each HA relationship. See the :ref:`ha-maintenance` section for details. :: { "command": "ha-maintenance-start", "service": [ "dhcp4" ] } .. isccmd:: ha-maintenance-cancel .. _command-ha-maintenance-cancel: The ``ha-maintenance-cancel`` Command ------------------------------------- This command is used to cancel the maintenance previously initiated using the :isccmd:`ha-maintenance-start` command. The server receiving this command will first send :isccmd:`ha-maintenance-notify`, with the ``cancel`` flag set to ``true``, to its partners. Next, the server reverts from the ``partner-in-maintenance`` state to its previous state. See the :ref:`ha-maintenance` section for details. :: { "command": "ha-maintenance-cancel", "service": [ "dhcp4" ] } .. isccmd:: ha-maintenance-notify .. _command-ha-maintenance-notify: The ``ha-maintenance-notify`` Command ------------------------------------- This command is sent by the server receiving the :isccmd:`ha-maintenance-start` or the :isccmd:`ha-maintenance-cancel` command to its partner, to cause the partner to transition to the ``in-maintenance`` state or to revert from this state to a previous state. See the :ref:`ha-maintenance` section for details. :: { "command": "ha-maintenance-notify", "service": [ "dhcp4" ], "arguments": { "cancel": false, "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. .. warning:: The :isccmd:`ha-maintenance-notify` command is not meant to be used by system administrators. It is used for internal communication between a pair of HA-enabled DHCP servers. Direct use of this command is not supported and may produce unintended consequences. .. isccmd:: ha-reset .. _command-ha-reset: The ``ha-reset`` Command ------------------------ This command causes the server to reset its High Availability state machine by transitioning it to the ``waiting`` state. A partner in the ``communication-recovery`` state may send this command to cause the server to synchronize its lease database. Database synchronization is required when the partner has failed to send all lease database updates after re-establishing connection after a temporary connection failure. It is also required when the ``delayed-updates-limit`` is exceeded, when the server is in the ``communication-recovery`` state. A server administrator may send this command to reset a misbehaving state machine. :: { "command": "ha-reset", "service": [ "dhcp4" ], "arguments": { "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. It elicits the response: :: { "result": 0, "text": "HA state machine reset." } If the server receiving this command is already in the ``waiting`` state, the command has no effect. .. isccmd:: ha-sync-complete-notify .. _command-ha-sync-complete-notify: The ``ha-sync-complete-notify`` Command --------------------------------------- A server sends this command to its partner to signal that it has completed lease-database synchronization. The partner may enable its DHCP service if it can allocate new leases in its current state. The partner does not enable the DHCP service in the ``partner-down`` state until it sends a successful :isccmd:`ha-heartbeat` test to its partner server. If the connection is still unavailable, the server in the ``partner-down`` state enables its own DHCP service to continue responding to clients. :: { "command": "ha-sync-complete-notify", "service": [ "dhcp4" ], "arguments": { "origin": 2000, "server-name": "server2" } } The optional ``server-name`` parameter specifies a name of one of the partners belonging to the HA relationship this command pertains to. This parameter can be omitted if the server receiving this command has only one HA relationship in the configuration. The ``origin`` parameter is used to select the HA service for which the receiving server should enable the DHCP service when it receives this notification. This is the same origin the sending server used previously to disable the DHCP service before synchronization. It elicits the response: :: { "result": 0, "text": "Server successfully notified about the synchronization completion." } .. warning:: The :isccmd:`ha-sync-complete-notify` command is not meant to be used by system administrators. It is used for internal communication between a pair of HA-enabled DHCP servers. Direct use of this command is not supported and may produce unintended consequences. .. _ha-hub-and-spoke: Hub and Spoke Configuration ~~~~~~~~~~~~~~~~~~~~~~~~~~~ The hub-and-spoke is a common arrangement of the DHCP servers for resiliency. It contains one central server and multiple branch servers. The branch servers are the primary servers in the ``hot-standby`` mode and respond to the local DHCP traffic in their respective locations. The central server acts as a standby server for each branch server. It maintains independent state machines with the branch servers, called relationships. If one of the branch servers experiences a failure, the central server can take over its DHCP traffic. In this case, we say that one of the central server's relationships is in the ``partner-down`` state. The remaining relationships may still be in the ``hot-standby`` state and not actively respond to DHCP traffic. When the branch server becomes active again, it synchronizes the lease database with the central server, and the central server becomes fully passive again. In rare cases, when multiple branch servers stop, the central server takes responsibility for all their traffic (possibly the entire DHCP traffic in the network when all branch servers are down). A simple hub-and-spoke arrangement consisting of two branch servers and one central server is shown below. :: +----- Central Server ------+ | | +----------+ relationship 1 | +----------+----------+ | relationship 2 +----------+ | Server 1 |===================| Server 2 | Server 4 |===================| Server 3 | +----------+ | +----------+----------+ | +----------+ | | +---------------------------+ Each branch server's configuration comprises a set of subnets appropriate for the branch server. Different branch servers serve different subnets. The central server's configuration comprises all subnets of the branch servers so that it can respond to the DHCP traffic directed to any of the failing branch servers. The subnets in the central server must be grouped into relationships like in the snippet below: .. code-block:: json { "Dhcp6": { "interfaces-config": { "interfaces": [ "enp0s8", "enp0s9" ] }, "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": {} }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server2", "mode": "hot-standby", "multi-threading": { "enable-multi-threading": true, "http-dedicated-listener": true, "http-listener-threads": 4, "http-client-threads": 4 }, "peers": [ { "name": "server1", "url": "http://192.168.56.66:8000/", "role": "primary", "auto-failover": true }, { "name": "server2", "url": "http://192.168.56.33:8000/", "role": "standby", "auto-failover": true } ] }, { "this-server-name": "server4", "mode": "hot-standby", "multi-threading": { "enable-multi-threading": true, "http-dedicated-listener": true, "http-listener-threads": 4, "http-client-threads": 4 }, "peers": [ { "name": "server3", "url": "http://192.168.57.99:8000/", "role": "primary", "auto-failover": true }, { "name": "server4", "url": "http://192.168.57.33:8000/", "role": "standby", "auto-failover": true } ] } ] } } ], "subnet6": [ { "id": 1, "subnet": "2001:db8:1::/64", "pools": [ { "pool": "2001:db8:1::/80" } ], "interface": "enp0s8", "user-context": { "ha-server-name": "server2" } }, { "id": 2, "subnet": "2001:db8:2::/64", "pools": [ { "pool": "2001:db8:2::/80" } ], "interface": "enp0s9", "user-context": { "ha-server-name": "server4" } } ] } } The peer names in the relationships must be unique. The user context for each subnet contains the ``ha-server-name`` parameter associating a subnet with a relationship. The ``ha-server-name`` can be any of the peer names in the relationship. Suppose a relationship contains peer names ``server1`` and ``server2``. It doesn't matter whether the ``ha-server-name`` is ``server1`` or ``server2``. In both cases, it associates a subnet with that relationship. It is not required to specify the ``ha-server-name`` in the branch servers, assuming that the branch servers only contain the subnets they serve. Consider the following configuration for branch ``server3``: .. code-block:: json { "Dhcp6": { "interfaces-config": { "interfaces": [ "enp0s8" ] }, "hooks-libraries": [ { "library": "/usr/lib/kea/hooks/libdhcp_lease_cmds.so", "parameters": {} }, { "library": "/usr/lib/kea/hooks/libdhcp_ha.so", "parameters": { "high-availability": [ { "this-server-name": "server3", "mode": "hot-standby", "multi-threading": { "enable-multi-threading": true, "http-dedicated-listener": true, "http-listener-threads": 4, "http-client-threads": 4 }, "peers": [ { "name": "server3", "url": "http://192.168.57.99:8000/", "role": "primary", "auto-failover": true }, { "name": "server4", "url": "http://192.168.57.33:8000/", "role": "standby", "auto-failover": true } ] } ] } } ], "subnet6": [ { "id": 2, "subnet": "2001:db8:2::/64", "pools": [ { "pool": "2001:db8:2::/80" } ], "interface": "enp0s8", "user-context": { "ha-server-name": "server3" } } ] } } .. note:: Even though it is not required to include the ``ha-server-name`` user context parameters in the branch servers, we recommend including them. The servers fetch all leases from the partners during the database synchronization. If the subnets are not explicitly associated with the relationship, the branch server inserts all fetched leases from the central server (including those from other relationships) into its database. Specifying ``ha-server-name`` parameter for each configured subnet in the branch server guarantees that only the leases belonging to its relationship are inserted into the branch server's database. .. note:: The peer names in the branch servers must match the peer names in the respective central server's relationships because these names are used for signaling between the HA partners.