Networkd

The xcp-networkd daemon (hereafter simply called “networkd”) is a component in the xapi toolstack that is responsible for configuring network interfaces and virtual switches (bridges) on a host.

The code is in ocaml/networkd.

Principles

  1. Distro-agnostic. Networkd is meant to work on at least CentOS/RHEL as well a Debian/Ubuntu based distros. It therefore should not use any network configuration features specific to those distros.

  2. Stateless. By default, networkd should not maintain any state. If you ask networkd anything about a network interface or bridge, or any other network sub-system property, it will always query the underlying system (e.g. an IP address), rather than returning any cached state. However, if you want networkd to configure networking at host boot time, the you can ask it to remember your configuration you have set for any interface or bridge you choose.

  3. Idempotent. It should be possible to call any networkd function multiple times without breaking things. For example, calling a function to set an IP address on an interface twice in a row should have the same outcome as calling it just once.

  4. Do no harm. Networkd should only configure what you ask it to configure. This means that it can co-exist with other network managers.

Usage

Networkd is a daemon that is typically started at host-boot time. In the same way as the other daemons in the xapi toolstack, it is controlled by RPC requests. It typically receives requests from the xapi daemon, on behalf of which it configures host networking.

Networkd’s RCP API is fully described by the network_interface.ml file. The API has two main namespaces: Interface and Bridge, which are implemented in two modules in network_server.ml.

In line with other xapi daemons, all API functions take an argument of type debug_info (a string) as their first argument. The debug string appears in any log lines that are produced as a side effort of calling the function.

Network Interface API

The Interface API has functions to query and configure properties of Linux network devices, such as IP addresses, and bringing them up or down. Most Interface functions take a name string as a reference to a network interface as their second argument, which is expected to be the name of the Linux network device. There is also a special function, called Interface.make_config, that is able to configure a number of interfaces at once. It takes an argument called config of type (iface * interface_config_t) list, where iface is an interface name, and interface_config_t is a compound type containing the full configuration for an interface (as far as networkd is able to configure them), currently defined as follows:

type interface_config_t = {
	ipv4_conf: ipv4;
	ipv4_gateway: Unix.inet_addr option;
	ipv6_conf: ipv6;
	ipv6_gateway: Unix.inet_addr option;
	ipv4_routes: (Unix.inet_addr * int * Unix.inet_addr) list;
	dns: Unix.inet_addr list * string list;
	mtu: int;
	ethtool_settings: (string * string) list;
	ethtool_offload: (string * string) list;
	persistent_i: bool;
}

When the function returns, it should have completely configured the interface, and have brought it up. The idempotency principle applies to this function, which means that it can be used to successively modify interface properties; any property that has not changed will effectively be ignored. In fact, Interface.make_config is the main function that xapi uses to configure interfaces, e.g. as a result of a PIF.plug or a PIF.reconfigure_ip call.

Also note the persistent property in the interface config. When an interface is made “persistent”, this means that any configuration that is set on it is remembered by networkd, and the interface config is written to disk. When networkd is started, it will read the persistent config and call Interface.make_config on it in order to apply it (see Startup below).

The full networkd API should be documented separately somewhere on this site.

Bridge API

The Bridge API functions are all about the management of virtual switches, also known as “bridges”. The shape of the Bridge API roughly follows that of the Open vSwitch in that it treats a bridge as a collection of “ports”, where a port can contain one or more “interfaces”.

NIC bonding and VLANs are all configured on the Bridge level. There are functions for creating and destroying bridges, adding and removing ports, and configuring bonds and VLANs. Like interfaces, bridges and ports are addressed by name in the Bridge functions. Analogous to the Interface function with the same name, there is a Bridge.make_config function, and bridges can be made persistent.

type port_config_t = {
	interfaces: iface list;
	bond_properties: (string * string) list;
	bond_mac: string option;
}
type bridge_config_t = {
	ports: (port * port_config_t) list;
	vlan: (bridge * int) option;
	bridge_mac: string option;
	other_config: (string * string) list;
	persistent_b: bool;
}

Backends

Networkd currently has two different backends: the “Linux bridge” backend and the “Open vSwitch” backend. The former is the “classic” backend based on the bridge module that is available in the Linux kernel, plus additional standard Linux functionality for NIC bonding and VLANs. The latter backend is newer and uses the Open vSwitch (OVS) for bridging as well as other functionality. Which backend is currently in use is defined by the file /etc/xensource/network.conf, which is read by networkd when it starts. The choice of backend (currently) only affects the Bridge API: every function in it has a separate implementation for each backend.

Low-level Interfaces

Networkd uses standard networking commands and interfaces that are available in most modern Linux distros, rather than relying on any distro-specific network tools (see the distro-agnostic principle). These are tools such as ip (iproute2), dhclient and brctl, as well as the sysfs files system, and netlink sockets. To control the OVS, the ovs-* command line tools are used. All low-level functions are called from network_utils.ml.

Configuration on Startup

Networkd, periodically as well as on shutdown, writes the current configuration of all bridges and interfaces (see above) in a JSON format to a file called networkd.db (currently in /var/lib/xcp). The contents of the file are completely described by the following type:

type config_t = {
	interface_config: (iface * interface_config_t) list;
	bridge_config: (bridge * bridge_config_t) list;
	gateway_interface: iface option;
	dns_interface: iface option;
}

The gateway_interface and dns_interface in the config are global host-level options to define from which interfaces the default gateway and DNS configuration is taken. This is especially important when multiple interfaces are configured by DHCP.

When networkd starts up, it first reads network.conf to determine the network backend. It subsequently attempts to parse networkd.db, and tries to call Bridge.make_config and Interface.make_config on it, with a special options to only apply the config for persistent bridges and interfaces, as well as bridges related to those (for example, if a VLAN bridge is configured, then also its parent bridge must be configured).

Networkd also supports upgrades from older versions of XenServer that used a network configuration script called interface-configure. If networkd.db is not found on startup, then networkd attempts to call this tool (via the /etc/init.d/management-interface script) in order to set up networking at boot time. This is normally followed immediately by a call from xapi instructing networkd to take over.

Finally, if no network config (old or new) is found on disk at all, networkd looks for a XenServer “firstboot” data file, which is written by XenServer’s host installer, and tries to apply it to set up the management interface.

Monitoring

Besides the ability to configure bridges and network interfaces, networkd has facilities for monitoring interfaces and bonds. When networkd starts, a monitor thread is started, which does several things (see network_monitor_thread.ml):

  • Every 5 seconds, it gathers send/receive counters and link state of all network interfaces. It then writes these stats to a shared-memory file, to be picked up by other components such as xcp-rrdd and xapi (see documentation about “xenostats” elsewhere).
  • It monitors NIC bonds, and sends alerts through xapi in case of link state changes within a bond.
  • It uses ip monitor address to watch for an IP address changes, and if so, it calls xapi (Host.signal_networking_change) for it to update the IP addresses of the PIFs in its database that were configured by DHCP.

Subsections of Networkd

Host Network Device Ordering on Networkd

Purpose

One of the Toolstack’s functions is to maintain a pool of hosts. A pool can be constructed by joining a host into an existing pool. One challenge in this process is determining which pool-wide network a network device on the joining host should connect to.

At first glance, this could be resolved by specifying a mapping between an individual network device and a pool-wide network. However, this approach would be burdensome for administrators when managing many hosts. It would be more efficient if the Toolstack could determine this automatically.

To achieve this, the Toolstack components on two hosts need to independently work out consistent identifications for the host network devices and connect the network devices with the same identification to the same pool-wide network. The identifications on a host can be considered as an order, with each network device assigned a unique position in the order as its identification. Network devices with the same position will connect to the same network.

The assumption

Why can the Toolstack components on two hosts independently work out an expected order without any communication? This is possible only under the assumption that the hosts have identical hardware, firmware, software, and the way network devices are plugged into them. For example, an administrator will always plug the network devices into the same PCI slot position on multiple hosts if they want these network devices to connect to the same network.

The ordering is considered consistent if the positions of such network devices (plugged into the same PCI slot position) in the generated orders are the same.

The biosdevname

Particularly, when the assumption above holds, a consistent initial order can be worked out on multiple hosts independently with the help of biosdevname. The “all_ethN” policy of the biosdevname utility can generate a device order based on whether the device is embedded or not, PCI cards in ascending slot order, and ports in ascending PCI bus/device/function order breadth-first. Since the hosts are identical, the orders generated by the biosdevname are consistent across the hosts.

An example of biosdevname’s output is as the following. The initial order can be derived from the BIOS device field.

# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth1
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...

However, the BIOS device of a particular network device may change with the addition or removal of devices. For example:

# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp4s0
Permanent MAC: EC:F4:BB:E6:D7:BB
Assigned MAC : EC:F4:BB:E6:D7:BB
Bus Info: 0000:04:00.0
...

BIOS device: eth1
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth2
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...

Therefore, the order derived from these values is used solely for determining the initial order and the order of newly added devices.

Principles

  • Initially, the order is aligned with PCI slots. This is to make the connection between cabling and order predictable: The network devices in identical PCI slots have the same position. The rationale is that PCI slots are more predictable than MAC addresses and correspond to physical locations.

  • Once a previous order has been established, the ordering should be maintained as stable as possible despite changes to MAC addresses or PCI addresses. The rationale is that the assumption is less likely to hold as long as the hosts are experiencing updates and maintenance. Therefore, maintaining the stable order is the best choice for automatic ordering.

Notation

mac:pci:position
!mac:pci:position

A network device is characterised by

  • MAC address, which is unique.
  • PCI slot, which is not unique and multiple network devices can share a PCI slot. PCI addresses correspond to hardware PCI slots and thus are physically observable.
  • position, the position assigned to this network device by xcp-networkd. At any given time, no position is assigned twice but the sequence of positions may have holes.
  • The !mac:pci:position notation indicates that this postion was previously used but currently is free because the device it was assgined was removed.

On a Linux system, MAC and PCI addresses have specific formats. However, for simplicity, symbolic names are used here: MAC addresses use lowercase letters, PCI addresses use uppercase letters, and positions use numbers.

Scenarios

The initial order

As mentioned above, the biosdevname can be used to generate consistent orders for the network devices on multiple hosts.

current input: a:A   b:D   c:C
initial order: a:A:0 c:C:1 b:D:2

This only works if the assumption of identical hardware, firmware, software, and network device placement holds. And it is considered that the assumption will hold for the majority of the use cases.

Otherwise, the order can be generated from a user’s configuration. The user can specify the order explicilty for individual hosts. However, administrators would prefer to avoid this as much as possible when managing many hosts.

user spec:     a::0  c::1  b::2
current input: a:A   b:D   c:C
initial order: a:A:0 c:C:1 b:D:2

Keep the order as stable as possible

Once an initial order is created on an individual host, it should be kept as stable as possible across host boot-ups and at runtime. For example, unless there are hardware changes, the position of a network device in the initial order should remain the same regardless of how many times the host is rebooted.

To achieve this, the initial order should be saved persistently on the host’s local storage so it can be referenced in subsequent orderings. When performing another ordering after the initial order has been saved, the position of a currently unordered network device should be determined by finding its position in the last saved order. The MAC address of the network device is a reliable attribute for this purpose, as it is considered unique for each network device globally.

Therefore, the network devices in the saved order should have their MAC addresses saved together, effectively mapping each position to a MAC address. When performing an ordering, the stable position can be found by searching the last saved order using the MAC address.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:D    c:C
new order:     a:A:0  c:C:1  b:D:2

Name labels of the network devices are not considered reliable enough to identify particular devices. For example, if the name labels are determined by the PCI address via systemd, and a firmware update changes the PCI addresses of the network devices, the name labels will also change.

The PCI addresses are not considered reliable as well. They may change due to the firmeware update/setting changes or even plugging/unpluggig other devices.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:B    c:E
new order:     a:A:0  c:E:1  b:B:2

Replacement

However, what happens when the MAC address of an unordered network device cannot be found in the last saved order? There are two possible scenarios:

  1. It’s a newly added network device since the last ordering.
  2. It’s a new device that replaces an existing network device.

Replacement is a supported scenario, as an administrator might replace a broken network device with a new one.

This can be recognized by comparing the PCI address where the network device is located. Therefore, the PCI address of each network device should also be saved in the order. In this case, searching the PCI address in the order results in one of the following:

  1. Not found: This means the PCI address was not occupied during the last ordering, indicating a newly added network device.
  2. Found with a MAC address, but another device with this MAC address is still present in the system: This suggests that the PCI address of an existing network device (with the same MAC address) has changed since the last ordering. This may be caused by either a device move or others like a firmware update. In this case, the current unordered network device is considered newly added.
last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:B    c:C    d:D
new order:     a:A:0  c:C:1  b:B:2  d:D:3
  1. Found with a MAC address, and no current devices have this MAC address: This indicates that a new network device has replaced the old one in the same PCI slot. The replacing network device should be assigned the same position as the replaced one.
last order:    a:A:0  c:C:1  b:D:2
current input: a:A    c:C    d:D
new order:     a:A:0  c:C:1  d:D:2

Removed devices

A network device can be removed or unplugged since the last ordering. Its position, MAC address, and PCI address are saved for future reference, and its position will be reserved. This means there may be a gap in the order: a position that was previously assigned to a network device is now vacant because the device has been removed.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:D
new order:     a:A:0  !c:C:1 d:D:2

Newly added devices

As long as the assumption holds, newly added devices since the last ordering can be assigned positions consistently across multiple hosts. Newly added devices will not be assigned the positions reserved for removed devices.

last order:    a:A:0 !c:C:1  d:D:2
current input: a:A           d:D    e:E
new order:     a:A:0 !c:C:1  d:D:2  e:E:3

Removed and then added back

It is a supported scenario for a removed device to be plugged back in, regardless of whether it is in the same PCI slot or not. This can be recognized by searching for the device in the saved removed devices using its MAC address. The reserved position will be reassigned to the device when it is added back.

last order:    a:A:0 !c:C:1 d:D:2
current input: a:A   c:F    d:D   e:E
new order:     a:A:0 c:F:1  d:D:2 e:E:3

Multinic functions

The multinic function is a special kind of network device. When this type of physical device is plugged into a PCI slot, multiple network devices are reported at a single PCI address. Additionally, the number of reported network devices may change due to driver updates.

current input: a:A b:A c:A d:A
initial order: a:A:0 b:A:1 c:A:2 d:A:3

As long as the assumption holds, the initial order of these devices can be generated automatically and kept stable by using MAC addresses to identify individual devices. However, biosdevname cannot reliably generate an order for all devices reported at one PCI address. For devices located at the same PCI address, their MAC addresses are used to generate the initial order.

last order:    a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5
current input: a:A   b:A   c:A   d:A   e:A   f:A   m:M   n:N
new order:     a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7

For example, suppose biosdevname generates an order for a multinic function and other non-multinic devices. Within this order, the N devices of the multinic function with MAC addresses mac[1], …, mac[N] are assigned positions pos[1], …, pos[N] correspondingly. biosdevname cannot ensure that the device with mac[1] is always assigned position pos[1]. Instead, it ensures that the entire set of positions pos[1], …, pos[N] remains stable for the devices of the multinic function. Therefore, to ensure the order follows the MAC address order, the devices of the multinic function need to be sorted by their MAC addresses within the set of positions.

last order:    a:A:0 b:A:1 c:A:2 d:A:3 m:M:4
current input: e:A   f:A   g:A   h:A   m:M
new order:     e:A:0 f:A:1 g:A:2 h:A:3 m:M:4

Rare cases that can not be handled automatically

In summary, to keep the order stable, the auto-generated order needs to be saved for the next ordering. When performing an automatic ordering for the current network devices, either the MAC address or the PCI address is used to recognize the device that was assigned the same position in the last ordering. If neither the MAC address nor the PCI address can be used to find a position from the last ordering, the device is considered newly added and is assigned a new position.

However, following this sorting logic, the ordering result may not always be as expected. In practice, this can be caused by various rare cases, such as switching an existing network device to connect to another network, performing firmware updates, changing firmware settings, or plugging/unplugging network devices. It is not worth complicating the entire function for these rare cases. Instead, the initial user’s configuration can be used to handle these rare scenarios.