Host Network Device Ordering on Networkd
Purpose
One of the Toolstack’s functions is to maintain a pool of hosts. A pool can be constructed by joining a host into an existing pool. One challenge in this process is determining which pool-wide network a network device on the joining host should connect to.
At first glance, this could be resolved by specifying a mapping between an individual network device and a pool-wide network. However, this approach would be burdensome for administrators when managing many hosts. It would be more efficient if the Toolstack could determine this automatically.
To achieve this, the Toolstack components on two hosts need to independently work out consistent identifications for the host network devices and connect the network devices with the same identification to the same pool-wide network. The identifications on a host can be considered as an order, with each network device assigned a unique position in the order as its identification. Network devices with the same position will connect to the same network.
The assumption
Why can the Toolstack components on two hosts independently work out an expected order without any communication? This is possible only under the assumption that the hosts have identical hardware, firmware, software, and the way network devices are plugged into them. For example, an administrator will always plug the network devices into the same PCI slot position on multiple hosts if they want these network devices to connect to the same network.
The ordering is considered consistent if the positions of such network devices (plugged into the same PCI slot position) in the generated orders are the same.
The biosdevname
Particularly, when the assumption above holds, a consistent initial order can be
worked out on multiple hosts independently with the help of biosdevname. The
“all_ethN” policy of the biosdevname utility can generate a device order based
on whether the device is embedded or not, PCI cards in ascending slot order, and
ports in ascending PCI bus/device/function order breadth-first. Since the hosts
are identical, the orders generated by the biosdevname are consistent across
the hosts.
An example of biosdevname’s output is as the following. The initial order can
be derived from the BIOS device field.
# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...
BIOS device: eth1
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...However, the BIOS device of a particular network device may change with the
addition or removal of devices. For example:
# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp4s0
Permanent MAC: EC:F4:BB:E6:D7:BB
Assigned MAC : EC:F4:BB:E6:D7:BB
Bus Info: 0000:04:00.0
...
BIOS device: eth1
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...
BIOS device: eth2
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...Therefore, the order derived from these values is used solely for determining the initial order and the order of newly added devices.
Principles
Initially, the order is aligned with PCI slots. This is to make the connection between cabling and order predictable: The network devices in identical PCI slots have the same position. The rationale is that PCI slots are more predictable than MAC addresses and correspond to physical locations.
Once a previous order has been established, the ordering should be maintained as stable as possible despite changes to MAC addresses or PCI addresses. The rationale is that the assumption is less likely to hold as long as the hosts are experiencing updates and maintenance. Therefore, maintaining the stable order is the best choice for automatic ordering.
Notation
mac:pci:position
!mac:pci:positionA network device is characterised by
- MAC address, which is unique.
- PCI slot, which is not unique and multiple network devices can share a PCI slot. PCI addresses correspond to hardware PCI slots and thus are physically observable.
- position, the position assigned to this network device by xcp-networkd. At any given time, no position is assigned twice but the sequence of positions may have holes.
- The
!mac:pci:positionnotation indicates that this postion was previously used but currently is free because the device it was assgined was removed.
On a Linux system, MAC and PCI addresses have specific formats. However, for simplicity, symbolic names are used here: MAC addresses use lowercase letters, PCI addresses use uppercase letters, and positions use numbers.
Scenarios
The initial order
As mentioned above, the biosdevname can be used to generate consistent orders
for the network devices on multiple hosts.
current input: a:A b:D c:C
initial order: a:A:0 c:C:1 b:D:2This only works if the assumption of identical hardware, firmware, software, and network device placement holds. And it is considered that the assumption will hold for the majority of the use cases.
Otherwise, the order can be generated from a user’s configuration. The user can specify the order explicilty for individual hosts. However, administrators would prefer to avoid this as much as possible when managing many hosts.
user spec: a::0 c::1 b::2
current input: a:A b:D c:C
initial order: a:A:0 c:C:1 b:D:2Keep the order as stable as possible
Once an initial order is created on an individual host, it should be kept as stable as possible across host boot-ups and at runtime. For example, unless there are hardware changes, the position of a network device in the initial order should remain the same regardless of how many times the host is rebooted.
To achieve this, the initial order should be saved persistently on the host’s local storage so it can be referenced in subsequent orderings. When performing another ordering after the initial order has been saved, the position of a currently unordered network device should be determined by finding its position in the last saved order. The MAC address of the network device is a reliable attribute for this purpose, as it is considered unique for each network device globally.
Therefore, the network devices in the saved order should have their MAC addresses saved together, effectively mapping each position to a MAC address. When performing an ordering, the stable position can be found by searching the last saved order using the MAC address.
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:D c:C
new order: a:A:0 c:C:1 b:D:2Name labels of the network devices are not considered reliable enough to identify particular devices. For example, if the name labels are determined by the PCI address via systemd, and a firmware update changes the PCI addresses of the network devices, the name labels will also change.
The PCI addresses are not considered reliable as well. They may change due to the firmeware update/setting changes or even plugging/unpluggig other devices.
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:B c:E
new order: a:A:0 c:E:1 b:B:2Replacement
However, what happens when the MAC address of an unordered network device cannot be found in the last saved order? There are two possible scenarios:
- It’s a newly added network device since the last ordering.
- It’s a new device that replaces an existing network device.
Replacement is a supported scenario, as an administrator might replace a broken network device with a new one.
This can be recognized by comparing the PCI address where the network device is located. Therefore, the PCI address of each network device should also be saved in the order. In this case, searching the PCI address in the order results in one of the following:
- Not found: This means the PCI address was not occupied during the last ordering, indicating a newly added network device.
- Found with a MAC address, but another device with this MAC address is still present in the system: This suggests that the PCI address of an existing network device (with the same MAC address) has changed since the last ordering. This may be caused by either a device move or others like a firmware update. In this case, the current unordered network device is considered newly added.
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:B c:C d:D
new order: a:A:0 c:C:1 b:B:2 d:D:3- Found with a MAC address, and no current devices have this MAC address: This indicates that a new network device has replaced the old one in the same PCI slot. The replacing network device should be assigned the same position as the replaced one.
last order: a:A:0 c:C:1 b:D:2
current input: a:A c:C d:D
new order: a:A:0 c:C:1 d:D:2Removed devices
A network device can be removed or unplugged since the last ordering. Its position, MAC address, and PCI address are saved for future reference, and its position will be reserved. This means there may be a gap in the order: a position that was previously assigned to a network device is now vacant because the device has been removed.
last order: a:A:0 c:C:1 b:D:2
current input: a:A b:D
new order: a:A:0 !c:C:1 d:D:2Newly added devices
As long as the assumption holds, newly added devices since the last ordering
can be assigned positions consistently across multiple hosts. Newly added
devices will not be assigned the positions reserved for removed devices.
last order: a:A:0 !c:C:1 d:D:2
current input: a:A d:D e:E
new order: a:A:0 !c:C:1 d:D:2 e:E:3Removed and then added back
It is a supported scenario for a removed device to be plugged back in, regardless of whether it is in the same PCI slot or not. This can be recognized by searching for the device in the saved removed devices using its MAC address. The reserved position will be reassigned to the device when it is added back.
last order: a:A:0 !c:C:1 d:D:2
current input: a:A c:F d:D e:E
new order: a:A:0 c:F:1 d:D:2 e:E:3Multinic functions
The multinic function is a special kind of network device. When this type of physical device is plugged into a PCI slot, multiple network devices are reported at a single PCI address. Additionally, the number of reported network devices may change due to driver updates.
current input: a:A b:A c:A d:A
initial order: a:A:0 b:A:1 c:A:2 d:A:3As long as the assumption holds, the initial order of these devices can be
generated automatically and kept stable by using MAC addresses to identify
individual devices. However, biosdevname cannot reliably generate an order for
all devices reported at one PCI address. For devices located at the same PCI
address, their MAC addresses are used to generate the initial order.
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5
current input: a:A b:A c:A d:A e:A f:A m:M n:N
new order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7For example, suppose biosdevname generates an order for a multinic function
and other non-multinic devices. Within this order, the N devices of the
multinic function with MAC addresses mac[1], …, mac[N] are assigned positions
pos[1], …, pos[N] correspondingly. biosdevname cannot ensure that the device
with mac[1] is always assigned position pos[1]. Instead, it ensures that the
entire set of positions pos[1], …, pos[N] remains stable for the devices of
the multinic function. Therefore, to ensure the order follows the MAC address
order, the devices of the multinic function need to be sorted by their MAC
addresses within the set of positions.
last order: a:A:0 b:A:1 c:A:2 d:A:3 m:M:4
current input: e:A f:A g:A h:A m:M
new order: e:A:0 f:A:1 g:A:2 h:A:3 m:M:4Rare cases that can not be handled automatically
In summary, to keep the order stable, the auto-generated order needs to be saved for the next ordering. When performing an automatic ordering for the current network devices, either the MAC address or the PCI address is used to recognize the device that was assigned the same position in the last ordering. If neither the MAC address nor the PCI address can be used to find a position from the last ordering, the device is considered newly added and is assigned a new position.
However, following this sorting logic, the ordering result may not always be as expected. In practice, this can be caused by various rare cases, such as switching an existing network device to connect to another network, performing firmware updates, changing firmware settings, or plugging/unplugging network devices. It is not worth complicating the entire function for these rare cases. Instead, the initial user’s configuration can be used to handle these rare scenarios.