Xenopsd

Xenopsd is the VM manager of the XAPI Toolstack. Xenopsd is responsible for:

Starting, stopping, rebooting, suspending, resuming, migrating VMs.
(Hot-)plugging and unplugging devices such as VBDs, VIFs, vGPUs and PCI devices.
Setting up VM consoles.
Running bootloaders.
Setting QoS parameters.
Configuring SMBIOS tables.
Handling crashes.
etc.

The code is in ocaml/xenopsd.

Principles

Do no harm: Xenopsd should never touch domains/VMs which it hasn’t been asked to manage. This means that it can co-exist with other VM managers such as ‘xl’ and ’libvirt’.
Be independent: Xenopsd should be able to work in isolation. In particular the loss of some other component (e.g. the network) should not by itself prevent VMs being managed locally (including shutdown and reboot).
Asynchronous by default: Xenopsd exposes task monitoring and offers cancellation for all operations. Xenopsd ensures that the system is always in a manageable state after an operation has been cancelled.
Avoid state duplication: where another component owns some state, Xenopsd will always defer to it. We will avoid creating out-of-sync caches of this state.
Be debuggable: Xenopsd will expose diagnostic APIs and tools to allow its internal state to be inspected and modified.

Xenopsd Architecture

Xenopsd instances run on a host and manage VMs on behalf of clients. This picture shows 3 different Xenopsd instances: 2 named “xenopsd-xc” and 1 named “xenopsd-xenlight”.

Where xenopsd fits on a host

Each instance is responsible for managing a disjoint set of VMs. Clients should never ask more than one Xenopsd to manage the same VM. Managing a VM means:

handling start/shutdown/suspend/resume/migrate/reboot
allowing devices (disks, nics, PCI cards, vCPUs etc) to be manipulated
providing updates to clients when things change (reboots, console becomes available, guest agent says something etc).

For a full list of features, consult the feature list.

Each Xenopsd instance has a unique name on the host. A typical name is

org.xen.xcp.xenops.classic
org.xen.xcp.xenops.xenlight

A higher-level tool, such as xapi will associate VMs with individual Xenopsd names.

Running multiple Xenopsds is necessary because

The virtual hardware supported by different technologies (libxc, libxl, qemu) is expected to be different. We can guarantee the virtual hardware is stable across a rolling upgrade by running the VM on the old Xenopsd. We can then switch Xenopsds later over a VM reboot when the VM admin is happy with it. If the VM admin is unhappy then we can reboot back to the original Xenopsd again.
The suspend/resume/migrate image formats will differ across technologies (again libxc vs libxl) and it will be more reliable to avoid switching technology over a migrate.
In the future different security domains may have different Xenopsd instances providing even stronger isolation guarantees between domains than is possible today.

Communication with Xenopsd is handled through a Xapi-global library: xcp-idl. This library supports

message framing: by default using HTTP but a binary framing format is available
message encoding: by default we use JSON but XML is also available
RPCs over Unix domain sockets and persistent queues.

This library allows the communication details to be changed without having to change all the Xapi clients and servers.

Xenopsd has a number of “backends” which perform the low-level VM operations such as (on Xen) “create domain” “hotplug disk” “destroy domain”. These backends contain all the hypervisor-specific code including

connecting to Xenstore
opening the libxc /proc/xen/privcmd interface
initialising libxl contexts

The following diagram shows the internal structure of Xenopsd:

Inside xenopsd

At the top of the diagram two client RPC have been sent: one to start a VM and the other to fetch the latest events. The RPCs are all defined in xcp-idl/xen/xenops_interface.ml. The RPCs are received by the Xenops_server module and decomposed into “micro-ops” (labelled “μ op”). These micro ops represent actions like

create a Xen domain (recall a Xen domain is an empty shell with no memory)
build a Xen domain: this is where the kernel or hvmloader is copied in
launch a device model: this is where a qemu instance is started (if one is required)
hotplug a device: this involves writing the frontend and backend trees to Xenstore
unpause a domain (recall a Xen domain is created in the paused state)

Each of these micro-ops is represented by a function call in a “backend plugin” interface. The micro-ops are enqueued in queues, one queue per VM. There is a thread pool (whose size can be changed dynamically by the admin) which pulls micro-ops from the VM queues and calls the corresponding backend function.

The active backend (there can only be one backend per Xenopsd instance) executes the micro-ops. The Xenops_server_xen backend in the picture above talks to libxc, libxl and qemu to create and destroy domains. The backend also talks to other Xapi services, in particular

it registers datasources with xcp-rrdd, telling xcp-rrdd to measure I/O throughput and vCPU utilisation
it reserves memory for new domains by talking to squeezed
it makes disks available by calling SMAPIv2 VDI.{at,de}tach, VDI.{,de}activate
it launches subprocesses by talking to forkexecd (avoiding problems with accidental fd capture)

Xenopsd backends are also responsible for monitoring running VMs. In the Xenops_server_xen backend this is done by watching Xenstore for

@releaseDomain watch events
device hotplug status changes

When such an event happens (for example: @releaseDomain sent when a domain requests a reboot) the corresponding operation does not happen inline. Instead the event is rebroadcast upwards to Xenops_server as a signal (for example: “VM id needs some attention”) and a “VM_stat” micro-op is queued in the appropriate queue. Xenopsd does not allow operations to run on the same VM in parallel and enforces this by:

pushing all operations pertaining to a VM to the same queue
associating each VM queue to at-most-one worker pool thread

The event takes the form “VM id needs some attention” and not “VM id needs to be rebooted” because, by the time the queue is flushed, the VM may well now be in a different state. Perhaps rather than being rebooted it now needs to be shutdown; or perhaps the domain is now in a good state because the reboot has already happened. The signals sent by the backend to the Xenops_server are a bit like event channel notifications in the Xen ring protocols: they are requests to ask someone to perform work, they don’t themselves describe the work that needs to be done.

An implication of this design is that it should always be possible to answer the question, “what operation should be performed to get the VM into a valid state?”. If an operation is cancelled half-way through or if Xenopsd is suddenly restarted, it will ask the question about all the VMs and perform the necessary operations. The operations must be designed carefully to make this work. For example if Xenopsd is restarted half-way through starting a VM, it must be obvious on restart that the VM should either be forcibly shutdown or rebooted to make it a valid state again. Note: we don’t demand that operations are performed as transactions; we only demand that the state they leave the system be “sensible” in the sense that the admin will recognise it and be able to continue their work.

Sometimes this can be achieved through careful ordering of side-effects within the operations, taking advantage of artifacts of the system such as:

a domain which has not been fully created will have total vCPU time = 0 and will be paused. If we see one of these we should reboot it because it may not be fully intact.

In the absense of “tells” from the system, operations are expected to journal their intentions and support restart after failure.

There are three categories of metadata associated with VMs:

system metadata: this is created as a side-effect of starting VMs. This includes all the information about active disks and nics stored in Xenstore and the list of running domains according to Xen.
VM: this is the configuration to use when the VM is started or rebooted. This is like a “config file” for the VM.
VmExtra: this is the runtime configuration of the VM. When VM configuration is changed it often cannot be applied immediately; instead the VM continues to run with the previous configuration. We need to track the runtime configuration of the VM in order for suspend/resume and migrate to work. It is also useful to be able to tell a client, “on next reboot this value will be x but currently it is x-1”.

VM and VmExtra metadata is stored by Xenopsd in the domain 0 filesystem, in a simple directory hierarchy.

Design

Design documents for xenopsd:

Events

ids rather than data; inherently coalescable
blocking poll + async operations implies a client needs 2 connections
coarse granularity
similarity and differences with: XenAPI, event channels, xenstore watches

https://github.com/xapi-project/xen-api/blob/30cc9a72e8726d1e7501cd01ddb27ced6d53b9be/ocaml/xapi/xapi_xenops.ml#L1467

Hooks

There are a number of hook points at which xenopsd may execute certain scripts. These scripts are found in hook-specific directories of the form /etc/xapi.d/<hookname>/. All executable scripts in these directories are run with the following arguments:

<script.sh> -reason <reason> -vmuuid <uuid of VM>

The scripts are executed in filename-order. By convention, the filenames are usually of the form 10resetvdis.

The hook points are:

vm-pre-shutdown
vm-pre-migrate
vm-post-migrate (Dundee only)
vm-pre-start
vm-pre-reboot
vm-pre-resume
vm-post-resume (Dundee only)
vm-post-destroy

and the reason codes are:

clean-shutdown
hard-shutdown
clean-reboot
hard-reboot
suspend
source -- passed to pre-migrate hook on source host
destination -- passed to post-migrate hook on destination (Dundee only)
none

For example, in order to execute a script on VM shutdown, it would be sufficient to create the script in the post-destroy hook point:

/etc/xapi.d/vm-post-destroy/01myscript.sh

containing

#!/bin/bash
echo I was passed $@ > /tmp/output

And when, for example, VM e30d0050-8f15-e10d-7613-cb2d045c8505 is shut-down, the script is executed:

[vagrant@localhost ~]$ sudo xe vm-shutdown --force uuid=e30d0050-8f15-e10d-7613-cb2d045c8505
[vagrant@localhost ~]$ cat /tmp/output
I was passed -vmuuid e30d0050-8f15-e10d-7613-cb2d045c8505 -reason hard-shutdown

PVS Proxy OVS Rules

Rule Design

The Open vSwitch (OVS) daemon implements a programmable switch. XenServer uses it to re-direct traffic between three entities:

PVS server - identified by its IP address
a local VM - identified by its MAC address
a local Proxy - identified by its MAC address

VM and PVS server are unaware of the Proxy; xapi configures OVS to redirect traffic between PVS and VM to pass through the proxy.

OVS uses rules that match packets. Rules are organised in sets called tables. A rule can be used to match a packet and to inject it into another rule set/table such that a packet can be matched again.

Furthermore, a rule can set registers associated with a packet which that can be matched in subsequent rules. In that way, a packet can be tagged such that it will only match specific rules downstream that match the tag.

Xapi configures 3 rule sets:

Table 0 - Entry Rules

Rules match UDP traffic between VM/PVS, Proxy/VM, and PVS/VM where the PVS server is identified by its IP and all other components by their MAC address. All packets are tagged with the direction they are going and re-submitted into Table 101 which handles ports.

Table 101 - Port Rules

Rules match UDP traffic going to a specific port of the PVS server and re-submit it into Table 102.

Table 102 - Exit Rules

These rules implement the redirection:

Rules matching packets coming from VM to PVS are directed to the Proxy.
Rules matching packets coming from PVS to VM are directed to the Proxy.
Rules matching packets coming from the Proxy are already addressed properly (to the VM) are handled normally.

Requirements for suspend image framing

We are currently (Dec 2013) undergoing a transition from the ‘classic’ xenopsd backend (built upon calls to libxc) to the ‘xenlight’ backend built on top of the officially supported libxl API.

During this work, we have come across an incompatibility between the suspend images created using the ‘classic’ backend and those created using the new libxl-based backend. This needed to be fixed to enable RPU to any new version of XenServer.

Historic ‘classic’ stack

Prior to this work, xenopsd was involved in the construction of the suspend image and we ended up with an image with the following format:

+-----------------------------+
| "XenSavedDomain\n"          |  <-- added by xenopsd-classic
|-----------------------------|
|  Memory image dump          |  <-- libxc
|-----------------------------|
| "QemuDeviceModelRecord\n"   |
|  <size of following record> |  <-- added by xenopsd-classic
|  (a 32-bit big-endian int)  |
|-----------------------------|
| "QEVM"                      |  <-- libxc/qemu
|  Qemu device record         |
+-----------------------------+

We have also been carrying a patch in the Xen patchqueue against xc_domain_restore. This patch (revert_qemu_tail.patch) stopped xc_domain_restore from attempting to read past the memory image dump. At which point xenopsd-classic would just take over and restore what it had put there.

Requirements for new stack

For xenopsd-xenlight to work, we need to operate without the revert_qemu_tail.patch since libxl assumes it is operating on top of an upstream libxc.

We need the following relationship between suspend images created on one backend being able to be restored on another backend. Where the backends are old-classic (OC), new-classic (NC) and xenlight (XL). Obviously all suspend images created on any backend must be able to be restored on the same backend:

                OC _______ NC _______ XL
                 \  >>>>>      >>>>>  /
                  \__________________/
                    >>>>>>>>>>>>>>>>

It turns out this was not so simple. After removing the patch against xc_domain_restore and allowing libxc to restore the hvm_buffer_tail, we found that supsend images created with OC (detailed in the previous section) are not of a valid format for two reasons:

i. The "XenSavedDomain\n" was extraneous;

ii. The Qemu signature section (prior to the record) is not of valid form.

It turns out that the section with the Qemu signature can be one of the following:

a. "QemuDeviceModelRecord" (NB. no newline) followed by the record to EOF;
b. "DeviceModelRecord0002" then a uint32_t length followed by record;
c. "RemusDeviceModelState" then a uint32_t length followed by record;

The old-classic (OC) backend not only uses an invalid signature (since it contains a trailing newline) but it also includes a length, and the length is in big-endian when the uint32_t is seen to be little-endian.

We considered creating a proxy for the fd in the incompatible cases but since this would need to be a 22-lookahead byte-by-byte proxy this was deemed impracticle. Instead we have made patched libxc with a much simpler patch to understand this legacy format.

Because peek-ahead is not possible on pipes, the patch for (ii) needed to be applied at a point where the hvm tail had been read completely. We piggy-backed on the point after (a) had been detected. At this point the remainder of the fd is buffered (only around 7k) and the magic “QEVM” is expected at the head of this buffer. So we simply added a patch to check if there was a pesky newline and the buffer[5:8] was “QEVM” and if it was we could discard the first 5 bytes:

                              0    1    2    3    4    5   6   7   8
Legacy format from OC:  [...| \n | \x | \x | \x | \x | Q | E | V | M |...]

Required at this point: [...|  Q |  E |  V |  M |...]

Changes made

To make the above use-cases work, we have made the following changes:

1. Make new-classic (NC) not restore Qemu tail (let libxc do it)
    xenopsd.git:ef3bf4b

2. Make new-classic use valid signature (b) for future restore images
    xenopsd.git:9ccef3e

3. Make xc_domain_restore in libxc understand legacy xenopsd (OC) format
    xen-4.3.pq.hg:libxc-restore-legacy-image.patch

4. Remove revert-qemu-tail.patch from Xen patchqueue
    xen-4.3.pq.hg:3f0e16f2141e

5. Make xenlight (XL) use "XenSavedDomain\n" start-of-image signature
    xenopsd.git:dcda545

This has made the required use-cases work as follows:

                OC __134__ NC __245__ XL
                 \  >>>>>      >>>>>  /
                  \_______345________/
                    >>>>>>>>>>>>>>>>

And the suspend-resume on same backends work by virtue of:

OC --> OC : Just works
NC --> NC : By 1,2,4
XL --> XL : By 4 (5 is used but not required)

New components

The output of the changes above are:

A new xenops-xc binary for NC
A new xenops-xl binary for XL
A new libxenguest.4.3 for both of NC and XL

Future considerations

This should serve as a useful reference when considering making changes to the suspend image in any way.

Suspend image framing format

Example suspend image layout:

+----------------------------+
| 1. Suspend image signature |
+============================+
| 2.0 Xenops header          |
| 2.1 Xenops record          |
+============================+
| 3.0 Libxc header           |
| 3.1 Libxc record           |
+============================+
| 4.0 Qemu header            |
| 4.1 Qemu save record       |
+============================+
| 5.0 End_of_image footer    |
+----------------------------+

A suspend image is now constructed as a series of header-record pairs. The initial signature (1.) is used to determine whether we are dealing with the unstructured, “legacy” suspend image or the new, structured format.

Each header is two 64-bit integers: the first identifies the header type and the second is the length of the record that follows in bytes. The following types have been defined (the ones marked with a (*) have yet to be implemented):

* Xenops       : Metadata for the suspend image
* Libxc        : The result of a xc_domain_save
* Libxl*       : Not implemented
* Libxc_legacy : Marked as a libxc record saved using pre-Xen-4.5
* Qemu_trad    : The qemu save file for the Qemu used in XenServer
* Qemu_xen*    : Not implemented
* Demu*        : Not implemented
* End_of_image : A footer marker to denote the end of the suspend image

Some of the above types do not have the notion of a length since they cannot be known upfront before saving and also are delegated to other layers of the stack on restoring. Specifically these are the memory image sections, libxc and libxl.

Tasks

Some operations performed by Xenopsd are blocking, for example:

suspend/resume/migration
attaching disks (where the SMAPI VDI.attach/activate calls can perform network I/O)

We want to be able to

present the user with an idea of progress (perhaps via a “progress bar”)
allow the user to cancel a blocked operation that is taking too long
associate logging with the user/client-initiated actions that spawned them

Principles

all operations which may block (the vast majority) should be written in an asynchronous style i.e. the operations should immediately return a Task id
all operations should guarantee to respond to a cancellation request in a bounded amount of time (30s)
when cancelled, the system should always be left in a valid state
clients are responsible for destroying Tasks when they are finished with the results

Types

A task has a state, which may be Pending, Completed or failed:

	type async_result = unit

	type completion_t = {
		duration : float;
		result : async_result option
	}

	type state =
		| Pending of float
		| Completed of completion_t
		| Failed of Rpc.t

When a task is Failed, we assocate it with a marshalled exception (a value of type Rpc.t). This exception must be one from the set defined in the Xenops_interface. To see how they are marshalled, see Xenops_server.

From the point of view of a client, a Task has the immutable type (which can be queried with a Task.stat):

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
	}

where

id is a unique (integer) id generated by Xenopsd. This is how a Task is represented to clients
dbg is a client-provided debug key which will be used in log lines, allowing lines from the same Task to be associated together
ctime is the creation time
state is the current state (Pending/Completed/Failed)
subtasks lists logical internal sub-operations for debugging
debug_info includes miscellaneous key/value pairs used for debugging

Internally, Xenopsd uses a mutable record type to track Task state. This is broadly similar to the interface type except

the state is mutable: this allows Tasks to complete
the task contains a “do this now” thunk
there is a “cancelling” boolean which is toggled to request a cancellation.
there is a list of cancel callbacks
there are some fields related to “cancel points”

Persistence

The Tasks are intended to represent activities associated with in-memory queues and threads. Therefore the active Tasks are kept in memory in a map, and will be lost over a process restart. This is desirable since we will also lose the queued items and the threads, so there is no need to resync on start.

Note that every operation must ensure that the state of the system is recoverable on restart by not leaving it in an invalid state. It is not necessary to either guarantee to complete or roll-back a Task. Tasks are not expected to be transactional.

Lifecycle of a Task

All Tasks returned by API functions are created as part of the enqueue functions: queue_operation_*. Even operations which are performed internally are normally wrapped in Tasks by the function immediate_operation.

A queued operation will be processed by one of the queue worker threads. It will

set the thread-local debug key to the Task.dbg
call task.Xenops_task.run, taking care to catch exceptions and update the task.Xenops_task.state
unset the thread-local debug key
generate an event on the Task to provoke clients to query the current state.

Task implementations must update their progress as they work. For the common case of a compound operation like VM_start which is decomposed into multiple “micro-ops” (e.g. VM_create VM_build) there is a useful helper function perform_atomics which divides the progress ‘bar’ into sections, where each “micro-op” can have a different size (weight). A progress callback function is passed into each Xenopsd backend function so it can be updated with fine granularity. For example note the arguments to B.VM.save

Clients are expected to destroy Tasks they are responsible for creating. Xenopsd cannot do this on their behalf because it does not know if they have successfully queried the Task status/result.

When Xenopsd is a client of itself, it will take care to destroy the Task properly, for example see immediate_operation.

Cancellation

The goal of cancellation is to unstick a blocked operation and to return the system to some valid state, not any valid state in particular. Xenopsd does not treat operations as transactions; when an operation is cancelled it may

fully complete (e.g. if it was about to do this anyway)
fully abort (e.g. if it had made no progress)
enter some other valid state (e.g. if it had gotten half way through)

Xenopsd will never leave the system in an invalid state after cancellation.

Every Xenopsd operation should unblock and return the system to a valid state within a reasonable amount of time after a cancel request. This should be as quick as possible but up to 30s may be acceptable. Bear in mind that a human is probably impatiently watching a UI say “please wait” and which doesn’t have any notion of progress itself. Keep it quick!

Cancellation is triggered by TASK.cancel which calls cancel. This

sets the cancelling boolean
calls all registered cancel callbacks

Implementations respond to cancellation by

if running: periodically call check_cancelling
if about to block: register a suitable cancel callback safely with with_cancel.

Xenopsd’s libxc backend can block in 2 different ways, and therefore has 2 different types of cancel callback:

cancellable Xenstore watches
cancellable subprocesses

Xenstore watches are used for device hotplug and unplug. Xenopsd has to wait for the backend or for a udev script to do something. If that blocks, we need a way to cancel the watch. The easiest way to cancel a watch is to watch an additional path (a “cancel path”) and delete it, see cancellable_watch. The “cancel paths” are placed within the VM’s Xenstore directory to ensure that cleanup code which does xenstore-rm will automatically “cancel” all outstanding watches. Note that we trigger a cancel by deleting rather than creating, to avoid racing with delete and creating orphaned Xenstore entries.

Subprocesses are used for suspend/resume/migrate. Xenopsd hands file descriptors to libxenguest by running a subprocess and passing the fds to it. Xenopsd therefore gets the process id and can send it a signal to cancel it. See Cancellable_subprocess.run.

Testing with cancel points

Cancellation is difficult to test, as it is completely asynchronous. Therefore Xenopsd has some built-in cancellation testing infrastructure known as “cancel points”. A “cancel point” is a point in the code where a Cancelled exception could be thrown, either by checking the cancelling boolean or as a side-effect of a cancel callback. The check_cancelling function increments a counter every time it passes one of these points, and this value is returned to clients in the Task.debug_info.

A test harness runs a series of operations. Each operation is first run all the way through to completion to discover the total number of cancel points. The operation is then re-run with a request to cancel at a particular point. The test then waits for the system to stabilise and verifies that it appears to be in a valid state.

Preventing Tasks leaking

The client who creates a Task must destroy it when the Task is finished, and they have processed the result. What if a client like xapi is restarted while a Task is running?

We assume that, if xapi is talking to a xenopsd, then xapi completely owns it. Therefore xapi should destroy any completed tasks that it doesn’t recognise.

If a user wishes to manage VMs with xenopsd in parallel with xapi, the user should run a separate xenopsd.

Features

General

Pluggable backends including
- xc: drives Xen via libxc and xenguest
- simulator: simulates operations for component-testing
Supports running multiple instances and backends on the same host, looking after different sets of VMs
Extensive configuration via command-line (see manpage) and config file
Command-line tool for easy VM administration and troubleshooting
User-settable degree of concurrency to get VMs started quickly

VMs

VM start/shutdown/reboot
VM suspend/resume/checkpoint/migrate
VM pause/unpause
VM s3suspend/s3resume
customisable SMBIOS tables for OEM-locked VMs
hooks for 3rd party extensions:
- pre-start
- pre-destroy
- post-destroy
- pre-reboot
per-VM xenguest replacement
suppression of VM reboot loops
live vCPU hotplug and unplug
vCPU to pCPU affinity setting
vCPU QoS settings (weight and cap for the Xen credit2 scheduler)
DMC memory-ballooning support
support for storage driver domains
live update of VM shadow memory
guest-initiated disk/nic hotunplug
guest-initiated disk eject
force disk/nic unplug
support for ‘surprise-removable’ devices
disk QoS configuration
nic QoS configuration
persistent RTC
two-way guest agent communication for monitoring and control
network carrier configuration
port-locking for nics
text and VNC consoles over TCP and Unix domain sockets
PV kernel and ramdisk whitelisting
configurable VM videoram
programmable action-after-crash behaviour including: shutting down the VM, taking a crash dump or leaving the domain paused for inspection
ability to move nics between bridges/switches
advertises the VM memory footprints
PCI passthrough
support for discrete emulators (e.g. ‘demu’)
PV keyboard and mouse
qemu stub domains
cirrus and stdvga graphics cards
HVM serial console (useful for debugging)
support for vGPU
workaround for ‘spurious page faults’ kernel bug
workaround for ‘machine address size’ kernel bug

Hosts

CPUid masking for heterogenous pools: reports true features and current features
Host console reading
Hypervisor version and capabilities reporting
Host CPU querying

APIs

versioned JSON-RPC API with feature advertisements
clients can disconnect, reconnect and easily resync with the latest VM state without losing updates
all operations have task control including
- asynchronous cancellation: for both subprocesses and xenstore watches
- progress updates
- subtasks
- per-task debug logs
asynchronous event watching API
advertises VM metrics
- memory usage
- balloon driver co-operativeness
- shadow memory usage
- domain ids
channel passing (via sendmsg(2)) for efficient memory image copying

Operation Walk-Throughs

Let’s trace through interesting operations to see how the whole system works.

Starting a VM
Complete walkthrough of starting a VM, from receiving the request to unpause.
Building a VM
After VM_create, VM_build builds the core of the domain (vCPUs, memory)
- VM_build μ-op
  Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).
- Domain.build
  Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.
- xenguest
  Perform building VMs: Allocate and populate the domain's system memory.
Migrating a VM
Walkthrough of migrating a VM from one host to another.
Live Migration
Sequence diagram of the process of Live Migration.

Inspiration for other walk-throughs:

Shutting down a VM and waiting for it to happen
A VM wants to reboot itself
A disk is hotplugged
A disk refuses to hotunplug
A VM is suspended

Walkthrough: Starting a VM

A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM configuration to use. A VM configuration is broken down into objects:

VM: A device-less Virtual Machine
VBD: A virtual block device for a VM
VIF: A virtual network interface for a VM
PCI: A virtual PCI device for a VM

Treating devices as first-class objects is convenient because we wish to expose operations on the devices such as hotplug, unplug, eject (for removable media), carrier manipulation (for network interfaces) etc.

The “add” functions in the Xenopsd interface cause Xenopsd to create the objects:

In the case of xapi, there are a set of functions which convert between the XenAPI objects and the Xenopsd objects. The two interfaces are slightly different because they have different expected users:

the XenAPI has many clients which are updated on long release cycles. The main property needed is backwards compatibility, so that new release of xapi remain compatible with these older clients. Quite often, we will choose to “grandfather in” some poorly designed interface simply because we wish to avoid imposing churn on 3rd parties.
the Xenopsd API clients are all open-source and are part of the xapi-project. These clients can be updated as the API is changed. The main property needed is to keep the interface clean, so that it properly hides the complexity of dealing with Xen from other components.

The Xenopsd “VM.add” function has code like this:

	let add' x =
		debug "VM.add %s" (Jsonrpc.to_string (rpc_of_t x));
		DB.write x.id x;
		let module B = (val get_backend () : S) in
		B.VM.add x;
		x.id

This function does 2 things:

it stores the VM configuration in the “database”
it tells the “backend” that the VM exists

The Xenopsd database is really a set of config files in the filesystem. All objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not stand-alone entities like disks) and are placed into a subdirectory named after the VM e.g.:

# ls /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701
config	vbd.xvda  vbd.xvdb
# cat /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701/config
{"id": "7b719ce6-0b17-9733-e8ee-dbc1e6e7b701", "name": "fedora",
 ...
}

Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that all objects are deleted when the host is rebooted. However the objects should be persisted over a simple Xenopsd restart, which is why the objects are stored in the filesystem.

Aside: it would probably be more appropriate to store the metadata in Xenstore since this has the exact object lifetime we need. This will require a more performant Xenstore to realise.

Every running Xenopsd process is linked with a single backend. Currently backends exist for:

Xen via libxc, libxenguest and xenstore
Xen via libxl, libxc and xenstore
Xen via libvirt
KVM by direct invocation of qemu
Simulation for testing

From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a. “Xenopsd classic”) backend.

The backend VM.add function checks whether the VM we have to manage already exists – and if it does then it ensures the Xenstore configuration is intact. This Xenstore configuration is important because at any time a client can query the state of a VM with VM.stat and this relies on certain Xenstore keys being present.

Once the VM metadata has been registered with Xenopsd, the client can call VM.start. Like all potentially-blocking Xenopsd APIs, this function returns a Task id. Please refer to the Task handling design for a general overview of how tasks are handled.

Clients can poll the state of a task by calling TASK.stat but most clients will prefer to use the event system instead. Please refer to the Event handling design for a general overview of how events are handled.

The event model is similar to the XenAPI: clients call a blocking UPDATES.get passing in a token which represents the point in time when the last UPDATES.get returned. The call blocks until some objects have changed state, and these object ids are returned (NB in the XenAPI the current object states are returned) The client must then call the relevant “stat” function, in this case TASK.stat

The client will be able to see the task make progress and use this to – for example – populate a progress bar in a UI. If the client needs to cancel the task then it can call the TASK.cancel; again see the Task handling design to understand how this is implemented.

When the Task has completed successfully, then calls to *.stat will show:

the power state is Paused
exactly one valid Xen domain id
all VBDs have active = plugged = true
all VIFs have active = plugged = true
all PCI devices have plugged = true
at least one active console
a valid start time
valid “targets” for memory and vCPU

Note: before a Task completes, calls to *.stat will show partial updates. E.g. the power state may be paused, but no disk may have been plugged. UI clients must choose whether they are happy displaying this in-between state or whether they wish to hide it and pretend the whole operation has happened transactionally. If a particular, when a client wishes to perform side-effects in response to xenopsd state changes (for example, to clean up an external resource when a VIF becomes unplugged), it must be very careful to avoid responding to these in-between states. Generally, it is safest to passively report these values without driving things directly from them.

Note: the Xenopsd implementation guarantees that, if it is restarted at any point during the start operation, on restart the VM state shall be “fixed” by either (i) shutting down the VM; or (ii) ensuring the VM is intact and running.

In the case of xapi every Xenopsd Task id bound one-to-one with a XenAPI task by the function sync_with_task. The function update_task is called when xapi receives a notification that a Xenopsd Task has changed state, and updates the corresponding XenAPI task. Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for background events via the function events_watch while each thread performing a XenAPI call waits for its specific Task to complete via the function event_wait.

It is the responsibility of the client to call TASK.destroy when the Task is no longer needed. Xenopsd won’t destroy the task because it contains the success/failure result of the operation which is needed by the client.

What happens when a Xenopsd receives a VM.start request?

When Xenopsd receives the request it adds it to the appropriate per-VM queue via the function queue_operation. To understand this and other internal details of Xenopsd, consult the architecture description. The queue_operation_int function looks like this:

let queue_operation_int dbg id op =
	let task = Xenops_task.add tasks dbg (fun t -> perform op t; None) in
	Redirector.push id (op, task);
	task

The “task” is a record containing Task metadata plus a “do it now” function which will be executed by a thread from the thread pool. The module Redirector takes care of:

pushing operations to the right queue
ensuring at most one worker thread is working on a VM’s operations
reducing the queue size by coalescing items together
providing a diagnostics interface

Once a thread from the worker pool becomes free, it will execute the “do it now” function. In the example above this is perform op t where op is VM_start vm and t is the Task. The function perform_exn has fragments like this:

  | VM_start (id, force) -> (
      debug "VM.start %s (force=%b)" id force ;
      let power = (B.VM.get_state (VM_DB.read_exn id)).Vm.power_state in
      match power with
      | Running ->
          info "VM %s is already running" id
      | _ ->
          perform_atomics (atomics_of_operation op) t ;
          VM_DB.signal id "^^^^^^^^^^^^^^^^^^^^--------
    )

Each “operation” (e.g. VM_start vm) is decomposed into “micro-ops” by the function atomics_of_operation where the micro-ops are small building-block actions common to the higher-level operations. Each operation corresponds to a list of “micro-ops”, where there is no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM configuration (for example a PV domain may not need a qemu). In the case of VM_start vm the Xenopsd server starts by calling the functions that decompose the VM_hook_script, VM_create and VM_build micro-ops:

        dequarantine_ops vgpus
      ; [
          VM_hook_script
            (id, Xenops_hooks.VM_pre_start, Xenops_hooks.reason__none)
        ; VM_create (id, None, None, no_sharept)
        ; VM_build (id, force)
        ]

This is the complete sequence of micro-ops:

1. run the “VM_pre_start” scripts

The VM_hook_script micro-op runs the corresponding “hook” scripts. The code is all in the Xenops_hooks module and looks for scripts in the hardcoded path /etc/xapi.d.

2. create a Xen domain

The VM_create micro-op calls the VM.create function in the backend. In the classic Xenopsd backend, the VM.create_exn function must

check if we’re creating a domain for a fresh VM or resuming an existing one: if it’s a resume then the domain configuration stored in the VmExtra database table must be used
ask squeezed to create a memory “reservation” big enough to hold the VM memory. Unfortunately the domain cannot be created until the memory is free because domain create often fails in low-memory conditions. This means the “reservation” is associated with our “session” with squeezed; if Xenopsd crashes and restarts the reservation will be freed automatically.
create the Domain via the libxc hypercall Xenctrl.domain_create
call generate_create_info() for storing the platform data (vCPUs, etc) the domain’s Xenstore tree. xenguest then uses this in the build phase (see below) to build the domain.
“transfer” the squeezed reservation to the domain such that squeezed will free the memory if the domain is destroyed later
compute and set an initial balloon target depending on the amount of memory reserved (recall we ask for a range between dynamic_min and dynamic_max)
apply the “suppress spurious page faults” workaround if requested
set the “machine address size”
“hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates lots of vCPUs and then the guest is asked to only use some of them. Every VM therefore starts with the “VCPUs_max” setting and co-operative hotplug is used to reduce the number. Note there is no enforcement mechanism: a VM which cheats and uses too many vCPUs would have to be caught by looking at the performance statistics.

3. build the domain

The build phase waits, if necessary, for the Xen memory scrubber to catch up reclaiming memory, runs NUMA placement, sets vCPU affinity and invokes the xenguest to build the system memory layout of the domain. See the walk-through of the VM_build μ-op for details.

4. mark each VBD as “active”

VBDs and VIFs are said to be “active” when they are intended to be used by a particular VM, even if the backend/frontend connection hasn’t been established, or has been closed. If someone calls VBD.stat or VIF.stat then the result includes both “active” and “plugged”, where “plugged” is true if the frontend/backend connection is established. For example xapi will set VBD.currently_attached to “active || plugged”. The “active” flag is conceptually very similar to the traditional “online” flag (which is not documented in the upstream Xen tree as of Oct/2014 but really should be) except that on unplug, one would set the “online” key to “0” (false) first before initiating the hotunplug. By contrast the “active” flag is set to false after the unplug i.e. “set_active” calls bracket plug/unplug. If the “active” flag was set before the unplug attempt then as soon as the frontend/backend connection is removed clients would see the VBD as completely dissociated from the VM – this would be misleading because Xenopsd will not have had time to use the storage API to release locks on the disks. By cleaning up before setting “active” to false, clients can be assured that the disks are now free to be reassigned.

5. handle non-persistent disks

A non-persistent disk is one which is reset to a known-good state on every VM start. The VBD_epoch_begin is the signal to perform any necessary reset.

6. plug VBDs

The VBD_plug micro-op will plug the VBD into the VM. Every VBD is plugged in a carefully-chosen order. Generally, plug order is important for all types of devices. For VBDs, we must work around the deficiency in the storage interface where a VDI, once attached read/only, cannot be attached read/write. Since it is legal to attach the same VDI with multiple VBDs, we must plug them in such that the read/write VBDs come first. From the guest’s point of view the order we plug them doesn’t matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).

The function VBD.plug will

call VDI.attach and VDI.activate in the storage API to make the devices ready (start the tapdisk processes etc)
add the Xenstore frontend/backend directories containing the block device info
add the extra xenstore keys returned by the VDI.attach call that are needed for SCSIid passthrough which is needed to support VSS
write the VBD information to the Xenopsd database so that future calls to VBD.stat can be told about the associated disk (this is needed so clients like xapi can cope with CD insert/eject etc)
if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

The Xenstore keys are written by the functions Device.Vbd.add_async and Device.Vbd.add_wait. In a Linux domain (such as dom0) when the backend directory is created, the kernel creates a “backend device”. Creating any device will cause a kernel UEVENT to fire which is picked up by udev. The udev rules run a script whose only job is to stat(2) the device (from the “params” key in the backend) and write the major and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do any of this, instead the FreeBSD kernel module simply opens the device in the “params” key). The script also writes the backend key “hotplug-status=connected”. We currently wait for this key to be written so that later calls to VBD.stat will return with “plugged=true”. If the call returns before this key is written then sometimes we receive an event, call VBD.stat and conclude erroneously that a spontaneous VBD unplug occurred.

7. mark each VIF as “active”

This is for the same reason as VBDs are marked “active”.

8. plug VIFs

Again, the order matters. Unlike VBDs, there is no read/write read/only constraint and the devices have unique indices (0, 1, 2, …) but Linux kernels have often (always?) ignored the actual index and instead relied on the order of results from the xenstore-ls listing. The order that xenstored returns the items happens to be the order the nodes were created so this means that (i) xenstored must continue to store directories as ordered lists rather than maps (which would be more efficient); and (ii) Xenopsd must make sure to plug the vifs in the same order. Note that relying on ethX device numbering has always been a bad idea but is still common. I bet if you change this, many tests will suddenly start to fail!

The function VIF.plug_exn will

compute the port locking configuration required and write this to a well-known location in the filesystem where it can be read from the udev scripts. This really should be written to Xenstore instead, since this scheme doesn’t work with driver domains.
add the Xenstore frontend/backend directories containing the network device info
write the VIF information to the Xenopsd database so that future calls to VIF.stat can be told about the associated network
if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

Similarly to the VBD case, the function Device.Vif.add will write the Xenstore keys and wait for the “hotplug-status=connected” key. We do this because we cannot apply the port locking rules until the backend device has been created, and we cannot know the rules have been applied until after the udev script has written the key. If we didn’t wait for it then the VM might execute without all the port locking properly configured.

9. create the device model

The VM_create_device_model micro-op will create a qemu device model if

the VM is HVM; or
the VM uses a PV keyboard or mouse (since only qemu currently has backend support for these devices).

The function VM.create_device_model_exn will

(if using a qemu stubdom) it will create and build the qemu domain
compute the necessary qemu arguments and launch it.

Note that qemu (aka the “device model”) is created after the VIFs and VBDs have been plugged but before the PCI devices have been plugged. Unfortunately qemu traditional infers the needed emulated hardware by inspecting the Xenstore VBD and VIF configuration and assuming that we want one emulated device per PV device, up to the natural limits of the emulated buses (i.e. there can be at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this create an ordering dependency that needn’t exist – and which impacts migration downtime – but it also completely ignores the plain fact that, on a Xen system, qemu can be in a different domain than the backend disk and network devices. This hack only works because we currently run everything in the same domain. There is an option (off by default) to list the emulated devices explicitly on the qemu command-line. If we switch to this by default then we ought to be able to start up qemu early, as soon as the domain has been created (qemu will need to know the domain id so it can map the I/O request ring).

10. plug PCI devices

PCI devices are treated differently to VBDs and VIFs. If we are attaching the device to an HVM guest then instead of relying on the traditional Xenstore frontend/backend state machine we instead send RPCs to qemu requesting they be hotplugged. Note the domain is paused at this point, but qemu still supports PCI hotplug/unplug. The reasons why this doesn’t follow the standard Xenstore model are known only to the people who contributed this support to qemu. Again the order matters because it determines the position of the virtual device in the VM.

Note that Xenopsd doesn’t know anything about the PCI devices; concepts such as “GPU groups” belong to higher layers, such as xapi.

11. mark the domain as alive

A design principle of Xenopsd is that it should tolerate failures such as being suddenly restarted. It guarantees to always leave the system in a valid state, in particular there should never be any “half-created VMs”. We achieve this for VM start by exploiting the mechanism which is necessary for reboot. When a VM wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a “reason code” of “reboot”. When Xenopsd sees this event VM_check_state operation is queued. This operation calls VM.get_domain_action_request to ask the question, “what needs to be done to make this VM happy now?”. The implementation checks the domain state for shutdown codes and also checks a special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd finishes starting the VM it clears this key. This means that if Xenopsd crashes while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted and will clean up the current domain and create a fresh one.

12. unpause the domain

A Xenopsd VM.start will always leave the domain paused, so strictly speaking this is a separate “operation” queued by the client (such as xapi) after the VM.start has completed. The function VM.unpause is reassuringly simple:

		if di.Xenctrl.total_memory_pages = 0n then raise (Domain_not_built);
		Domain.unpause ~xc di.Xenctrl.domid;
		Opt.iter
			(fun stubdom_domid ->
				Domain.unpause ~xc stubdom_domid
			) (get_stubdom ~xs di.Xenctrl.domid)

Building a VM

flowchart
subgraph xenopsd VM_build[xenopsd:&nbsp;VM_build&nbsp;micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end

Walk-through documents for the VM_build phase:

VM_build μ-op
Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).
Domain.build
Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.
xenguest
Perform building VMs: Allocate and populate the domain's system memory.

VM_build micro-op

Overview

On Xen, Xenctrl.domain_create creates an empty domain and returns the domain ID (domid) of the new domain to xenopsd.

In the build phase, the xenguest program is called to create the system memory layout of the domain, set vCPU affinity and a lot more.

The VM_build micro-op collects the VM build parameters and calls VM.build, which calls VM.build_domain, which calls VM.build_domain_exn which calls Domain.build:

flowchart
subgraph xenopsd VM_build[xenopsd:&nbsp;VM_build&nbsp;micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end

The function VM.build_domain_exn must:

Run pygrub (or eliloader) to extract the kernel and initrd, if necessary
Call Domain.build to
- optionally run NUMA placement and
- invoke xenguest to set up the domain memory.
See the walk-through of the Domain.build function for more details on this phase.
Apply the cpuid configuration
Store the current domain configuration on disk – it’s important to know the difference between the configuration you started with and the configuration you would use after a reboot because some properties (such as maximum memory and vCPUs) as fixed on create.

Domain.build

Overview

flowchart LR
subgraph xenopsd VM_build[
  xenopsd&nbsp;thread&nbsp;pool&nbsp;with&nbsp;two&nbsp;VM_build&nbsp;micro#8209;ops:
  During&nbsp;parallel&nbsp;VM_start,&nbsp;Many&nbsp;threads&nbsp;run&nbsp;this&nbsp;in&nbsp;parallel!
]
direction LR
build_domain_exn[
  VM.build_domain_exn
  from thread pool Thread #1
]  --> Domain.build
Domain.build --> build_pre
build_pre --> wait_xen_free_mem
build_pre -->|if NUMA/Best_effort| numa_placement
Domain.build --> xenguest[Invoke xenguest]
click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank

build_domain_exn2[
  VM.build_domain_exn
  from thread pool Thread #2]  --> Domain.build2[Domain.build]
Domain.build2 --> build_pre2[build_pre]
build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
Domain.build2 --> xenguest2[Invoke xenguest]
click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
end

VM.build_domain_exn calls Domain.build to call:

build_pre to prepare the build of a VM:
- If the xe config numa_placement is set to Best_effort, invoke the NUMA placement algorithm.
- Run xenguest
xenguest to invoke the xenguest program to setup the domain’s system memory.

build_pre: Prepare building the VM

Domain.build calls build_pre (which is also used for VM restore) to:

Call wait_xen_free_mem to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory. It
1. calls Xenctrl.physinfo which returns:
  - hostinfo.free_pages - the free and already scrubbed pages (available)
  - host.scrub_pages - the not yet scrubbed pages (not yet available)
2. repeats this until a timeout as long as free_pages is lower than the required pages
  - unless if scrub_pages is 0 (no scrubbing left to do)
Note: free_pages is system-wide memory, not memory specific to a NUMA node. Because this is not NUMA-aware, in case of temporary node-specific memory shortage, this check is not sufficient to prevent the VM from being spread over all NUMA nodes. It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.
Call the hypercall to set the timer mode
Call the hypercall to set the number of vCPUs

Call the numa_placement function as described in the NUMA feature description when the xe configuration option numa_placement is set to Best_effort (except when the VM has a hard CPU affinity).

match !Xenops_server.numa_placement with
| Any ->
    ()
| Best_effort ->
    log_reraise (Printf.sprintf "NUMA placement") (fun () ->
        if has_hard_affinity then
          D.debug "VM has hard affinity set, skipping NUMA optimization"
        else
          numa_placement domid ~vcpus
            ~memory:(Int64.mul memory.xen_max_mib 1048576L)
    )

NUMA placement

build_pre passes the domid, the number of vCPUs and xen_max_mib to the numa_placement function to run the algorithm to find the best NUMA placement.

When it returns a NUMA node to use, it calls the Xen hypercalls to set the vCPU affinity to this NUMA node:

  let vm = NUMARequest.make ~memory ~vcpus in
  let nodea =
    match !numa_resources with
    | None ->
        Array.of_list nodes
    | Some a ->
        Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
  in
  numa_resources := Some nodea ;
  Softaffinity.plan ~vm host nodea

By using the default auto_node_affinity feature of Xen, setting the vCPU affinity causes the Xen hypervisor to activate NUMA node affinity for memory allocations to be aligned with the vCPU affinity of the domain.

Summary: This passes the information to the hypervisor that memory allocation for this domain should preferably be done from this NUMA node.

Invoke the xenguest program

With the preparation in build_pre completed, Domain.build calls the xenguest function to invoke the xenguest program to build the domain.

Notes on future design improvements

The Xen domain feature flag domain->auto_node_affinity can be disabled by calling xc_domain_node_setaffinity() to set a specific NUMA node affinity in special cases:

This can be used, for example, when there might not be enough memory on the preferred NUMA node, and there are other NUMA nodes (in the same CPU package) to use (reference).

xenguest

As part of starting a new domain in VM_build, xenopsd calls xenguest. When multiple domain build threads run in parallel, also multiple instances of xenguest also run in parallel:

flowchart
subgraph xenopsd VM_build[xenopsd&nbsp;VM_build&nbsp;micro#8209;ops]
direction LR
xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1]
xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2]
xenguest1 --> libxenguest
xenguest2 --> libxenguest2[libxenguest]
click xenopsd1 "../Domain.build/index.html"
click xenopsd2 "../Domain.build/index.html"
click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
libxenguest --> Xen[Xen<br>Hypervisor]
libxenguest2 --> Xen
end

About xenguest

xenguest is called by the xenopsd Domain.build function to perform the build phase for new VMs, which is part of the xenopsd VM.start operation.

xenguest was created as a separate program due to issues with libxenguest:

It wasn’t threadsafe: fixed, but it still uses a per-call global struct
It had an incompatible licence, but now licensed under the LGPL.

Those were fixed, but we still shell out to xenguest, which is currently carried in the patch queue for the Xen hypervisor packages, but could become an individual package once planned changes to the Xen hypercalls are stabilised.

Over time, xenguest has evolved to build more of the initial domain state.

Interface to xenguest

flowchart
subgraph xenopsd VM_build[xenopsd&nbsp;VM_build&nbsp;micro#8209;op]
direction TB
mode
domid
memmax
Xenstore
end
mode[--mode build_hvm] --> xenguest
domid --> xenguest
memmax --> xenguest
Xenstore[Xenstore platform data] --> xenguest

xenopsd must pass this information to xenguest to build a VM:

The domain type to build for (HVM, PHV or PV).
- It is passed using the command line option --mode hvm_build.
The domid of the created empty domain,
The amount of system memory of the domain,
A number of other parameters that are domain-specific.

xenopsd uses the Xenstore to provide platform data:

the vCPU affinity
the vCPU credit2 weight/cap parameters
whether the NX bit is exposed
whether the viridian CPUID leaf is exposed
whether the system has PAE or not
whether the system has ACPI or not
whether the system has nested HVM or not
whether the system has an HPET or not

When called to build a domain, xenguest reads those and builds the VM accordingly.

Walkthrough of the xenguest build mode

flowchart
subgraph xenguest[xenguest&nbsp;#8209;#8209;mode&nbsp;hvm_build&nbsp;domid]
direction LR
stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
    get_flags#40;#41;&nbsp;<#8209;&nbsp;Xenstore&nbsp;platform&nbsp;data
]
stub_xc_hvm_build --> configure_vcpus[
    configure_vcpus#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercall
]
stub_xc_hvm_build --> setup_mem[
    setup_mem#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercalls&nbsp;to&nbsp;setup&nbsp;domain&nbsp;memory
]
end

Based on the given domain type, the xenguest program calls dedicated functions for the build process of the given domain type.

These are:

stub_xc_hvm_build() for HVM,
stub_xc_pvh_build() for PVH, and
stub_xc_pv_build() for PV domains.

These domain build functions call these functions:

get_flags() to get the platform data from the Xenstore
configure_vcpus() which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)
The setup_mem function for the given VM type.

The function hvm_build_setup_mem()

For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory layout of the new domain, allocating the required memory and populating for the new domain. It must:

Derive the e820 memory layout of the system memory of the domain including memory holes depending on PCI passthrough and vGPU flags.
Load the BIOS/UEFI firmware images
Store the final MMIO hole parameters in the Xenstore
Call the libxenguest function xc_dom_boot_mem_init() (see below)
Call construct_cpuid_policy() to apply the CPUID featureset policy

The function xc_dom_boot_mem_init()

flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end

hvm_build_setup_mem() calls xc_dom_boot_mem_init() to allocate and populate the domain’s system memory.

It calls meminit_hvm() to loop over the vmemranges of the domain for mapping the system RAM of the guest from the Xen hypervisor heap. Its goals are:

Attempt to allocate 1GB superpages when possible
Fall back to 2MB pages when 1GB allocation failed
Fall back to 4k pages when both failed

It uses the hypercall XENMEM_populate_physmap to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.

https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071

XENMEM_populate_physmap:

Uses construct_memop_from_reservation to convert the arguments for allocating a page from struct xen_memory_reservation to struct memop_args.
Sets flags and calls functions according to the arguments
Allocates the requested page at the most suitable place
- depending on passed flags, allocate on a specific NUMA node
- else, if the domain has node affinity, on the affine nodes
- also in the most suitable memory zone within the NUMA node
Falls back to less desirable places if this fails
- or fail for “exact” allocation requests
When no pages of the requested size are free, it splits larger superpages into pages of the requested size.

For more details on the VM build step involving xenguest and Xen side see: https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest

Walkthrough: Migrating a VM

At the end of this walkthrough, a sequence diagram of the overall process is included.

Invocation

The command to migrate the VM is dispatched by the autogenerated dispatch_call function from xapi/server.ml. For more information about the generated functions you can have a look to XAPI IDL model.

The command triggers the operation VM_migrate that uses many low level atomics operations. These are:

The migrate command has several parameters such as:

Should it be started asynchronously,
Should it be forwarded to another host,
How arguments should be marshalled, and so on.

A new thread is created by xapi/server_helpers.ml to handle the command asynchronously. The helper thread checks if the command should be passed to the message forwarding layer in order to be executed on another host (the destination) or locally (if it is already at the destination host).

It will finally reach xapi/api_server.ml that will take the action of posted a command to the message broker message switch. It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some XAPI daemons. In the case of the migration this message sends by XAPI will be consumed by the xenopsd daemon that will do the job of migrating the VM.

Overview

The migration is an asynchronous task and a thread is created to handle this task. The task reference is returned to the client, which can then check its status until completion.

As shown in the introduction, xenopsd fetches the VM_migrate operation from the message broker.

All tasks specific to libxenctrl, xenguest and Xenstore are handled by the xenopsd xc backend.

The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.

During the migration process, the destination domain will be built with the same UUID as the original VM, except that the last part of the UUID will be XXXXXXXX-XXXX-XXXX-XXXX-000000000001. The original domain will be removed using XXXXXXXX-XXXX-XXXX-XXXX-000000000000.

Preparing VM migration

At specific places, xenopsd can execute hooks to run scripts. In case a pre-migrate script is in place, a command to run this script is sent to the original domain.

Likewise, a command is sent to Qemu using the Qemu Machine Protocol (QMP) to check that the domain can be suspended (see xenopsd/xc/device_common.ml). After checking with Qemu that the VM is can be suspended, the migration can begin.

Importing metadata

As for hooks, commands to source domain are sent using stunnel a daemon which is used as a wrapper to manage SSL encryption communication between two hosts on the same pool. To import the metadata, an XML RPC command is sent to the original domain.

Once imported, it will give us a reference id and will allow building the new domain on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001 where XXX... is the reference id of the original VM.

Memory setup

One of the first steps the setup of the VM’s memory: The backend checks that there is no ballooning operation in progress. If so, the migration could fail.

Once memory has been checked, the daemon will get the state of the VM (running, halted, …) and The backend retrieves the domain’s platform data (memory, vCPUs setc) from the Xenstore.

Once this is complete, we can restore VIF and create the domain.

The synchronisation of the memory is the first point of synchronisation and everything is ready for VM migration.

Destination VM setup

After receiving memory we can set up the destination domain. If we have a vGPU we need to kick off its migration process. We will need to wait for the acknowledgement that the GPU entry has been successfully initialized before starting the main VM migration.

The receiver informs the sender using a handshake protocol that everything is set up and ready for save/restore.

Destination VM restore

VM restore is a low level atomic operation VM.restore. This operation is represented by a function call to backend. It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor and libxc for sending a migration request to the emu-manager.

After sending the request results coming from emu-manager are collected by the main thread. It blocks until results are received.

During the live migration, emu-manager helps in ensuring the correct state transitions for the devices and handling the message passing for the VM as it’s moved between hosts. This includes making sure that the state of the VM’s virtual devices, like disks or network interfaces, is correctly moved over.

Destination VM rename

Once all operations are done, xenopsd renames the target VM from its temporary name to its real UUID. This operation is a low-level atomic VM.rename which takes care of updating the Xenstore on the destination host.

Restoring devices

Restoring devices starts by activating VBD using the low level atomic operation VBD.set_active. It is an update of Xenstore. VBDs that are read-write must be plugged before read-only ones. Once activated the low level atomic operation VBD.plug is called. VDI are attached and activate.

Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug. If there are VGPUs we will set them as active now using the atomic VGPU.set_active.

Creating the device model

create_device_model configures qemu-dm and starts it. This allows to manage PCI devices.

PCI plug

PCI.plug is executed by the backend. It plugs a PCI device and advertises it to QEMU if this option is set. It is the case for NVIDIA SR-IOV vGPUs.

Unpause

The libxenctrl call xc_domain_unpause() unpauses the domain, and it starts running.

Cleanup

VM_set_domain_action_request marks the domain as alive: In case xenopsd restarts, it no longer reboots the VM. See the chapter on marking domains as alive for more information.
If a post-migrate script is in place, it is executed by the Xenops_hooks.VM_post_migrate hook.
The final step is a handshake to seal the success of the migration and the old VM can now be cleaned up.

Syncronisation point 4 has been reached, the migration is complete.

Live migration flowchart

This flowchart gives a visual representation of the VM migration workflow:

sequenceDiagram
autonumber
participant tx as sender
participant rx0 as receiver thread 0
participant rx1 as receiver thread 1
participant rx2 as receiver thread 2

activate tx
tx->>rx0: VM.import_metadata
tx->>tx: Squash memory to dynamic-min

tx->>rx1: HTTP /migrate/vm
activate rx1
rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
rx1->>tx: handshake (control channel)<br/>Synchronisation point 1

tx->>rx2: HTTP /migrate/mem
activate rx2
rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem

tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK

rx2->>rx1: memory fd

tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
tx->>tx: VM_rename
rx1->>rx2: exit
deactivate rx2

tx->>rx1: handshake (control channel)<br/>Synchronisation point 3

rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request

rx1->>tx: handshake (control channel)<br/>Synchronisation point 4

deactivate rx1

tx->>tx: VM_shutdown<br/>VM_remove
deactivate tx

References

These pages might help for a better understanding of the XAPI toolstack:

See the XAPI architecture for the overall architecture of Xapi
See the XAPI dispatcher for service dispatch and message forwarding
See the Xenopsd architecture for the overall architecture of Xenopsd
See the How Xen suspend and resume works for very similar operations in more detail.

Live Migration Sequence Diagram

sequenceDiagram
autonumber
participant tx as sender
participant rx0 as receiver thread 0
participant rx1 as receiver thread 1
participant rx2 as receiver thread 2

activate tx
tx->>rx0: VM.import_metadata
tx->>tx: Squash memory to dynamic-min

tx->>rx1: HTTP /migrate/vm
activate rx1
rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
rx1->>tx: handshake (control channel)<br/>Synchronisation point 1

tx->>rx2: HTTP /migrate/mem
activate rx2
rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem

tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK

rx2->>rx1: memory fd

tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
tx->>tx: VM_rename
rx1->>rx2: exit
deactivate rx2

tx->>rx1: handshake (control channel)<br/>Synchronisation point 3

rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request

rx1->>tx: handshake (control channel)<br/>Synchronisation point 4

deactivate rx1

tx->>tx: VM_shutdown<br/>VM_remove
deactivate tx

Xenopsd

Principles

Subsections of Xenopsd

Xenopsd Architecture

Design

Subsections of Design

Events

Hooks

PVS Proxy OVS Rules

Rule Design

Table 0 - Entry Rules

Table 101 - Port Rules

Table 102 - Exit Rules

Requirements for suspend image framing

Historic ‘classic’ stack

Requirements for new stack

Changes made

New components

Future considerations

Suspend image framing format

Tasks

Principles

Types

Persistence

Lifecycle of a Task

Cancellation

Testing with cancel points

Preventing Tasks leaking

Features

General

VMs

Hosts

APIs

Operation Walk-Throughs

Subsections of Walk-throughs

Walkthrough: Starting a VM

1. run the “VM_pre_start” scripts

2. create a Xen domain

3. build the domain

4. mark each VBD as “active”

5. handle non-persistent disks

6. plug VBDs

7. mark each VIF as “active”

8. plug VIFs

9. create the device model

10. plug PCI devices

11. mark the domain as alive

12. unpause the domain

Building a VM

Subsections of Building a VM

VM_build micro-op

Overview

Domain.build

Overview

build_pre: Prepare building the VM

NUMA placement

Invoke the xenguest program

Notes on future design improvements

xenguest

About xenguest

Interface to xenguest

Walkthrough of the xenguest build mode

The function hvm_build_setup_mem()

The function xc_dom_boot_mem_init()

Walkthrough: Migrating a VM

Invocation

Overview

Preparing VM migration

Importing metadata

Memory setup

Destination VM setup

Destination VM restore

Destination VM rename

Restoring devices

Creating the device model

PCI plug

Unpause

Cleanup

Live migration flowchart

References