Xenopsd

Xenopsd is the VM manager of the XAPI Toolstack. Xenopsd is responsible for:

  • Starting, stopping, rebooting, suspending, resuming, migrating VMs.
  • (Hot-)plugging and unplugging devices such as VBDs, VIFs, vGPUs and PCI devices.
  • Setting up VM consoles.
  • Running bootloaders.
  • Setting QoS parameters.
  • Configuring SMBIOS tables.
  • Handling crashes.
  • etc.

Check out the full features list.

The code is in ocaml/xenopsd.

Principles

  1. Do no harm: Xenopsd should never touch domains/VMs which it hasn’t been asked to manage. This means that it can co-exist with other VM managers such as ‘xl’ and ’libvirt’.
  2. Be independent: Xenopsd should be able to work in isolation. In particular the loss of some other component (e.g. the network) should not by itself prevent VMs being managed locally (including shutdown and reboot).
  3. Asynchronous by default: Xenopsd exposes task monitoring and offers cancellation for all operations. Xenopsd ensures that the system is always in a manageable state after an operation has been cancelled.
  4. Avoid state duplication: where another component owns some state, Xenopsd will always defer to it. We will avoid creating out-of-sync caches of this state.
  5. Be debuggable: Xenopsd will expose diagnostic APIs and tools to allow its internal state to be inspected and modified.

Subsections of Xenopsd

Architecture

Xenopsd instances run on a host and manage VMs on behalf of clients. This picture shows 3 different Xenopsd instances: 2 named “xenopsd-xc” and 1 named “xenopsd-xenlight”.

Where xenopsd fits on a host Where xenopsd fits on a host

Each instance is responsible for managing a disjoint set of VMs. Clients should never ask more than one Xenopsd to manage the same VM. Managing a VM means:

  • handling start/shutdown/suspend/resume/migrate/reboot
  • allowing devices (disks, nics, PCI cards, vCPUs etc) to be manipulated
  • providing updates to clients when things change (reboots, console becomes available, guest agent says something etc).

For a full list of features, consult the features list.

Each Xenopsd instance has a unique name on the host. A typical name is

  • org.xen.xcp.xenops.classic
  • org.xen.xcp.xenops.xenlight

A higher-level tool, such as xapi will associate VMs with individual Xenopsd names.

Running multiple Xenopsds is necessary because

  • The virtual hardware supported by different technologies (libxc, libxl, qemu) is expected to be different. We can guarantee the virtual hardware is stable across a rolling upgrade by running the VM on the old Xenopsd. We can then switch Xenopsds later over a VM reboot when the VM admin is happy with it. If the VM admin is unhappy then we can reboot back to the original Xenopsd again.
  • The suspend/resume/migrate image formats will differ across technologies (again libxc vs libxl) and it will be more reliable to avoid switching technology over a migrate.
  • In the future different security domains may have different Xenopsd instances providing even stronger isolation guarantees between domains than is possible today.

Communication with Xenopsd is handled through a Xapi-global library: xcp-idl. This library supports

  • message framing: by default using HTTP but a binary framing format is available
  • message encoding: by default we use JSON but XML is also available
  • RPCs over Unix domain sockets and persistent queues.

This library allows the communication details to be changed without having to change all the Xapi clients and servers.

Xenopsd has a number of “backends” which perform the low-level VM operations such as (on Xen) “create domain” “hotplug disk” “destroy domain”. These backends contain all the hypervisor-specific code including

  • connecting to Xenstore
  • opening the libxc /proc/xen/privcmd interface
  • initialising libxl contexts

The following diagram shows the internal structure of Xenopsd:

Inside xenopsd Inside xenopsd

At the top of the diagram two client RPC have been sent: one to start a VM and the other to fetch the latest events. The RPCs are all defined in xcp-idl/xen/xenops_interface.ml. The RPCs are received by the Xenops_server module and decomposed into “micro-ops” (labelled “μ op”). These micro ops represent actions like

  • create a Xen domain (recall a Xen domain is an empty shell with no memory)
  • build a Xen domain: this is where the kernel or hvmloader is copied in
  • launch a device model: this is where a qemu instance is started (if one is required)
  • hotplug a device: this involves writing the frontend and backend trees to Xenstore
  • unpause a domain (recall a Xen domain is created in the paused state)

Each of these micro-ops is represented by a function call in a “backend plugin” interface. The micro-ops are enqueued in queues, one queue per VM. There is a thread pool (whose size can be changed dynamically by the admin) which pulls micro-ops from the VM queues and calls the corresponding backend function.

The active backend (there can only be one backend per Xenopsd instance) executes the micro-ops. The Xenops_server_xen backend in the picture above talks to libxc, libxl and qemu to create and destroy domains. The backend also talks to other Xapi services, in particular

  • it registers datasources with xcp-rrdd, telling xcp-rrdd to measure I/O throughput and vCPU utilisation
  • it reserves memory for new domains by talking to squeezed
  • it makes disks available by calling SMAPIv2 VDI.{at,de}tach, VDI.{,de}activate
  • it launches subprocesses by talking to forkexecd (avoiding problems with accidental fd capture)

Xenopsd backends are also responsible for monitoring running VMs. In the Xenops_server_xen backend this is done by watching Xenstore for

  • @releaseDomain watch events
  • device hotplug status changes

When such an event happens (for example: @releaseDomain sent when a domain requests a reboot) the corresponding operation does not happen inline. Instead the event is rebroadcast upwards to Xenops_server as a signal (for example: “VM id needs some attention”) and a “VM_stat” micro-op is queued in the appropriate queue. Xenopsd does not allow operations to run on the same VM in parallel and enforces this by:

  • pushing all operations pertaining to a VM to the same queue
  • associating each VM queue to at-most-one worker pool thread

The event takes the form “VM id needs some attention” and not “VM id needs to be rebooted” because, by the time the queue is flushed, the VM may well now be in a different state. Perhaps rather than being rebooted it now needs to be shutdown; or perhaps the domain is now in a good state because the reboot has already happened. The signals sent by the backend to the Xenops_server are a bit like event channel notifications in the Xen ring protocols: they are requests to ask someone to perform work, they don’t themselves describe the work that needs to be done.

An implication of this design is that it should always be possible to answer the question, “what operation should be performed to get the VM into a valid state?”. If an operation is cancelled half-way through or if Xenopsd is suddenly restarted, it will ask the question about all the VMs and perform the necessary operations. The operations must be designed carefully to make this work. For example if Xenopsd is restarted half-way through starting a VM, it must be obvious on restart that the VM should either be forcibly shutdown or rebooted to make it a valid state again. Note: we don’t demand that operations are performed as transactions; we only demand that the state they leave the system be “sensible” in the sense that the admin will recognise it and be able to continue their work.

Sometimes this can be achieved through careful ordering of side-effects within the operations, taking advantage of artifacts of the system such as:

  • a domain which has not been fully created will have total vCPU time = 0 and will be paused. If we see one of these we should reboot it because it may not be fully intact.

In the absense of “tells” from the system, operations are expected to journal their intentions and support restart after failure.

There are three categories of metadata associated with VMs:

  1. system metadata: this is created as a side-effect of starting VMs. This includes all the information about active disks and nics stored in Xenstore and the list of running domains according to Xen.
  2. VM: this is the configuration to use when the VM is started or rebooted. This is like a “config file” for the VM.
  3. VmExtra: this is the runtime configuration of the VM. When VM configuration is changed it often cannot be applied immediately; instead the VM continues to run with the previous configuration. We need to track the runtime configuration of the VM in order for suspend/resume and migrate to work. It is also useful to be able to tell a client, “on next reboot this value will be x but currently it is x-1”.

VM and VmExtra metadata is stored by Xenopsd in the domain 0 filesystem, in a simple directory hierarchy.

Design

Subsections of Design

Hooks

There are a number of hook points at which xenopsd may execute certain scripts. These scripts are found in hook-specific directories of the form /etc/xapi.d/<hookname>/. All executable scripts in these directories are run with the following arguments:

<script.sh> -reason <reason> -vmuuid <uuid of VM>

The scripts are executed in filename-order. By convention, the filenames are usually of the form 10resetvdis.

The hook points are:

vm-pre-shutdown
vm-pre-migrate
vm-post-migrate (Dundee only)
vm-pre-start
vm-pre-reboot
vm-pre-resume
vm-post-resume (Dundee only)
vm-post-destroy

and the reason codes are:

clean-shutdown
hard-shutdown
clean-reboot
hard-reboot
suspend
source -- passed to pre-migrate hook on source host
destination -- passed to post-migrate hook on destination (Dundee only)
none

For example, in order to execute a script on VM shutdown, it would be sufficient to create the script in the post-destroy hook point:

/etc/xapi.d/vm-post-destroy/01myscript.sh

containing

#!/bin/bash
echo I was passed $@ > /tmp/output

And when, for example, VM e30d0050-8f15-e10d-7613-cb2d045c8505 is shut-down, the script is executed:

[vagrant@localhost ~]$ sudo xe vm-shutdown --force uuid=e30d0050-8f15-e10d-7613-cb2d045c8505
[vagrant@localhost ~]$ cat /tmp/output
I was passed -vmuuid e30d0050-8f15-e10d-7613-cb2d045c8505 -reason hard-shutdown

PVS Proxy OVS Rules

Rule Design

The Open vSwitch (OVS) daemon implements a programmable switch. XenServer uses it to re-direct traffic between three entities:

  • PVS server - identified by its IP address
  • a local VM - identified by its MAC address
  • a local Proxy - identified by its MAC address

VM and PVS server are unaware of the Proxy; xapi configures OVS to redirect traffic between PVS and VM to pass through the proxy.

OVS uses rules that match packets. Rules are organised in sets called tables. A rule can be used to match a packet and to inject it into another rule set/table table such that a packet can be matched again.

Furthermore, a rule can set registers associated with a packet which that can be matched in subsequent rules. In that way, a packet can be tagged such that it will only match specific rules downstream that match the tag.

Xapi configures 3 rule sets:

Table 0 - Entry Rules

Rules match UDP traffic between VM/PVS, Proxy/VM, and PVS/VM where the PVS server is identified by its IP and all other components by their MAC address. All packets are tagged with the direction they are going and re-submitted into Table 101 which handles ports.

Table 101 - Port Rules

Rules match UDP traffic going to a specific port of the PVS server and re-submit it into Table 102.

Table 102 - Exit Rules

These rules implement the redirection:

  • Rules matching packets coming from VM to PVS are directed to the Proxy.
  • Rules matching packets coming from PVS to VM are directed to the Proxy.
  • Rules matching packets coming from the Proxy are already addressed properly (to the VM) are handled normally.

Requirements for suspend image framing

We are currently (Dec 2013) undergoing a transition from the ‘classic’ xenopsd backend (built upon calls to libxc) to the ‘xenlight’ backend built on top of the officially supported libxl API.

During this work, we have come across an incompatibility between the suspend images created using the ‘classic’ backend and those created using the new libxl-based backend. This needed to be fixed to enable RPU to any new version of XenServer.

Historic ‘classic’ stack

Prior to this work, xenopsd was involved in the construction of the suspend image and we ended up with an image with the following format:

+-----------------------------+
| "XenSavedDomain\n"          |  <-- added by xenopsd-classic
|-----------------------------|
|  Memory image dump          |  <-- libxc
|-----------------------------|
| "QemuDeviceModelRecord\n"   |
|  <size of following record> |  <-- added by xenopsd-classic
|  (a 32-bit big-endian int)  |
|-----------------------------|
| "QEVM"                      |  <-- libxc/qemu
|  Qemu device record         |
+-----------------------------+

We have also been carrying a patch in the Xen patchqueue against xc_domain_restore. This patch (revert_qemu_tail.patch) stopped xc_domain_restore from attempting to read past the memory image dump. At which point xenopsd-classic would just take over and restore what it had put there.

Requirements for new stack

For xenopsd-xenlight to work, we need to operate without the revert_qemu_tail.patch since libxl assumes it is operating on top of an upstream libxc.

We need the following relationship between suspend images created on one backend being able to be restored on another backend. Where the backends are old-classic (OC), new-classic (NC) and xenlight (XL). Obviously all suspend images created on any backend must be able to be restored on the same backend:

                OC _______ NC _______ XL
                 \  >>>>>      >>>>>  /
                  \__________________/
                    >>>>>>>>>>>>>>>>

It turns out this was not so simple. After removing the patch against xc_domain_restore and allowing libxc to restore the hvm_buffer_tail, we found that supsend images created with OC (detailed in the previous section) are not of a valid format for two reasons:

i. The "XenSavedDomain\n" was extraneous;

ii. The Qemu signature section (prior to the record) is not of valid form.

It turns out that the section with the Qemu signature can be one of the following:

a. "QemuDeviceModelRecord" (NB. no newline) followed by the record to EOF;
b. "DeviceModelRecord0002" then a uint32_t length followed by record;
c. "RemusDeviceModelState" then a uint32_t length followed by record;

The old-classic (OC) backend not only uses an invalid signature (since it contains a trailing newline) but it also includes a length, and the length is in big-endian when the uint32_t is seen to be little-endian.

We considered creating a proxy for the fd in the incompatible cases but since this would need to be a 22-lookahead byte-by-byte proxy this was deemed impracticle. Instead we have made patched libxc with a much simpler patch to understand this legacy format.

Because peek-ahead is not possible on pipes, the patch for (ii) needed to be applied at a point where the hvm tail had been read completely. We piggy-backed on the point after (a) had been detected. At this point the remainder of the fd is buffered (only around 7k) and the magic “QEVM” is expected at the head of this buffer. So we simply added a patch to check if there was a pesky newline and the buffer[5:8] was “QEVM” and if it was we could discard the first 5 bytes:

                              0    1    2    3    4    5   6   7   8
Legacy format from OC:  [...| \n | \x | \x | \x | \x | Q | E | V | M |...]

Required at this point: [...|  Q |  E |  V |  M |...]

Changes made

To make the above use-cases work, we have made the following changes:

1. Make new-classic (NC) not restore Qemu tail (let libxc do it)
    xenopsd.git:ef3bf4b

2. Make new-classic use valid signature (b) for future restore images
    xenopsd.git:9ccef3e

3. Make xc_domain_restore in libxc understand legacy xenopsd (OC) format
    xen-4.3.pq.hg:libxc-restore-legacy-image.patch

4. Remove revert-qemu-tail.patch from Xen patchqueue
    xen-4.3.pq.hg:3f0e16f2141e

5. Make xenlight (XL) use "XenSavedDomain\n" start-of-image signature
    xenopsd.git:dcda545

This has made the required use-cases work as follows:

                OC __134__ NC __245__ XL
                 \  >>>>>      >>>>>  /
                  \_______345________/
                    >>>>>>>>>>>>>>>>

And the suspend-resume on same backends work by virtue of:

OC --> OC : Just works
NC --> NC : By 1,2,4
XL --> XL : By 4 (5 is used but not required)

New components

The output of the changes above are:

  • A new xenops-xc binary for NC
  • A new xenops-xl binary for XL
  • A new libxenguest.4.3 for both of NC and XL

Future considerations

This should serve as a useful reference when considering making changes to the suspend image in any way.

Suspend image framing format

Example suspend image layout:

+----------------------------+
| 1. Suspend image signature |
+============================+
| 2.0 Xenops header          |
| 2.1 Xenops record          |
+============================+
| 3.0 Libxc header           |
| 3.1 Libxc record           |
+============================+
| 4.0 Qemu header            |
| 4.1 Qemu save record       |
+============================+
| 5.0 End_of_image footer    |
+----------------------------+

A suspend image is now constucted as a series of header-record pairs. The initial signature (1.) is used to determine whether we are dealing with the unstructured, “legacy” suspend image or the new, structured format.

Each header is two 64-bit integers: the first identifies the header type and the second is the length of the record that follows in bytes. The following types have been defined (the ones marked with a (*) have yet to be implemented):

* Xenops       : Metadata for the suspend image
* Libxc        : The result of a xc_domain_save
* Libxl*       : Not implemented
* Libxc_legacy : Marked as a libxc record saved using pre-Xen-4.5
* Qemu_trad    : The qemu save file for the Qemu used in XenServer
* Qemu_xen*    : Not implemented
* Demu*        : Not implemented
* End_of_image : A footer marker to denote the end of the suspend image

Some of the above types do not have the notion of a length since they cannot be known upfront before saving and also are delegated to other layers of the stack on restoring. Specifically these are the memory image sections, libxc and libxl.

Tasks

Some operations performed by Xenopsd are blocking, for example:

  • suspend/resume/migration
  • attaching disks (where the SMAPI VDI.attach/activate calls can perform network I/O)

We want to be able to

  • present the user with an idea of progress (perhaps via a “progress bar”)
  • allow the user to cancel a blocked operation that is taking too long
  • associate logging with the user/client-initiated actions that spawned them

Principles

  • all operations which may block (the vast majority) should be written in an asynchronous style i.e. the operations should immediately return a Task id
  • all operations should guarantee to respond to a cancellation request in a bounded amount of time (30s)
  • when cancelled, the system should always be left in a valid state
  • clients are responsible for destroying Tasks when they are finished with the results

Types

A task has a state, which may be Pending, Completed or failed:

	type async_result = unit

	type completion_t = {
		duration : float;
		result : async_result option
	}

	type state =
		| Pending of float
		| Completed of completion_t
		| Failed of Rpc.t

When a task is Failed, we assocate it with a marshalled exception (a value of type Rpc.t). This exception must be one from the set defined in the Xenops_interface. To see how they are marshalled, see Xenops_server.

From the point of view of a client, a Task has the immutable type (which can be queried with a Task.stat):

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
	}

where

  • id is a unique (integer) id generated by Xenopsd. This is how a Task is represented to clients
  • dbg is a client-provided debug key which will be used in log lines, allowing lines from the same Task to be associated together
  • ctime is the creation time
  • state is the current state (Pending/Completed/Failed)
  • subtasks lists logical internal sub-operations for debugging
  • debug_info includes miscellaneous key/value pairs used for debugging

Internally, Xenopsd uses a mutable record type to track Task state. This is broadly similar to the interface type except

  • the state is mutable: this allows Tasks to complete
  • the task contains a “do this now” thunk
  • there is a “cancelling” boolean which is toggled to request a cancellation.
  • there is a list of cancel callbacks
  • there are some fields related to “cancel points”

Persistence

The Tasks are intended to represent activities associated with in-memory queues and threads. Therefore the active Tasks are kept in memory in a map, and will be lost over a process restart. This is desirable since we will also lose the queued items and the threads, so there is no need to resync on start.

Note that every operation must ensure that the state of the system is recoverable on restart by not leaving it in an invalid state. It is not necessary to either guarantee to complete or roll-back a Task. Tasks are not expected to be transactional.

Lifecycle of a Task

All Tasks returned by API functions are created as part of the enqueue functions: queue_operation_*. Even operations which are performed internally are normally wrapped in Tasks by the function immediate_operation.

A queued operation will be processed by one of the queue worker threads. It will

  • set the thread-local debug key to the Task.dbg
  • call task.Xenops_task.run, taking care to catch exceptions and update the task.Xenops_task.state
  • unset the thread-local debug key
  • generate an event on the Task to provoke clients to query the current state.

Task implementations must update their progress as they work. For the common case of a compound operation like VM_start which is decomposed into multiple “micro-ops” (e.g. VM_create VM_build) there is a useful helper function perform_atomics which divides the progress ‘bar’ into sections, where each “micro-op” can have a different size (weight). A progress callback function is passed into each Xenopsd backend function so it can be updated with fine granulatiry. For example note the arguments to B.VM.save

Clients are expected to destroy Tasks they are responsible for creating. Xenopsd cannot do this on their behalf because it does not know if they have successfully queried the Task status/result.

When Xenopsd is a client of itself, it will take care to destroy the Task properly, for example see immediate_operation.

Cancellation

The goal of cancellation is to unstick a blocked operation and to return the system to some valid state, not any valid state in particular. Xenopsd does not treat operations as transactions; when an operation is cancelled it may

  • fully complete (e.g. if it was about to do this anyway)
  • fully abort (e.g. if it had made no progress)
  • enter some other valid state (e.g. if it had gotten half way through)

Xenopsd will never leave the system in an invalid state after cancellation.

Every Xenopsd operation should unblock and return the system to a valid state within a reasonable amount of time after a cancel request. This should be as quick as possible but up to 30s may be acceptable. Bear in mind that a human is probably impatiently watching a UI say “please wait” and which doesn’t have any notion of progress itself. Keep it quick!

Cancellation is triggered by TASK.cancel which calls cancel. This

  • sets the cancelling boolean
  • calls all registered cancel callbacks

Implementations respond to cancellation by

Xenopsd’s libxc backend can block in 2 different ways, and therefore has 2 different types of cancel callback:

  1. cancellable Xenstore watches
  2. cancellable subprocesses

Xenstore watches are used for device hotplug and unplug. Xenopsd has to wait for the backend or for a udev script to do something. If that blocks then we need a way to cancel the watch. The easiest way to cancel a watch is to watch an additional path (a “cancel path”) and delete it, see cancellable_watch. The “cancel paths” are placed within the VM’s Xenstore directory to ensure that cleanup code which does xenstore-rm will automatically “cancel” all outstanding watches. Note that we trigger a cancel by deleting rather than creating, to avoid racing with delete and creating orphaned Xenstore entries.

Subprocesses are used for suspend/resume/migrate. Xenopsd hands file descriptors to libxenguest by running a subprocess and passing the fds to it. Xenopsd therefore gets the process id and can send it a signal to cancel it. See Cancellable_subprocess.run.

Testing with cancel points

Cancellation is difficult to test, as it is completely asynchronous. Therefore Xenopsd has some built-in cancellation testing infrastructure known as “cancel points”. A “cancel point” is a point in the code where a Cancelled exception could be thrown, either by checking the cancelling boolean or as a side-effect of a cancel callback. The check_cancelling function increments a counter every time it passes one of these points, and this value is returned to clients in the Task.debug_info.

A test harness runs a series of operations. Each operation is first run all the way through to completion to discover the total number of cancel points. The operation is then re-run with a request to cancel at a particular point. The test then waits for the system to stabilise and verifies that it appears to be in a valid state.

Preventing Tasks leaking

The client who creates a Task must destroy it when the Task is finished, and they have processed the result. What if a client like xapi is restarted while a Task is running?

We assume that, if xapi is talking to a xenopsd, then xapi completely owns it. Therefore xapi should destroy any completed tasks that it doesn’t recognise.

If a user wishes to manage VMs with xenopsd in parallel with xapi, the user should run a separate xenopsd.

Features

General

  • Pluggable backends including
    • xc: drives Xen via libxc and xenguest
    • simulator: simulates operations for component-testing
  • Supports running multiple instances and backends on the same host, looking after different sets of VMs
  • Extensive configuration via command-line (see manpage) and config file
  • Command-line tool for easy VM administration and troubleshooting
  • User-settable degree of concurrency to get VMs started quickly

VMs

  • VM start/shutdown/reboot
  • VM suspend/resume/checkpoint/migrate
  • VM pause/unpause
  • VM s3suspend/s3resume
  • customisable SMBIOS tables for OEM-locked VMs
  • hooks for 3rd party extensions:
    • pre-start
    • pre-destroy
    • post-destroy
    • pre-reboot
  • per-VM xenguest replacement
  • suppression of VM reboot loops
  • live vCPU hotplug and unplug
  • vCPU to pCPU affinity setting
  • vCPU QoS settings (weight and cap for the Xen credit2 scheduler)
  • DMC memory-ballooning support
  • support for storage driver domains
  • live update of VM shadow memory
  • guest-initiated disk/nic hotunplug
  • guest-initiated disk eject
  • force disk/nic unplug
  • support for ‘surprise-removable’ devices
  • disk QoS configuration
  • nic QoS configuration
  • persistent RTC
  • two-way guest agent communication for monitoring and control
  • network carrier configuration
  • port-locking for nics
  • text and VNC consoles over TCP and Unix domain sockets
  • PV kernel and ramdisk whitelisting
  • configurable VM videoram
  • programmable action-after-crash behaviour including: shutting down the VM, taking a crash dump or leaving the domain paused for inspection
  • ability to move nics between bridges/switches
  • advertises the VM memory footprints
  • PCI passthrough
  • support for discrete emulators (e.g. ‘demu’)
  • PV keyboard and mouse
  • qemu stub domains
  • cirrus and stdvga graphics cards
  • HVM serial console (useful for debugging)
  • support for vGPU
  • workaround for ‘spurious page faults’ kernel bug
  • workaround for ‘machine address size’ kernel bug

Hosts

  • CPUid masking for heterogenous pools: reports true features and current features
  • Host console reading
  • Hypervisor version and capabilities reporting
  • Host CPU querying

APIs

  • versioned json-rpc API with feature advertisements
  • clients can disconnect, reconnect and easily resync with the latest VM state without losing updates
  • all operations have task control including
    • asychronous cancellation: for both subprocesses and xenstore watches
    • progress updates
    • subtasks
    • per-task debug logs
  • asynchronous event watching API
  • advertises VM metrics
    • memory usage
    • balloon driver co-operativeness
    • shadow memory usage
    • domain ids
  • channel passing (via sendmsg(2)) for efficient memory image copying

Operation Walk-Throughs

Let’s trace through interesting operations to see how the whole system works.

  • Starting a VM
  • Migrating a VM
  • Shutting down a VM and waiting for it to happen
  • A VM wants to reboot itself
  • A disk is hotplugged
  • A disk refuses to hotunplug
  • A VM is suspended

Subsections of Operation Walk-Throughs

Live Migration Sequence Diagram

sequenceDiagram autonumber participant tx as sender participant rx0 as receiver thread 0 participant rx1 as receiver thread 1 participant rx2 as receiver thread 2 activate tx tx->>rx0: VM.import_metadata tx->>tx: Squash memory to dynamic-min tx->>rx1: HTTP /migrate/vm activate rx1 rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs rx1->>tx: handshake (control channel)<br/>Synchronisation point 1 tx->>rx2: HTTP /migrate/mem activate rx2 rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK rx2->>rx1: memory fd tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2 tx->>tx: VM_rename rx1->>rx2: exit deactivate rx2 tx->>rx1: handshake (control channel)<br/>Synchronisation point 3 rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request rx1->>tx: handshake (control channel)<br/>Synchronisation point 4 deactivate rx1 tx->>tx: VM_shutdown<br/>VM_remove deactivate tx

Walkthrough: Migrating a VM

A XenAPI client wishes to migrate a VM from one host to another within the same pool.

The client will issue a command to migrate the VM and it will be dispatched by the autogenerated dispatch_call function from xapi/server.ml. For more information about the generated functions you can have a look to XAPI IDL model.

The command will trigger the operation VM_migrate that has low level operations performed by the backend. These atomics operations that we will describe in the documentation are:

  • VM.restore
  • VM.rename
  • VBD.set_active
  • VBD.plug
  • VIF.set_active
  • VGPU.set_active
  • VM.create_device_model
  • PCI.plug
  • VM.set_domain_action_request

The command have serveral parameters such as: should it be ran asynchronously, should it be forwared to another host, how arguments should be marshalled and so on. A new thread is created by xapi/server_helpers.ml to handle the command asynchronously. At this point the helper also check if the command should be passed to the message forwarding layer in order to be executed on another host (the destination) or locally if we are already at the right place.

It will finally reach xapi/api_server.ml that will take the action of posted a command to the message broker message switch. It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some XAPI daemons. In the case of the migration this message sends by XAPI will be consumed by the xenopsd daemon that will do the job of migrating the VM.

The migration of the VM

The migration is an asynchronous task and a thread is created to handle this task. The tasks’s reference is returned to the client, which can then check its status until completion.

As we see in the introduction the xenopsd daemon will pop the operation VM_migrate from the message broker.

Only one backend is know available that interacts with libxc, libxenguest and xenstore. It is the xc backend.

The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.

During the migration process the destination domain will be built with the same uuid than the original VM but the last part of the UUID will be XXXXXXXX-XXXX-XXXX-XXXX-000000000001. The original domain will be removed using XXXXXXXX-XXXX-XXXX-XXXX-000000000000.

There are some points called hooks at which xenopsd can execute some script. Before starting a migration a command is send to the original domain to execute a pre migrate script if it exists.

Before starting the migration a command is sent to Qemu using the Qemu Machine Protocol (QMP) to check that the domain can be suspended (see xenopsd/xc/device_common.ml). After checking with Qemu that the VM is suspendable we can start the migration.

Importing metadata

As for hooks, commands to source domain are sent using stunnel a daemon which is used as a wrapper to manage SSL encryption communication between two hosts on the same pool. To import metada an XML RPC command is sent to the original domain.

Once imported it will give us a reference id and will allow to build the new domain on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001 where XXX... is the reference id of the original VM.

Setting memory

One of the first thing to do is to setup the memory. The backend will check that there is no ballooning operation in progress. At this point the migration can fail if a ballooning operation is in progress and takes too much time.

Once memory checked the daemon will get the state of the VM (running, halted, …) and information about the VM are retrieve by the backend like the maximum memory the domain can consume but also information about quotas for example. Information are retrieve by the backend from xenstore.

Once this is complete, we can restore VIF and create the domain.

The synchronisation of the memory is the first point of synchronisation and everythin is ready for VM migration.

VM Migration

After receiving memory we can set up the destination domain. If we have a vGPU we need to kick off its migration process. We will need to wait the acknowledge that indicates that the entry for the GPU has been well initialized. before starting the main VM migration.

Their is a mechanism of handshake for synchronizing between the source and the destination. Using the handshake protocol the receiver inform the sender of the request that everything has been setup and ready to save/restore.

VM restore

VM restore is a low level atomic operation VM.restore. This operation is represented by a function call to backend. It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor and libxc for sending a request of migration to the emu-manager.

After sending the request results coming from emu-manager are collected by the main thread. It blocks until results are received.

During the live migration, emu-manager helps in ensuring the correct state transitions for the devices and handling the message passing for the VM as it’s moved between hosts. This includes making sure that the state of the VM’s virtual devices, like disks or network interfaces, is correctly moved over.

VM renaming

Once all operations are done we can rename the VM on the target from its temporary name to its real UUID. This operation is another low level atomic one VM.rename that will take care of updating the xenstore on the destination.

The next step is the restauration of devices and unpause the domain.

Restoring remaining devices

Restoring devices starts by activating VBD using the low level atomic operation VBD.set_active. It is an update of Xenstore. VBDs that are read-write must be plugged before read-only ones. Once activated the low level atomic operation VBD.plug is called. VDI are attached and activate.

Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug. If there are VGPUs we will set them as active now using the atomic VGPU.set_active.

We are almost done. The next step is to create the device model

create device model

Create device model is done by using the atomic operation VM.create_device_model. This will configure qemu-dm and started. This allow to manage PCI devices.

PCI plug

PCI.plug is executed by the backend. It plugs a PCI device and advertise it to QEMU if this option is set. It is the case for NVIDIA SR-IOV vGPUS.

At this point devices have been restored. The new domain is considered survivable. We can unpause the domain and performs last actions

Unpause and done

Unpause is done by managing the state of the domain using bindings to xenctrl. Once hypervisor has unpaused the domain some actions can be requested using VM.set_domain_action_request. It is a path in xenstore. By default no action is done but a reboot can be for example initiated.

Previously we spoke about some points called hooks at which xenopsd can execute some script. There is also a hook to run a post migrate script. After the execution of the script if there is one the migration is almost done. The last step is a handskake to seal the success of the migration and the old VM can now be cleaned.

Links

Some links are old but even if many changes occured they are relevant for a global understanding of the XAPI toolstack.

Walkthrough: Starting a VM

A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM configuration to use. A VM configuration is broken down into objects:

  • VM: A device-less Virtual Machine
  • VBD: A virtual block device for a VM
  • VIF: A virtual network interface for a VM
  • PCI: A virtual PCI device for a VM

Treating devices as first-class objects is convenient because we wish to expose operations on the devices such as hotplug, unplug, eject (for removable media), carrier manipulation (for network interfaces) etc.

The “add” functions in the Xenopsd interface cause Xenopsd to create the objects:

In the case of xapi, there are a set of functions which convert between the XenAPI objects and the Xenopsd objects. The two interfaces are slightly different because they have different expected users:

  • the XenAPI has many clients which are updated on long release cycles. The main property needed is backwards compatibility, so that new release of xapi remain compatible with these older clients. Quite often we will chose to “grandfather in” some poorly designed interface simply because we wish to avoid imposing churn on 3rd parties.
  • the Xenopsd API clients are all open-source and are part of the xapi-project. These clients can be updated as the API is changed. The main property needed is to keep the interface clean, so that it properly hides the complexity of dealing with Xen from other components.

The Xenopsd “VM.add” function has code like this:

	let add' x =
		debug "VM.add %s" (Jsonrpc.to_string (rpc_of_t x));
		DB.write x.id x;
		let module B = (val get_backend () : S) in
		B.VM.add x;
		x.id

This function does 2 things:

  • it stores the VM configuration in the “database”
  • it tells the “backend” that the VM exists

The Xenopsd database is really a set of config files in the filesystem. All objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not stand-alone entities like disks) and are placed into a subdirectory named after the VM e.g.:

# ls /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701
config	vbd.xvda  vbd.xvdb
# cat /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701/config
{"id": "7b719ce6-0b17-9733-e8ee-dbc1e6e7b701", "name": "fedora",
 ...
}

Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that all objects are deleted when the host is rebooted. However the objects should be persisted over a simple Xenopsd restart, which is why the objects are stored in the filesystem.

Aside: it would probably be more appropriate to store the metadata in Xenstore since this has the exact object lifetime we need. This will require a more performant Xenstore to realise.

Every running Xenopsd process is linked with a single backend. Currently backends exist for:

  • Xen via libxc, libxenguest and xenstore
  • Xen via libxl, libxc and xenstore
  • Xen via libvirt
  • KVM by direct invocation of qemu
  • Simulation for testing

From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a. “Xenopsd classic”) backend.

The backend VM.add function checks whether the VM we have to manage already exists – and if it does then it ensures the Xenstore configuration is intact. This Xenstore configuration is important because at any time a client can query the state of a VM with VM.stat and this relies on certain Xenstore keys being present.

Once the VM metadata has been registered with Xenopsd, the client can call VM.start. Like all potentially-blocking Xenopsd APIs, this function returns a Task id. Please refer to the Task handling design for a general overview of how tasks are handled.

Clients can poll the state of a task by calling TASK.stat but most clients will prefer to use the event system instead. Please refer to the Event handling design for a general overview of how events are handled.

The event model is similar to the XenAPI: clients call a blocking UPDATES.get passing in a token which represents the point in time when the last UPDATES.get returned. The call blocks until some objects have changed state, and these object ids are returned (NB in the XenAPI the current object states are returned) The client must then call the relevant “stat” function, in this case TASK.stat

The client will be able to see the task make progress and use this to – for example – populate a progress bar in a UI. If the client needs to cancel the task then it can call the TASK.cancel; again see the Task handling design to understand how this is implemented.

When the Task has completed successfully, then calls to *.stat will show:

  • the power state is Paused
  • exactly one valid Xen domain id
  • all VBDs have active = plugged = true
  • all VIFs have active = plugged = true
  • all PCI devices have plugged = true
  • at least one active console
  • a valid start time
  • valid “targets” for memory and vCPU

Note: before a Task completes, calls to *.stat will show partial updates e.g. the power state may be Paused but none of the disks may have become plugged. UI clients must choose whether they are happy displaying this in-between state or whether they wish to hide it and pretend the whole operation has happened transactionally. If a particular client wishes to perform side-effects in response to Xenopsd state changes – for example to clean up an external resource when a VIF becomes unplugged – then it must be very careful to avoid responding to these in-between states. Generally it is safest to passively report these values without driving things directly from them. Think of them as status lights on the front panel of a PC: fine to look at but it’s not a good idea to wire them up to actuators which actually do things.

Note: the Xenopsd implementation guarantees that, if it is restarted at any point during the start operation, on restart the VM state shall be “fixed” by either (i) shutting down the VM; or (ii) ensuring the VM is intact and running.

In the case of xapi every Xenopsd Task id bound one-to-one with a XenAPI task by the function sync_with_task. The function update_task is called when xapi receives a notification that a Xenopsd Task has changed state, and updates the corresponding XenAPI task. Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for background events via the function events_watch while each thread performing a XenAPI call waits for its specific Task to complete via the function event_wait.

It is the responsibility of the client to call TASK.destroy when the Task is nolonger needed. Xenopsd won’t destroy the task because it contains the success/failure result of the operation which is needed by the client.

What happens when a Xenopsd receives a VM.start request?

When Xenopsd receives the request it adds it to the appropriate per-VM queue via the function queue_operation. To understand this and other internal details of Xenopsd, consult the architecture description. The queue_operation_int function looks like this:

let queue_operation_int dbg id op =
	let task = Xenops_task.add tasks dbg (fun t -> perform op t; None) in
	Redirector.push id (op, task);
	task

The “task” is a record containing Task metadata plus a “do it now” function which will be executed by a thread from the thread pool. The module Redirector takes care of:

  • pushing operations to the right queue
  • ensuring at most one worker thread is working on a VM’s operations
  • reducing the queue size by coalescing items together
  • providing a diagnostics interface

Once a thread from the worker pool becomes free, it will execute the “do it now” function. In the example above this is perform op t where op is VM_start vm and t is the Task. The function perform has fragments like this:

		| VM_start id ->
			debug "VM.start %s" id;
			perform_atomics (atomics_of_operation op) t;
			VM_DB.signal id

Each “operation” (e.g. VM_start vm) is decomposed into “micro-ops” by the function atomics_of_operation where the micro-ops are small building-block actions common to the higher-level operations. Each operation corresponds to a list of “micro-ops”, where there is no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM configuration (for example a PV domain may not need a qemu). In the case of VM_start vm this decomposes into the sequence:

1. run the “VM_pre_start” scripts

The VM_hook_script micro-op runs the corresponding “hook” scripts. The code is all in the Xenops_hooks module and looks for scripts in the hardcoded path /etc/xapi.d.

2. create a Xen domain

The VM_create micro-op calls the VM.create function in the backend. In the classic Xenopsd backend the VM.create_exn function must

  1. check if we’re creating a domain for a fresh VM or resuming an existing one: if it’s a resume then the domain configuration stored in the VmExtra database table must be used
  2. ask squeezed to create a memory “reservation” big enough to hold the VM memory. Unfortunately the domain cannot be created until the memory is free because domain create often fails in low-memory conditions. This means the “reservation” is associated with our “session” with squeezed; if Xenopsd crashes and restarts the reservation will be freed automatically.
  3. create the Domain via the libxc hypercall
  4. “transfer” the squeezed reservation to the domain such that squeezed will free the memory if the domain is destroyed later
  5. compute and set an initial balloon target depending on the amount of memory reserved (recall we ask for a range between dynamic_min and dynamic_max)
  6. apply the “suppress spurious page faults” workaround if requested
  7. set the “machine address size”
  8. “hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates lots of vCPUs and then the guest is asked to only use some of them. Every VM therefore starts with the “VCPUs_max” setting and co-operative hotplug is used to reduce the number. Note there is no enforcement mechanism: a VM which cheats and uses too many vCPUs would have to be caught by looking at the performance statistics.

3. build the domain

On a Xen system a domain is created empty, and memory is actually allocated from the host in the “build” phase via functions in libxenguest. The VM.build_domain_exn function must

  1. run pygrub (or eliloader) to extract the kernel and initrd, if necessary
  2. invoke the xenguest binary to interact with libxenguest.
  3. apply the cpuid configuration
  4. store the current domain configuration on disk – it’s important to know the difference between the configuration you started with and the configuration you would use after a reboot because some properties (such as maximum memory and vCPUs) as fixed on create.

The xenguest binary was originally a separate binary for two reasons: (i) the libxenguest functions weren’t threadsafe since they used lots of global variables; and (ii) the libxenguest functions used to have a different, incompatible license, which prevent us linking. Both these problems have been resolved but we still shell out to the xenguest binary.

The xenguest binary has also evolved to configure more of the initial domain state. It also reads Xenstore and configures

  • the vCPU affinity
  • the vCPU credit2 weight/cap parameters
  • whether the NX bit is exposed
  • whether the viridian CPUID leaf is exposed
  • whether the system has PAE or not
  • whether the system has ACPI or not
  • whether the system has nested HVM or not
  • whether the system has an HPET or not

4. mark each VBD as “active”

VBDs and VIFs are said to be “active” when they are intended to be used by a particular VM, even if the backend/frontend connection hasn’t been established, or has been closed. If someone calls VBD.stat or VIF.stat then the result includes both “active” and “plugged”, where “plugged” is true if the frontend/backend connection is established. For example xapi will set VBD.currently_attached to “active || plugged”. The “active” flag is conceptually very similar to the traditional “online” flag (which is not documented in the upstream Xen tree as of Oct/2014 but really should be) except that on unplug, one would set the “online” key to “0” (false) first before initiating the hotunplug. By contrast the “active” flag is set to false after the unplug i.e. “set_active” calls bracket plug/unplug. If the “active” flag was set before the unplug attempt then as soon as the frontend/backend connection is removed clients would see the VBD as completely dissociated from the VM – this would be misleading because Xenopsd will not have had time to use the storage API to release locks on the disks. By doing all the cleanup before setting “active” to false, clients can be assured that the disks are now free to be reassigned.

5. handle non-persistent disks

A non-persistent disk is one which is reset to a known-good state on every VM start. The VBD_epoch_begin is the signal to perform any necessary reset.

6. plug VBDs

The VBD_plug micro-op will plug the VBD into the VM. Every VBD is plugged in a carefully-chosen order. Generally, plug order is important for all types of devices. For VBDs, we must work around the deficiency in the storage interface where a VDI, once attached read/only, cannot be attached read/write. Since it is legal to attach the same VDI with multiple VBDs, we must plug them in such that the read/write VBDs come first. From the guest’s point of view the order we plug them doesn’t matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).

The function VBD.plug will

  • call VDI.attach and VDI.activate in the storage API to make the devices ready (start the tapdisk processes etc)
  • add the Xenstore frontend/backend directories containing the block device info
  • add the extra xenstore keys returned by the VDI.attach call that are needed for SCSIid passthrough which is needed to support VSS
  • write the VBD information to the Xenopsd database so that future calls to VBD.stat can be told about the associated disk (this is needed so clients like xapi can cope with CD insert/eject etc)
  • if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

The Xenstore keys are written by the functions Device.Vbd.add_async and Device.Vbd.add_wait. In a Linux domain (such as dom0) when the backend directory is created, the kernel creates a “backend device”. Creating any device will cause a kernel UEVENT to fire which is picked up by udev. The udev rules run a script whose only job is to stat(2) the device (from the “params” key in the backend) and write the major and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do any of this, instead the FreeBSD kernel module simply opens the device in the “params” key). The script also writes the backend key “hotplug-status=connected”. We currently wait for this key to be written so that later calls to VBD.stat will return with “plugged=true”. If the call returns before this key is written then sometimes we receive an event, call VBD.stat and conclude erroneously that a spontaneous VBD unplug occurred.

7. mark each VIF as “active”

This is for the same reason as VBDs are marked “active”.

8. plug VIFs

Again, the order matters. Unlike VBDs, there is no read/write read/only constraint and the devices have unique indices (0, 1, 2, …) but Linux kernels have often (always?) ignored the actual index and instead relied on the order of results from the xenstore-ls listing. The order that xenstored returns the items happens to be the order the nodes were created so this means that (i) xenstored must continue to store directories as ordered lists rather than maps (which would be more efficient); and (ii) Xenopsd must make sure to plug the vifs in the same order. Note that relying on ethX device numbering has always been a bad idea but is still common. I bet if you change this lots of tests will suddenly start to fail!

The function VIF.plug_exn will

  • compute the port locking configuration required and write this to a well-known location in the filesystem where it can be read from the udev scripts. This really should be written to Xenstore instead, since this scheme doesn’t work with driver domains.
  • add the Xenstore frontend/backend directories containing the network device info
  • write the VIF information to the Xenopsd database so that future calls to VIF.stat can be told about the associated network
  • if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

Similarly to the VBD case, the function Device.Vif.add will write the Xenstore keys and wait for the “hotplug-status=connected” key. We do this because we cannot apply the port locking rules until the backend device has been created, and we cannot know the rules have been applied until after the udev script has written the key. If we didn’t wait for it then the VM might execute without all the port locking properly configured.

9. create the device model

The VM_create_device_model micro-op will create a qemu device model if

  • the VM is HVM; or
  • the VM uses a PV keyboard or mouse (since only qemu currently has backend support for these devices).

The function VM.create_device_model_exn will

  • (if using a qemu stubdom) it will create and build the qemu domain
  • compute the necessary qemu arguments and launch it.

Note that qemu (aka the “device model”) is created after the VIFs and VBDs have been plugged but before the PCI devices have been plugged. Unfortunately qemu traditional infers the needed emulated hardware by inspecting the Xenstore VBD and VIF configuration and assuming that we want one emulated device per PV device, up to the natural limits of the emulated buses (i.e. there can be at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this create an ordering dependency that needn’t exist – and which impacts migration downtime – but it also completely ignores the plain fact that, on a Xen system, qemu can be in a different domain than the backend disk and network devices. This hack only works because we currently run everything in the same domain. There is an option (off by default) to list the emulated devices explicitly on the qemu command-line. If we switch to this by default then we ought to be able to start up qemu early, as soon as the domain has been created (qemu will need to know the domain id so it can map the I/O request ring).

10. plug PCI devices

PCI devices are treated differently to VBDs and VIFs. If we are attaching the device to an HVM guest then instead of relying on the traditional Xenstore frontend/backend state machine we instead send RPCs to qemu requesting they be hotplugged. Note the domain is paused at this point, but qemu still supports PCI hotplug/unplug. The reasons why this doesn’t follow the standard Xenstore model are known only to the people who contributed this support to qemu. Again the order matters because it determines the position of the virtual device in the VM.

Note that Xenopsd doesn’t know anything about the PCI devices; concepts such as “GPU groups” belong to higher layers, such as xapi.

11. mark the domain as alive

A design principle of Xenopsd is that it should tolerate failures such as being suddenly restarted. It guarantees to always leave the system in a valid state, in particular there should never be any “half-created VMs”. We achieve this for VM start by exploiting the mechanism which is necessary for reboot. When a VM wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a “reason code” of “reboot”. When Xenopsd sees this event VM_check_state operation is queued. This operation calls VM.get_domain_action_request to ask the question, “what needs to be done to make this VM happy now?”. The implementation checks the domain state for shutdown codes and also checks a special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd finishes starting the VM it clears this key. This means that if Xenopsd crashes while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted and will clean up the current domain and create a fresh one.

12. unpause the domain

A Xenopsd VM.start will always leave the domain paused, so strictly speaking this is a separate “operation” queued by the client (such as xapi) after the VM.start has completed. The function VM.unpause is reassuringly simple:

		if di.Xenctrl.total_memory_pages = 0n then raise (Domain_not_built);
		Domain.unpause ~xc di.Xenctrl.domid;
		Opt.iter
			(fun stubdom_domid ->
				Domain.unpause ~xc stubdom_domid
			) (get_stubdom ~xs di.Xenctrl.domid)