Design

Design documents for xenopsd:

Events

ids rather than data; inherently coalescable
blocking poll + async operations implies a client needs 2 connections
coarse granularity
similarity and differences with: XenAPI, event channels, xenstore watches

https://github.com/xapi-project/xen-api/blob/30cc9a72e8726d1e7501cd01ddb27ced6d53b9be/ocaml/xapi/xapi_xenops.ml#L1467

Hooks

There are a number of hook points at which xenopsd may execute certain scripts. These scripts are found in hook-specific directories of the form /etc/xapi.d/<hookname>/. All executable scripts in these directories are run with the following arguments:

<script.sh> -reason <reason> -vmuuid <uuid of VM>

The scripts are executed in filename-order. By convention, the filenames are usually of the form 10resetvdis.

The hook points are:

vm-pre-shutdown
vm-pre-migrate
vm-post-migrate (Dundee only)
vm-pre-start
vm-pre-reboot
vm-pre-resume
vm-post-resume (Dundee only)
vm-post-destroy

and the reason codes are:

clean-shutdown
hard-shutdown
clean-reboot
hard-reboot
suspend
source -- passed to pre-migrate hook on source host
destination -- passed to post-migrate hook on destination (Dundee only)
none

For example, in order to execute a script on VM shutdown, it would be sufficient to create the script in the post-destroy hook point:

/etc/xapi.d/vm-post-destroy/01myscript.sh

containing

#!/bin/bash
echo I was passed $@ > /tmp/output

And when, for example, VM e30d0050-8f15-e10d-7613-cb2d045c8505 is shut-down, the script is executed:

[vagrant@localhost ~]$ sudo xe vm-shutdown --force uuid=e30d0050-8f15-e10d-7613-cb2d045c8505
[vagrant@localhost ~]$ cat /tmp/output
I was passed -vmuuid e30d0050-8f15-e10d-7613-cb2d045c8505 -reason hard-shutdown

PVS Proxy OVS Rules

Rule Design

The Open vSwitch (OVS) daemon implements a programmable switch. XenServer uses it to re-direct traffic between three entities:

PVS server - identified by its IP address
a local VM - identified by its MAC address
a local Proxy - identified by its MAC address

VM and PVS server are unaware of the Proxy; xapi configures OVS to redirect traffic between PVS and VM to pass through the proxy.

OVS uses rules that match packets. Rules are organised in sets called tables. A rule can be used to match a packet and to inject it into another rule set/table such that a packet can be matched again.

Furthermore, a rule can set registers associated with a packet which that can be matched in subsequent rules. In that way, a packet can be tagged such that it will only match specific rules downstream that match the tag.

Xapi configures 3 rule sets:

Table 0 - Entry Rules

Rules match UDP traffic between VM/PVS, Proxy/VM, and PVS/VM where the PVS server is identified by its IP and all other components by their MAC address. All packets are tagged with the direction they are going and re-submitted into Table 101 which handles ports.

Table 101 - Port Rules

Rules match UDP traffic going to a specific port of the PVS server and re-submit it into Table 102.

Table 102 - Exit Rules

These rules implement the redirection:

Rules matching packets coming from VM to PVS are directed to the Proxy.
Rules matching packets coming from PVS to VM are directed to the Proxy.
Rules matching packets coming from the Proxy are already addressed properly (to the VM) are handled normally.

Requirements for suspend image framing

We are currently (Dec 2013) undergoing a transition from the ‘classic’ xenopsd backend (built upon calls to libxc) to the ‘xenlight’ backend built on top of the officially supported libxl API.

During this work, we have come across an incompatibility between the suspend images created using the ‘classic’ backend and those created using the new libxl-based backend. This needed to be fixed to enable RPU to any new version of XenServer.

Historic ‘classic’ stack

Prior to this work, xenopsd was involved in the construction of the suspend image and we ended up with an image with the following format:

+-----------------------------+
| "XenSavedDomain\n"          |  <-- added by xenopsd-classic
|-----------------------------|
|  Memory image dump          |  <-- libxc
|-----------------------------|
| "QemuDeviceModelRecord\n"   |
|  <size of following record> |  <-- added by xenopsd-classic
|  (a 32-bit big-endian int)  |
|-----------------------------|
| "QEVM"                      |  <-- libxc/qemu
|  Qemu device record         |
+-----------------------------+

We have also been carrying a patch in the Xen patchqueue against xc_domain_restore. This patch (revert_qemu_tail.patch) stopped xc_domain_restore from attempting to read past the memory image dump. At which point xenopsd-classic would just take over and restore what it had put there.

Requirements for new stack

For xenopsd-xenlight to work, we need to operate without the revert_qemu_tail.patch since libxl assumes it is operating on top of an upstream libxc.

We need the following relationship between suspend images created on one backend being able to be restored on another backend. Where the backends are old-classic (OC), new-classic (NC) and xenlight (XL). Obviously all suspend images created on any backend must be able to be restored on the same backend:

                OC _______ NC _______ XL
                 \  >>>>>      >>>>>  /
                  \__________________/
                    >>>>>>>>>>>>>>>>

It turns out this was not so simple. After removing the patch against xc_domain_restore and allowing libxc to restore the hvm_buffer_tail, we found that supsend images created with OC (detailed in the previous section) are not of a valid format for two reasons:

i. The "XenSavedDomain\n" was extraneous;

ii. The Qemu signature section (prior to the record) is not of valid form.

It turns out that the section with the Qemu signature can be one of the following:

a. "QemuDeviceModelRecord" (NB. no newline) followed by the record to EOF;
b. "DeviceModelRecord0002" then a uint32_t length followed by record;
c. "RemusDeviceModelState" then a uint32_t length followed by record;

The old-classic (OC) backend not only uses an invalid signature (since it contains a trailing newline) but it also includes a length, and the length is in big-endian when the uint32_t is seen to be little-endian.

We considered creating a proxy for the fd in the incompatible cases but since this would need to be a 22-lookahead byte-by-byte proxy this was deemed impracticle. Instead we have made patched libxc with a much simpler patch to understand this legacy format.

Because peek-ahead is not possible on pipes, the patch for (ii) needed to be applied at a point where the hvm tail had been read completely. We piggy-backed on the point after (a) had been detected. At this point the remainder of the fd is buffered (only around 7k) and the magic “QEVM” is expected at the head of this buffer. So we simply added a patch to check if there was a pesky newline and the buffer[5:8] was “QEVM” and if it was we could discard the first 5 bytes:

                              0    1    2    3    4    5   6   7   8
Legacy format from OC:  [...| \n | \x | \x | \x | \x | Q | E | V | M |...]

Required at this point: [...|  Q |  E |  V |  M |...]

Changes made

To make the above use-cases work, we have made the following changes:

1. Make new-classic (NC) not restore Qemu tail (let libxc do it)
    xenopsd.git:ef3bf4b

2. Make new-classic use valid signature (b) for future restore images
    xenopsd.git:9ccef3e

3. Make xc_domain_restore in libxc understand legacy xenopsd (OC) format
    xen-4.3.pq.hg:libxc-restore-legacy-image.patch

4. Remove revert-qemu-tail.patch from Xen patchqueue
    xen-4.3.pq.hg:3f0e16f2141e

5. Make xenlight (XL) use "XenSavedDomain\n" start-of-image signature
    xenopsd.git:dcda545

This has made the required use-cases work as follows:

                OC __134__ NC __245__ XL
                 \  >>>>>      >>>>>  /
                  \_______345________/
                    >>>>>>>>>>>>>>>>

And the suspend-resume on same backends work by virtue of:

OC --> OC : Just works
NC --> NC : By 1,2,4
XL --> XL : By 4 (5 is used but not required)

New components

The output of the changes above are:

A new xenops-xc binary for NC
A new xenops-xl binary for XL
A new libxenguest.4.3 for both of NC and XL

Future considerations

This should serve as a useful reference when considering making changes to the suspend image in any way.

Suspend image framing format

Example suspend image layout:

+----------------------------+
| 1. Suspend image signature |
+============================+
| 2.0 Xenops header          |
| 2.1 Xenops record          |
+============================+
| 3.0 Libxc header           |
| 3.1 Libxc record           |
+============================+
| 4.0 Qemu header            |
| 4.1 Qemu save record       |
+============================+
| 5.0 End_of_image footer    |
+----------------------------+

A suspend image is now constructed as a series of header-record pairs. The initial signature (1.) is used to determine whether we are dealing with the unstructured, “legacy” suspend image or the new, structured format.

Each header is two 64-bit integers: the first identifies the header type and the second is the length of the record that follows in bytes. The following types have been defined (the ones marked with a (*) have yet to be implemented):

* Xenops       : Metadata for the suspend image
* Libxc        : The result of a xc_domain_save
* Libxl*       : Not implemented
* Libxc_legacy : Marked as a libxc record saved using pre-Xen-4.5
* Qemu_trad    : The qemu save file for the Qemu used in XenServer
* Qemu_xen*    : Not implemented
* Demu*        : Not implemented
* End_of_image : A footer marker to denote the end of the suspend image

Some of the above types do not have the notion of a length since they cannot be known upfront before saving and also are delegated to other layers of the stack on restoring. Specifically these are the memory image sections, libxc and libxl.

Tasks

Some operations performed by Xenopsd are blocking, for example:

suspend/resume/migration
attaching disks (where the SMAPI VDI.attach/activate calls can perform network I/O)

We want to be able to

present the user with an idea of progress (perhaps via a “progress bar”)
allow the user to cancel a blocked operation that is taking too long
associate logging with the user/client-initiated actions that spawned them

Principles

all operations which may block (the vast majority) should be written in an asynchronous style i.e. the operations should immediately return a Task id
all operations should guarantee to respond to a cancellation request in a bounded amount of time (30s)
when cancelled, the system should always be left in a valid state
clients are responsible for destroying Tasks when they are finished with the results

Types

A task has a state, which may be Pending, Completed or failed:

	type async_result = unit

	type completion_t = {
		duration : float;
		result : async_result option
	}

	type state =
		| Pending of float
		| Completed of completion_t
		| Failed of Rpc.t

When a task is Failed, we assocate it with a marshalled exception (a value of type Rpc.t). This exception must be one from the set defined in the Xenops_interface. To see how they are marshalled, see Xenops_server.

From the point of view of a client, a Task has the immutable type (which can be queried with a Task.stat):

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
	}

where

id is a unique (integer) id generated by Xenopsd. This is how a Task is represented to clients
dbg is a client-provided debug key which will be used in log lines, allowing lines from the same Task to be associated together
ctime is the creation time
state is the current state (Pending/Completed/Failed)
subtasks lists logical internal sub-operations for debugging
debug_info includes miscellaneous key/value pairs used for debugging

Internally, Xenopsd uses a mutable record type to track Task state. This is broadly similar to the interface type except

the state is mutable: this allows Tasks to complete
the task contains a “do this now” thunk
there is a “cancelling” boolean which is toggled to request a cancellation.
there is a list of cancel callbacks
there are some fields related to “cancel points”

Persistence

The Tasks are intended to represent activities associated with in-memory queues and threads. Therefore the active Tasks are kept in memory in a map, and will be lost over a process restart. This is desirable since we will also lose the queued items and the threads, so there is no need to resync on start.

Note that every operation must ensure that the state of the system is recoverable on restart by not leaving it in an invalid state. It is not necessary to either guarantee to complete or roll-back a Task. Tasks are not expected to be transactional.

Lifecycle of a Task

All Tasks returned by API functions are created as part of the enqueue functions: queue_operation_*. Even operations which are performed internally are normally wrapped in Tasks by the function immediate_operation.

A queued operation will be processed by one of the queue worker threads. It will

set the thread-local debug key to the Task.dbg
call task.Xenops_task.run, taking care to catch exceptions and update the task.Xenops_task.state
unset the thread-local debug key
generate an event on the Task to provoke clients to query the current state.

Task implementations must update their progress as they work. For the common case of a compound operation like VM_start which is decomposed into multiple “micro-ops” (e.g. VM_create VM_build) there is a useful helper function perform_atomics which divides the progress ‘bar’ into sections, where each “micro-op” can have a different size (weight). A progress callback function is passed into each Xenopsd backend function so it can be updated with fine granularity. For example note the arguments to B.VM.save

Clients are expected to destroy Tasks they are responsible for creating. Xenopsd cannot do this on their behalf because it does not know if they have successfully queried the Task status/result.

When Xenopsd is a client of itself, it will take care to destroy the Task properly, for example see immediate_operation.

Cancellation

The goal of cancellation is to unstick a blocked operation and to return the system to some valid state, not any valid state in particular. Xenopsd does not treat operations as transactions; when an operation is cancelled it may

fully complete (e.g. if it was about to do this anyway)
fully abort (e.g. if it had made no progress)
enter some other valid state (e.g. if it had gotten half way through)

Xenopsd will never leave the system in an invalid state after cancellation.

Every Xenopsd operation should unblock and return the system to a valid state within a reasonable amount of time after a cancel request. This should be as quick as possible but up to 30s may be acceptable. Bear in mind that a human is probably impatiently watching a UI say “please wait” and which doesn’t have any notion of progress itself. Keep it quick!

Cancellation is triggered by TASK.cancel which calls cancel. This

sets the cancelling boolean
calls all registered cancel callbacks

Implementations respond to cancellation by

if running: periodically call check_cancelling
if about to block: register a suitable cancel callback safely with with_cancel.

Xenopsd’s libxc backend can block in 2 different ways, and therefore has 2 different types of cancel callback:

cancellable Xenstore watches
cancellable subprocesses

Xenstore watches are used for device hotplug and unplug. Xenopsd has to wait for the backend or for a udev script to do something. If that blocks, we need a way to cancel the watch. The easiest way to cancel a watch is to watch an additional path (a “cancel path”) and delete it, see cancellable_watch. The “cancel paths” are placed within the VM’s Xenstore directory to ensure that cleanup code which does xenstore-rm will automatically “cancel” all outstanding watches. Note that we trigger a cancel by deleting rather than creating, to avoid racing with delete and creating orphaned Xenstore entries.

Subprocesses are used for suspend/resume/migrate. Xenopsd hands file descriptors to libxenguest by running a subprocess and passing the fds to it. Xenopsd therefore gets the process id and can send it a signal to cancel it. See Cancellable_subprocess.run.

Testing with cancel points

Cancellation is difficult to test, as it is completely asynchronous. Therefore Xenopsd has some built-in cancellation testing infrastructure known as “cancel points”. A “cancel point” is a point in the code where a Cancelled exception could be thrown, either by checking the cancelling boolean or as a side-effect of a cancel callback. The check_cancelling function increments a counter every time it passes one of these points, and this value is returned to clients in the Task.debug_info.

A test harness runs a series of operations. Each operation is first run all the way through to completion to discover the total number of cancel points. The operation is then re-run with a request to cancel at a particular point. The test then waits for the system to stabilise and verifies that it appears to be in a valid state.

Preventing Tasks leaking

The client who creates a Task must destroy it when the Task is finished, and they have processed the result. What if a client like xapi is restarted while a Task is running?

We assume that, if xapi is talking to a xenopsd, then xapi completely owns it. Therefore xapi should destroy any completed tasks that it doesn’t recognise.

If a user wishes to manage VMs with xenopsd in parallel with xapi, the user should run a separate xenopsd.