XAPI Toolstack Developer Guide

The XAPI Toolstack:

Forms the control plane of both XenServer as well as xcp-ng,
manages clusters of Xen hosts with shared storage and networking,
has a full-featured API, used by clients such as XenCenter and Xen Orchestra.

The XAPI Toolstack is an open-source project developed by the xapi project, a sub-project of the Linux Foundation Xen Project.

The source code is available on Github under the xapi-project. the main repository is xen-api.

This developer guide documents the internals of the Toolstack to help developers understand the code, fix bugs and add new features. It is a work-in-progress, with new documents added when ready and updated whenever needed.

The XAPI Toolstack

Responsibilities

The XAPI Toolstack forms the main control plane of a pool of XenServer hosts. It allows the administrator to:

Configure the hardware resources of XenServer hosts: storage, networking, graphics, memory.
Create, configure and destroy VMs and their virtual resources.
Control the lifecycle of VMs.
Monitor the status of hosts, VMs and related resources.

To this, the Toolstack:

Exposes an API that can be accessed by external clients over HTTP(s).
Exposes a CLI.
Ensures that physical resources are configured when needed, and VMs receive the resources they require.
Implements various features to help the administrator manage their systems.
Monitors running VMs.
Records metrics about physical and virtual resources.

High-level architecture

The XAPI Toolstack manages a cluster of hosts, network switches and storage on behalf of clients such as XenCenter and Xen Orchestra.

The most fundamental concept is of a Resource pool: the whole cluster managed as a single entity. The following diagram shows a cluster of hosts running xapi, all sharing some storage:

A Resource Pool

At any time, at most one host is known as the pool coordinator (formerly known as “master”) and is responsible for coordination and locking resources within the pool. When a pool is first created a coordinator host is chosen. The coordinator role can be transferred

on user request in an orderly fashion (xe pool-designate-new-master)
on user request in an emergency (xe pool-emergency-transition-to-master)
automatically if HA is enabled on the cluster.

All hosts expose an HTTP, XML-RPC and JSON-RPC interface running on port 80 and with TLS on port 443, but control operations will only be processed on the coordinator host. Attempts to send a control operation to another host will result in a XenAPI redirect error message. For efficiency the following operations are permitted on non-coordinator hosts:

querying performance counters (and their history)
connecting to VNC consoles
import/export (particularly when disks are on local storage)

Since the coordinator host acts as coordinator and lock manager, the other hosts will often talk to the coordinator. Non-coordinator hosts will also talk to each other (over the same HTTP and RPC channels) to

transfer VM memory images (VM migration)
mirror disks (storage migration)

Note that some types of shared storage (in particular all those using vhd) require coordination for disk GC and coalesce. This coordination is currently done by xapi and hence it is not possible to share this kind of storage between resource pools.

The following diagram shows the software running on a single host. Note that all hosts run the same software (although not necessarily the same version, if we are in the middle of a rolling update).

A Host

The XAPI Toolstack expects the host to be running Xen on x86. The Xen hypervisor partitions the host into Domains, some of which can have privileged hardware access, and the rest are unprivileged guests. The XAPI Toolstack normally runs all of its components in the privileged initial domain, Domain 0, also known as “the control domain”. However there is experimental code which supports “driver domains” allowing storage and networking drivers to be isolated in their own domains.

Environment

The Toolstack runs in an environment on a server (host) that has:

Physical hardware.
The Xen hypervisor.
The control domain (domain 0): the privileged domain that the Toolstack runs in.
Other, mostly unprivileged domains, usually for guests (VMs).

The Toolstack relies on various bits of software inside the control domain, and directly communicates with most of these:

Linux kernel including drivers for hardware and Xen paravirtualised devices (e.g. netback and blkback).
- Interacts through /sys and /proc, udev scripts, xenstore, …
CentOS distribution including userspace tools and libraries.
- systemd, networking tools, …
Xen-specific libraries, especially libxenctrl (a.k.a. libxc)
xenstored: a key-value pair configuration database
- Accessible from all domains on a host, which makes it useful for inter-domain communication.
- The control domain has access to the entire xenstore database, while other domains only see sub-trees that are specific to that domain.
- Used for connecting VM disks and network interfaces, and other VM configuration options.
- Used for VM status reporting, e.g. the capabilities of the PV drivers (if installed), the IP address, etc.
SM: Storage Manager plugins which connect xapi’s internal storage interfaces to the control APIs of external storage systems.
stunnel: a daemon which decodes TLS and forwards traffic to xapi (and the other way around).
Open vSwitch (OVS): a virtual network switch, used to connect VMs to network interfaces. The OVS offers several networking features that xapi takes advantage of.
QEMU: emulation of various bits of hardware
DEMU: emulation of Nvidia vGPUs
xenguest
emu-manager
pvsproxy
xenconsoled: allows access to guest consoles. This is common to all Xen hosts.

The Toolstack also interacts with software that runs inside the guests:

PV drivers
The guest agent

Daemons

The Toolstack consists of a set of co-operating daemons:

xapi

manages clusters of hosts, co-ordinating access to shared storage and networking.

xenopsd

a low-level “domain manager” which takes care of creating, suspending, resuming, migrating, rebooting domains by interacting with Xen via libxc and libxl.

xcp-rrdd

a performance counter monitoring daemon which aggregates “datasources” defined via a plugin API and records history for each. There are various rrdd-plugin daemons:

xcp-rrdd-gpumon
xcp-rrdd-iostat
xcp-rrdd-squeezed
xcp-rrdd-xenpm
xcp-rrdd-dcmi
xcp-rrdd-netdev
xcp-rrdd-cpu

xcp-networkd

a host network manager which takes care of configuring interfaces, bridges and OpenVSwitch instances

squeezed

a daemon in charge of VM memory management

xapi-storage-script

for storage manipulation over SMAPIv3

message-switch

exchanges messages between the daemons on a host

xapi-guard

forwards uefi and vtpm persistence calls from domains to xapi

v6d

controls which features are enabled.

forkexecd

a helper daemon that assists the above daemons with executing binaries and scripts

xhad

The High-Availability daemon

perfmon

a daemon which monitors performance counters and sends “alerts” if values exceed some pre-defined threshold

mpathalert

a daemon which monitors “storage paths” and sends “alerts” if paths fail and need repair

wsproxy

handles access to VM consoles

Interfaces

Communication between the Toolstack daemon is built upon libraries from a component called xapi-idl.

Abstracts communication between daemons over the message-switch using JSON/RPC.
Contains the definition of the interfaces exposed by the daemons (except xapi).

Features

Disaster Recovery

The HA feature will restart VMs after hosts have failed, but what happens if a whole site (e.g. datacenter) is lost? A disaster recovery configuration is shown in the following diagram:

Disaster recovery maintaining a secondary site

We rely on the storage array’s built-in mirroring to replicate (synchronously or asynchronously: the admin’s choice) between the primary and the secondary site. When DR is enabled the VM disk data and VM metadata are written to the storage server and mirrored. The secondary site contains the other side of the data mirror and a set of hosts, which may be powered off.

In normal operation, the DR feature allows a “dry-run” recovery where a host on the secondary site checks that it can indeed see all the VM disk data and metadata. This should be done regularly, so that admins are familiar with the process.

After a disaster, the admin breaks the mirror on the secondary site and triggers a remote power-on of the offline hosts (either using an out-of-band tool or the built-in host power-on feature of xapi). The pool master on the secondary site can connect to the storage and extract all the VM metadata. Finally the VMs can all be restarted.

When the primary site is fully recovered, the mirror can be re-synchronised and the VMs can be moved back.

Event handling in the Control Plane - Xapi, Xenopsd and Xenstore

Introduction

Xapi, xenopsd and xenstore use a number of different events to obtain indications that some state changed in dom0 or in the guests. The events are used as an efficient alternative to polling all these states periodically.

xenstore provides a very configurable approach in which each and any key can be watched individually by a xenstore client. Once the value of a watched key changes, xenstore will indicate to the client that the value for that key has changed. An ocaml xenstore client library provides a way for ocaml programs such as xenopsd, message-cli and rrdd to provide high-level ocaml callback functions to watch specific key. It’s very common, for instance, for xenopsd to watch specific keys in the xenstore keyspace of a guest and then after receiving events for some or all of them, read other keys or subkeys in xenstored to update its internal state mirroring the state of guests and its devices (for instance, if the guest has pv drivers and specific frontend devices have established connections with the backend devices in dom0).
xapi also provides a very configurable event mechanism in which the xenapi can be used to provide events whenever a xapi object (for instance, a VM, a VBD etc) changes state. This event mechanism is very reliable and is extensively used by XenCenter to provide real-time update on the XenCenter GUI.
xenopsd provides a somewhat less configurable event mechanism, where it always provides signals for all objects (VBDs, VMs etc) whose state changed (so it’s not possible to select a subset of objects to watch for as in xenstore or in xapi). It’s up to the xenopsd client (eg. xapi) to receive these events and then filter out or act on each received signal by calling back xenopsd and asking it information for the specific signalled object. The main use in xapi for the xenopsd signals is to update xapi’s database of the current state of each object controlled by xenopsd (VBDs, VMs etc).

Given a choice between polling states and receiving events when the state change, we should in general opt for receiving events in the code in order to avoid adding bottlenecks in dom0 that will prevent the scalability of XenServer to many VMs and virtual devices.

Connection of events between XAPI, xenopsd and xenstore, with main functions and data structures responsible for receiving and sending them

Xapi

Sending events from the xenapi

A xenapi user client, such as XenCenter, the xe-cli or a python script, can register to receive events from XAPI for specific objects in the XAPI DB. XAPI will generate events for those registered clients whenever the corresponding XAPI DB object changes.

Sending events from the xenapi

This small python scripts shows how to register a simple event watch loop for XAPI:

import XenAPI
session = XenAPI.Session("http://xshost")
session.login_with_password("username","password")
session.xenapi.event.register(["VM","pool"]) # register for events in the pool and VM objects                                                
while True:
  try:
    events = session.xenapi.event.next() # block until a xapi event on a xapi DB object is available
    for event in events:
      print "received event op=%s class=%s ref=%s" % (event['operation'], event['class'], event['ref'])                                      
      if event['class'] == 'vm' and event['operation'] == 'mod':
        vm = event['snapshot']
        print "xapi-event on vm: vm_uuid=%s, vm_name_label=%s, power_state=%s, current_operation=%s" % (vm['uuid'],vm['name_label'],vm['power_state'],vm['current_operations'].values())
  except XenAPI.Failure, e:
    if len(e.details) > 0 and e.details[0] == 'EVENTS_LOST':
      session.xenapi.event.unregister(["VM","pool"])
      session.xenapi.event.register(["VM","pool"])

Receiving events from xenopsd

Xapi receives all events from xenopsd via the function xapi_xenops.events_watch() in its own independent thread. This is a single-threaded function that is responsible for handling all of the signals sent by xenopsd. In some situations with lots of VMs and virtual devices such as VBDs, this loop may saturate a single dom0 vcpu, which will slow down handling all of the xenopsd events and may cause the xenopsd signals to accumulate unboundedly in the worst case in the updates queue in xenopsd (see Figure 1).

The function xapi_xenops.events_watch() calls xenops_client.UPDATES.get() to obtain a list of (barrier, barrier_events), and then it process each one of the barrier_event, which can be one of the following events:

Vm id: something changed in this VM, run xapi_xenops.update_vm() to query xenopsd about its state. The function update_vm() will update power_state, allowed_operations, console and guest_agent state in the xapi DB.
Vbd id: something changed in this VM, run xapi_xenops.update_vbd() to query xenopsd about its state. The function update_vbd() will update currently_attached and connected in the xapi DB.
Vif id: something changed in this VM, run xapi_xenops.update_vif() to query xenopsd about its state. The function update_vif() will update activate and plugged state of in the xapi DB.
Pci id: something changed in this VM, run xapi_xenops.update_pci() to query xenopsd about its state.
Vgpu id: something changed in this VM, run xapi_xenops.update_vgpu() to query xenopsd about its state.
Task id: something changed in this VM, run xapi_xenops.update_task() to query xenopsd about its state. The function update_task() will update the progress of the task in the xapi DB using the information of the task in xenopsd.

Receiving events from xenopsd

All the xapi_xenops.update_X() functions above will call Xenopsd_client.X.stat() functions to obtain the current state of X from xenopsd:

Obtaining current state

There are a couple of optimisations while processing the events in xapi_xenops.events_watch():

if an event X=(vm_id,dev_id) (eg. Vbd dev_id) has already been processed in a barrier_events, it’s not processed again. A typical value for X is eg. “<vm_uuid>.xvda” for a VBD.
if Events_from_xenopsd.are_supressed X, then this event is ignored. Events are supressed if VM X.vm_id is migrating away from the host

Barriers

When xapi needs to execute (and to wait for events indicating completion of) a xapi operation (such as VM.start and VM.shutdown) containing many xenopsd sub-operations (such as VM.start – to force xenopsd to change the VM power_state, and VM.stat, VBD.stat, VIF.stat etc – to force the xapi DB to catch up with the xenopsd new state for these objects), xapi sends to the xenopsd input queue a barrier, indicating that xapi will then block and only continue execution of the barred operation when xenopsd returns the barrier. The barrier should only be returned when xenopsd has finished the execution of all the operations requested by xapi (such as VBD.stat and VM.stat in order to update the state of the VM in the xapi database after a VM.start has been issued to xenopsd).

A recent problem has been detected in the xapi_xenops.events_watch() function: when it needs to process many VM_check_state events, this may push for later the processing of barriers associated with a VM.start, delaying xapi in reporting (via a xapi event) that the VM state in the xapi DB has reached the running power_state. This needs further debugging, and is probably one of the reasons in CA-87377 why in some conditions a xapi event reporting that the VM power_state is running (causing it to go from yellow to green state in XenCenter) is taking so long to be returned, way after the VM is already running.

Xenopsd

Xenopsd has a few queues that are used by xapi to store commands to be executed (eg. VBD.stat) and update events to be picked up by xapi. The main ones, easily seen at runtime by running the following command in dom0, are:

# xenops-cli diagnostics --queue=org.xen.xapi.xenops.classic
{
   queues: [  # XENOPSD INPUT QUEUE
            ... stuff that still needs to be processed by xenopsd
            VM.stat
            VBD.stat
            VM.start
            VM.shutdown
            VIF.plug
            etc
           ]
   workers: [ # XENOPSD WORKER THREADS
            ... which stuff each worker thread is processing
   ]
   updates: {
     updates: [ # XENOPSD OUTPUT QUEUE
            ... signals from xenopsd that need to be picked up by xapi
               VM_check_state
               VBD_check_state
               etc
        ]
      } tasks: [ # XENOPSD TASKS
               ... state of each known task, before they are manually deleted after completion of the task
               ]
}

Sending events to xapi

Whenever xenopsd changes the state of a XenServer object such as a VBD or VM, or when it receives an event from xenstore indicating that the states of these objects have changed (perhaps because either a guest or the dom0 backend changed the state of a virtual device), it creates a signal for the corresponding object (VM_check_state, VBD_check_state etc) and send it up to xapi. Xapi will then process this event in its xapi_xenops.events_watch() function.

Sending events to xapi

These signals may need to wait a long time to be processed if the single-threaded xapi_xenops.events_watch() function is having difficulties (ie taking a long time) to process previous signals in the UPDATES queue from xenopsd.

Receiving events from xenstore

Xenopsd watches a number of keys in xenstore, both in dom0 and in each guest. Xenstore is responsible to send watch events to xenopsd whenever the watched keys change state. Xenopsd uses a xenstore client library to make it easier to create a callback function that is called whenever xenstore sends these events.

Receiving events from xenstore

Xenopsd also needs to complement sometimes these watch events with polling of some values. An example is the @introduceDomain event in xenstore (handled in xenopsd/xc/xenstore_watch.ml), which indicates that a new VM has been created. This event unfortunately does not indicate the domid of the VM, and xenopsd needs to query Xen (via libxc) which domains are now available in the host and compare with the previous list of known domains, in order to figure out the domid of the newly introduced domain.

It is not good practice to poll xenstore for changes of values. This will add a large overhead to both xenstore and xenopsd, and decrease the scalability of XenServer in terms of number of VMs/host and virtual devices per VM. A much better approach is to rely on the watch events of xenstore to indicate when a specific value has changed in xenstore.

Xenstore

Sending events to xenstore clients

If a xenstore client has created watch events for a key, then xenstore will send events to this client whenever this key changes state.

Receiving events from xenstore clients

Xenstore clients indicate to xenstore that something state changed by writing to some xenstore key. This may or may not cause xenstore to create watch events for the corresponding key, depending on if other xenstore clients have watches on this key.

High-Availability

High-Availability (HA) tries to keep VMs running, even when there are hardware failures in the resource pool, when the admin is not present. Without HA the following may happen:

during the night someone spills a cup of coffee over an FC switch; then
VMs running on the affected hosts will lose access to their storage; then
business-critical services will go down; then
monitoring software will send a text message to an off-duty admin; then
the admin will travel to the office and fix the problem by restarting the VMs elsewhere.

With HA the following will happen:

during the night someone spills a cup of coffee over an FC switch; then
VMs running on the affected hosts will lose access to their storage; then
business-critical services will go down; then
the HA software will determine which hosts are affected and shut them down; then
the HA software will restart the VMs on unaffected hosts; then
services are restored; then on the next working day
the admin can arrange for the faulty switch to be replaced.

HA is designed to handle an emergency and allow the admin time to fix failures properly.

Example

The following diagram shows an HA-enabled pool, before and after a network link between two hosts fails.

High-Availability in action

When HA is enabled, all hosts in the pool

exchange periodic heartbeat messages over the network
send heartbeats to a shared storage device.
attempt to acquire a “master lock” on the shared storage.

HA is designed to recover as much as possible of the pool after a single failure i.e. it removes single points of failure. When some subset of the pool suffers a failure then the remaining pool members

figure out whether they are in the largest fully-connected set (the “liveset”);
- if they are not in the largest set then they “fence” themselves (i.e. force reboot via the hypervisor watchdog)
elect a master using the “master lock”
restart all lost VMs.

After HA has recovered a pool, it is important that the original failure is addressed because the remaining pool members may not be able to cope with any more failures.

Design

HA must never violate the following safety rules:

there must be at most one master at all times. This is because the master holds the VM and disk locks.
there must be at most one instance of a particular VM at all times. This is because starting the same VM twice will result in severe filesystem corruption.

However to be useful HA must:

detect failures quickly;
minimise the number of false-positives in the failure detector; and
make the failure handling logic as robust as possible.

The implementation difficulty arises when trying to be both useful and safe at the same time.

Terminology

We use the following terminology:

fencing: also known as I/O fencing, refers to the act of isolating a host from network and storage. Once a host has been fenced, any VMs running there cannot generate side-effects observable to a third party. This means it is safe to restart the running VMs on another node without violating the safety-rule and running the same VM simultaneously in two locations.
heartbeating: exchanging status updates with other hosts at regular pre-arranged intervals. Heartbeat messages reveal that hosts are alive and that I/O paths are working.
statefile: a shared disk (also known as a “quorum disk”) on the “Heartbeat” SR which is mapped as a block device into every host’s domain 0. The shared disk acts both as a channel for heartbeat messages and also as a building block of a Pool master lock, to prevent multiple hosts becoming masters in violation of the safety-rule (a dangerous situation also known as “split-brain”).
management network: the network over which the XenAPI XML/RPC requests flow and also used to send heartbeat messages.
liveset: a per-Host view containing a subset of the Hosts in the Pool which are considered by that Host to be alive i.e. responding to XenAPI commands and running the VMs marked as resident_on there. When a Host b leaves the liveset as seen by Host a it is safe for Host a to assume that Host b has been fenced and to take recovery actions (e.g. restarting VMs), without violating either of the safety-rules.
properly shared SR: an SR which has field shared=true; and which has a PBD connecting it to every enabled Host in the Pool; and where each of these PBDs has field currently_attached set to true. A VM whose disks are in a properly shared SR could be restarted on any enabled Host, memory and network permitting.
properly shared Network: a Network which has a PIF connecting it to every enabled Host in the Pool; and where each of these PIFs has field currently_attached set to true. A VM whose VIFs connect to properly shared Networks could be restarted on any enabled Host, memory and storage permitting.
agile: a VM is said to be agile if all disks are in properly shared SRs and all network interfaces connect to properly shared Networks.
unprotected: an unprotected VM has field ha_always_run set to false and will never be restarted automatically on failure or have reconfiguration actions blocked by the HA overcommit protection.
best-effort: a best-effort VM has fields ha_always_run set to true and ha_restart_priority set to best-effort. A best-effort VM will only be restarted if (i) the failure is directly observed; and (ii) capacity exists for an immediate restart. No more than one restart attempt will ever be made.
protected: a VM is said to be protected if it will be restarted by HA i.e. has field ha_always_run set to true and field ha_restart_priority not set to `best-effort.
survival rule 1: describes the situation where hosts survive because they are in the largest network partition with statefile access. This is the normal state of the xhad daemon.
survival rule 2: describes the situation where all hosts have lost access to the statefile but remain alive while they can all see each-other on the network. In this state any further failure will cause all nodes to self-fence. This state is intended to cope with the system-wide temporary loss of the storage service underlying the statefile.

Assumptions

We assume:

All I/O used for monitoring the health of hosts (i.e. both storage and network-based heartbeating) is along redundant paths, so that it survives a single hardware failure (e.g. a broken switch or an accidentally-unplugged cable). It is up to the admin to ensure their environment is setup correctly.
The hypervisor watchdog mechanism will be able to guarantee the isolation of nodes, once communication has been lost, within a pre-arranged time period. Therefore no active power fencing equipment is required.
VMs may only be marked as protected if they are fully agile i.e. able to run on any host, memory permitting. No additional constraints of any kind may be specified e.g. it is not possible to make “CPU reservations”.
Pools are assumed to be homogenous with respect to CPU type and presence of VT/SVM support (also known as “HVM”). If a Pool is created with non-homogenous hosts using the --force flag then the additional constraints will not be noticed by the VM failover planner resulting in runtime failures while trying to execute the failover plans.
No attempt will ever be made to shutdown or suspend “lower” priority VMs to guarantee the survival of “higher” priority VMs.
Once HA is enabled it is not possible to reconfigure the management network or the SR used for storage heartbeating.
VMs marked as protected are considered to have failed if they are offline i.e. the VM failure handling code is level-sensitive rather than edge-sensitive.
VMs marked as best-effort are considered to have failed only when the host where they are resident is declared offline i.e. the best-effort VM failure handling code is edge-sensitive rather than level-sensitive. A single restart attempt is attempted and if this fails no further start is attempted.
HA can only be enabled if all Pool hosts are online and actively responding to requests.
when HA is enabled the database is configured to write all updates to the “Heartbeat” SR, guaranteeing that VM configuration changes are not lost when a host fails.

Components

The implementation is split across the following components:

xhad: the cluster membership daemon maintains a quorum of hosts through network and storage heartbeats
xapi: used to configure the HA policy i.e. which network and storage to use for heartbeating and which VMs to restart after a failure.
xen: the Xen watchdog is used to reliably fence the host when the host has been (partially or totally) isolated from the cluster

To avoid a “split-brain”, the cluster membership daemon must “fence” (i.e. isolate) nodes when they are not part of the cluster. In general there are 2 approaches:

cut the power of remote hosts which you can’t talk to on the network any more. This is the approach taken by most open-source clustering software since it is simpler. However it has the downside of requiring the customer buy more hardware and set it up correctly.
rely on the remote hosts using a watchdog to cut their own power (i.e. halt or reboot) after a timeout. This relies on the watchdog being reliable. Most other people don’t trust the Linux watchdog; after all the Linux kernel is highly threaded, performs a lot of (useful) functions and kernel bugs which result in deadlocks do happen. We use the Xen watchdog because we believe that the Xen hypervisor is simple enough to reliably fence the host (via triggering a reboot of domain 0 which then triggers a host reboot).

xhad

xhad is the cluster membership daemon: it exchanges heartbeats with the other nodes to determine which nodes are still in the cluster (the “live set”) and which nodes have definitely failed (through watchdog fencing). When a host has definitely failed, xapi will unlock all the disks and restart the VMs according to the HA policy.

Since Xapi is a critical part of the system, the xhad also acts as a Xapi watchdog. It polls Xapi every few seconds and checks if Xapi can respond. If Xapi seems to have failed then xhad will restart it. If restarts continue to fail then xhad will consider the host to have failed and self-fence.

xhad is configured via a simple config file written on each host in /etc/xensource/xhad.conf. The file must be identical on each host in the cluster. To make changes to the file, HA must be disabled and then re-enabled afterwards. Note it may not be possible to re-enable HA depending on the configuration change (e.g. if a host has been added but that host has a broken network configuration then this will block HA enable).

The xhad.conf file is written in XML and contains

pool-wide configuration: this includes a list of all hosts which should be in the liveset and global timeout information
local host configuration: this identifies the local host and described which local network interface and block device to use for heartbeating.

The following is an example xhad.conf file:

<?xml version="1.0" encoding="utf-8"?>
<xhad-config version="1.0">

  <!--pool-wide configuration-->
  <common-config>
    <GenerationUUID>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx</GenerationUUID>
    <UDPport>694</UDPport>

    <!--for each host, specify host UUID, and IP address-->
    <host>
      <HostID>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx</HostID>
      <IPaddress>xxx.xxx.xxx.xx1</IPaddress>
    </host>

    <host>
      <HostID>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx</HostID>
      <IPaddress>xxx.xxx.xxx.xx2</IPaddress>
    </host>

    <host>
      <HostID>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx</HostID>
      <IPaddress>xxx.xxx.xxx.xx3</IPaddress>
    </host>

    <!--optional parameters [sec] -->
    <parameters>
      <HeartbeatInterval>4</HeartbeatInterval>
      <HeartbeatTimeout>30</HeartbeatTimeout>
      <StateFileInterval>4</StateFileInterval>
      <StateFileTimeout>30</StateFileTimeout>
      <HeartbeatWatchdogTimeout>30</HeartbeatWatchdogTimeout>
      <StateFileWatchdogTimeout>45</StateFileWatchdogTimeout>
      <BootJoinTimeout>90</BootJoinTimeout>
      <EnableJoinTimeout>90</EnableJoinTimeout>
      <XapiHealthCheckInterval>60</XapiHealthCheckInterval>
      <XapiHealthCheckTimeout>10</XapiHealthCheckTimeout>
      <XapiRestartAttempts>1</XapiRestartAttempts>
      <XapiRestartTimeout>30</XapiRestartTimeout>
      <XapiLicenseCheckTimeout>30</XapiLicenseCheckTimeout>
    </parameters>
  </common-config>

  <!--local host configuration-->
  <local-config>
    <localhost>
      <HostID>xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxx2</HostID>
      <HeartbeatInterface> xapi1</HeartbeatInterface>
      <HeartbeatPhysicalInterface>bond0</HeartbeatPhysicalInterface>
      <StateFile>/dev/statefiledevicename</StateFile>
    </localhost>
  </local-config>

</xhad-config>

The fields have the following meaning:

GenerationUUID: a UUID generated each time HA is reconfigured. This allows xhad to tell an old host which failed; had been removed from the configuration; repaired and then restarted that the world has changed while it was away.
UDPport: the port number to use for network heartbeats. It’s important to allow this traffic through the firewall and to make sure the same port number is free on all hosts (beware of portmap services occasionally binding to it).
HostID: a UUID identifying a host in the pool. We would normally use xapi’s notion of a host uuid.
IPaddress: any IP address on the remote host. We would normally use xapi’s notion of a management network.
HeartbeatTimeout: if a heartbeat packet is not received for this many seconds, then xhad considers the heartbeat to have failed. This is the user-supplied “HA timeout” value, represented below as T. T must be bigger than 10; we would normally use 60s.
StateFileTimeout: if a storage update is not seen for a host for this many seconds, then xhad considers the storage heartbeat to have failed. We would normally use the same value as the HeartbeatTimeout T.
HeartbeatInterval: interval between heartbeat packets sent. We would normally use a value 2 <= t <= 6, derived from the user-supplied HA timeout via t = (T + 10) / 10
StateFileInterval: interval betwen storage updates (also known as “statefile updates”). This would normally be set to the same value as HeartbeatInterval.
HeartbeatWatchdogTimeout: If the host does not send a heartbeat for this amount of time then the host self-fences via the Xen watchdog. We normally set this to T.
StateFileWatchdogTimeout: If the host does not update the statefile for this amount of time then the host self-fences via the Xen watchdog. We normally set this to T+15.
BootJoinTimeout: When the host is booting and joining the liveset (i.e. the cluster), consider the join a failure if it takes longer than this amount of time. We would normally set this to T+60.
EnableJoinTimeout: When the host is enabling HA for the first time, consider the enable a failure if it takes longer than this amount of time. We would normally set this to T+60.
XapiHealthCheckInterval: Interval between “health checks” where we run a script to check whether Xapi is responding or not.
XapiHealthCheckTimeout: Number of seconds to wait before assuming that Xapi has deadlocked during a “health check”.
XapiRestartAttempts: Number of Xapi restarts to attempt before concluding Xapi has permanently failed.
XapiRestartTimeout: Number of seconds to wait for a Xapi restart to complete before concluding it has failed.
XapiLicenseCheckTimeout: Number of seconds to wait for a Xapi license check to complete before concluding that xhad should terminate.

In addition to the config file, Xhad exposes a simple control API which is exposed as scripts:

ha_set_pool_state (Init | Invalid): sets the global pool state to “Init” (before starting HA) or “Invalid” (causing all other daemons who can see the statefile to shutdown)
ha_start_daemon: if the pool state is “Init” then the daemon will attempt to contact other daemons and enable HA. If the pool state is “Active” then the host will attempt to join the existing liveset.
ha_query_liveset: returns the current state of the cluster.
ha_propose_master: returns whether the current node has been elected pool master.
ha_stop_daemon: shuts down the xhad on the local host. Note this will not disarm the Xen watchdog by itself.
ha_disarm_fencing: disables fencing on the local host.
ha_set_excluded: when a host is being shutdown cleanly, record the fact that the VMs have all been shutdown so that this host can be ignored in future cluster membership calculations.

Fencing

Xhad continuously monitors whether the host should remain alive, or if it should self-fence. There are two “survival rules” which will keep a host alive; if neither rule applies (or if xhad crashes or deadlocks) then the host will fence. The rules are:

Xapi is running; the storage heartbeats are visible; this host is a member of the “best” partition (as seen through the storage heartbeats)
Xapi is running; the storage is inaccessible; all hosts which should be running (i.e. not those “excluded” by being cleanly shutdown) are online and have also lost storage access (as seen through the network heartbeats).

where the “best” partition is the largest one if that is unique, or if there are multiple partitions of the same size then the one containing the lowest host uuid is considered best.

The first survival rule is the “normal” case. The second rule exists only to prevent the storage from becoming a single point of failure: all hosts can remain alive until the storage is repaired. Note that if a host has failed and has not yet been repaired, then the storage becomes a single point of failure for the degraded pool. HA removes single point of failures, but multiple failures can still cause problems. It is important to fix failures properly after HA has worked around them.

xapi

Xapi is responsible for

exposing an interface for setting HA policy
creating VDIs (disks) on shared storage for heartbeating and storing the pool database
arranging for these disks to be attached on host boot, before the “SRmaster” is online
configuring and managing the xhad heartbeating daemon

The HA policy APIs include

methods to determine whether a VM is agile i.e. can be restarted in principle on any host after a failure
planning for a user-specified number of host failures and enforcing access control
restarting failed protected VMs in policy order

The HA policy settings are stored in the Pool database which is written (synchronously) to a VDI in the same SR that’s being used for heartbeating. This ensures that the database can be recovered after a host fails and the VMs are recovered.

Xapi stores 2 settings in its local database:

ha_disable_failover_actions: this is set to false when we want nodes to be able to recover VMs – this is the normal case. It is set to true during the HA disable process to prevent a split-brain forming while HA is only partially enabled.
ha_armed: this is set to true to tell Xapi to start Xhad during host startup and wait to join the liveset.

Disks on shared storage

The regular disk APIs for creating, destroying, attaching, detaching (etc) disks need the SRmaster (usually but not always the Pool master) to be online to allow the disks to be locked. The SRmaster cannot be brought online until the host has joined the liveset. Therefore we have a cyclic dependency: joining the liveset needs the statefile disk to be attached but attaching a disk requires being a member of the liveset already.

The dependency is broken by adding an explicit “unlocked” attach storage API called VDI_ATTACH_FROM_CONFIG. Xapi uses the VDI_GENERATE_CONFIG API during the HA enable operation and stores away the result. When the system boots the VDI_ATTACH_FROM_CONFIG is able to attach the disk without the SRmaster.

The role of Host.enabled

The Host.enabled flag is used to mean, “this host is ready to start VMs and should be included in failure planning”. The VM restart planner assumes for simplicity that all protected VMs can be started anywhere; therefore all involved networks and storage must be properly shared. If a host with an unplugged PBD were to become enabled then the corresponding SR would cease to be properly shared, all the VMs would cease to be agile and the VM restart logic would fail.

To ensure the VM restart logic always works, great care is taken to make sure that Hosts may only become enabled when their networks and storage are properly configured. This is achieved by:

when the master boots and initialises its database it sets all Hosts to dead and disabled and then signals the HA background thread (signal_database_state_valid) to wake up from sleep and start processing liveset information (and potentially setting hosts to live)
when a slave calls Pool.hello (i.e. after the slave has rebooted), the master sets it to disabled, allowing it a grace period to plug in its storage;
when a host (master or slave) successfully plugs in its networking and storage it calls consider_enabling_host which checks that the preconditions are met and then sets the host to enabled; and
when a slave notices its database connection to the master restart (i.e. after the master xapi has just restarted) it calls consider_enabling_host}

The steady-state

When HA is enabled and all hosts are running normally then each calls ha_query_liveset every 10s.

Slaves check to see if the host they believe is the master is alive and has the master lock. If another node has become master then the slave will rewrite its pool.conf and restart. If no node is the master then the slave will call on_master_failure, proposing itself and, if it is rejected, checking the liveset to see which node acquired the lock.

The master monitors the liveset and updates the Host_metrics.live flag of every host to reflect the liveset value. For every host which is not in the liveset (i.e. has fenced) it enumerates all resident VMs and marks them as Halted. For each protected VM which is not running, the master computes a VM restart plan and attempts to execute it. If the plan fails then a best-effort VM.start call is attempted. Finally an alert is generated if the VM could not be restarted.

Note that XenAPI heartbeats are still sent when HA is enabled, even though they are not used to drive the values of the Host_metrics.live field. Note further that, when a host is being shutdown, the host is immediately marked as dead and its host reference is added to a list used to prevent the Host_metrics.live being accidentally reset back to live again by the asynchronous liveset query. The Host reference is removed from the list when the host restarts and calls Pool.hello.

Planning and overcommit

The VM failover planning code is sub-divided into two pieces, stored in separate files:

binpack.ml: contains two algorithms for packing items of different sizes (i.e. VMs) into bins of different sizes (i.e. Hosts); and
xapi_ha_vm_failover.ml: interfaces between the Pool database and the binpacker; also performs counterfactual reasoning for overcommit protection.

The input to the binpacking algorithms are configuration values which represent an abstract view of the Pool:

type ('a, 'b) configuration = {
  hosts:        ('a * int64) list; (** a list of live hosts and free memory *)
  vms:          ('b * int64) list; (** a list of VMs and their memory requirements *)
  placement:    ('b * 'a) list;    (** current VM locations *)
  total_hosts:  int;               (** total number of hosts in the pool 'n' *)
  num_failures: int;               (** number of failures to tolerate 'r' *)
}

Note that:

the memory required by the VMs listed in placement has already been substracted from the free memory of the hosts; it doesn’t need to be subtracted again.
the free memory of each host has already had per-host miscellaneous overheads subtracted from it, including that used by unprotected VMs, which do not appear in the VM list.
the total number of hosts in the pool (total_hosts) is a constant for any particular invocation of HA.
the number of failures to tolerate (num_failures) is the user-settable value from the XenAPI Pool.ha_host_failures_to_tolerate.

There are two algorithms which satisfy the interface:

sig
  plan_always_possible: ('a, 'b) configuration -> bool;
  get_specific_plan: ('a, 'b) configuration -> 'b list -> ('b * 'a) list
end

The function get_specific_plan takes a configuration and a list of VMs( the host where they are resident on have failed). It returns a VM restart plan represented as a VM to Host association list. This is the function called by the background HA VM restart thread on the master.

The function plan_always_possible returns true if every sequence of Host failures of length num_failures (irrespective of whether all hosts failed at once, or in multiple separate episodes) would result in calls to get_specific_plan which would allow all protected VMs to be restarted. This function is heavily used by the overcommit protection logic as well as code in XenCenter which aims to maximise failover capacity using the counterfactual reasoning APIs:

Pool.ha_compute_max_host_failures_to_tolerate
Pool.ha_compute_hypothetical_max_host_failures_to_tolerate

There are two binpacking algorithms: the more detailed but expensive algorithmm is used for smaller/less complicated pool configurations while the less detailed, cheaper algorithm is used for the rest. The choice between algorithms is based only on total_hosts (n) and num_failures (r). Note that the choice of algorithm will only change if the number of Pool hosts is varied (requiring HA to be disabled and then enabled) or if the user requests a new num_failures target to plan for.

The expensive algorithm uses an exchaustive search with a “biggest-fit-decreasing” strategy that takes the biggest VMs first and allocates them to the biggest remaining Host. The implementation keeps the VMs and Hosts as sorted lists throughout. There are a number of transformations to the input configuration which are guaranteed to preserve the existence of a VM to host allocation (even if the actual allocation is different). These transformations which are safe are:

VMs may be removed from the list
VMs may have their memory requirements reduced
Hosts may be added
Hosts may have additional memory added.

The cheaper algorithm is used for larger Pools where the state space to search is too large. It uses the same “biggest-fit-decreasing” strategy with the following simplifying approximations:

every VM that fails is as big as the biggest
the number of VMs which fail due to a single Host failure is always the maximum possible (even if these are all very small VMs)
the largest and most capable Hosts fail

An informal argument that these approximations are safe is as follows: if the maximum number of VMs fail, each of which is size of the largest and we can find a restart plan using only the smaller hosts then any real failure:

can never result in the failure of more VMs;
can never result in the failure of bigger VMs; and
can never result in less host capacity remaining.

Therefore we can take this almost-certainly-worse-than-worst-case failure plan and:

replace the remaining hosts in the worst case plan with the real remaining hosts, which will be the same size or larger; and
replace the failed VMs in the worst case plan with the real failed VMs, which will be fewer or the same in number and smaller or the same in size.

Note that this strategy will perform best when each host has the same number of VMs on it and when all VMs are approximately the same size. If one very big VM exists and a lot of smaller VMs then it will probably fail to find a plan. It is more tolerant of differing amounts of free host memory.

Overcommit protection

Overcommit protection blocks operations which would prevent the Pool being able to restart protected VMs after host failure. The Pool may become unable to restart protected VMs in two general ways: (i) by running out of resource i.e. host memory; and (ii) by altering host configuration in such a way that VMs cannot be started (or the planner thinks that VMs cannot be started).

API calls which would change the amount of host memory currently in use (VM.start, VM.resume, VM.migrate etc) have been modified to call the planning functions supplying special “configuration change” parameters. Configuration change values represent the proposed operation and have type

type configuration_change = {
  (** existing VMs which are leaving *)
  old_vms_leaving: (API.ref_host * (API.ref_VM * API.vM_t)) list;
  (** existing VMs which are arriving *)
  old_vms_arriving: (API.ref_host * (API.ref_VM * API.vM_t)) list;  
  (** hosts to pretend to disable *)
  hosts_to_disable: API.ref_host list;
  (** new number of failures to consider *)
  num_failures: int option;
  (** new VMs to restart *)  
  new_vms_to_protect: API.ref_VM list;
}

A VM migration will be represented by saying the VM is “leaving” one host and “arriving” at another. A VM start or resume will be represented by saying the VM is “arriving” on a host.

Note that no attempt is made to integrate the overcommit protection with the general VM.start host chooser as this would be quite expensive.

Note that the overcommit protection calls are written as asserts called within the message forwarder in the master, holding the main forwarding lock.

API calls which would change the system configuration in such a way as to prevent the HA restart planner being able to guarantee to restart protected VMs are also blocked. These calls include:

VBD.create: where the disk is not in a properly shared SR
VBD.insert: where the CDROM is local to a host
VIF.create: where the network is not properly shared
PIF.unplug: when the network would cease to be properly shared
PBD.unplug: when the storage would cease to be properly shared
Host.enable: when some network or storage would cease to be properly shared (e.g. if this host had a broken storage configuration)

xen

The Xen hypervisor has per-domain watchdog counters which, when enabled, decrement as time passes and can be reset from a hypercall from the domain. If the domain fails to make the hypercall and the timer reaches zero then the domain is immediately shutdown with reason reboot. We configure Xen to reboot the host when domain 0 enters this state.

High-level operations

Enabling HA

Before HA can be enabled the admin must take care to configure the environment properly. In particular:

NIC bonds should be available for network heartbeats;
multipath should be configured for the storage heartbeats;
all hosts should be online and fully-booted.

The XenAPI client can request a specific shared SR to be used for storage heartbeats, otherwise Xapi will use the Pool’s default SR. Xapi will use VDI_GENERATE_CONFIG to ensure the disk will be attached automatically on system boot before the liveset has been joined.

Note that extra effort is made to re-use any existing heartbeat VDIS so that

if HA is disabled with some hosts offline, when they are rebooted they stand a higher chance of seeing a well-formed statefile with an explicit invalid state. If the VDIs were destroyed on HA disable then hosts which boot up later would fail to attach the disk and it would be harder to distinguish between a temporary storage failure and a permanent HA disable.
the heartbeat SR can be created on expensive low-latency high-reliability storage and made as small as possible (to minimise infrastructure cost), safe in the knowledge that if HA enables successfully once, it won’t run out of space and fail to enable in the future.

The Xapi-to-Xapi communication looks as follows:

Configuring HA around the Pool

The Xapi Pool master calls Host.ha_join_liveset on all hosts in the pool simultaneously. Each host runs the ha_start_daemon script which starts Xhad. Each Xhad starts exchanging heartbeats over the network and storage defined in the xhad.conf.

Joining a liveset

Starting up a host

The Xhad instances exchange heartbeats and decide which hosts are in the “liveset” and which have been fenced.

After joining the liveset, each host clears the “excluded” flag which would have been set if the host had been shutdown cleanly before – this is only needed when a host is shutdown cleanly and then restarted.

Xapi periodically queries the state of xhad via the ha_query_liveset command. The state will be Starting until the liveset is fully formed at which point the state will be Online.

When the ha_start_daemon script returns then Xapi will decide whether to stand for master election or not. Initially when HA is being enabled and there is a master already, this node will be expected to stand unopposed. Later when HA notices that the master host has been fenced, all remaining hosts will stand for election and one of them will be chosen.

Shutting down a host

When a host is to be shutdown cleanly, it can be safely “excluded” from the pool such that a future failure of the storage heartbeat will not cause all pool hosts to self-fence (see survival rule 2 above). When a host is “excluded” all other hosts know that the host does not consider itself a master and has no resources locked i.e. no VMs are running on it. An excluded host will never allow itself to form part of a “split brain”.

Once a host has given up its master role and shutdown any VMs, it is safe to disable fencing with ha_disarm_fencing and stop xhad with ha_stop_daemon. Once the daemon has been stopped the “excluded” bit can be set in the statefile via ha_set_excluded and the host safely rebooted.

Restarting a host

When a host restarts after a failure Xapi notices that ha_armed is set in the local database. Xapi

runs the attach-static-vdis script to attach the statefile and database VDIs. This can fail if the storage is inaccessible; Xapi will retry until it succeeds.
runs the ha_start_daemon to join the liveset, or determine that HA has been cleanly disabled (via setting the state to Invalid).

In the special case where Xhad fails to access the statefile and the host used to be a slave then Xapi will try to contact the previous master and find out

who the new master is;
whether HA is enabled on the Pool or not.

If Xapi can confirm that HA was disabled then it will disarm itself and join the new master. Otherwise it will keep waiting for the statefile to recover.

In the special case where the statefile has been destroyed and cannot be recovered, there is an emergency HA disable API the admin can use to assert that HA really has been disabled, and it’s not simply a connectivity problem. Obviously this API should only be used if the admin is totally sure that HA has been disabled.

Disabling HA

There are 2 methods of disabling HA: one for the “normal” case when the statefile is available; and the other for the “emergency” case when the statefile has failed and can’t be recovered.

Disabling HA cleanly

HA can be shutdown cleanly when the statefile is working i.e. when hosts are alive because of survival rule 1. First the master Xapi tells the local Xhad to mark the pool state as “invalid” using ha_set_pool_state. Every xhad instance will notice this state change the next time it performs a storage heartbeat. The Xhad instances will shutdown and Xapi will notice that HA has been disabled the next time it attempts to query the liveset.

If a host loses access to the statefile (or if none of the hosts have access to the statefile) then HA can be disabled uncleanly.

Disabling HA uncleanly

The Xapi master first calls Host.ha_disable_failover_actions on each host which sets ha_disable_failover_decisions in the lcoal database. This prevents the node rebooting, gaining statefile access, acquiring the master lock and restarting VMs when other hosts have disabled their fencing (i.e. a “split brain”).

Disabling HA uncleanly

Once the master is sure that no host will suddenly start recovering VMs it is safe to call Host.ha_disarm_fencing which runs the script ha_disarm_fencing and then shuts down the Xhad with ha_stop_daemon.

Add a host to the pool

We assume that adding a host to the pool is an operation the admin will perform manually, so it is acceptable to disable HA for the duration and to re-enable it afterwards. If a failure happens during this operation then the admin will take care of it by hand.

Multi-version drivers

Linux loads device drivers on boot and every device driver exists in one version. XAPI extends this scheme such that device drivers may exist in multiple variants plus a mechanism to select the variant being loaded on boot. Such a driver is called a multi-version driver and we expect only a small subset of drivers, built and distributed by XenServer, to have this property. The following covers the background, API, and CLI for multi-version drivers in XAPI.

Variant vs. Version

A driver comes in several variants, each of which has a version. A variant may be updated to a later version while retaining its identity. This makes variants and versions somewhat synonymous and is admittedly confusing.

Device Drivers in Linux and XAPI

Drivers that are not compiled into the kernel are loaded dynamically from the file system. They are loaded from the hierarchy

/lib/modules/<kernel-version>/

and we are particularly interested in the hierarchy

/lib/modules/<kernel-version>/updates/

where vendor-supplied (“driver disk”) drivers are located and where we want to support multiple versions. A driver has typically file extension .ko (kernel object).

A presence in the file system does not mean that a driver is loaded as this happens only on demand. The actually loaded drivers (or modules, in Linux parlance) can be observed from

/proc/modules

netlink_diag 16384 0 - Live 0x0000000000000000
udp_diag 16384 0 - Live 0x0000000000000000
tcp_diag 16384 0 - Live 0x0000000000000000

which includes dependencies between modules (the - means no dependencies).

Driver Properties

A driver name is unique and a driver can be loaded only once. The fact that kernel object files are located in a file system hierarchy means that a driver may exist multiple times and in different version in the file system. From the kernel’s perspective a driver has a unique name and is loaded at most once. We thus can talk about a driver using its name and acknowledge it may exist in different versions in the file system.
A driver that is loaded by the kernel we call active.
A driver file (name.ko) that is in a hierarchy searched by the kernel is called selected. If the kernel needs the driver of that name, it would load this object file.

For a driver (name.ko) selection and activation are independent properties:

inactive, deselected: not loaded now and won’t be loaded on next boot.
active, deselected: currently loaded but won’t be loaded on next boot.
inactive, selected: not loaded now but will be loaded on demand.
active, selected: currently loaded and will be loaded on demand after a reboot.

For a driver to be selected it needs to be in the hierarchy searched by the kernel. By removing a driver from the hierarchy it can be de-selected. This is possible even for drivers that are already loaded. Hence, activation and selection are independent.

Multi-Version Drivers

To support multi-version drivers, XenServer introduces a new hierarchy in Dom0. This is mostly technical background because a lower-level tool deals with this and not XAPI directly.

/lib/modules/<kernel-version>/updates/ is searched by the kernel for drivers.
The hierarchy is expected to contain symbolic links to the file actually containing the driver: /lib/modules/<kernel-version>/xenserver/<driver>/<version>/<name>.ko

The xenserver hierarchy provides drivers in several versions. To select a particular version, we expect a symbolic link from updates/<name>.ko to <driver>/<version>/<name>.ko. At the next boot, the kernel will search the updates/ entries and load the linked driver, which will become active.

Example filesystem hierarchy:

/lib/
└── modules
    └── 4.19.0+1 ->
        ├── updates
        │   ├── aacraid.ko
        │   ├── bnx2fc.ko -> ../xenserver/bnx2fc/2.12.13/bnx2fc.ko
        │   ├── bnx2i.ko
        │   ├── cxgb4i.ko
        │   ├── cxgb4.ko
        │   ├── dell_laptop.ko -> ../xenserver/dell_laptop/1.2.3/dell_laptop.ko
        │   ├── e1000e.ko
        │   ├── i40e.ko
        │   ├── ice.ko -> ../xenserver/intel-ice/1.11.17.1/ice.ko
        │   ├── igb.ko
        │   ├── smartpqi.ko
        │   └── tcm_qla2xxx.ko
        └── xenserver
            ├── bnx2fc
            │   ├── 2.12.13
            │   │   └── bnx2fc.ko
            │   └── 2.12.20-dell
            │       └── bnx2fc.ko
            ├── dell_laptop
            │   └── 1.2.3
            │       └── dell_laptop.ko
            └── intel-ice
                ├── 1.11.17.1
                │   └── ice.ko
                └── 1.6.4
                    └── ice.ko

Selection of a driver is synonymous with creating a symbolic link to the desired version.

Versions

The version of a driver is encoded in the path to its object file but not in the name itself: for xenserver/intel-ice/1.11.17.1/ice.ko the driver name is ice and only its location hints at the version.

The kernel does not reveal the location from where it loaded an active driver. Hence the name is not sufficient to observe the currently active version. For this, we use ELF notes.

The driver file (name.ko) is in ELF linker format and may contain custom ELF notes. These are binary annotations that can be compiled into the file. The kernel reveals these details for loaded drivers (i.e., modules) in:

/sys/module/<name>/notes/

The directory contains files like

/sys/module/xfs/notes/.note.gnu.build-id

with a specific name (.note.xenserver) for our purpose. Such a file contains in binary encoding a sequence of records, each containing:

A null-terminated name (string)
A type (integer)
A desc (see below)

The format of the description is vendor specific and is used for a null-terminated string holding the version. The name is fixed to “XenServer”. The exact format is described in ELF notes.

A note with the name “XenServer” and a particular type then has the version as a null-terminated string the desc field. Additional “XenServer” notes of a different type may be present.

API

XAPI has capabilities to inspect and select multi-version drivers.

The API uses the terminology introduced above:

A driver is specific to a host.
A driver has a unique name; however, for API purposes a driver is identified by a UUID (on the CLI) and reference (programmatically).
A driver has multiple variants; each variant has a version. Programatically, variants are represented as objects (referenced by UUID and a reference) but this is mostly hidden in the CLI for convenience.
A driver variant is active if it is currently used by the kernel (loaded).
A driver variant is selected if it will be considered by the kernel (on next boot or when loading on demand).
Only one variant can be active, and only one variants can be selected.

Inspection and selection of drivers is facilitated by a tool (“drivertool”) that is called by xapi. Hence, XAPI does not by itself manipulate the file system that implements driver selection.

An example interaction with the API through xe:

[root@lcy2-dt110 log]# xe hostdriver-list uuid=c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0 params=all
uuid ( RO)                   : c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0
                   name ( RO): cisco-fnic
                   type ( RO): network
            description ( RO): cisco-fnic
                   info ( RO): cisco-fnic
              host-uuid ( RO): 6de288e7-0f82-4563-b071-bcdc083b0ffd
         active-variant ( RO): <none>
       selected-variant ( RO): <none>
               variants ( RO): generic/1.2
    variants-dev-status ( RO): generic=beta
          variants-uuid ( RO): generic=abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
    variants-hw-present ( RO):

Selection of a variant by name (which is unique per driver); this variant would become active after reboot.

[root@lcy2-dt110 log]# xe hostdriver-select variant-name=generic uuid=c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0
[root@lcy2-dt110 log]# xe hostdriver-list uuid=c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0 params=all
uuid ( RO)                   : c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0
                   name ( RO): cisco-fnic
                   type ( RO): network
            description ( RO): cisco-fnic
                   info ( RO): cisco-fnic
              host-uuid ( RO): 6de288e7-0f82-4563-b071-bcdc083b0ffd
         active-variant ( RO): <none>
       selected-variant ( RO): generic
               variants ( RO): generic/1.2
    variants-dev-status ( RO): generic=beta
          variants-uuid ( RO): generic=abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
    variants-hw-present ( RO):

The variant can be inspected, too, using it’s UUID.

[root@lcy2-dt110 log]# xe hostdriver-variant-list uuid=abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
uuid ( RO)           : abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
           name ( RO): generic
        version ( RO): 1.2
         status ( RO): beta
         active ( RO): false
       selected ( RO): true
    driver-uuid ( RO): c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0
    driver-name ( RO): cisco-fnic
      host-uuid ( RO): 6de288e7-0f82-4563-b071-bcdc083b0ffd
     hw-present ( RO): false

Class Host_driver

Class Host_driver represents an instance of a multi-version driver on a host. It references Driver_variant objects for the details of the available and active variants. A variant has a version.

Fields

All fields are read-only and can’t be set directly. Be aware that names in the CLI and the API may differ.

host: reference to the host where the driver is installed.
name: string; name of the driver without “.ko” extension.
variants: string set; set of variants available on the host for this driver. The name of each variant of a driver is unique and used in the CLI for selecting it.
selected_varinat: variant, possibly empty. Variant that is selected, i.e. the variant of the driver that will be considered by the kernel when loading the driver the next time. May be null when none is selected.
active_variant: variant, possibly empty. Variant that is currently loaded by the kernel.
type, info, description: strings providing background information.

The CLI uses hostdriver and a dash instead of an underscore. The CLI also offers convenience fields. Whenever selected and active variant are not the same, a reboot is required to activate the selected driver/variant combination.

(We are not using host-driver in the CLI to avoid the impression that this is part of a host object.)

Methods

All method invocations require Pool_Operator rights. “The Pool Operator role manages host- and pool-wide resources, including setting up storage, creating resource pools and managing patches, high availability (HA) and workload balancing (WLB)”
select (self, variant); select variant of driver self. Selecting the variant (a reference) of an existing driver.
deselect(self): this driver can’t be loaded next time the kernel is looking for a driver. This is a potentially dangerous operation, so it’s protected in the CLI with a --force flag.
rescan (host): scan the host and update its driver information. Called on toolstack restart and may be invoked from the CLI for development.

Class `Driver_variant`

An object of this class represents a variant of a driver on a host, i.e., it is specific to both.

name: unique name
driver: what host driver this belongs to
version: string; a driver variant has a version
status: string: development status, like “beta”
hardware_present: boolean, true if the host has the hardware installed supported by this driver

The only method available is select(self) to select a variant. It has the same effect as the select method on the Host_driver class.

The CLI comes with corresponding xe hostdriver-variant-* commands to list and select a variant.

[root@lcy2-dt110 log]# xe hostdriver-variant-list uuid=abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
uuid ( RO)           : abf5997b-f2ad-c0ef-b27f-3f8a37bf58a6
           name ( RO): generic
        version ( RO): 1.2
         status ( RO): beta
         active ( RO): false
       selected ( RO): true
    driver-uuid ( RO): c0fe459d-5f8a-3fb1-3fe5-3c602fafecc0
    driver-name ( RO): cisco-fnic
      host-uuid ( RO): 6de288e7-0f82-4563-b071-bcdc083b0ffd
     hw-present ( RO): false

Database

Each Host_driver and Driver_variant object is represented in the database and data is persisted over reboots. This means this data will be part of data collected in a xen-bugtool invocation.

Scan and Rescan

On XAPI start-up, XAPI updates the Host_driver objects belonging to the host to reflect the actual situation. This can be initiated from the CLI, too, mostly for development.

NUMA

NUMA in a nutshell

Systems that contain more than one CPU socket are typically built on a Non-Uniform Memory Architecture (NUMA) ¹². In a NUMA system each node has fast, lower latency access to local memory.

hwloc

In the diagram ³ above we have 4 NUMA nodes:

2 of those are due to 2 separate physical packages (sockets)
a further 2 is due to Sub-NUMA-Clustering (aka Nodes Per Socket for AMD) where the L3 cache is split

The L3 cache is shared among multiple cores, but cores 0-5 have lower latency access to one part of it, than cores 6-11, and this is also reflected by splitting memory addresses into 4 31GiB ranges in total.

In the diagram the closer the memory is to the core, the lower the access latency:

per-core caches: L1, L2
per-package shared cache: L3 (local part), L3 (remote part)
local NUMA node (to a group of cores, e.g. L#0 P#0), node 0
remote NUMA node in same package (L#1 P#2), node 1
remote NUMA node in other packages (L#2 P#1 and ‘L#3P#3’), node 2 and 3

The NUMA distance matrix

Accessing remote NUMA node in the other package has to go through a shared interconnect, which has lower bandwidth than the direct connections, and also a bottleneck if both cores have to access remote memory: the bandwidth for a single core is effectively at most half.

This is reflected in the NUMA distance/latency matrix. The units are arbitrary, and by convention access latency to the local NUMA node is given distance ‘10’.

Relative latency matrix by logical indexes:

index	0	2	1	3
0	10	21	11	21
2	21	10	21	11
1	11	21	10	21
3	21	11	21	10

This follows the latencies described previously:

fast access to local NUMA node memory (by definition), node 0, cost 10
slightly slower access latency to the other NUMA node in same package, node 1, cost 11
twice as slow access latency to remote NUMA memory in the other physical package (socket): nodes 2 and 3, cost 21

There is also I/O NUMA where a cost is similarly associated to where a PCIe is plugged in, but exploring that is future work (it requires exposing NUMA topology to the Dom0 kernel to benefit from it), and for simplicity the diagram above does not show it.

Advantages of NUMA

NUMA does have advantages though: if each node accesses only its local memory, then each node can independently achieve maximum throughput.

For best performance, we should:

minimize the amount of interconnect bandwidth we are using
run code that accesses memory allocated on the closest NUMA node
maximize the number of NUMA nodes that we use in the system as a whole

If a VM’s memory and vCPUs can entirely fit within a single NUMA node then we should tell Xen to prefer to allocate memory from and run the vCPUs on a single NUMA node.

Xen vCPU soft-affinity

The Xen scheduler supports 2 kinds of constraints:

hard pinning: a vCPU may only run on the specified set of pCPUs and nowhere else
soft pinning: a vCPU is preferably run on the specified set of pCPUs, but if they are all busy then it may run elsewhere

Hard pinning can be used to partition the system. But, it can potentially leave part of the system idle while another part is bottlenecked by many vCPUs competing for the same limited set of pCPUs.

Xen does not migrate workloads between NUMA nodes on its own (the Linux kernel can). Although, it is possible to achieve a similar effect with explicit migration. However, migration introduces additional delays and is best avoided for entire VMs.

Therefore, soft pinning is preferred: Running on a potentially suboptimal pCPU that uses remote memory could still be better than not running it at all until a pCPU is free to run it.

Xen will also allocate memory for the VM according to the vCPU (soft) pinning: If the vCPUs are pinned to NUMA nodes A and B, Xen allocates memory from NUMA nodes A and B in a round-robin way, resulting in interleaving.

Current default: No vCPU pinning

By default, when no vCPU pinning is used, Xen interleaves memory from all NUMA nodes. This averages the memory performance, but individual tasks’ performance may be significantly higher or lower depending on which NUMA node the application may have “landed” on. As a result, restarting processes will speed them up or slow them down as address space randomization picks different memory regions inside a VM.

This uses the memory bandwidth of all memory controllers and distributes the load across all nodes. However, the memory latency is higher as the NUMA interconnects are used for most memory accesses and vCPU synchronization within the Domains.

Note that this is not the worst case: the worst case would be for memory to be allocated on one NUMA node, but the vCPU always running on the furthest away NUMA node.

Best effort NUMA-aware memory allocation for VMs

Summary

The best-effort mode attempts to fit Domains into NUMA nodes and to balance memory usage. It soft-pins Domains on the NUMA node with the most available memory when adding the Domain. Memory is currently allocated when booting the VM (or while constructing the resuming VM).

Parallel boot issue: Memory is not pre-allocated on creation, but allocated during boot. The result is that parallel VM creation and boot can exhaust the memory of NUMA nodes.

Goals

By default, Xen stripes the VM’s memory across all NUMA nodes of the host, which means that every VM has to go through all the interconnects. The goal here is to find a better allocation than the default, not necessarily an optimal allocation. An optimal allocation would require knowing what VMs you would start/create in the future, and planning across hosts. This allows the host to use all NUMA nodes to take advantage of the full memory bandwidth available on the pool hosts.

Overall, we want to balance the VMs across NUMA nodes, such that we use all NUMA nodes to take advantage of the maximum memory bandwidth available on the system. For now this proposed balancing will be done only by balancing memory usage: always heuristically allocating VMs on the NUMA node that has the most available memory. For now, this allocation has a race condition: This happens when multiple VMs are booted in parallel, because we don’t wait until Xen has constructed the domain for each one (that’d serialize domain construction, which is currently parallel). This may be improved in the future by having an API to query Xen where it has allocated the memory, and to explicitly ask it to place memory on a given NUMA node (instead of best_effort).

If a VM doesn’t fit into a single node then it is not so clear what the best approach is. One criteria to consider is minimizing the NUMA distance between the nodes chosen for the VM. Large NUMA systems may not be fully connected in a mesh requiring multiple hops to each a node, or even have asymmetric links, or links with different bandwidth. The specific NUMA topology is provided by the ACPI SLIT table as the matrix of distances between nodes. It is possible that 3 NUMA nodes have a smaller average/maximum distance than 2, so we need to consider all possibilities.

For N nodes there would be 2^N possibilities, so [Topology.NUMA.candidates] limits the number of choices to 65520+N (full set of 2^N possibilities for 16 NUMA nodes, and a reduced set of choices for larger systems).

Implementation

[Topology.NUMA.candidates] is a sorted sequence of node sets, in ascending order of maximum/average distances. Once we’ve eliminated the candidates not suitable for this VM (that do not have enough total memory/pCPUs) we are left with a monotonically increasing sequence of nodes. There are still multiple possibilities with same average distance. This is where we consider our second criteria - balancing - and pick the node with most available free memory.

Once a suitable set of NUMA nodes are picked we compute the CPU soft affinity as the union of the CPUs from all these NUMA nodes. If we didn’t find a solution then we let Xen use its default allocation.

The “distances” between NUMA nodes may not all be equal, e.g. some nodes may have shorter links to some remote NUMA nodes, while others may have to go through multiple hops to reach it. See page 13 in ⁴ for a diagram of an AMD Opteron 6272 system.

Limitations and tradeoffs

Booting multiple VMs in parallel will result in potentially allocating both on the same NUMA node (race condition)
When we’re about to run out of host memory we’ll fall back to striping memory again, but the soft affinity mask won’t reflect that (this needs an API to query Xen on where it has actually placed the VM, so we can fix up the mask accordingly)
XAPI is not aware of NUMA balancing across a pool. Xenopsd chooses NUMA nodes purely based on amount of free memory on the NUMA nodes of the host, even if a better NUMA placement could be found on another host
Very large (>16 NUMA nodes) systems may only explore a limited number of choices (fit into a single node vs fallback to full interleaving)
The exact VM placement is not yet controllable
Microbenchmarks with a single VM on a host show both performance improvements and regressions on memory bandwidth usage: previously a single VM may have been able to take advantage of the bandwidth of both NUMA nodes if it happened to allocate memory from the right places, whereas now it’ll be forced to use just a single node. As soon as you have more than 1 VM that is busy on a system enabling NUMA balancing should almost always be an improvement though.
It is not supported to combine hard vCPU masks with soft affinity: if hard affinities are used, then no NUMA scheduling is done by the toolstack, and we obey exactly what the user has asked for with hard affinities. This shouldn’t affect other VMs since the memory used by hard-pinned VMs will still be reflected in overall less memory available on individual NUMA nodes.
Corner case: the ACPI standard allows certain NUMA nodes to be unreachable (distance 0xFF = -1 in the Xen bindings). This is not supported and will cause an exception to be raised. If this is an issue in practice the NUMA matrix could be pre-filtered to contain only reachable nodes. NUMA nodes with 0 CPUs are accepted (it can result from hard affinity pinning)
NUMA balancing is not considered during HA planning
Dom0 is a single VM that needs to communicate with all other VMs, so NUMA balancing is not applied to it (we’d need to expose NUMA topology to the Dom0 kernel, so it can better allocate processes)
IO NUMA is out of scope for now

XAPI datamodel design

New API field: Host.numa_affinity_policy.
Choices: default_policy, any, best_effort.
On upgrade the field is set to default_policy
Changes in the field only affect newly (re)booted VMs, for changes to take effect on existing VMs a host evacuation or reboot is needed

There may be more choices in the future (e.g. strict, which requires both Xen and toolstack changes).

Meaning of the policy:

any: the Xen default where it allocated memory by striping across NUMA nodes
best_effort: the algorithm described in this document, where soft pinning is used to achieve better balancing and lower latency
default_policy: when the admin hasn’t expressed a preference
Currently, default_policy is treated as any, but the admin can change it, and then the system will remember that change across upgrades. If we didn’t have a default_policy then changing the “default” policy on an upgrade would be tricky: we either risk overriding an explicit choice of the admin, or existing installs cannot take advantage of the improved performance from best_effort
Future XAPI versions may change default_policy to mean best_effort. Admins can still override it to any if they wish on a host by host basis.

It is not expected that users would have to change best_effort, unless they run very specific workloads, so a pool level control is not provided at this moment.

There is also no separate feature flag: this host flag acts as a feature flag that can be set through the API without restarting the toolstack. Although obviously only new VMs will benefit.

Debugging the allocator is done by running xl vcpu-list and investigating the soft pinning masks, and by analyzing xensource.log.

Xenopsd implementation

See the documentation in [softaffinity.mli] and [topology.mli].

[Softaffinity.plan] returns a [CPUSet] given a host’s NUMA allocation state and a VM’s NUMA allocation request.
[Topology.CPUSet] provides helpers for operating on a set of CPU indexes.
[Topology.NUMAResource] is a [CPUSet] and the free memory available on a NUMA node.
[Topology.NUMARequest] is a request for a given number of vCPUs and memory in bytes.
[Topology.NUMA] represents a host’s NUMA allocation state.
[Topology.NUMA.candidates] are groups of nodes orderd by minimum average distance. The sequence is limited to [N+65520], where [N] is the number of NUMA nodes. This avoids exponential state space explosion on very large systems (>16 NUMA nodes).
[Topology.NUMA.choose] will choose one NUMA node deterministically, while trying to keep overall NUMA node usage balanced.
[Domain.numa_placement] builds a [NUMARequest] and uses the above [Topology] and [Softaffinity] functions to compute and apply a plan.

We used to have a xenopsd.conf configuration option to enable NUMA placement, for backwards compatibility this is still supported, but only if the admin hasn’t set an explicit policy on the Host. It is best to remove the experimental xenopsd.conf entry though, a future version may completely drop it.

Tests are in [test_topology.ml] which checks balancing properties and whether the plan has improved best/worst/average-case access times in a simulated test based on 2 predefined NUMA distance matrixes (one from Intel and one from an AMD system).

Future work

Enable ‘best_effort’ mode by default once more testing has been done
Add an API to query Xen for the NUMA node memory placement (where it has actually allocated the VM’s memory). Currently, only the xl debug-keys interface exists which is not supported in production as it can result in killing the host via the watchdog, and is not a proper API, but a textual debug output with no stability guarantees.
More host policies, e.g. strict. Requires the XAPI pool scheduler to be NUMA aware and consider it as part of choosing hosts.
VM level policy that can set a NUMA affinity index, mapped to a NUMA node modulo NUMA nodes available on the system (this is needed so that after migration we don’t end up trying to allocate vCPUs to a non-existent NUMA node)
VM level anti-affinity rules for NUMA placement (can be achieved by setting unique NUMA affinity indexes)

Xen on NUMA Machines ↩︎
What is NUMA? ↩︎
created with lstopo-no-graphics --no-io --of svg --vert=L3 >hwloc.svg on a bare metal Linux ↩︎
Lepers, Baptiste. “Improving performance on NUMA systems.” PhD diss., Université de Grenoble, 2014. ↩︎

Snapshots

Snapshots represent the state of a VM, or a disk (VDI) at a point in time. They can be used for:

backups (hourly, daily, weekly etc)
experiments (take snapshot, try something, revert back again)
golden images (install OS, get it just right, clone it 1000s of times)

Disk snapshots

Disks are represented in the XenAPI as VDI objects. Disk snapshots are represented as VDI objects with the flag is_a_snapshot set to true. Snapshots are always considered read-only, and should only be used for backup or cloning into new disks. Disk snapshots have a lifetime independent of the disk they are a snapshot of i.e. if someone deletes the original disk, the snapshots remain. This contrasts with some storage arrays in which snapshots are “second class” objects which are automatically deleted when the original disk is deleted.

Disks are implemented in Xapi via “Storage Manager” (SM) plugins. The SM plugins conform to an api (the SMAPI) which has operations including

vdi_create: make a fresh disk, full of zeroes
vdi_snapshot: create a snapshot of a disk

File-based vhd implementation

The existing “EXT” and “NFS” file-based Xapi SM plugins store disk data in trees of .vhd files as in the following diagram:

Relationship between VDIs and vhd files

From the XenAPI point of view, we have one current VDI and a set of snapshots, each taken at a different point in time. These VDIs correspond to leaf vhds in a tree stored on disk, where the non-leaf nodes contain all the shared blocks.

The vhd files are always thinly-provisioned which means they only allocate new blocks on an as-needed basis. The snapshot leaf vhd files only contain vhd metadata and therefore are very small (a few KiB). The parent nodes containing the shared blocks only contain the shared blocks. The current leaf initially contains only the vhd metadata and therefore is very small (a few KiB) and will only grow when the VM writes blocks.

File-based vhd implementations are a good choice if a “gold image” snapshot is going to be cloned lots of times.

Block-based vhd implementation

The existing “LVM”, “LVMoISCSI” and “LVMoHBA” block-based Xapi SM plugins store disk data in trees of .vhd files contained within LVM logical volumes:

Relationship between VDIs and LVs containing vhd data

Non-snapshot VDIs are always stored full size (a.k.a. thickly-provisioned). When parent nodes are created they are automatically shrunk to the minimum size needed to store the shared blocks. The LVs corresponding with snapshot VDIs only contain vhd metadata and by default consume 8MiB. Note: this is different to VDI.clones which are stored full size.

Block-based vhd implementations are not a good choice if a “gold image” snapshot is going to be cloned lots of times, since each clone will be stored full size.

Hypothetical LUN implementation

A hypothetical Xapi SM plugin could use LUNs on an iSCSI storage array as VDIs, and the array’s custom control interface to implement the “snapshot” operation:

Relationship between VDIs and LUNs on a hypothetical storage target

From the XenAPI point of view, we have one current VDI and a set of snapshots, each taken at a different point in time. These VDIs correspond to LUNs on the same iSCSI target, and internally within the target these LUNs are comprised of blocks from a large shared copy-on-write pool with support for dedup.

Reverting disk snapshots

There is no current way to revert in-place a disk to a snapshot, but it is possible to create a writable disk by “cloning” a snapshot.

VM snapshots

Let’s say we have a VM, “VM1” that has 2 disks. Concentrating only on the VM, VBDs and VDIs, we have the following structure:

VM objects

When we take a snapshot, we first ask the storage backends to snapshot all of the VDIs associated with the VM, producing new VDI objects. Then we copy all of the metadata, producing a new ‘snapshot’ VM object, complete with its own VBDs copied from the original, but now pointing at the snapshot VDIs. We also copy the VIFs and VGPUs but for now we will ignore those.

This process leads to a set of objects that look like this:

VM and snapshot objects

We have fields that help navigate the new objects: VM.snapshot_of, and VDI.snapshot_of. These, like you would expect, point to the relevant other objects.

Deleting VM snapshots

When a snapshot is deleted Xapi calls the SM API vdi_delete. The Xapi SM plugins which use vhd format data do not reclaim space immediately; instead they mark the corresponding vhd leaf node as “hidden” and, at some point later, run a garbage collector process.

The garbage collector will first determine whether a “coalesce” should happen i.e. whether any parent nodes have only one child i.e. the “shared” blocks are only “shared” with one other node. In the following example the snapshot delete leaves such a parent node and the coalesce process copies blocks from the redundant parent’s only child into the parent:

We coalesce parent blocks into grand parent nodes

Note that if the vhd data is being stored in LVM, then the parent node will have had to be expanded to full size to accommodate the writes. Unfortunately this means the act of reclaiming space actually consumes space itself, which means it is important to never completely run out of space in such an SR.

Once the blocks have been copied, we can now cut one of the parents out of the tree by relinking its children into their grandparent:

Relink children into grand parent

Finally the garbage collector can remove unused vhd files / LVM LVs:

Clean up

Reverting VM snapshots

The XenAPI call VM.revert overwrites the VM metadata with the snapshot VM metadata, deletes the current VDIs and replaces them with clones of the snapshot VDIs. Note there is no “vdi_revert” in the SMAPI.

Revert implementation details

This is the process by which we revert a VM to a snapshot. The first thing to notice is that there is some logic that is called from message_forwarding.ml, which uses some low-level database magic to turn the current VM record into one that looks like the snapshot object. We then go to the rest of the implementation in xapi_vm_snapshot.ml. First, we shut down the VM if it is currently running. Then, we revert all of the VBDs, VIFs and VGPUs. To revert the VBDs, we need to deal with the VDIs underneath them. In order to create space, the first thing we do is delete all of the VDIs currently attached via VBDs to the VM. We then clone the disks from the snapshot. Note that there is no SMAPI operation ‘revert’ currently - we simply clone from the snapshot VDI. It’s important to note that cloning creates a new VDI object: this is not the one we started with gone.

Tracing

Tracing is a powerful tool for observing system behavior across multiple components, making it especially useful for debugging and performance analysis in complex environments.

By integrating OpenTelemetry (a standard that unifies OpenTracing and OpenCensus) and the Zipkin v2 protocol, XAPI enables efficient tracking and visualization of operations across internal and external systems. This facilitates detailed analysis and improves collaboration between teams.

Tracing is commonly used in high-level applications such as web services. As a result, less widely-used or non-web-oriented languages may lack dedicated libraries for distributed tracing (An OCaml implementation has been developed specifically for XenAPI).

How tracing works in XAPI

Spans and Trace Context

A span is the core unit of a trace, representing a single operation with a defined start and end time. Spans can contain sub-spans that represent child tasks. This helps identify bottlenecks or areas that can be parallelized.
- A span can contain several contextual elements such as tags (key-value pairs), events (time-based data), and errors.
The TraceContext HTTP standard defines how trace IDs and span contexts are propagated across systems, enabling full traceability of operations.

This data enables the creation of relationships between tasks and supports visualizations such as architecture diagrams or execution flows. These help in identifying root causes of issues and bottlenecks, and also assist newcomers in onboarding to the project.

Configuration

To enable tracing, you need to create an Observer object in XAPI. This can be done using the xe CLI:

xe observer-create \
  name-label=<name> \
  enabled=true \
  components=xapi,xenopsd \

By default, if you don’t specify enabled=true, the observer will be disabled.
To add an HTTP endpoint, make sure the server is up and running, then run:
```
xe observer-param-set uuid=<OBSERVER_UUID> endpoints=bugtool,http://<jaeger-ip>:9411/api/v2/spans
```
If you specify an invalid or unreachable HTTP endpoint, the configuration will fail.
components: Specify which internal components (e.g., xapi, xenopsd) should be traced. Additional components are expected to be supported in future releases. An experimental smapi component is also available and requires additional configuration (explained below).
endpoints: The observer can collect traces locally in /var/log/dt or forward them to external visualization tools such as Jaeger. Currently, only HTTP/S endpoints are supported, and they require additional configuration steps (see next section).

To disable tracing you just need to set enabled to false:

xe observer-param-set uuid=<OBSERVER_UUID> enabled=false

Enabling smapi component

smapi component is currently considered experimental and is filtered by default. To enable it, you must explicitly configure the following in xapi.conf:
```
observer-experimental-components=""
```
This tells XAPI that no components are considered experimental, thereby allowing smapi to be traced. A modification to xapi.conf requires a restart of the XAPI toolstack.

Enabling HTTP/S endpoints

By default HTTP and HTTPS endpoints are disabled. To enable them, add the following lines to xapi.conf:
```
observer-endpoint-http-enabled=true
observer-endpoint-https-enabled=true
```
As with enabling smapi component, modifying xapi.conf requires a restart of the XAPI toolstack. Note: HTTPS endpoint support is available but not tested and may not work.

Sending local trace to endpoint

By default, traces are generated locally in the /var/log/dt directory. You can copy or forward these traces to another location or endpoint using the xs-trace tool. For example, if you have a Jaeger server running locally, you can run:

xs-trace /var/log/dt/ http://127.0.0.1:9411/api/v2/spans

You will then be able to visualize the traces in Jaeger.

Tagging Trace Sessions for Easier Search

Specific attributes

To make trace logs easier to locate and analyze, it can be helpful to add custom attributes around the execution of specific commands. For example:

# xe observer-param-set uuid=<OBSERVER_UUID> attributes:custom.random=1234
# xe vm-start ...
# xe observer-param-clear uuid=<OBSERVER_UUID> param-name=attributes param-key=custom.random

This technique adds a temporary attribute, custom.random=1234, which will appear in the generated trace spans, making it easier to search for specific activity in trace visualisation tools. It may also be possible to achieve similar tagging using baggage parameters directly in individual xe commands, but this approach is currently undocumented.

Baggage

Baggage, contextual information that resides alongside the context, is supported. This means you can run the following command:

BAGGAGE="mybaggage=apples" xe vm-list

You will be able to search for tags mybaggage=apples.

Traceparent

Another way to assist in trace searching is to use the TRACEPARENT HTTP header. It is an HTTP header field that identifies the incoming request. It has a specific format and it is supported by XAPI. Once generated you can run command as:

TRACEPARENT="00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01" xe vm-list

And you will be able to look for trace 4bf92f3577b34da6a3ce929d0e0e4736.

vGPU

XenServer has supported passthrough for GPU devices since XenServer 6.0. Since the advent of NVIDIA’s vGPU-capable GRID K1/K2 cards it has been possible to carve up a GPU into smaller pieces yielding a more scalable solution to boosting graphics performance within virtual machines.

The K1 has four GK104 GPUs and the K2 two GK107 GPUs. Each of these will be exposed through Xapi so a host with a single K1 card will have access to four independent PGPUs.

Each of the GPUs can then be subdivided into vGPUs. For each type of PGPU, there are a few options of vGPU type which consume different amounts of the PGPU. For example, K1 and K2 cards can currently be configured in the following ways:

Possible VGX configurations

Note, this diagram is not to scale, the PGPU resource required by each vGPU type is as follows:

vGPU type	PGPU kind	vGPUs / PGPU
k100	GK104	8
k140Q	GK104	4
k200	GK107	8
k240Q	GK107	4
k260Q	GK107	2

Currently each physical GPU (PGPU) only supports homogeneous vGPU configurations but different configurations are supported on different PGPUs across a single K1/K2 card. This means that, for example, a host with a K1 card can run 64 VMs with k100 vGPUs (8 per PGPU).

XenServer’s vGPU architecture

A new display type has been added to the device model:

@@ -4519,6 +4522,7 @@ static const QEMUOption qemu_options[] =

     /* Xen tree options: */
     { "std-vga", 0, QEMU_OPTION_std_vga },
+    { "vgpu", 0, QEMU_OPTION_vgpu },
     { "videoram", HAS_ARG, QEMU_OPTION_videoram },
     { "d", HAS_ARG, QEMU_OPTION_domid }, /* deprecated; for xend compatibility */
     { "domid", HAS_ARG, QEMU_OPTION_domid },

With this in place, qemu can now be started using a new option that will enable it to communicate with a new display emulator, vgpu to expose the graphics device to the guest. The vgpu binary is responsible for handling the VGX-capable GPU and, once it has been successfully passed through, the in-guest drivers can be installed in the same way as when it detects new hardware.

The diagram below shows the relevant parts of the architecture for this project.

XenServer’s vGPU architecture

Relevant code

In Xenopsd: Xenops_server_xen is where Xenopsd gets the vGPU information from the values passed from Xapi;
In Xenopsd: Device.__start is where the vgpu process is started, if necessary, before Qemu.

Xapi’s API and data model

A lot of work has gone into the toolstack to handle the creation and management of VMs with vGPUs. We revised our data model, introducing a semantic link between VGPU and PGPU objects to help with utilisation tracking; we maintained the GPU_group concept as a pool-wide abstraction of PGPUs available for VMs; and we added VGPU_types which are configurations for VGPU objects.

Xapi’s vGPU datamodel

Aside: The VGPU type in Xapi’s data model predates this feature and was synonymous with GPU-passthrough. A VGPU is simply a display device assigned to a VM which may be a vGPU (this feature) or a whole GPU (a VGPU of type passthrough).

VGPU_types can be enabled/disabled on a per-PGPU basis allowing for reservation of particular PGPUs for certain workloads. VGPUs are allocated on PGPUs within their GPU group in either a depth-first or breadth-first manner, which is configurable on a per-group basis.

VGPU_types are created by xapi at startup depending on the available hardware and config files present in dom0. They exist in the pool database, and a primary key is used to avoid duplication. In XenServer 6.x the tuple of (vendor_name, model_name) was used as the primary key, however this was not ideal as these values are subject to change. XenServer 7.0 switched to a new primary key generated from static metadata, falling back to the old method for backwards compatibility.

A VGPU_type will be garbage collected when there is no VGPU of that type and there is no hardware which supports that type. On VM import, all VGPUs and VGPU_types will be created if necessary - if this results in the creation of a new VGPU_type then the VM will not be usable until the required hardware and drivers are installed.

Relevant code

In Xapi: Xapi_vgpu_type contains the type definitions and parsing logic for vGPUs;
In Xapi: Xapi_pgpu_helpers defines the functions used to allocate vGPUs on PGPUs.

Xapi <-> Xenopsd interface

In XenServer 6.x, all VGPU config was added to the VM’s platform field at startup, and this information was used by xenopsd to start the display emulator. See the relevant code in ocaml/xapi/vgpuops.ml.

In XenServer 7.0, to facilitate support of VGPU on Intel hardware in parallel with the existing NVIDIA support, VGPUs were made first-class objects in the xapi-xenopsd interface. The interface is described in the design document on the GPU support evolution.

VM startup

On the pool master:

Assuming no WLB, all VM.start tasks pass through Xapi_vm_helpers.choose_host_for_vm_no_wlb. If the VM has a vGPU, the list of all hosts in the pool is split into a list of lists, where the first list is the most optimal in terms of the GPU group’s allocation mode and the PGPU availability on each host.
Each list of hosts in turn is passed to Xapi_vm_placement.select_host, which checks storage, network and memory availability, until a suitable host is found.
Once a host has been chosen, allocate_vm_to_host will set the VM.scheduled_to_be_resident_on and VGPU.scheduled_to_be_resident_on fields.

The task is then ready to be forwarded to the host on which the VM will start:

If the VM has a VGPU, the startup task is wrapped in Xapi_gpumon.with_gpumon_stopped. This makes sure that the NVIDIA driver is not in use so can be loaded or unloaded from physical GPUs as required.
The VM metadata, including VGPU metadata, is passed to xenopsd. The creation of the VGPU metadata is done by vgpus_of_vm. Note that at this point passthrough VGPUs are represented by the PCI device type, and metadata is generated by pcis_of_vm.
As part of starting up the VM, xenopsd should report a VGPU event or a PCI event, which xapi will use to indicate that the xapi VGPU object can be marked as currently_attached.

Usage

To create a VGPU of a given type you can use vgpu-create:

$ xe vgpu-create vm-uuid=... gpu-group-uuid=... vgpu-type-uuid=...

To see a list of VGPU types available for use on your XenServer, run the following command. Note: these will only be populated if you have installed the relevant NVIDIA RPMs and if there is hardware installed on that host supported each type. Using params=all will display more information such as the maximum number of heads supported by that VGPU type and which PGPUs have this type enabled and supported.

$ xe vgpu-type-list [params=all]

To access the new and relevant parameters on a PGPU (i.e. supported_VGPU_types, enabled_VGPU_types, resident_VGPUs) you can use pgpu-param-get with param-name=supported-vgpu-types param-name=enabled-vgpu-types and param-name=resident-vgpus respectively. Or, alternatively, you can use the following command to list all the parameters for the PGPU. You can get the types supported or enabled for a given PGPU:

$ xe pgpu-list uuid=... params=all

Xapi Storage Migration

The Xapi Storage Migration (XSM) also known as “Storage Motion” allows

a running VM to be migrated within a pool, between different hosts and different storage simultaneously;
a running VM to be migrated to another pool;
a disk attached to a running VM to be moved to another SR.

The following diagram shows how XSM works at a high level:

Xapi Storage Migration

The slowest part of a storage migration is migrating the storage, since virtual disks can be very large. Xapi starts by taking a snapshot and copying that to the destination as a background task. Before the datapath connecting the VM to the disk is re-established, xapi tells tapdisk to start mirroring all writes to a remote tapdisk over NBD. From this point on all VM disk writes are written to both the old and the new disk. When the background snapshot copy is complete, xapi can migrate the VM memory across. Once the VM memory image has been received, the destination VM is complete and the original can be safely destroyed.

Xapi

Xapi is the xapi-project host and cluster manager.

Xapi is responsible for:

providing a stable interface (the XenAPI)
allowing one client to manage multiple hosts
hosting the “xe” CLI
authenticating users and applying role-based access control
locking resources (in particular disks)
allowing storage to be managed through plugins
planning and coping with host failures (“High Availability”)
storing VM and host configuration
generating alerts
managing software patching

Principles

The XenAPI interface must remain backwards compatible, allowing older clients to continue working
Xapi delegates all Xenstore/libxc/libxl access to Xenopsd, so Xapi could be run in an unprivileged helper domain
Xapi delegates the low-level storage manipulation to SM plugins.
Xapi delegates setting up host networking to xcp-networkd.
Xapi delegates monitoring performance counters to xcp-rrdd.

Overview

The following diagram shows the internals of Xapi:

Internals of xapi

The top of the diagram shows the XenAPI clients: XenCenter, XenOrchestra, OpenStack and CloudStack using XenAPI and HTTP GET/PUT over ports 80 and 443 to talk to xapi. These XenAPI (JSON-RPC or XML-RPC over HTTP POST) and HTTP GET/PUT are always authenticated using either PAM (by default using the local passwd and group files) or through Active Directory.

The APIs are classified into categories:

coordinator-only: these are the majority of current APIs. The coordinator should be called and relied upon to forward the call to the right place with the right locks held.
normally-local: these are performance special cases such as disk import/export and console connection which are sent directly to hosts which have the most efficient access to the data.
emergency: these deal with scenarios where the coordinator is offline

If the incoming API call should be resent to the coordinator than a XenAPI HOST_IS_SLAVE error message containing the coordinator’s IP is sent to the client.

Once past the initial checks, API calls enter the “message forwarding” layer which

locks resources (via the current_operations mechanism)
decides which host should execute the request.

If the request should run locally then a direct function call is used; otherwise the message forwarding code makes a synchronous API call to a specific other host. Note: Xapi currently employs a “thread per request” model which causes one full POSIX thread to be created for every request. Even when a request is forwarded the full thread persists, blocking for the result to become available.

If the XenAPI call is a VM lifecycle operation then it is converted into a Xenopsd API call and forwarded over a Unix domain socket. Xapi and Xenopsd have similar notions of cancellable asynchronous “tasks”, so the current Xapi task (all operations run in the context of a task) is bound to the Xenopsd task, so cancellation is passed through and progress updates are received.

If the XenAPI call is a storage operation then the “storage access” layer

verifies that the storage objects are in the correct state (SR attached/detached; VDI attached/activated read-only/read-write)
invokes the relevant operation in the Storage Manager API (SMAPI) v2 interface;
depending on the type of SR:
- uses the SMAPIv2 to SMAPIv1 converter to generate the necessary command-line to talk to the SMAPIv1 plugin (EXT, NFS, LVM etc) and to execute it
- uses the SMAPIv2 to SMAPIv3 converter daemon xapi-storage-script to exectute the necessary SMAPIv3 command (GFS2)
persists the state of the storage objects (including the result of a VDI.attach call) to persistent storage

Internally the SMAPIv1 plugins use privileged access to the Xapi database to directly set fields (e.g. VDI.virtual_size) that would be considered read/only to other clients. The SMAPIv1 plugins also rely on Xapi for

knowledge of all hosts which may access the storage
locking of disks within the resource pool
safely executing code on other hosts via the “Xapi plugin” mechanism

The Xapi database contains Host and VM metadata and is shared pool-wide. The coordinator keeps a copy in memory, and all other nodes remote queries to the coordinator. The database associates each object with a generation count which is used to implement the XenAPI event.next and event.from APIs. The database is routinely asynchronously flushed to disk in XML format. If the “redo-log” is enabled then all database writes are made synchronously as deltas to a shared block device. Without the redo-log, recent updates may be lost if Xapi is killed before a flush.

High-Availability refers to planning for host failure, monitoring host liveness and then following-through on the plans. Xapi defers to an external host liveness monitor called xhad. When xhad confirms that a host has failed – and has been isolated from the storage – then Xapi will restart any VMs which have failed and which have been marked as “protected” by HA. Xapi can also impose admission control to prevent the pool becoming too overloaded to cope with n arbitrary host failures.

The xe CLI is implemented in terms of the XenAPI, but for efficiency the implementation is linked directly into Xapi. The xe program remotes its command-line to Xapi, and Xapi sends back a series of simple commands (prompt for input; print line; fetch file; exit etc).

Guides

Helpful guides for xapi developers.

How to add....

How to add....

Adding a Class to the API

This document describes how to add a new class to the data model that defines the Xen Server API. It complements two other documents that describe how to extend an existing class:

As a running example, we will use the addition of a class that is part of the design for the PVS Direct feature. PVS Direct introduces proxies that serve VMs with disk images. This class was added via commit CP-16939 to Xen API.

Example: PVS_server

In the world of Xen Server, each important concept like a virtual machine, interface, or users is represented by a class in the data model. A class defines methods and instance variables. At runtime, all class instances are held in an in-memory database. For example, part of [PVS Direct] is a class PVS_server, representing a resource that provides block-level data for virtual machines. The design document defines it to have the following important properties:

Fields

(string set) addresses (RO/constructor) IPv4 addresses of the server.
(int) first_port (RO/constructor) First UDP port accepted by the server.
(int) last_port (RO/constructor) Last UDP port accepted by the server.
(PVS_farm ref) farm (RO/constructor) Link to the farm that this server is included in. A PVS_server object must always have a valid farm reference; the PVS_server will be automatically GC’ed by xapi if the associated PVS_farm object is removed.
(string) uuid (R0/runtime) Unique identifier/object reference. Allocated by the server.

Methods (or Functions)

(PVS_server ref) introduce (string set addresses, int first_port, int last_port, PVS_farm ref farm) Introduce a new PVS server into the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.
(void) forget (PVS_server ref self) Remove a PVS server from the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.

Implementation Overview

The implementation of a class is distributed over several files:

ocaml/idl/datamodel.ml – central class definition
ocaml/idl/datamodel_types.ml – definition of releases
ocaml/xapi/cli_frontend.ml – declaration of CLI operations
ocaml/xapi/cli_operations.ml – implementation of CLI operations
ocaml/xapi/records.ml – getters and setters
ocaml/xapi/OMakefile – refers to xapi_pvs_farm.ml
ocaml/xapi/api_server.ml – refers to xapi_pvs_farm.ml
ocaml/xapi/message_forwarding.ml
ocaml/xapi/xapi_pvs_farm.ml – implementation of methods, new file

Data Model

The data model ocaml/idl/datamodel.ml defines the class. To keep the name space tidy, most helper functions are grouped into an internal module:

(* datamodel.ml *)

let schema_minor_vsn = 103 (* line 21 -- increment this *)
let _pvs_farm = "PVS_farm" (* line 153 *)

module PVS_farm = struct (* line 8658 *)
  let lifecycle = [Prototyped, rel_dundee_plus, ""]

  let introduce = call
    ~name:"introduce"
    ~doc:"Introduce new PVS farm"
    ~result:(Ref _pvs_farm, "the new PVS farm")
    ~params:
    [ String,"name","name of the PVS farm"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let forget = call
    ~name:"forget"
    ~doc:"Remove a farm's meta data"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ]
    ~errs:[
      Api_errors.pvs_farm_contains_running_proxies;
      Api_errors.pvs_farm_contains_servers;
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()


  let set_name = call
    ~name:"set_name"
    ~doc:"Update the name of the PVS farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; String, "value", "name to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let add_cache_storage = call
    ~name:"add_cache_storage"
    ~doc:"Add a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let remove_cache_storage = call
    ~name:"remove_cache_storage"
    ~doc:"Remove a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be removed"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let obj =
    let null_str = Some (VString "") in
    let null_set = Some (VSet []) in
    create_obj (* <---- creates class *)
    ~name: _pvs_farm
    ~descr:"machines serving blocks of data for provisioning VMs"
    ~doccomments:[]
    ~gen_constructor_destructor:false
    ~gen_events:true
    ~in_db:true
    ~lifecycle
    ~persist:PersistEverything
    ~in_oss_since:None
    ~messages_default_allowed_roles:_R_POOL_OP
    ~contents:
    [ uid     _pvs_farm ~lifecycle

    ; field   ~qualifier:StaticRO ~lifecycle
              ~ty:String "name" ~default_value:null_str
              "Name of the PVS farm. Must match name configured in PVS"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _sr)) "cache_storage" ~default_value:null_set
              ~ignore_foreign_key:true
              "The SR used by PVS proxy for the cache"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_server)) "servers"
              "The set of PVS servers in the farm"


    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_proxy)) "proxies"
              "The set of proxies associated with the farm"
    ]
    ~messages:
    [ introduce
    ; forget
    ; set_name
    ; add_cache_storage
    ; remove_cache_storage
    ]
    ()
end
let pvs_farm = PVS_farm.obj

The class is defined by a call to create_obj and it defines the fields and messages (methods) belonging to the class. Each field has a name, a type, and some meta information. Likewise, each message (or method) is created by call that describes its parameters.

The PVS_farm has additional getter and setter methods for accessing its fields. These are not declared here as part of the messages but are automatically generated.

To make sure the new class is actually used, it is important to enter it into two lists:

(* datamodel.ml *)
let all_system = (* line 8917 *)
  [
    ...
    vgpu_type;
    pvs_farm;
    ...
  ]

let expose_get_all_messages_for = [ (* line 9097 *)
  ...
  _pvs_farm;
  _pvs_server;
  _pvs_proxy;

When a field refers to another object that itself refers back to it, these two need to be entered into the all_relations list. For example, _pvs_server refers to a _pvs_farm value via "farm", which, in turn, refers to the _pvs_server value via its "servers" field.

let all_relations =
  [
    (* ... *)
    (_sr, "introduced_by"), (_dr_task, "introduced_SRs");
    (_pvs_server, "farm"), (_pvs_farm, "servers");
    (_pvs_proxy,  "farm"), (_pvs_farm, "proxies");
  ]

CLI Conventions

The CLI provides access to objects from the command line. The following conventions exist for naming fields:

A field in the data model uses an underscore (_) but a hyphen (-) in the CLI: what is cache_storage in the data model becomes cache-storage in the CLI.
When a field contains a reference or multiple, like proxies, it becomes proxy-uuids in the CLI because references are always referred to by their UUID.

CLI Getters and Setters

All fields can be read from the CLI and some fields can also be set via the CLI. These getters and setters are mostly generated automatically and need to be connected to the CLI through a function in ocaml/xapi/records.ml. Note that field names here use the naming convention for the CLI:

(* ocaml/xapi/records.ml *)
let pvs_farm_record rpc session_id pvs_farm =
  let _ref = ref pvs_farm in
  let empty_record =
    ToGet (fun () -> Client.PVS_farm.get_record rpc session_id !_ref) in
  let record = ref empty_record in
  let x () = lzy_get record in
    { setref    = (fun r -> _ref := r ; record := empty_record)
    ; setrefrec = (fun (a,b) -> _ref := a; record := Got b)
    ; record    = x
    ; getref    = (fun () -> !_ref)
    ; fields=
      [ make_field ~name:"uuid"
        ~get:(fun () -> (x ()).API.pVS_farm_uuid) ()
      ; make_field ~name:"name"
        ~get:(fun () -> (x ()).API.pVS_farm_name)
        ~set:(fun name ->
          Client.PVS_farm.set_name rpc session_id !_ref name) ()
      ; make_field ~name:"cache-storage"
        ~get:(fun () -> (x ()).API.pVS_farm_cache_storage
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~add_to_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.add_cache_storage rpc session_id !_ref sr)
        ~remove_from_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.remove_cache_storage rpc session_id !_ref sr)
        ()
      ; make_field ~name:"server-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref)
        ()
      ; make_field ~name:"proxy-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref)
        ()
      ]
    }

CLI Interface to Methods

Methods accessible from the CLI are declared in ocaml/xapi/cli_frontend.ml. Each declaration refers to the real implementation of the method, like Cli_operations.PVS_far.introduce:

(* cli_frontend.ml *)
let rec cmdtable_data : (string*cmd_spec) list =
  (* ... *)
  "pvs-farm-introduce",
  {
    reqd=["name"];
    optn=[];
    help="Introduce new PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.introduce;
    flags=[];
  };
  "pvs-farm-forget",
  {
    reqd=["uuid"];
    optn=[];
    help="Forget a PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.forget;
    flags=[];
  };

CLI Implementation of Methods

Each CLI operation that is not a getter or setter has an implementation in cli_operations.ml which is implemented in terms of the real implementation:

(* cli_operations.ml *)
module PVS_farm = struct
  let introduce printer rpc session_id params =
    let name  = List.assoc "name" params in
    let ref   = Client.PVS_farm.introduce ~rpc ~session_id ~name in
    let uuid  = Client.PVS_farm.get_uuid rpc session_id ref in
    printer (Cli_printer.PList [uuid])

  let forget printer rpc session_id params =
    let uuid  = List.assoc "uuid" params in
    let ref   = Client.PVS_farm.get_by_uuid ~rpc ~session_id ~uuid in
    Client.PVS_farm.forget rpc session_id ref
end

Fields that should show up in the CLI interface by default are declared in the gen_cmds value:

(* cli_operations.ml *)
let gen_cmds rpc session_id =
  let mk = make_param_funs in
  List.concat
  [ (*...*)
  ; Client.Pool.(mk get_all get_all_records_where
    get_by_uuid pool_record "pool" []
    ["uuid";"name-label";"name-description";"master"
    ;"default-SR"] rpc session_id)
  ; Client.PVS_farm.(mk get_all get_all_records_where
    get_by_uuid pvs_farm_record "pvs-farm" []
    ["uuid";"name";"cache-storage";"server-uuids"] rpc session_id)

Error messages

Error messages used by an implementation are introduced in two files:

(* ocaml/xapi-consts/api_errors.ml *)
let pvs_farm_contains_running_proxies = "PVS_FARM_CONTAINS_RUNNING_PROXIES"
let pvs_farm_contains_servers = "PVS_FARM_CONTAINS_SERVERS"
let pvs_farm_sr_already_added = "PVS_FARM_SR_ALREADY_ADDED"
let pvs_farm_sr_is_in_use = "PVS_FARM_SR_IS_IN_USE"
let sr_not_in_pvs_farm = "SR_NOT_IN_PVS_FARM"
let pvs_farm_cant_set_name = "PVS_FARM_CANT_SET_NAME"

(* ocaml/idl/datamodel.ml *)
  (* PVS errors *)
  error Api_errors.pvs_farm_contains_running_proxies ["proxies"]
    ~doc:"The PVS farm contains running proxies and cannot be forgotten." ();

  error Api_errors.pvs_farm_contains_servers ["servers"]
    ~doc:"The PVS farm contains servers and cannot be forgotten."
    ();

  error Api_errors.pvs_farm_sr_already_added ["farm"; "SR"]
    ~doc:"Trying to add a cache SR that is already associated with the farm"
    ();

  error Api_errors.sr_not_in_pvs_farm ["farm"; "SR"]
    ~doc:"The SR is not associated with the farm."
    ();

  error Api_errors.pvs_farm_sr_is_in_use ["farm"; "SR"]
    ~doc:"The SR is in use by the farm and cannot be removed."
    ();

  error Api_errors.pvs_farm_cant_set_name ["farm"]
    ~doc:"The name of the farm can't be set while proxies are active."
    ()

Method Implementation

The implementation of methods lives in a module in ocaml/xapi:

(* ocaml/xapi/api_server.ml *)
  module PVS_farm = Xapi_pvs_farm

The file below is typically a new file and needs to be added to ocaml/xapi/OMakefile.

(* ocaml/xapi/xapi_pvs_farm.ml *)
module D = Debug.Make(struct let name = "xapi_pvs_farm" end)
module E = Api_errors

let api_error msg xs = raise (E.Server_error (msg, xs))

let introduce ~__context ~name =
  let pvs_farm = Ref.make () in
  let uuid = Uuid.to_string (Uuid.make_uuid ()) in
  Db.PVS_farm.create ~__context
    ~ref:pvs_farm ~uuid ~name ~cache_storage:[];
  pvs_farm

(* ... *)

Messages received on a slave host may or may not be executed there. In the simple case, each methods executes locally:

(* ocaml/xapi/message_forwarding.ml *)
module PVS_farm = struct
  let introduce ~__context ~name =
    info "PVS_farm.introduce %s" name;
    Local.PVS_farm.introduce ~__context ~name

  let forget ~__context ~self =
    info "PVS_farm.forget";
    Local.PVS_farm.forget ~__context ~self

  let set_name ~__context ~self ~value =
    info "PVS_farm.set_name %s" value;
    Local.PVS_farm.set_name ~__context ~self ~value

  let add_cache_storage ~__context ~self ~value =
    info "PVS_farm.add_cache_storage";
    Local.PVS_farm.add_cache_storage ~__context ~self ~value

  let remove_cache_storage ~__context ~self ~value =
    info "PVS_farm.remove_cache_storage";
    Local.PVS_farm.remove_cache_storage ~__context ~self ~value
end

Adding a field to the API

This page describes how to add a field to XenAPI. A field is a parameter of a class that can be used in functions and read from the API.

Bumping the database schema version

Whenever a field is added to or removed from the API, its schema version needs to be increased. XAPI needs this fundamental procedure in order to be able to detect that an automatic database upgrade is necessary or to find out that the new schema is incompatible with the existing database. If the schema version is not bumped, XAPI will start failing in unpredictable ways. Note that bumping the version is not necessary when adding functions, only when adding fields.

The current version number is kept at the top of the file ocaml/idl/datamodel_common.ml in the variables schema_major_vsn and schema_minor_vsn, of which only the latter should be incremented (the major version only exists for historical reasons). When moving to a new XenServer release, also update the variable last_release_schema_minor_vsn to the schema version of the last release. To keep track of the schema versions of recent XenServer releases, the file contains variables for these, such as miami_release_schema_minor_vsn. After starting a new version of Xapi on an existing server, the database is automatically upgraded if the schema version of the existing database matches the value of last_release_schema_*_vsn in the new Xapi.

As an example, the patch below shows how the schema version was bumped when the new API fields used for ActiveDirectory integration were added:

--- a/ocaml/idl/datamodel.ml  Tue Nov 11 16:17:48 2008 +0000
+++ b/ocaml/idl/datamodel.ml  Tue Nov 11 15:53:29 2008 +0000
@@ -15,17 +15,20 @@ open Datamodel_types
  open Datamodel_types

  (* IMPORTANT: Please bump schema vsn if you change/add/remove a _field_.
     You do not have to dump vsn if you change/add/remove a message *)

  let schema_major_vsn = 5
 -let schema_minor_vsn = 55
 +let schema_minor_vsn = 56

  (* Historical schema versions just in case this is useful later *)
  let rio_schema_major_vsn = 5
  let rio_schema_minor_vsn = 19

 +let miami_release_schema_major_vsn = 5
 +let miami_release_schema_minor_vsn = 35
 +
  (* the schema vsn of the last release: used to determine whether we can
     upgrade or not.. *)
  let last_release_schema_major_vsn = 5
 -let last_release_schema_minor_vsn = 35
 +let last_release_schema_minor_vsn = 55

Setting the schema hash

In the ocaml/idl/schematest.ml there is the last_known_schema_hash This needs to be updated to be the next hash after the schema version was bumped. Get the new hash by running make test and you will receive the correct hash in the error message.

Adding the new field to some existing class

ocaml/idl/datamodel.ml

Add a new “field” line to the class in the file ocaml/idl/datamodel.ml or ocaml/idl/datamodel_[class].ml. The new field might require a suitable default value. This default value is used in case the user does not provide a value for the field.

A field has a number of parameters:

The lifecycle parameter, which shows how the field has evolved over time.
The qualifier parameter, which controls access to the field. The following values are possible:

Value	Meaning
StaticRO	Field is set statically at install-time.
DynamicRO	Field is computed dynamically at run time.
RW	Field is read/write.

The ty parameter for the type of the field.
The default_value parameter.
The name of the field.
A documentation string.

Example of a field in the pool class:

field ~lifecycle:[Published, rel_orlando, "Controls whether HA is enabled"]
      ~qualifier:DynamicRO ~ty:Bool
      ~default_value:(Some (VBool false)) "ha_enabled" "true if HA is enabled on the pool, false otherwise";

See datamodel_types.ml for information about other parameters.

Changing Constructors

Adding a field would change the constructors for the class – functions Db.*.create – and therefore, any references to these in the code need to be updated. In the example, the argument ~ha_enabled:false should be added to any call to Db.Pool.create.

Examples of where these calls can be found is in ocaml/tests/common/test_common.ml and ocaml/xapi/xapi_[class].ml.

CLI Records

If you want this field to show up in the CLI (which you probably do), you will also need to modify the Records module, in the file ocaml/xapi-cli-server/records.ml. Find the record function for the class which you have modified, add a new entry to the fields list using make_field. This type can be found in the same file.

The only required parameters are name and get (and unit, of course ). If your field is a map or set, then you will need to pass in get_{map,set}, and optionally set_{map,set}, if it is a RW field. The hidden parameter is useful if you don’t want this field to show up in a *_params_list call. As an example, here is a field that we’ve just added to the SM class:

make_field ~name:"versioned-capabilities"
           ~get:(fun () -> get_from_map (x ()).API.sM_versioned_capabilities)
           ~get_map:(fun () -> (x ()).API.sM_versioned_capabilities)
           ~hidden:true ();

Testing

The new fields can be tested by copying the newly compiled xapi binary to a test box. After the new xapi service is started, the file /var/log/xensource.log in the test box should contain a few lines reporting the successful upgrade of the metadata schema in the test box:

[...|xapi] Db has schema major_vsn=5, minor_vsn=57 (current is 5 58) (last is 5 57)
[...|xapi] Database schema version is that of last release: attempting upgrade
[...|sql] attempting to restore database from /var/xapi/state.db
[...|sql] finished parsing xml
[...|sql] writing db as xml to file '/var/xapi/state.db'.
[...|xapi] Database upgrade complete, restarting to use new db

Making this field accessible as a CLI attribute

XenAPI functions to get and set the value of the new field are generated automatically. It requires some extra work, however, to enable such operations in the CLI.

The CLI has commands such as host-param-list and host-param-get. To make a new field accessible by these commands, the file xapi-cli-server/records.ml needs to be edited. For the pool.ha-enabled field, the pool_record function in this file contains the following (note the convention to replace underscores by hyphens in the CLI):

let pool_record rpc session_id pool =
  ...
[
  ...
  make_field ~name:"ha-enabled" ~get:(fun () -> string_of_bool (x ()).API.pool_ha_enabled) ();
  ...
]}

NB: the ~get parameter must return a string so include a relevant function to convert the type of the field into a string i.e. string_of_bool

See xapi-cli-server/records.ml for examples of handling field types other than Bool.

Adding a function to the API

This page describes how to add a function to XenAPI.

Add message to API

The file idl/datamodel.ml is a description of the API, from which the marshalling and handler code is generated.

In this file, the create_obj function is used to define a class which may contain fields and support operations (known as “messages”). For example, the identifier host is defined using create_obj to encapsulate the operations which can be performed on a host.

In order to add a function to the API, we need to add a message to an existing class. This entails adding a function in idl/datamodel.ml or one of the other datamodel files to describe the new message and adding it to the class’s list of messages. In this example, we are adding to idl/datamodel_host.ml.

The function to describe the new message will look something like the following:

let host_price_of = call ~flags:[`Session]
    ~name:"price_of"
    ~in_oss_since:None
    ~lifecycle:[]
    ~params:[(Ref _host, "host", "The host containing the price information");
             (String, "item", "The item whose price is queried")]
    ~result:(Float, "The price of the item")
    ~doc:"Returns the price of a named item."
    ~allowed_roles:_R_POOL_OP
    ()

By convention, the name of the function is formed from the name of the class and the name of the message: host and price_of, in the example. An entry for host_price_of is added to the messages of the host class:

let host =
    create_obj ...
        ~messages: [...
                    host_price_of;
                   ]
...

The parameters passed to call are all optional (except ~name and ~lifecycle).

The ~flags parameter is used to set conditions for the use of the message. For example, `Session is used to indicate that the call must be made in the presence of an existing session.
The value of the ~lifecycle parameter should be [] in new code, with dune automatically generating appropriate values (datamodel_lifecycle.ml)
The ~params parameter describes a list of the formal parameters of the message. Each parameter is described by a triple. The first component of the triple is the type (from type ty in idl/datamodel_types.ml); the second is the name of the parameter, and the third is a human-readable description of the parameter. The first triple in the list is conventionally the instance of the class on which the message will operate. In the example, this is a reference to the host.
Similarly, the ~result describes the message’s return type, although this is permitted to merely be a single value rather than a list of values. If no ~result is specified, the default is unit.
The ~doc parameter describes what the message is doing.
The bool ~hide_from_docs parameter prevents the message from being included in the documentation when generated.
The bool ~pool_internal parameter is used to indicate if the message should be callable by external systems or only internal hosts.
The ~errs parameter is a list of possible exceptions that the message can raise.
The parameter ~lifecycle takes in an array of (Status, version, doc) to indicate the lifecycle of the message type. This takes over from ~in_oss_since which indicated the release that the message type was introduced. NOTE: Leave this parameter empty, it will be populated on build.
The ~allowed_roles parameter is used for access control (see below).

Compiling xen-api.(hg|git) will cause the code corresponding to this message to be generated and output in ocaml/xapi/server.ml. In the example above, a section handling an incoming call host.price_of appeared in ocaml/xapi/server.ml. However, after this was generated, the rest of the build failed because this call expects a price_of function in the Host object.

Update expose_get_all_messages_for list

If you are adding a new class, do not forget to add your new class _name to the expose_get_all_messages_for list, at the bottom of datamodel.ml, in order to have automatically generated get_all and get_all_records functions attached to it.

Update the RBAC field containing the roles expected to use the new API call

After the RBAC integration, Xapi provides by default a set of static roles associated to the most common subject tasks.

The api calls associated with each role are defined by a new ~allowed_roles parameter in each api call, which specifies the list of static roles that should be able to execute the call. The possible roles for this list is one of the following names, defined in datamodel.ml:

role_pool_admin
role_pool_operator
role_vm_power_admin
role_vm_admin
role_vm_operator
role_read_only

So, for instance,

~allowed_roles:[role_pool_admin,role_pool_operator] (* this is not the recommended usage, see example below *)

would be a valid list (though it is not the recommended way of using allowed_roles, see below), meaning that subjects belonging to either role_pool_admin or role_pool_operator can execute the api call.

The RBAC requirements define a policy where the roles in the list above are supposed to be totally-ordered by the set of api-calls associated with each of them. That means that any api-call allowed to role_pool_operator should also be in role_pool_admin; any api-call allowed to role_vm_power_admin should also be in role_pool_operator and also in role_pool_admin; and so on. Datamodel.ml provides shortcuts for expressing these totally-ordered set of roles policy associated with each api-call:

_R_POOL_ADMIN, equivalent to [role_pool_admin]
_R_POOL_OP, equivalent to [role_pool_admin,role_pool_operator]
_R_VM_POWER_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin]
_R_VM_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin]
_R_VM_OP, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op]
_R_READ_ONLY, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op,role_read_only]

The ~allowed_roles parameter should use one of the shortcuts in the list above, instead of directly using a list of roles, because the shortcuts above make sure that the roles in the list are in a total order regarding the api-calls permission sets. Creating an api-call with e.g. allowed_roles:[role_pool_admin,role_vm_admin] would be wrong, because that would mean that a pool_operator cannot execute the api-call that a vm_admin can, breaking the total-order policy expected in the RBAC 1.0 implementation. In the future, this requirement might be relaxed.

So, the example above should instead be used as:

~allowed_roles:_R_POOL_OP  (* recommended usage via pre-defined totally-ordered role lists *)

and so on.

How to determine the correct role of a new api-call:

if only xapi should execute the api-call, ie. it is an internal call: _R_POOL_ADMIN
if it is related to subject, role, external-authentication: _R_POOL_ADMIN
if it is related to accessing Dom0 (via console, ssh, whatever): _R_POOL_ADMIN
if it is related to the pool object: R_POOL_OP
if it is related to the host object, licenses, backups, physical devices: _R_POOL_OP
if it is related to managing VM memory, snapshot/checkpoint, migration: _R_VM_POWER_ADMIN
if it is related to creating, destroying, cloning, importing/exporting VMs: _R_VM_ADMIN
if it is related to starting, stopping, pausing etc VMs or otherwise accessing/manipulating VMs: _R_VM_OP
if it is related to being able to login, manipulate own tasks and read values only: _R_READ_ONLY

Update message forwarding

The “message forwarding” layer describes the policy of whether an incoming API call should be forwarded to another host (such as another member of the pool) or processed on the host which receives the call. This policy may be non-trivial to describe and so cannot be auto-generated from the data model.

In xapi/message_forwarding.ml, add a function to the relevant module to describe this policy. In the running example, we add the following function to the Host module:

let price_of ~__context ~host ~item =
    info "Host.price_of for item %s" item;
    let local_fn = Local.Host.price_of ~host ~item in
    let remote_fn = Client.Host.price_of ~host ~item in
    do_op_on ~local_fn ~__context ~host ~remote_fn

After the ~__context parameter, the parameters of this new function should match the parameters we specified for the message. In this case, that is the host and the item to query the price of.

The do_op_on function takes a function to execute locally and a function to execute remotely and performs one of these operations depending on whether the given host is the local host.

The local function references Local.Host.price_of, which is a function we will write in the next step.

Implement the function

Now we write the function to perform the logic behind the new API call. For a host-based call, this will reside in xapi/xapi_host.ml. For other classes, other files with similar names are used.

We add the following function to xapi/xapi_host.ml:

let price_of ~__context ~host ~item =
    if item = "fish" then 3.14 else 0.00

We also need to add the function to the interface xapi/xapi_host.mli:

val price_of :
    __context:Context.t -> host:API.ref_host -> item:string -> float

Congratulations, you’ve added a function to the API!

Add the operation to the CLI

Edit xapi-cli-server/cli_frontend.ml. Add a block to the definition of cmdtable_data as in the following example:

"host-price-of",
{
  reqd=["host-uuid"; "item"];
  optn=[];
  help="Find out the price of an item on a certain host.";
  implementation= No_fd Cli_operations.host_price_of;
  flags=[];
};

Include here the following:

The names of required (reqd) and optional (optn) parameters.
A description to be displayed when calling xe help <cmd> in the help field.
The implementation should use With_fd if any communication with the client is necessary (for example, showing the user a warning, sending the contents of a file, etc.) Otherwise, No_fd can be used as above.
The flags field can be used to set special options:
- Vm_selectors: adds a “vm” parameter for the name of a VM (rather than a UUID)
- Host_selectors: adds a “host” parameter for the name of a host (rather than a UUID)
- Standard: includes the command in the list of common commands displayed by xe help
- Neverforward:
- Hidden:
- Deprecated of string list:

Now we must implement Cli_operations.host_price_of. This is done in xapi-cli-server/cli_operations.ml. This function typically extracts the parameters and forwards them to the internal implementation of the function. Other arbitrary code is permitted. For example:

let host_price_of printer rpc session_id params =
  let host = Client.Host.get_by_uuid rpc session_id (List.assoc "host-uuid" params) in
  let item = List.assoc "item" params in
  let price = string_of_float (Client.Host.price_of ~rpc ~session_id ~host ~item) in
  printer (Cli_printer.PList [price])

Tab Completion in the CLI

The CLI features tab completion for many of its commands’ parameters. Tab completion is implemented in the file ocaml/xe-cli/bash-completion, which is installed on the host as /etc/bash_completion.d/cli, and is done on a parameter-name rather than on a command-name basis. The main portion of the bash-completion file is a case statement that contains a section for each of the parameters that benefit from completion. There is also an entry that catches all parameter names ending at -uuid, and performs an automatic lookup of suitable UUIDs. The host-uuid parameter of our new host-price-of command therefore automatically gains completion capabilities.

Executing the CLI operation

Recompile xapi with the changes described above and install it on a test machine.

Execute the following command to see if the function exists:

xe help host-price-of

Invoke the function itself with the following command:

xe host-price-of host-uuid=<tab> item=fish

and you should find out the price of fish.

Adding a XenAPI extension

A XenAPI extension is a new RPC which is implemented as a separate executable (i.e. it is not part of xapi) but which still benefits from xapi parameter type-checking, multi-language stub generation, documentation generation, authentication etc. An extension can be backported to previous versions by simply adding the implementation, without having to recompile xapi itself.

A XenAPI extension is in two parts:

a declaration in the xapi datamodel. This must use the ~forward_to:(Extension "filename") parameter. The filename must be unique, and should be the same as the XenAPI call name.
an implementation executable in the dom0 filesystem with path /etc/xapi.d/extensions/filename

To define an extension

First write the declaration in the datamodel. The act of specifying the types and writing the documentation will help clarify the intended meaning of the call.

Second create a prototype of your implementation and put an executable file in /etc/xapi.d/extensions/filename. The calling convention is:

the file must be executable
xapi will parse the XMLRPC call arguments received over the network and check the session_id is valid
xapi will execute the named executable
the XMLRPC call arguments will be sent to the executable on stdin and stdin will be closed afterwards
the executable will run and print an XMLRPC response on stdout
xapi will read the response and return it to the client.

See the basic example.

Second make a pull request containing only the datamodel definitions (it is not necessary to include the prototype too). This will attract review comments which will help you improve your API further. Once the pull request is merged, then the API call name and extension are officially yours and you may use them on any xapi version which supports the extension mechanism.

Packaging your extension

Your extension /etc/xapi.d/extensions/filename (and dependencies) should be packaged for your target distribution (for XenServer dom0 this would be a CentOS RPM). Once the package is unpacked on the target machine, the extension should be immediately callable via the XenAPI, provided the xapi version supports the extension mechanism. Note the xapi version does not need to know about the specific extension in advance: it will always look in /etc/xapi.d/extensions/ for all RPC calls whose name it does not recognise.

Limitations

On type-checking

if the xapi version is new enough to know about your specific extension: xapi will type-check the call arguments for you
if the xapi version is too old to know about your specific extension: the extension will still be callable but the arguments will not be type-checked.

On access control

if the xapi version is new enough to know about your specific extension: you can declare that a user must have a particular role (e.g. ‘VM admin’)
if the xapi version is too old to know about your specific extension: the extension will still be callable but the client must have the ‘Pool admin’ role.

Since a xapi which knows about your specific extension is stricter than an older xapi, it’s a good idea to develop against the new xapi and then test older xapi versions later.

XE CLI architecture

Info

The links in this page point to the source files of xapi v1.132.0, not to the latest source code. Meanwhile, the CLI server code in xapi has been moved to a library separate from the main xapi binary, and has its own subdirectory ocaml/xapi-cli-server.

Architecture

The actual CLI is a very lightweight binary in ocaml/xe-cli
- It is just a dumb client, that does everything that xapi tells it to do
- This is a security issue
  - We must trust the xenserver that we connect to, because it can tell xe to read local files, download files, …
- When it is first called, it takes the few command-line arguments it needs, and then passes the rest to xapi in a HTTP PUT request
  - Each argument is in a separate line
- Then it loops doing what xapi tells it to do, in a loop, until xapi tells it to exit or an exception happens
The protocol description is in ocaml/xapi-cli-protocol/cli_protocol.ml
- The CLI has such a protocol that one binary can talk to multiple versions of xapi as long as their CLI protocol versions are compatible
- and the CLI can be changed without updating the xe binary
- and also for performance reasons, it is more efficient this way than by having a CLI that makes XenAPI calls
Xapi
- The HTTP POST request is sent to the /cli URL
- In Xapi.server_init, xapi registers the appropriate function to handle these requests, defined in common_http_handlers in the same file: Xapi_cli.handler
- The relevant code is in ocaml/xapi/records.ml, ocaml/xapi/cli_*.ml
  - CLI object definitions are in records.ml, command definitions in cli_frontend.ml (in cmdtable_data), implementations of commands in cli_operations.ml
- When a command is received, it is parsed into a command name and a parameter list of key-value pairs
  - and the command table is populated lazily from the commands defined in cmdtable_data in cli_frontend.ml, and automatically generated low-level parameter commands (the ones defined in section A.3.2 of the XenServer Administrator’s Guide) are also added for a list of standard classes
  - the command table maps command names to records that contain the implementation of the command, among other things
- Then the command name is looked up in the command table, and the corresponding operation is executed with the parsed key-value parameter list passed to it

Walk-through: CLI handler in xapi (external calls)

Definitions for the HTTP handler

Constants.cli_uri = "/cli"

Datamodel.http_actions = [...;
  ("post_cli", (Post, Constants.cli_uri, false, [], _R_READ_ONLY, []));
...]

(* these public http actions will NOT be checked by RBAC *)
(* they are meant to be used in exceptional cases where RBAC is already *)
(* checked inside them, such as in the XMLRPC (API) calls *)
Datamodel.public_http_actions_with_no_rbac_check` = [...
  "post_cli";  (* CLI commands -> calls XMLRPC *)
...]

Xapi.common_http_handlers = [...;
  ("post_cli", (Http_svr.BufIO Xapi_cli.handler));
...]

Xapi.server_init () =
  ...
  "Registering http handlers", [], (fun () -> List.iter Xapi_http.add_handler common_http_handlers);
  ...

Due to there definitions, Xapi_http.add_handler does not perform RBAC checks for post_cli. This means that the CLI handler does not use Xapi_http.assert_credentials_ok when a request comes in, as most other handlers do. The reason is that RBAC checking is delegated to the actual XenAPI calls that are being done by the commands in Cli_operations.

This means that the Xapi_http.add_handler call so resolves to simply:

Http_svr.Server.add_handler server Http.Post "/cli" (Http_svr.BufIO Xapi_cli.handler))

…which means that the function Xapi_cli.handler is called directly when an HTTP POST request with path /cli comes in.

High-level request processing

Xapi_cli.handler:

Reads the body of the HTTP request, limitted to Xapi_globs.http_limit_max_cli_size = 200 * 1024 characters.
Sends a protocol version string to the client: "XenSource thin CLI protocol" plus binary encoded major (0) and (2) minor numbers.
Reads the protocol version from the client and exits with an error if it does not match the above.
Calls Xapi_cli.parse_session_and_args with the request’s body to extract the session ref, if there.
Calls Cli_frontend.parse_commandline to parse the rest of the command line from the body.
Calls Xapi_cli.exec_command to execute the command.
On error, calls exception_handler.

Xapi_cli.parse_session_and_args:

Is passed the request body and reads it line by line. Each line is considered an argument.
Removes any CR chars from the end of each argument.
If the first arg starts with session_id=, the the bit after this prefix is considered to be a session reference.
Returns the session ref (if there) and (remaining) list of args.

Cli_frontend.parse_commandline:

Returns the command name and assoc list of param names and values. It handles --name and -flag arguments by turning them into key/value string pairs.

Xapi_cli.exec_command:

Finds username/password params.
Get the rpc function: this is the so-called “fake_rpc callback”, which does not use the network or HTTP at all, but goes straight to Api_server.callback1 (the XenAPI RPC entry point). This function is used by the CLI handler to do loopback XenAPI calls.
Logs the parsed xe command, omitting sensitive data.
Continues as Xapi_cli.do_rpcs
Looks up the command name in the command table from Cli_frontend (raises an error if not found).
Checks if all required params have been supplied (raises an error if not).
Checks that the host is a pool master (raises an error if not).
Depending on the command, a session.login_with_password or session.slave_local_login_with_password XenAPI call is made with the supplied username and password. If the authentication passes, then a session reference is returned for the RBAC role that belongs to the user. This session is used to do further XenAPI calls.
Next, the implementation of the command in Cli_operations is executed.

Command implementations

The various commands are implemented in cli_operations.ml. These functions are only called after user authentication has passed (see above). However, RBAC restrictions are only enforced inside any XenAPI calls that are made, and not on any of the other code in cli_operations.ml.

The type of each command implementation function is as follows (see cli_cmdtable.ml):

type op =
  Cli_printer.print_fn ->
  (Rpc.call -> Rpc.response) ->
  API.ref_session -> ((string*string) list) -> unit

So each function receives a printer for sending text output to the xe client, and rpc function and session reference for doing XenAPI calls, and a key/value pair param list. Here is a typical example:

let bond_create printer rpc session_id params =
  let network = List.assoc "network-uuid" params in
  let mac = List.assoc_default "mac" params "" in
  let network = Client.Network.get_by_uuid rpc session_id network in
  let pifs = List.assoc "pif-uuids" params in
  let uuids = String.split ',' pifs in
  let pifs = List.map (fun uuid -> Client.PIF.get_by_uuid rpc session_id uuid) uuids in
  let mode = Record_util.bond_mode_of_string (List.assoc_default "mode" params "") in
  let properties = read_map_params "properties" params in
  let bond = Client.Bond.create rpc session_id network pifs mac mode properties in
  let uuid = Client.Bond.get_uuid rpc session_id bond in
  printer (Cli_printer.PList [ uuid])

The necessary parameters are looked up in params using List.assoc or similar.
UUIDs are translated into reference by get_by_uuid XenAPI calls (note that the Client module is the XenAPI client, and functions in there require the rpc function and session reference).
Then the main API call is made (Client.Bond.create in this case).
Further API calls may be made to output data for the client, and passed to the printer.

This is the common case for CLI operations: they do API calls based on the parameters that were passed in.

However, other commands are more complicated, for example vm_import/export and vm_migrate. These contain a lot more logic in the CLI commands, and also send commands to the client to instruct it to read or write files and/or do HTTP calls.

Yet other commands do not actually do any XenAPI calls, but instead get “helpful” information from other places. Example: diagnostic_gc_stats, which displays statistics from xapi’s OCaml GC.

Tutorials

The following tutorials show how to extend the CLI (and XenAPI):

Database

Metadata-on-LUN

In the present version of XenServer, metadata changes resulting in writes to the database are not persisted in non-volatile storage. Hence, in case of failure, up to five minutes’ worth of metadata changes could be lost. The Metadata-on-LUN feature addresses the issue by ensuring that all database writes are retained. This will be used to improve recovery from failure by storing incremental deltas which can be re-applied to an old version of the database to bring it more up-to-date. An implication of this is that clients will no longer be required to perform a ‘pool-sync-database’ to protect critical writes, because all writes will be implicitly protected.

This is implemented by saving descriptions of all persistent database writes to a LUN when HA is active. Upon xapi restart after failure, such as on master fail-over, these descriptions are read and parsed to restore the latest version of the database.

Layout on block device

It is useful to store the database on the block device as well as the deltas, so that it is unambiguous on recovery which version of the database the deltas apply to.

The content of the block device will be structured as shown in the table below. It consists of a header; the rest of the device is split into two halves.

	Length (bytes)	Description
Header	16	Magic identifier
	1	ASCII NUL
	1	Validity byte
First half database	36	UUID as ASCII string
	16	Length of database as decimal ASCII
	(as specified)	Database (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
First half deltas	16	Length of database delta as decimal ASCII
	(as specified)	Database delta (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
Second half database	36	UUID as ASCII string
	16	Length of database as decimal ASCII
	(as specified)	Database (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
Second half deltas	16	Length of database delta as decimal ASCII
	(as specified)	Database delta (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string

After the header, one or both halves may be devoid of content. In a half which contains a database, there may be zero or more deltas (repetitions of the last three entries in each half).

The structure of the device is split into two halves to provide double-buffering. In case of failure during write to one half, the other half remains intact.

The magic identifier at the start of the file protect against attempting to treat a different device as a redo log.

The validity byte is a single `ascii character indicating the state of the two halves. It can take the following values:

Byte	Description
`0`	Neither half is valid
`1`	First half is valid
`2`	Second half is valid

The use of lengths preceding data sections permit convenient reading. The constant repetitions of the UUIDs act as nonces to protect against reading in invalid data in the case of an incomplete or corrupt write.

Architecture

The I/O to and from the block device may involve long delays. For example, if there is a network problem, or the iSCSI device disappears, the I/O calls may block indefinitely. It is important to isolate this from xapi. Hence, I/O with the block device will occur in a separate process.

Xapi will communicate with the I/O process via a UNIX domain socket using a simple text-based protocol described below. The I/O process will use to ensure that it can always accept xapi’s requests with a guaranteed upper limit on the delay. Xapi can therefore communicate with the process using blocking I/O.

Xapi will interact with the I/O process in a best-effort fashion. If it cannot communicate with the process, or the process indicates that it has not carried out the requested command, xapi will continue execution regardless. Redo-log entries are idempotent (modulo the raising of exceptions in some cases) so it is of little consequence if a particular entry cannot be written but others can. If xapi notices that the process has died, it will attempt to restart it.

The I/O process keeps track of a pointer for each half indicating the position at which the next delta will be written in that half.

Protocol

Upon connection to the control socket, the I/O process will attempt to connect to the block device. Depending on whether this is successful or unsuccessful, one of two responses will be sent to the client.

connect|ack_ if it is successful; or
connect|nack|<length>|<message> if it is unsuccessful, perhaps because the block device does not exist or cannot be read from. The <message> is a description of the error; the <length> of the message is expressed using 16 digits of decimal ascii.

The former message indicates that the I/O process is ready to receive commands. The latter message indicates that commands can not be sent to the I/O process.

There are three commands which xapi can send to the I/O process. These are described below, with a high level description of the operational semantics of the I/O process’ actions, and the corresponding responses. For ease of parsing, each command is ten bytes in length.

Write database

Xapi requests that a new database is written to the block device, and sends its content using the data socket.

Command:

: writedb___|<uuid>|<generation-count>|<length>: The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.

Semantics:

Read the validity byte.
If one half is valid, we will use the other half. If no halves are valid, we will use the first half.
Read the data from the data socket and write it into the chosen half.
Set the pointer for the chosen half to point to the position after the data.
Set the validity byte to indicate the chosen half is valid.

Response:

: writedb|ack_ in case of successful write; or: writedb|nack|<length>|<message> otherwise.; For error messages, the length of the message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Write database delta

Xapi sends a description of a database delta to append to the block device.

Command:

: writedelta|<uuid>|<generation-count>|<length>|<data>: The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.

Semantics:

Read the validity byte to establish which half is valid. If neither half is valid, return with a nack.
If the half’s pointer is set, seek to that position. Otherwise, scan through the half and stop at the position after the last write.
Write the entry.
Update the half’s pointer to point to the position after the entry.

Response:

: writedelta|ack_ in case of successful append; or: writedelta|nack|<length>|<message> otherwise.; For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Read log

Xapi requests the contents of the log.

Command:

: read______

Semantics:

Read the validity byte to establish which half is valid. If neither half is valid, return with an end.
Attempt to read the database from the current half.
If this is successful, continue in that half reading entries up to the position of the half’s pointer. If the pointer is not set, read until a record of length zero is found or the end of the half is reached. Otherwise—if the attempt to the read the database was not successful—switch to using the other half and try again from step 2.
Finally output an end.

Response:

: read|nack_|<length>|<message> in case of error; or: read|db___|<generation-count>|<length>|<data> for a database record, then a sequence of zero or more; read|delta|<generation-count>|<length>|<data> for each delta record, then; read|end__; For each record, and for error messages, the length of the data or message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Re-initialise log

Xapi requests that the block device is re-initialised with a fresh redo-log.

Command:

: empty_____\

Semantics:

: 1. Set the validity byte to indicate that neither half is valid.

Response:

: empty|ack_ in case of successful re-initialisation; or
empty|nack|<length>|<message> otherwise.: For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Impact on xapi performance

The implementation of the feature causes a slow-down in xapi of around 6% in the general case. However, if the LUN becomes inaccessible this can cause a slow-down of up to 25% in the worst case.

The figure below shows the result of testing four configurations, counting the number of database writes effected through a command-line ‘xe pool-param-set’ call.

The first and second configurations are xapi without the Metadata-on-LUN feature, with HA disabled and enabled respectively.
The third configuration shows xapi with the Metadata-on-LUN feature using a healthy LUN to which all database writes can be successfully flushed.
The fourth configuration shows xapi with the Metadata-on-LUN feature using an inaccessible LUN for which all database writes fail.

$Impact of feature on xapi database-writing performance. (Green points\nrepresent individual samples; red bars are the arithmetic means of\nsamples.)$

Testing strategy

The section above shows how xapi performance is affected by this feature. The sections below describe the dev-testing which has already been undertaken, and propose how this feature will impact on regression testing.

Dev-testing performed

A variety of informal tests have been performed as part of the development process:

Enable HA.: Confirm LUN starts being used to persist database writes.
Enable HA, disable HA.: Confirm LUN stops being used.
Enable HA, kill xapi on master, restart xapi on master.: Confirm that last database write before kill is successfully restored on restart.
Repeatedly enable and disable HA.: Confirm that no file descriptors are leaked (verified by counting the number of descriptors in /proc/pid/fd/).
Enable HA, reboot the master.: Due to HA, a slave becomes the master (or this can be forced using ‘xe pool-emergency-transition-to-master’). Confirm that the new master starts is able to restore the database from the LUN from the point the old master left off, and begins to write new changes to the LUN.
Enable HA, disable the iSCSI volume.: Confirm that xapi continues to make progress, although database writes are not persisted.
Enable HA, disable and enable the iSCSI volume.: Confirm that xapi begins to use the LUN when the iSCSI volume is re-enabled and subsequent writes are persisted.

These tests have been undertaken using an iSCSI target VM and a real iSCSI volume on lannik. In these scenarios, disabling the iSCSI volume consists of stopping the VM and unmapping the LUN, respectively.

Proposed new regression test

A new regression test is proposed to confirm that all database writes are persisted across failure.

There are three types of database modification to test: row creation, field-write and row deletion. Although these three kinds of write could be tested in separate tests, the means of setting up the pre-conditions for a field-write and a row deletion require a row creation, so it is convenient to test them all in a single test.

Start a pool containing three hosts.
Issue a CLI command on the master to create a row in the database, e.g.
xe network-create name-label=a.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the row creation persisted:
xe network-list name-label=a,
confirming that the returned string is non-empty.
Issue a CLI command on the master to modify a field in the new row in the database:
xe network-param-set uuid=<uuid> name-description=abcd,
where <uuid> is the UUID returned from step 2.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the field-write persisted:
xe network-param-get uuid=<uuid> param-name=name-description,
where <uuid> is the UUID returned from step 2. The returned string should contain
abcd.
Issue a CLI command on the master to delete the row from the database:
xe network-destroy uuid=<uuid>,
where <uuid> is the UUID returned from step 2.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the row does not exist:
xe network-list name-label=a,
confirming that the returned string is empty.

Impact on existing regression tests

The Metadata-on-LUN feature should mean that there is no need to perform an ‘xe pool-sync-database’ operation in existing HA regression tests to ensure that database state persists on xapi failure.

XAPI's Internals

The articles provided under this sub-section intend to act as a developer resource for toolstack engineers.

Certificates and PEM Files

Xapi uses certificates for secure communication within a pool and with external clients. These certificates are using the PEM file format and reside in the Dom0 file system. This documents explains the purpose of these files.

Design Documents

Paths

Below are paths used by Xapi for certificates; additional certficates may be installed but they are not fundamental for Xapi’s operation.

/etc/xensource/xapi-ssl.pem
/etc/xensource/xapi-pool-tls.pem
/etc/stunnel/certs-pool/1c111a1f-412e-47c0-9003-60789b839bc3.pem
/etc/stunnel/certs-pool/960abfff-6017-4d97-bd56-0a8f1a43e51a.pem
/etc/stunnel/xapi-stunnel-ca-bundle.pem
/etc/stunnel/certs/
/etc/stunnel/xapi-pool-ca-bundle.pem

Fundamental Certificates

Certificates that identify a host. These certificates are comprised of both a private and a public key. The public key may be distributed to other hosts.

xapi-ssl.pem

This certificate identifies a host for extra-pool clients.

This is the certificate used by the API HTTPS server that clients like XenCenter or CVAD connect to. On installation of XenServer it is auto generated but can be updated by a user using the API. This is the most important certificate for a user to establish an HTTPS connection to a pool or host to be used as an API.

/etc/xensource/xapi-ssl.pem
contains private and public key for this host
Host.get_server_certificate API call
referenced by /etc/stunnel/xapi.conf
xe host-server-certificate-install XE command to replace the certificate.
See below for xapi-stunnel-ca-bundle for additional certificates that can be added to a pool in support of a user-supplied host certificate.
xe reset-server-certificate creates a new self-signed certificate.

`xapi-pool-tls.pem`

This certificate identifies a host inside a pool. It is auto generated and used for all intra-pool HTTPS connections. It needs to be distributed inside a pool to establish trust. The distribution of the public part of the certificate is performed by the API and must not be done manually.

/etc/xensource/xapi-pool-tls.pem
contains private and public key for this host
referenced by /etc/stunnel/xapi.conf
This certificate can be re-generated using the API or XE
Host.refresh_server_certificate
xe host-refresh-server-certificate

Certificate Bundles

Certifiacte bundles are used by stunnel. They are a collection of public keys from hosts and certificates provided by a user. Knowing a host’s public key facilitates stunnel connecting to the host.

Bundles by themselves are a technicality as they organise a set of certificates in a single file but don’t add new certificates.

`xapi-pool-ca-bundle.pem` and `certs-pool/*.pem`

Collection of public keys from xapi-pool-tls.pem across the pool. The public keys are collected in the certs-pool directory: each is named after the UUID of its host and the bundle is constructed from them.

bundle of public keys from hosts’ xapi-pool-tls.pem
constructed from PEM files in certs-pool/
/opt/xensource/bin/update-ca-bundle.sh generates the bundle from PEM files

`xapi-stunnel-ca-bundle.pem` and `certs/*.pem`

User-supplied certificates; they are not essential for the operation of a pool from Xapi’s perspective. They make stunnel aware of certificates used by clients when using HTTPS for API calls.

in a plain pool installation, these are empty; PEMs supplied by a user are stored here and bundled into the xapi-stunnerl-ca-bundle.pem.
bundle of public keys supploed by a user
constructed from PEM files in certs/
/opt/xensource/bin/update-ca-bundle.sh generates the bundle from PEM files
Updated by a user using xe pool-install-ca-certificate
Pool.install_ca_certificate
Pool.uninstall_ca_certificate
xe pool-certificate-sync explicitly distribute these certificates in the pool.
User-provided certificates can be used to let xapi connect to WLB.

Generated Parts of Xapi

Introduction

Many parts of xapi are auto-generated during the build process. This article aims to document some of these modules and how they relate to each other. The intention of this article is to serve as a developer resource and, as such, its contents are prone to change (or become inaccurate) as the codebase evolves. The ultimate source of truth remains the codebase itself.

Interface Description

All of XenAPI’s data model is described within the ocaml/idl subdirectory. The data model itself describes the classes that make up the API. The classes themselves comprise fields and messages (methods), relating functionality together.

API Types (`aPI.ml`)

The internal representation of each object is a record whose type is specified within the generated API module, as part of building xapi-types. For example, the task object’s structure is defined by the type task_t. Similarly, API includes the internal type representation used to represent fields. For example, a type such as (string -> vdi_operations) map, used by the data model, is defined as (string * vdi_operations) list within API (where vdi_operations itself is a polymorphic variant also defined by API).

Note that the all the type definitions within API are annotated with [@@deriving rpc]. This ensures that the final module, after preprocessing, also contains functions to marshal each type to/from Rpc.t values (which is important for Server - described later - which receives that format as input).

Database Actions (`db_actions.ml`)

The majority of XenAPI consists of methods used to read and modify the fields of objects in the database. These methods are automatically generated and their implementations are placed within the generated Db_actions module.

The Db_actions module consists of various, related, parts: type definitions for a subset of objects’ fields, marshallers for converting API types to/from strings, and database action handlers. Briefly, the role of each of these is described below:

Type definitions for XenAPI objects are redefined in Db_actions in order to exclude internal fields. If a field is marked as “internal only” within the data model, then it should only exist within internal representations (those defined by API, described above). For example, task_t (describing a task object) - as described by Db_actions - notably omits the field task_session (of type session ref) so as to not leak sensitive information to clients (in this case, the reference to the session that created the task object).
Fields of the database are internally stored as strings and must be marshalled to typed values for use in OCaml code using DB actions. To this end, submodules String_to_DM and DM_to_String are generated to include code for doing these conversions. These submodules consist of the inverse operations of the other. For example, for the data model type Observer ref set, the function ref_Observer_set exists in both String_to_DM (as string -> [`Observer] API.Ref.t list) and in DM_to_String (inversely, as [`Observer] API.Ref.t list -> string).
Handlers for actions that read/write fields of database objects are implemented by Db_actions. Each handler uses the relevant marshallers to marshal inputs and outputs. Note that Db_actions generates two variants of get_record for each class: a normal get_record which returns the public class representation as described by the types defined in Db_actions, and a get_record_internal, which returns the full class representation (including internal fields) as described by the API module.

Registering for Snapshots

The Db_actions module also generates modules that register callbacks for Xapi’s event mechanisms. These modules are named in the format Class_init and consist only of top-level code, evaluated for its effect.

As each event must provide a snapshot of the related object, the event mechanism must be able to read records from the database. To do this, Eventgen exposes an API that Db_actions uses to register callbacks (one for each type of object in the database).

For example, within Db_actions, there is a module VM_init, consisting of:

module VM_init = struct
  let _  =
    Hashtbl.add Eventgen.get_record_table "VM"
      (fun ~__context ~self -> (fun () -> API.rpc_of_vM_t (VM.get_record ~__context ~self:(Ref.of_string self))))
end

As snapshots are served to external clients, the functions use the public get_record functions - returning types defined by API - which omits internal fields.

Notice that the type of values being mapped to is __context:Context.t -> self:string -> unit -> Rpc.t.

The presence of unit in the type is to permit partial application (of __context and self) to create thunks. The type unit -> Rpc.t is lossy, it says nothing about the context or object reference being used to fetch the snapshot; these details are captured by the closure arising from partial application. This means that code can arbitrarily delay the fetching of a snapshot and then “force” it (on demand) later. In practice, these snapshots are not delayed for long, see Eventgen for more information.

Custom Actions (`custom_actions.ml`)

The API operations that require a custom implementation (i.e. are not automatically generated) are grouped together into a signature called CUSTOM_ACTIONS within custom_actions.ml.

open API

module type CUSTOM_ACTIONS = sig

  module Session : sig
    val login_with_password : __context:Context.t -> uname:string -> pwd:string -> version:string -> originator:string -> ref_session
    val logout : __context:Context.t -> unit
(* ... *)

The isolation of these methods into their own signature is important for ensuring implementations exist for them.

The Actions submodule of Api_server_common can be ascribed this signature. The purpose of the Actions sub-module is to group custom implementations together whilst renaming them using module aliases. For example, the custom implementations of messages associate with the task class exist within xapi_task.ml - the module arising from this file is aliased, within Actions, to rename it and satisfy the CUSTOM_ACTIONS signature (e.g. module Task = Xapi_task).

The signature is not explicitly ascribed to Actions at its definition site, but is used by functors which are parameterised by modules satisfying CUSTOM_ACTIONS. The important modules of note are Actions itself (which comprise the concrete implementations of custom messages in Xapi) and the Forwarder sub-module (of Api_server_common) which uses the Message_forwarding.Forward functor to potentially override (via shadowing) the implementations of custom actions, in order to define policies around forwarding (e.g. whether the coordinator should handle a custom action by appealing to a subordinate host, usually via the Client module - described later).

Server (`server.ml`)

The Server modules contains the logic to handle incoming calls from Xapi’s HTTP server. At the top level, it contains a functor (parameterised by Local and Forward - both satisfying the CUSTOM_ACTIONS signature), that contains a large pattern match on the name of the supplied message (e.g. Host.get_record). Then, dependent on the semantics of the message itself, the body of each handler differs in slight ways when doing dispatch.

The top-level `dispatch_call` function

The top-level dispatch function, dispatch_call, has the following header:

let dispatch_call (http_req: Http.Request.t) (fd: Unix.file_descr) (call: Rpc.call) =

The incoming HTTP request (http_req) and - related - socket’s file descriptor (Unix.file_descr) are forwarded - within handler code - to Server_helpers.do_dispatch. The HTTP request is important because task and tracing-related metadata can be propagated using fields within the request header. The file descriptor can be used to determine the origin of the request (whether it’s local or not) but also can permit flexibility in upgrading protocols (as is done in other parts of Xapi, such as the /cli handler, where the connection starts off as HTTP but continues as something else).

The Anatomy of a Handler

A typical handler, within dispatch_call, looks like the following:

    | "task.get_name_label" | "task_get_name_label" ->
        begin match __params with
        | [session_id_rpc; self_rpc] ->
            (* has no side-effect; should be handled by DB action *)
            (* has no asynchronous mode *)
            let session_id = ref_session_of_rpc session_id_rpc in
            let self = ref_task_of_rpc self_rpc in
            Session_check.check ~intra_pool_only:false ~session_id ~action:"task.get_name_label";
            let arg_names_values = [("session_id", session_id_rpc); ("self", self_rpc)] in
            let key_names = [] in
            let rbac __context fn = Rbac.check session_id __call ~args:arg_names_values ~keys:key_names ~__context ~fn in
            let marshaller = (fun x -> rpc_of_string x) in
            let local_op = fun ~__context ->(rbac __context (fun()->(Db_actions.DB_Action.Task.get_name_label ~__context:(Context.check_for_foreign_database ~__context)  ~self))) in
            let supports_async = false in
            let generate_task_for = false in
            ApiLogRead.debug "task.get_name_label";
            let resp = Server_helpers.do_dispatch ~session_id  supports_async __call local_op marshaller fd http_req __label __sync_ty generate_task_for in
            resp
        | _ ->
            Server_helpers.parameter_count_mismatch_failure __call "1" (string_of_int ((List.length __params) - 1))
        end

The start of each handler contains calls to unmarshal arguments from their Rpc.t representation to that defined by the API module. These functions are automatically generated during preprocessing of the aPI.ml file (API modules from xapi-types, described above). The conversion from the incoming XML-RPC (or JSON-RPC) to the Rpc.t encoding is handled by Api_server before it calls dispatch_call.

In the example above, the “local” operation (local_op) uses handlers generated within Db_actions (described above). This is typical of handlers for DB-related actions (the most common type of action): they have no forwarding logic (thus, no entry in the CUSTOM_ACTIONS signature) as they can only be carried out on the coordinator host (which maintains the database). If a subordinate host wishes to change the database, it must use a custom endpoint and protocol (not described here).

To see more about how the CUSTOM_ACTIONS signature is used in practice, you can look at the “local” and “forward” operations for a message with custom handling. For example, in Pool_patch.apply:

(* ... *)
let local_op = fun ~__context ->(rbac __context (fun()->(Custom.Pool_patch.apply ~__context:(Context.check_for_foreign_database ~__context)  ~self ~host))) in
(* ...  *)
let forward_op = fun ~local_fn ~__context -> (rbac __context (fun()-> (Forward.Pool_patch.apply ~__context:(Context.check_for_foreign_database ~__context)  ~self ~host) )) in
(* ...  *)

As mentioned above, the Custom and Forward modules are both inputs to Server’s Make functor. The difference lies in how they are instantiated: Api_server ensures that Custom is referring to local implementations (such as that arising from modules defined by files named xapi_*.ml) and Forward is referring to the module derived by Message_forwarding (but shadowed with implementations that may apply different handling to the call).

RBAC Checking and Auditing

In order to implement RBAC (Role Based Access Control) checking for individual messages, each handler contains logic that wraps an action (as a callback) within code that calls into the Rbac module (specifically, the Rbac.check function).

In the typical case, Rbac.check compares the name of a call against the list of RBAC permissions granted to the role associated with the originator’s session/context. There is more involved logic for key-related RBAC checks (explained later).

For an accessible listing of each (static) RBAC permission, Xapi auto-generates a CSV file containing this information in a tabular format (within rbac_static.csv). The information in that file is consistent with the auto-generated Rbac_static module described in this document.

Along with providing authorisation checking, the Rbac.check function also appends to an audit log which contains a (sanitised) list of actions (alongside their RBAC check outcome).

RBAC Checking of Keys

In auto-generated handlers for add_to and remove_from messages (e.g. pool.add_to_other_config), the RBAC check may cite a list of key descriptors. For example:

(* pool.add_to_other_config ...  *)
let arg_names_values = [("session_id", session_id_rpc); ("self", self_rpc); ("key", key_rpc); ("value", value_rpc)] in
let key_names = ["folder"; "XenCenter.CustomFields.*"; "EMPTY_FOLDERS"] in
let rbac __context fn = Rbac.check session_id __call ~args:arg_names_values ~keys:key_names ~__context ~fn in
(* ... *)

These keys are specified within the data model as being tied to specific roles, in order to apply role-based exclusions to specific keys. The usual situation is that the setter for such a (string -> string) map field (e.g. pool.set_other_config) requires a more privileged role than the roles specified for individual keys.

The mechanism that enforces this check is somewhat brittle at present: the Rbac.check function is provided the list of key descriptors and the (association) list of (unmarshalled) arguments. If the key descriptor list is non-empty, it will consult the argument listing for the cited key (i.e. the key name mapped to by “key” in the argument listing) and then attempt to match that against a descriptor. If there is a match, it will check the current session against the list of RBAC permissions. The key-related RBAC permissions are encoded in the format action/key:key (all lowercase) - for example, pool.add_to_other_config/key:xencenter.customfields.*.

Alternative Wire Names

In order to support languages that have keywords that collide with message names within Xapi, an alternative wire format is also cased upon within dispatch_call.

| "task.get_name_label" | "task_get_name_label" ->
(* ... *)

For example, Python uses the keyword from to handle imports and, so, an API call (using xmlrpc.client) - rendered as event.from - is a syntactic error. To get around this, the API permits an underscore to be substituted in place of the period (.) that separates the class name from the message name (e.g. event_from).

This apparent duplication of cases does not amount to a concrete duplication of matching code within the compiled module (due to how OCaml special cases the compilation of pattern matching over constant strings). However, in future, we could avoid casing on both of them by normalising the name of the incoming call (i.e. transform event.from to event_from prior to matching).

Client (`client.ml`)

The Client module serves as the main module of the xapi-client library. The primary consumer of this library is Xapi itself, for use when a host may call into another host (or itself).

For example, when defining a message forwarding policy, the implementation of a handler may use the Client module to invoke a function on another host. For instance, the message forwarding of Pool_patch.apply (from xapi/message_forwarding.ml):

let apply ~__context ~self ~host =
  info "Pool_patch.apply: pool patch = '%s'; host = '%s'"
    (pool_patch_uuid ~__context self)
    (host_uuid ~__context host) ;
  let local_fn = Local.Pool_patch.apply ~self ~host in
  do_op_on ~local_fn ~__context ~host (fun session_id rpc ->
    Client.Pool_patch.apply ~rpc ~session_id ~self ~host
  )

The do_op_on machinery provides the rpc transport (Rpc.call -> Rpc.response) to the callback which passes it to Client’s implementation (which just performs the relevant marshalling). The RPC transport itself is XML-RPC over HTTP (as implemented by the internal http-lib library - ocaml/libs/http-lib).

Client Internals

Internally, the Client module contains a few functors. The top-level functor, ClientF is parameterised by a signature describing an arbitrary monad. The intention is to permit users to instantiate clients defined in terms of an RPC transport that may be asynchronous (for example, within the context of a program using Lwt or Async for its networking).

There is also a sub-functor AsyncF (within ClientF) that is parameterised by a module that provides a qualifier string to be prepended to calls’ method names. A few messages in Xapi can be qualified with async qualifiers (in particular, Async and InternalAsync). The AsyncF functor provides handling of those calls and is used to define the sub-modules (within ClientF) Async and InternalAsync. (the former prepending Async and the latter further prepending Internal). Code at the top-level of Server’s dispatch_call function is used to parse (and remove) this async qualifier from the provided message name.

module ClientF = functor(X : IO) ->struct
  (* ... *)
  module AsyncF = functor(AQ: AsyncQualifier) ->struct
    (* handling of messages with asynchronous modes *)
    module Session = struct
      let create_from_db_file ~rpc ~session_id ~filename =
        let session_id = rpc_of_ref_session session_id in
        let filename = rpc_of_string filename in
        rpc_wrapper rpc (Printf.sprintf "%sAsync.session.create_from_db_file" AQ.async_qualifier) [ session_id; filename ] >>= fun x -> return (ref_task_of_rpc  x)
        (* ... *)
    end
    (* ...  *)
  end
  (* handling of messages with synchronous modes; similar to above, but without prefixing of "Async" *)

  module Async = AsyncF(struct let async_qualifier = "" end)
  module InternalAsync = AsyncF(struct let async_qualifier = "Internal" end)
end
(* instantiate Client with the identity monad *)
module Client = ClientF(Id)

The usual Client module used by users of xapi-client is the Client sub-module, defined in terms of the identity monad (which simply applies the given continuation as its sequencing logic and performs no wrapping):

module Id = struct
  type 'a t = 'a
  let bind x f = f x 
  let return x = x
end

module Client = ClientF(Id)

This results in synchronous semantics, whereby any code within Xapi that uses it would block as it waits for a response via the RPC transport. This is not an issue in practice, as each call is given its own thread during the dispatch logic.

Note that the RPC transport itself is defined in terms of the provided monad. In the identity case, it’s a simple alias, and so the type of rpc is rendered Rpc.call -> Rpc.response. However, if you were to provide a monad defined, for example, in terms of Lwt.t (i.e. type 'a t = 'a Lwt.t), the expected type of the transport would reflect that: Rpc.call -> Rpc.response Lwt.t.

Rbac_static

The data model assigns specific roles to messages and fields. In order to permit RBAC (Role Based Access Control) checking for the related actions, Xapi must be able to determine the required role(s) for a given action. To this end, Rbac_static is generated to contain entries that encode this information.

The format of the entries in Rbac_static is rather peculiar. For example, for the action Pool_patch.apply, we find permission_pool_patch_apply defined at the top-level:

let permission_pool_patch_apply = 
  { (* 311/2196 *)
  role_uuid = "d4385002-b920-5412-4c57-b010f451fa81";
  role_name_label = "pool_patch.apply";
  role_name_description = permission_description;
  role_subroles = []; (* permission cannot have any subroles *)
  role_is_internal = true;
  }

This record is of type role_t (as defined by Db_actions). This record is later incorporated into role-specific lists of permissions (for each statically known role).

The reason that Rbac_static defines permissions in a format defined by Db_actions is because, to avoid flooding the database with thousands of entries, Rbac_static acts as its own database. In Xapi_role, functions are defined that mirror the functionality of functions within Db_actions (e.g. get_by_uuid).

The get_by_uuid function (within Xapi_role) illustrates the bypassing of the database clearly:

let get_by_uuid ~__context ~uuid =
  match find_role_by_uuid uuid with
  | Some static_record ->
      ref_of_role ~role:static_record
  | None ->
      (* pass-through to Db *)
      Db.Role.get_by_uuid ~__context ~uuid

If a role can be found (by UUID) statically (within Rbac_static), then that is used. Otherwise, the database is queried. Using the database as a fallback is important because there is still a dynamic component to the RBAC checking in Xapi: users can define their own roles that incorporate other roles as sub-roles - it’s just that the statically-known roles won’t be stored in the database. Precluding static roles from the database helps to avoid making the database larger and prevents users from deleting static roles from the database.

Host memory accounting

Memory is used for many things:

the hypervisor code: this is the Xen executable itself
the hypervisor heap: this is needed for per-domain structures and per-vCPU structures
the crash kernel: this is needed to collect information after a host crash
domain RAM: this is the memory the VM believes it has
shadow memory: for HVM guests running on hosts without hardware assisted paging (HAP) Xen uses shadow to optimise page table updates. For all guests shadow is used during live migration for tracking the memory transfer.
video RAM for the virtual graphics card

Some of these are constants (e.g. hypervisor code) while some depend on the VM configuration (e.g. domain RAM). Xapi calls the constants “host overhead” and the variables due to VM configuration as “VM overhead”. There is no low-level API to query this information, therefore xapi will sample the host overheads at system boot time and model the per-VM overheads.

Host overhead

The host overhead is not managed by xapi, instead it is sampled. After the host boots and before any VMs start, xapi asks Xen how much memory the host has in total, and how much memory is currently free. Xapi subtracts the free from the total and stores this as the host overhead.

VM overhead

The inputs to the model are

VM.memory_static_max: the maximum amount of RAM the domain will be able to use
VM.HVM_shadow_multiplier: allows the shadow memory to be increased
VM.VCPUs_max: the maximum number of vCPUs the domain will be able to use

First the shadow memory is calculated, in MiB

Shadow memory in MiB

Second the VM overhead is calculated, in MiB

Memory overhead in MiB

Memory required to start a VM

If ballooning is disabled, the memory required to start a VM is the same as the VM overhead above.

If ballooning is enabled then the memory calculation above is modified to use the VM.memory_dynamic_max rather than the VM.memory_static_max.

Memory required to migrate a VM

If ballooning is disabled, the memory required to receive a migrating VM is the same as the VM overhead above.

If ballooning is enabled, then the VM will first be ballooned down to VM.memory_dynamic_min and then it will be migrated across. If the VM fails to balloon all the way down, then correspondingly more memory will be required on the receiving side.

XAPI's Storage Layers

Info

The links in this page point to the source files of xapi v25.11.0.

Xapi directly communicates only with the SMAPIv2 layer. There are no plugins directly implementing the SMAPIv2 interface, but the plugins in other layers are accessed through it:

graph TD
A[xapi] --> B[SMAPIv2 interface]
B --> C[SMAPIv2 <-> SMAPIv1 state machine: storage_smapiv1_wrapper.ml]
C --> G[SMAPIv2 <-> SMAPIv1 translation: storage_smapiv1.ml]
B --> D[SMAPIv2 <-> SMAPIv3 translation: xapi-storage-script]
G --> E[SMAPIv1 plugins]
D --> F[SMAPIv3 plugins]

SMAPIv1

These are the files related to SMAPIv1 in /ocaml/xapi/:

sm.ml: OCaml “bindings” for the SMAPIv1 Python “drivers” (SM)
sm_exec.ml: support for implementing the above “bindings”. The parameters are converted to XML-RPC, passed to the relevant python script (“driver”), and then the standard output of the program is parsed as an XML-RPC response (we use ocaml/libs/http-lib/xMLRPC.ml for parsing XML-RPC). When adding new functionality, we can modify type call to add parameters, but when we don’t add any common ones, we should just pass the new parameters in the args record.
smint.ml: Contains types, exceptions, … for the SMAPIv1 OCaml interface.
storage_smapiv1_wrapper.ml: The Wrapper module wraps a SMAPIv2 server (Server_impl) and takes care of locking and datapaths (in case of multiple connections (=datapaths) from VMs to the same VDI, using a state machine for SMAPIv1 operations. It will use the superstate computed by the vdi_automaton.ml in xapi-idl) to compute the required actions to reach the desired state from the current one. It also implements some functionality, like the DP module, that is not implemented in lower layers.
storage_smapiv1.ml: a SMAPIv2 server that translates SMAPIv2 calls to SMAPIv1 ones, by calling ocaml/xapi/sm.ml. It calls passes the XML-RPC requests as the first command-line argument to the corresponding Python script, which returns an XML-RPC response on standard output.

SMAPIv2

These are the files related to SMAPIv2, which need to be modified to implement new calls:

ocaml/xapi-idl/storage/storage_interface.ml: Contains the SMAPIv2 interface
ocaml/xapi-idl/storage/storage_skeleton.ml: A stub SMAPIv2 storage server implementation that matches the SMAPIv2 storage server interface (this is verified by storage_skeleton_test.ml), each of its function just raise a Storage_interface.Unimplemented error. This skeleton is used to automatically fill the unimplemented methods of the below storage servers to satisfy the interface.
ocaml/xapi/storage_mux.ml: A SMAPIv2 server, which multiplexes between other servers. A different SMAPIv2 server can be registered for each SR. Then it forwards the calls for each SR to the “storage plugin” registered for that SR.
ocaml/xapi/storage_smapiv1_wrapper.ml: Implements a state machine to compute SMAPIv1 actions needed to reach the desired state, see SMAPIv1.
ocaml/xapi/storage_smapiv1.ml: Translates the SMAPIv2 calls to SMAPIv1, see SMAPIv1.

How SMAPIv2 works:

We use message-switch under the hood for RPC communication between xapi-idl components. The main Storage_mux.Server (basically Storage_impl.Wrapper(Mux)) is registered to listen on the “org.xen.xapi.storage” queue during xapi’s startup, and this is the main entry point for incoming SMAPIv2 function calls. Storage_mux does not really multiplex between different plugins right now: earlier during xapi’s startup, the same SMAPIv1 storage server module is registered on the various “org.xen.xapi.storage.<sr type>” queues for each supported SR type. (This will change with SMAPIv3, which is accessed via a SMAPIv2 plugin outside of xapi that translates between SMAPIv2 and SMAPIv3.) Then, in Storage_access.create_sr, which is called during SR.create, and also during PBD.plug, the relevant “org.xen.xapi.storage.<sr type>” queue needed for that PBD is registered with Storage_mux in Storage_access.bind for the SR of that PBD.
So basically what happens is that xapi registers itself as a SMAPIv2 server, and forwards incoming function calls to itself through message-switch, using its Storage_mux module. These calls are forwarded to xapi’s SMAPIv1 module doing SMAPIv2 -> SMAPIv1 translation.

Registration of the various storage servers

sequenceDiagram
participant q as message-switch
participant v1 as Storage_smapiv1.SMAPIv1
participant svr as Storage_mux.Server

Note over q, svr: xapi startup, "Starting SMAPIv1 proxies"
q ->> v1:org.xen.xapi.storage.sr_type_1
q ->> v1:org.xen.xapi.storage.sr_type_2
q ->> v1:org.xen.xapi.storage.sr_type_3

Note over q, svr: xapi startup, "Starting SM service"
q ->> svr:org.xen.xapi.storage 

Note over q, svr: SR.create, PBD.plug
svr ->> q:org.xapi.storage.sr_type_2

What happens when a SMAPIv2 “function” is called

graph TD

call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage

subgraph message-switch
org.xen.xapi.storage
org.xen.xapi.storage.SR_type_x
end

org.xen.xapi.storage --VDI.attach2--> Storage_smapiv1_wrapper.Wrapper

subgraph xapi
subgraph Storage_mux.server
Storage_smapiv1_wrapper.Wrapper --> Storage_mux.mux
end
Storage_smapiv1.SMAPIv1
end

Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x
org.xen.xapi.storage.SR_type_x --VDI.attach2--> Storage_smapiv1.SMAPIv1

subgraph SMAPIv1
driver_x[SMAPIv1 driver for SR_type_x]
end

Storage_smapiv1.SMAPIv1 --vdi_attach--> driver_x

Interface Changes, Backward Compatibility, & SXM

During SXM, xapi calls SMAPIv2 functions on a remote xapi. Therefore it is important to keep all those SMAPIv2 functions backward-compatible that we call remotely (e.g. Remote.VDI.attach), otherwise SXM from an older to a newer xapi will break.

Functionality implemented in SMAPIv2 layers

The layer between SMAPIv2 and SMAPIv1 is much fatter than the one between SMAPIv2 and SMAPIv3. The latter does not do much, apart from simple translation. However, the former has large portions of code in its intermediate layers, in addition to the basic SMAPIv2 <-> SMAPIv1 translation in storage_access.ml.

These are the two files in xapi that implement the SMAPIv2 storage interface, from higher to lower level:

Functionality implemented by higher layers is not implemented by the layers below it.

Extra functionality in `storage_task.ml`

storage_smapiv1_wrapper.ml also implements the UPDATES and TASK SMAPIv2 APIs. These are backed by the Updates, Task_server, and Scheduler modules from xcp-idl, instantiated in xapi’s Storage_task module. Migration code in Storage_mux will interact with these to update task progress. There is also an event loop in xapi that keeps calling UPDATES.get to keep the tasks in xapi’s database in sync with the storage manager’s tasks.

Storage_smapiv1_wrapper.ml also implements the legacy VDI.attach call by simply calling the newer VDI.attach2 call in the same module. In general, this is a good place to implement a compatibility layer for deprecated functionality removed from other layers, because this is the first module that intercepts a SMAPIv2 call.

Extra functionality in `storage_mux.ml`

Storage_mux redirects all storage motion (SXM) code to storage_migrate.ml, and the multiplexed will be managed by storage_migrate.ml. The main implementation resides in the DATA and DATA.MIRROR modules. Migration code will use the Storage_task module to run the operations and update the task’s progress.

It also implements the Policy module from the SMAPIv2 interface.

SMAPIv3

SMAPIv3 has a slightly different interface from SMAPIv2. The xapi-storage-script daemon is a SMAPIv2 plugin separate from xapi that is doing the SMAPIv2 ↔ SMAPIv3 translation. It keeps the plugins registered with xapi-idl (their message-switch queues) up to date as their files appear or disappear from the relevant directory.

SMAPIv3 Interface

The SMAPIv3 interface is defined using an OCaml-based IDL from the ocaml-rpc library, and is located at xen-api/ocaml/xapi-storage

From this interface we generate

OCaml RPC client bindings used in xapi-storage-script
The SMAPIv3 API reference
Python bindings, used by the SM scripts that implement the SMAPIv3 interface.
- These bindings are built by running make at the root level, and appear in the _build/default/ocaml/xapi-storage/python/xapi/storage/api/v5/ directory.
- On a XenServer host, they are stored in the /usr/lib/python3.6/site-packages/xapi/storage/api/v5/ directory

SMAPIv3 Plugins

For SMAPIv3 we have volume plugins to manipulate SRs and volumes (=VDIs) in them, and datapath plugins for connecting to the volumes. Volume plugins tell us which datapath plugins we can use with each volume, and what to pass to the plugin. Both volume and datapath plugins implement some common functionality: the SMAPIv3 plugin interface.

How SMAPIv3 works:

The xapi-storage-script daemon detects volume and datapath plugins stored in subdirectories of the /usr/libexec/xapi-storage-script/volume/ and /usr/libexec/xapi-storage-script/datapath/ directories, respectively. When it finds a new datapath plugin, it adds the plugin to a lookup table and uses it the next time that datapath is required. When it finds a new volume plugin, it binds a new message-switch queue named after the plugin’s subdirectory to a new server instance that uses these volume scripts.

To invoke a SMAPIv3 method, it executes a program named <Interface name>.<function name> in the plugin’s directory, for example /usr/libexec/xapi-storage-script/volume/org.xen.xapi.storage.gfs2/SR.ls. The inputs to each script can be passed as command-line arguments and are type-checked using the generated Python bindings, and so are the outputs. The URIs of the SRs that xapi-storage-script knows about are stored in the /var/run/nonpersistent/xapi-storage-script/state.db file, these URIs can be used on the command line when an sr argument is expected.

Registration of the various SMAPIv3 plugins

sequenceDiagram
participant q as message-switch
participant v1 as (Storage_access.SMAPIv1)
participant svr as Storage_mux.Server
participant vol_dir as /../volume/
participant dp_dir as /../datapath/
participant script as xapi-storage-script

Note over script, vol_dir: xapi-storage-script startup
script ->> vol_dir: new subdir org.xen.xapi.storage.sr_type_4
q ->> script: org.xen.xapi.storage.sr_type_4
script ->> dp_dir: new subdir sr_type_4_dp

Note over q, svr: xapi startup, "Starting SMAPIv1 proxies"
q -->> v1:org.xen.xapi.storage.sr_type_1
q -->> v1:org.xen.xapi.storage.sr_type_2
q -->> v1:org.xen.xapi.storage.sr_type_3

Note over q, svr: xapi startup, "Starting SM service"
q ->> svr:org.xen.xapi.storage 

Note over q, svr: SR.create, PBD.plug
svr ->> q:org.xapi.storage.sr_type_4

What happens when a SMAPIv3 “function” is called

graph TD

call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage

subgraph message-switch
org.xen.xapi.storage
org.xen.xapi.storage.SR_type_x
end

org.xen.xapi.storage --VDI.attach2--> Storage_impl.Wrapper

subgraph xapi
subgraph Storage_mux.server
Storage_impl.Wrapper --> Storage_mux.mux
end
Storage_access.SMAPIv1
end

Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x

org.xen.xapi.storage.SR_type_x -."VDI.attach2".-> Storage_access.SMAPIv1

subgraph SMAPIv1
driver_x[SMAPIv1 driver for SR_type_x]
end

Storage_access.SMAPIv1 -.vdi_attach.-> driver_x

subgraph SMAPIv3
xapi-storage-script --Datapath.attach--> v3_dp_plugin_x
subgraph SMAPIv3 plugins
v3_vol_plugin_x[volume plugin for SR_type_x]
v3_dp_plugin_x[datapath plugin for SR_type_x]
end
end

org.xen.xapi.storage.SR_type_x --VDI.attach2-->xapi-storage-script

Error reporting

In our SMAPIv1 OCaml “bindings” in xapi (xen-api/ocaml/xapi/sm_exec.ml), when we inspect the error codes returned from a call to SM, we translate some of the SMAPIv1/SM error codes to XenAPI errors, and for others, we just construct an error code of the form SR_BACKEND_FAILURE_<SM error number>.

The file xcp-idl/storage/storage_interface.ml defines a number of SMAPIv2 errors, ultimately all errors from the various SMAPIv2 storage servers in xapi will be returned as one of these. Most of the errors aren’t converted into a specific exception in Storage_interface, but are simply wrapped with Storage_interface.Backend_error.

The Storage_utils.transform_storage_exn function is used by the client code in xapi to translate the SMAPIv2 errors into XenAPI errors again, this unwraps the errors wrapped with Storage_interface.Backend_error.

Message Forwarding

In the message forwarding layer, first we check the validity of VDI operations using mark_vdi and mark_sr. These first check that the operation is valid operations, using Xapi_vdi.check_operation_error, for mark_vdi, which also inspects the current operations of the VDI, and then, if the operation is valid, it is added to the VDI’s current operations, and update_allowed_operations is called. Then we forward the VDI operation to a suitable host that has a PBD plugged for the VDI’s SR.

Checking that the SR is attached

For the VDI operations, we check at two different places whether the SR is attached: first, at the Xapi level, in Xapi_vdi.check_operation_error, for the resize operation, and then, at the SMAPIv1 level, in Sm.assert_pbd_is_plugged. Sm.assert_pbd_is_plugged performs the same checks, plus it checks that the PBD is attached to the localhost, unlike Xapi_vdi.check_operation_error. This behaviour is correct, because Xapi_vdi.check_operation_error is called from the message forwarding layer, which forwards the call to a host that has the SR attached.

VDI Identifiers and Storage Motion

VDI “location”: this is the VDI identifier used by the SM backend. It is usually the UUID of the VDI, but for ISO SRs it is the name of the ISO.
VDI “content_id”: this is used for storage motion, to reduce the amount of data copied. When we copy over a VDI, the content_id will initially be the same. However, when we attach a VDI as read-write, and then detach it, then we will blank its content_id (set it to a random UUID), because we may have written to it, so the content could be different. .

Storage migration

Overview

The core idea of storage migration is surprisingly simple: We have VDIs attached to a VM, and we wish to migrate these VDIs from one SR to another. This necessarily requires us to copy the data stored in these VDIs over to the new SR, which can be a long-running process if there are gigabytes or even terabytes of them. We wish to minimise the down time of this process to allow the VM to keep running as much as possible.

At a very high level, the SXM process generally only consists of two stages: preparation and mirroring. The preparation is about getting the receiving host ready for the mirroring operation, while the mirroring itself can be further divided into two more operations: 1. sending new writes to both sides; 2.copying existing data from source to destination. The exact detail of how to set up a mirror differs significantly between SMAPIv1 and SMAPIv3, but both of them will have to perform the above two operations. Once the mirroring is established, it is a matter of checking the status of the mirroring and carry on with the follwoing VM migration.

The reality is more complex than what we had hoped for. For example, in SMAPIv1, the mirror establishment is quite an involved process and is itself divided into several stages, which will be discussed in more detail later on.

SXM Multiplexing

This section is about the design idea behind the additional layer of mutiplexing specifically for Storage Xen Motion (SXM) from SRs using SMAPIv3. It is recommended that you have read the introduction doc for the storage layer first to understand how storage multiplexing is done between SMAPIv2 and SMAPI{v1, v3} before reading this.

Motivation

The existing SXM code was designed to work only with SMAPIv1 SRs, and therefore does not take into account the dramatic difference in the ways SXM is done between SMAPIv1 and SMAPIv3. The exact difference will be covered later on in this doc, for this section it is sufficient to assume that they have two ways of doing migration. Therefore, we need different code paths for migration from SMAPIv1 and SMAPIv3.

But we have storage_mux.ml

Indeed, storage_mux.ml is responsible for multiplexing and forwarding requests to the correct storage backend, based on the SR type that the caller specifies. And in fact, for inbound SXM to SMAPIv3 (i.e. migrating into a SMAPIv3 SR, GFS2 for example), storage_mux is doing the heavy lifting of multiplexing between different storage backends. Every time a Remote. call is invoked, this will go through the SMAPIv2 layer to the remote host and get multiplexed on the destination host, based on whether we are migrating into a SMAPIv1 or SMAPIv3 SR (see the diagram below). And the inbound SXM is implemented by implementing the existing SMAPIv2 -> SMAPIv3 calls (see import_activate for example) which may not have been implemented before.

mux for inbound

While this works fine for inbound SXM, it does not work for outbound SXM. A typical SXM consists of four combinations, the source sr type (v1/v3) and the destiantion sr type (v1/v3), any of the four combinations is possible. We have already covered the destination multiplexing (v1/v3) by utilising storage_mux, and at this point we have run out of multiplexer for multiplexing on the source. In other words, we can only mutiplex once for each SMAPIv2 call, and we can either use that chance for either the source or the destination, and we have already used it for the latter.

Thought experiments on an alternative design

To make it even more concrete, let us consider an example: the mirroring logic in SXM is different based on the source SR type of the SXM call. You might imagine defining a function like MIRROR.start v3_sr v1_sr that will be multiplexed by the storage_mux based on the source SR type, and forwarded to storage_smapiv3_migrate, or even just xapi-storage-script, which is indeed quite possible. Now at this point we have already done the multiplexing, but we still wish to multiplex operations on destination SRs, for example, we might want to attach a VDI belonging to a SMAPIv1 SR on the remote host. But as we have already done the multiplexing and is now inside xapi-storage-script, we have lost any chance of doing any further multiplexing :(

Design

The idea of this new design is to introduce an additional multiplexing layer that is specific for multiplexing calls based on the source SR type. For example, in the diagram below the send_start src_sr dest_sr will take both the src SR and the destination SR as parameters, and suppose the mirroring logic is different for different types of source SRs (i.e. SMAPIv1 or SMAPIv3), the storage migration code will necessarily choose the right code path based on the source SR type. And this is exactly what is done in this additional multiplexing layer. The respective logic for doing {v1,v3}-specifi mirroring, for example, will stay in storage_smapi{v1,v3}_migrate.ml

mux for outbound

Note that later on storage_smapi{v1,v3}_migrate.ml will still have the flexibility to call remote SMAPIv2 functions, such as Remote.VDI.attach dest_sr vdi, and it will be handled just as before.

SMAPIv1 migration

This section is about migration from SMAPIv1 SRs to SMAPIv1 or SMAPIv3 SRs, since the migration is driven by the source host, it is usally the source host that determines most of the logic during a storage migration.

First we take a look at an overview diagram of what happens during SMAPIv1 SXM: the diagram is labelled with S1, S2 … which indicates different stages of the migration. We will talk about each stage in more detail below.

overview-v1

Preparation

Before we can start our migration process, there are a number of preparations needed to prepare for the following mirror. For SMAPIv1 this involves:

Create a new VDI (called leaf) that will be used as the receiving VDI for all the new writes
Create a dummy snapshot of the VDI above to make sure it is a differencing disk and can be composed later on
Create a VDI (called parent) that will be used to receive the existing content of the disk (the snapshot)

Note that the leaf VDI needs to be attached and activated on the destination host (to a non-exsiting mirror_vm) since it will later on accept writes to mirror what is written on the source host.

The parent VDI may be created in two different ways: 1. If there is a “similar VDI”, clone it on the destination host and use it as the parent VDI; 2. If there is no such VDI, create a new blank VDI. The similarity here is defined by the distances between different VDIs in the VHD tree, which is exploiting the internal representation of the storage layer, hence we will not go into too much detail about this here.

Once these preparations are done, a mirror_receive_result data structure is then passed back to the source host that will contain all the necessary information about these new VDIs, etc.

Establishing mirror

At a high level, mirror establishment for SMAPIv1 works as follows:

Take a snapshot of a VDI that is attached to VM1. This gives us an immutable copy of the current state of the VDI, with all the data up until the point we took the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting to a shared parent, which stores the shared content for the snapshot and the writable VDI from which we took the snapshot (snapshot)
Mirror the writable VDI to the server hosts: this means that all writes that goes to the client VDI will also be written to the mirrored VDI on the remote host (mirror)
Copy the immutable snapshot from our local host to the remote (copy)
Compose the mirror and the snapshot to form a single VDI
Destroy the snapshot on the local host (cleanup)

Mirror

The mirroring process for SMAPIv1 is rather unconventional, so it is worth documenting how this works. Instead of a conventional client server architecture, where the source client connects to the destination server directly through the NBD protocol in tapdisk, the connection is established in xapi and then passed onto tapdisk. It was done in this rather unusual way mainly due to authentication issues. Because it is xapi that is creating the connection, tapdisk does not need to be concerned about authentication of the connection, thus simplifying the storage component. This is reasonable as the storage component should focus on handling storage requests rather than worrying about network security.

The diagram below illustrates this prcess. First, xapi on the source host will initiate an https request to the remote xapi. This request contains the necessary information about the VDI to be mirrored, and the SR that contains it, etc. This information is then passed onto the https handler on the destination host (called nbd_handler) which then processes this information. Now the unusual step is that both the source and the destination xapi will pass this connection onto tapdisk, by sending the fd representing the socket connection to the tapdisk process. On the source this would be nbd client process of tapdisk, and on the destination this would be the nbd server process of the tapdisk. After this step, we can consider a client-server connection is established between two tapdisks on the client and server, as if the tapdisk on the source host makes a request to the tapdisk on the destination host and initiates the connection. On the diagram, this is indicated by the dashed lines between the tapdisk processes. Logically, we can view this as xapi creates the connection, and then passes this connection down into tapdisk.

mirror

Snapshot

The next step would be create a snapshot of the VDI. This is easily done as a VDI.snapshot operation. If the VDI was in VHD format, then internally this would create two children for, one for the snapshot, which only contains the metadata information and tends to be small, the other for the writable VDI where all the new writes will go to. The shared base copy contains the shared blocks.

snapshot

Copy and compose

Once the snapshot is created, we can then copy the snapshot from the source to the destination. This step is done by sparse_dd using the nbd protocol. This is also the step that takes the most time to complete.

sparse_dd is a process forked by xapi that does the copying of the disk blocks. sparse_dd can supports a number of protocols, including nbd. In this case, sparse_dd will initiate an https put request to the destination host, with a url of the form <address>/services/SM/nbdproxy/<sr>/<vdi>. This https request then gets handled by the https handler on the destination host B, which will then spawn a handler thread. This handler will find the “generic” nbd server¹ of either tapdisk or qemu-dp, depending on the destination SR type, and then start proxying data between the https connection socket and the socket connected to the nbd server.

sxm new copy

Once copying is done, the snapshot and mirrored VDI can be then composed into a single VDI.

Finish

At this point the VDI is synchronised to the new host! Mirror is still working at this point though because that will not be destroyed until the VM itself has been migrated as well. Some cleanups are done at this point, such as deleting the snapshot that is taken on the source, destroying the mirror datapath, etc.

The end results look like the following. Note that VM2 is in dashed line as it is not yet created yet. The next steps would be to migrate the VM1 itself to the destination as well, but this is part of the VM migration process and will not be covered here.

final

SMAPIv3 migration

This section covers the mechanism of migrations from SRs using SMAPIv3 (to SMAPIv1 or SMAPIv3). Although the core ideas are the same, SMAPIv3 has a rather different mechanism for mirroring: 1. it does not require xapi to take snapshot of the VDI anymore, since the mirror itself will take care of replicating the existing data to the destination; 2. there is no fd passing for connection establishment anymore, and instead proxies are used for connection setup.

Preparation

The preparation work for SMAPIv3 is greatly simplified by the fact that the mirror at the storage layer will copy the existing data in the VDI to the destination. This means that snapshot of the source VDI is not required anymore. So we are left with only one thing:

Create a VDI used for mirroring the data of the source VDI

For this reason, the implementation logic for SMAPIv3 preparation is also shorter, as the complexity is now handled by the storage layer, which is where it is supposed to be handled.

Establishing mirror

The other significant difference is that the storage backend for SMAPIv3 qemu-dp SRs no longer accepts fds, so xapi needs to proxy the data between two nbd client and nbd server.

SMAPIv3 provides the Data.mirror uri domain remote which needs three parameters: uri for accessing the local disk, doamin for the domain slice on which mirroring should happen, and most importantly for this design, a remote url which represents the remote nbd server to which the blocks of data can be sent to.

This function itself, when called by xapi and forwarded to the storage layer’s qemu-dp nbd client, will initiate a nbd connection to the nbd server pointed to by remote. This works fine when the storage migration happens entirely within a local host, where qemu-dp’s nbd client and nbd server can communicate over unix domain sockets. However, it does not work for inter-host migrations as qemu-dp’s nbd server is not exposed publicly over the network (just as tapdisk’s nbd server). Therefore a proxying service on the source host is needed for forwarding the nbd connection from the source host to the destination host. And it would be the responsiblity of xapi to manage this proxy service.

The following diagram illustrates the mirroring process of a single VDI:

sxm mirror

The first step for xapi is then to set up a nbd proxy thread that will be listening on a local unix domain socket with path /var/run/nbdproxy/export/<domain> where domain is the domain parameter mentioned above in Data.mirror. The nbd proxy thread will accept nbd connections (or rather any connections, it does not speak/care about nbd protocol at all) and sends an https put request to the remote xapi. The proxy itself will then forward the data exactly as it is to the remote side through the https connection.

Once the proxy is set up, xapi will call Data.mirror, which will be forwarded to the xapi-storage-script and is further forwarded to the qemu-dp. This call contains, among other parameters, the destination NBD server url (remote) to be connected. In this case the destination nbd server is exactly the domain socket to which the proxy thread is listening. Therefore the remote parameter will be of the form nbd+unix:///<export>?socket=<socket> where the export is provided by the destination nbd server that represents the VDI prepared on the destination host, and the socket will be the path of the unix domain socket where the proxy thread (which we just created) is listening at.

When this connection is set up, the proxy process will talk to the remote xapi via https requests, and on the remote side, an https handler will proxy this request to the appropriate nbd server of either tapdisk or qemu-dp, using exactly the same import proxy as mentioned before.

Note that this proxying service is tightly integrated with outbound SXM of SMAPIv3 SRs. This is to make it simple to focus on the migration itself.

Although there is no need to explicitly copy the VDI anymore, we still need to transfer the data and wait for it finish. For this we use Data.stat call provided by the storage backend to query the status of the mirror, and wait for it to finish as needed.

Limitations

This way of establishing the connection simplifies the implementation of the migration for SMAPIv3, but it also has limitations:

One proxy per live VDI migration is needed, which can potentially consume lots of resources in dom0, and we should measure the impact of this before we switch to using more resource-efficient ways such as wire guard that allows establishing a single connection between multiple hosts.

Finish

As there is no need to copy a VDI, there is also no need to compose or delete the snapshot. The cleanup procedure would therefore just involve destroy the datapath that was used for receiving writes for the mirrored VDI.

Error Handling

Storage migration is a long-running process, and is prone to failures in each step. Hence it is important specifying what errors could be raised at each step and their significance. This is beneficial both for the user and for triaging.

There are two general cleanup functions in SXM: MIRROR.receive_cancel and MIRROR.stop. The former is for cleaning up whatever has been created by MIRROR.receive_start on the destination host (such as VDIs for receiving mirrored data). The latter is a more comprehensive function that attempts to “undo” all the side effects that was done during the SXM, and also calls receive_cancel as part of its operations.

Currently error handling was done by building up a list of cleanup functions in the on_fail list ref as the function executes. For example, if the receive_start has been completed successfully, add receive_cancel to the list of cleanup functions. And whenever an exception is encountered, just execute whatever has been added to the on_fail list ref. This is convenient, but does entangle all the error handling logic with the core SXM logic itself, making the code rather than hard to understand and maintain.

The idea to fix this is to introduce explicit “stages” during the SXM and define explicitly what error handling should be done if it fails at a certain stage. This helps separate the error handling logic into the with part of a try with block, which is where they are supposed to be. Since we need to accommodate the existing SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that each stage also roughly corresponds to a helper function that is called within Storage_migrate.start, which is the wrapper function that initiates storage migration. And each helper functions themselves would also have error handling logic within themselves as needed (e.g. see Storage_smapiv1_migrate.receive_start) to deal with exceptions that happen within each helper functions.

Preparation (SMAPIv1 and SMAPIv3)

The preparation stage generally corresponds to what is done in receive_start, and this function itself will handle exceptions when there are partial failures within the function itself, such as an exception after the receiving VDI is created. It will use the old-style on_fail function but only with a limited scope.

There is nothing to be done at a higher level (i.e within MIRROR.start which calls receive_start) if preparation has failed.

Snapshot and mirror failure (SMAPIv1)

For SMAPIv1, the mirror is done in a bit cumbersome way. The end goal is to establish connections between two tapdisk processes on the source and destination hosts. To achieve this goal, xapi will do two main jobs: 1. create a connection between two hosts and pass the connection to tapdisk; 2. create a snapshot as a starting point of the mirroring process.

Therefore handling of failures at these two stages are similar: clean up what was done in the preparation stage by calling receive_cancel, and that is almost it. Again, we will leave whatever is needed for partial failure handling within those functions themselves and only clean up at a stage-level in storage_migrate.ml

Note that receive_cancel is a multiplexed function for SMAPIv1 and SMAPIv3, which means different clean up logic will be executed depending on what type of SR we are migrating from.

Mirror failure (SMAPIv3)

The Data.stat call in SMAPIv3 returns a data structure that includes the current progress of the mirror job, whether it has completed syncing the existing data and whether the mirorr has failed. Similar to how it is done in SMAPIv1, we wait for the sync to complete once we issue the Data.mirror call, by repeatedly polling the status of the mirror using the Data.stat call. During this process, the status of the mirror is also checked and if a failure is detected, a Migration_mirror_failure will be raised and then gets handled by the code in storage_migrate.ml by calling Storage_smapiv3_migrate.receive_cancel2, which will clean up the mirror datapath and destroy the mirror VDI, similar to what is done in SMAPIv1.

Copy failure (SMAPIv1)

The final step of storage migration for SMAPIv1 is to copy the snapshot from the source to the destination. At this stage, most of the side effectful work has been done, so we do need to call MIRROR.stop to clean things up if we experience an failure during copying.

SMAPIv1 Migration implementation detail

Info

The following doc refers to the xapi a version of xapi that is before 24.37 after which point this code structure has undergone many changes as part of adding support for SMAPIv3 SXM. Therefore the following tutorial might be less relevant in terms of the implementation detail. Although the general principle should remain the same.

sequenceDiagram
participant local_tapdisk as local tapdisk
participant local_smapiv2 as local SMAPIv2
participant xapi
participant remote_xapi as remote xapi
participant remote_smapiv2 as remote SMAPIv2 (might redirect)
participant remote_tapdisk as remote tapdisk

Note over xapi: Sort VDIs increasingly by size and then age

loop VM's & snapshots' VDIs & suspend images
  xapi->>remote_xapi: plug dest SR to dest host and pool master
  alt VDI is not mirrored
    Note over xapi: We don't mirror RO VDIs & VDIs of snapshots
    xapi->>local_smapiv2: DATA.copy remote_sm_url

    activate local_smapiv2
    local_smapiv2-->>local_smapiv2: SR.scan
    local_smapiv2-->>local_smapiv2: VDI.similar_content
    local_smapiv2-->>remote_smapiv2: SR.scan
    Note over local_smapiv2: Find nearest smaller remote VDI remote_base, if any
    alt remote_base
      local_smapiv2-->>remote_smapiv2: VDI.clone
      local_smapiv2-->>remote_smapiv2: VDI.resize
    else no remote_base
      local_smapiv2-->>remote_smapiv2: VDI.create
    end

    Note over local_smapiv2: call copy'
    activate local_smapiv2
    local_smapiv2-->>remote_smapiv2: SR.list
    local_smapiv2-->>remote_smapiv2: SR.scan
    Note over local_smapiv2: create new datapaths remote_dp, base_dp, leaf_dp
    Note over local_smapiv2: find local base_vdi with same content_id as dest, if any
    local_smapiv2-->>remote_smapiv2: VDI.attach2 remote_dp dest
    local_smapiv2-->>remote_smapiv2: VDI.activate remote_dp dest
    opt base_vdi
      local_smapiv2-->>local_smapiv2: VDI.attach2 base_dp base_vdi
      local_smapiv2-->>local_smapiv2: VDI.activate base_dp base_vdi
    end
    local_smapiv2-->>local_smapiv2: VDI.attach2 leaf_dp vdi
    local_smapiv2-->>local_smapiv2: VDI.activate leaf_dp vdi
    local_smapiv2-->>remote_xapi: sparse_dd base_vdi vdi dest [NBD URI for dest & remote_dp]
    Note over remote_xapi: HTTP handler verifies credentials
    remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server
    local_smapiv2-->>local_smapiv2: VDI.deactivate leaf_dp vdi
    local_smapiv2-->>local_smapiv2: VDI.detach leaf_dp vdi
    opt base_vdi
      local_smapiv2-->>local_smapiv2: VDI.deactivate base_dp base_vdi
      local_smapiv2-->>local_smapiv2: VDI.detach base_dp base_vdi
    end
    local_smapiv2-->>remote_smapiv2: DP.destroy remote_dp
    deactivate local_smapiv2

    local_smapiv2-->>remote_smapiv2: VDI.snapshot remote_copy
    local_smapiv2-->>remote_smapiv2: VDI.destroy remote_copy
    local_smapiv2->>xapi: task(snapshot)
    deactivate local_smapiv2

  else VDI is mirrored
    Note over xapi: We mirror RW VDIs of the VM
    Note over xapi: create new datapath dp
    xapi->>local_smapiv2: VDI.attach2 dp
    xapi->>local_smapiv2: VDI.activate dp
    xapi->>local_smapiv2: DATA.MIRROR.start dp remote_sm_url

    activate local_smapiv2
    Note over local_smapiv2: copy disk data & mirror local writes
    local_smapiv2-->>local_smapiv2: SR.scan
    local_smapiv2-->>local_smapiv2: VDI.similar_content
    local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_start similars
    activate remote_smapiv2
    remote_smapiv2-->>local_smapiv2: mirror_vdi,mirror_dp,copy_diffs_from,copy_diffs_to,dummy_vdi
    deactivate remote_smapiv2
    local_smapiv2-->>local_smapiv2: DP.attach_info dp
    local_smapiv2-->>remote_xapi: connect to [NBD URI for mirror_vdi & mirror_dp]
    Note over remote_xapi: HTTP handler verifies credentials
    remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server
    local_smapiv2-->>local_tapdisk: pass socket & dp to tapdisk of dp
    local_smapiv2-->>local_smapiv2: VDI.snapshot local_vdi [mirror:dp]
    local_smapiv2-->>local_tapdisk: [Python] unpause disk, pass dp
    local_tapdisk-->>remote_tapdisk: mirror new writes via NBD to socket
    Note over local_smapiv2: call copy' snapshot copy_diffs_to
    local_smapiv2-->>remote_smapiv2: VDI.compose copy_diffs_to mirror_vdi
    local_smapiv2-->>remote_smapiv2: VDI.remove_from_sm_config mirror_vdi base_mirror
    local_smapiv2-->>remote_smapiv2: VDI.destroy dummy_vdi
    local_smapiv2-->>local_smapiv2: VDI.destroy snapshot
    local_smapiv2->>xapi: task(mirror ID)
    deactivate local_smapiv2

    xapi->>local_smapiv2: DATA.MIRROR.stat
    activate local_smapiv2
    local_smapiv2->>xapi: dest_vdi
    deactivate local_smapiv2
  end

  loop until task finished
    xapi->>local_smapiv2: UPDATES.get
    xapi->>local_smapiv2: TASK.stat
  end
  xapi->>local_smapiv2: TASK.stat
  xapi->>local_smapiv2: TASK.destroy
end
opt for snapshot VDIs
  xapi->>local_smapiv2: SR.update_snapshot_info_src remote_sm_url
  activate local_smapiv2
  local_smapiv2-->>remote_smapiv2: SR.update_snapshot_info_dest
  deactivate local_smapiv2
end
Note over xapi: ...
Note over xapi: reserve resources for the new VM in dest host
loop all VDIs
  opt VDI is mirrored
    xapi->>local_smapiv2: DP.destroy dp
  end
end
opt post_detach_hook
  opt active local mirror
    local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_finalize [mirror ID]
    Note over remote_smapiv2: destroy mirror dp
  end
end
Note over xapi: memory image migration by xenopsd
Note over xapi: destroy the VM record

Receiving SXM

These are the remote calls in the above diagram sent from the remote host to the receiving end of storage motion:

Remote SMAPIv2 -> local SMAPIv2 RPC calls:
- SR.list
- SR.scan
- SR.update_snapshot_info_dest
- VDI.attach2
- VDI.activate
- VDI.snapshot
- VDI.destroy
- For copying:
  - For copying from base:
    - VDI.clone
    - VDI.resize
  - For copying without base:
    - VDI.create
- For mirroring:
  - DATA.MIRROR.receive_start
  - VDI.compose
  - VDI.remove_from_sm_config
  - DATA.MIRROR.receive_finalize
HTTP requests to xapi:
- Connecting to NBD URI via xapi’s HTTP handler

This is how xapi coordinates storage migration. We’ll do it as a code walkthrough through the two layers: xapi and storage-in-xapi (SMAPIv2).

Xapi code

The entry point is in xapi_vm_migration.ml

The function takes several arguments:

a vm reference (vm)
a dictionary of (string * string) key-value pairs about the destination (dest). This is the result of a previous call to the destination pool, Host.migrate_receive
live, a boolean of whether we should live-migrate or suspend-resume,
vdi_map, a mapping of VDI references to destination SR references,
vif_map, a mapping of VIF references to destination network references,
vgpu_map, similar for VGPUs
options, another dictionary of options

let migrate_send'  ~__context ~vm ~dest ~live ~vdi_map ~vif_map ~vgpu_map ~options =
  SMPERF.debug "vm.migrate_send called vm:%s" (Db.VM.get_uuid ~__context ~self:vm);

  let open Xapi_xenops in

  let localhost = Helpers.get_localhost ~__context in
  let remote = remote_of_dest dest in

  (* Copy mode means we don't destroy the VM on the source host. We also don't
     	   copy over the RRDs/messages *)
  let copy = try bool_of_string (List.assoc "copy" options) with _ -> false in

It begins by getting the local host reference, deciding whether we’re copying or moving, and converting the input dest parameter from an untyped string association list to a typed record, remote, which is declared further up the file:

type remote = {
  rpc : Rpc.call -> Rpc.response;
  session : API.ref_session;
  sm_url : string;
  xenops_url : string;
  master_url : string;
  remote_ip : string; (* IP address *)
  remote_master_ip : string; (* IP address *)
  dest_host : API.ref_host;
}

this contains:

A function, rpc, for calling XenAPI RPCs on the destination
A session valid on the destination
A sm_url on which SMAPIv2 APIs can be called on the destination
A master_url on which XenAPI commands can be called (not currently used)
The IP address, remote_ip, of the destination host
The IP address, remote_master_ip, of the master of the destination pool

Next, we determine which VDIs to copy:

  (* The first thing to do is to create mirrors of all the disks on the remote.
     We look through the VM's VBDs and all of those of the snapshots. We then
     compile a list of all of the associated VDIs, whether we mirror them or not
     (mirroring means we believe the VDI to be active and new writes should be
     mirrored to the destination - otherwise we just copy it)
     We look at the VDIs of the VM, the VDIs of all of the snapshots, and any
     suspend-image VDIs. *)

  let vm_uuid = Db.VM.get_uuid ~__context ~self:vm in
  let vbds = Db.VM.get_VBDs ~__context ~self:vm in
  let vifs = Db.VM.get_VIFs ~__context ~self:vm in
  let snapshots = Db.VM.get_snapshots ~__context ~self:vm in
  let vm_and_snapshots = vm :: snapshots in
  let snapshots_vbds = List.concat_map (fun self -> Db.VM.get_VBDs ~__context ~self) snapshots in
  let snapshot_vifs = List.concat_map (fun self -> Db.VM.get_VIFs ~__context ~self) snapshots in

we now decide whether we’re intra-pool or not, and if we’re intra-pool whether we’re migrating onto the same host (localhost migrate). Intra-pool is decided by trying to do a lookup of our current host uuid on the destination pool.

  let is_intra_pool = try ignore(Db.Host.get_uuid ~__context ~self:remote.dest_host); true with _ -> false in
  let is_same_host = is_intra_pool && remote.dest_host == localhost in

  if copy && is_intra_pool then raise (Api_errors.Server_error(Api_errors.operation_not_allowed, [ "Copy mode is disallowed on intra pool storage migration, try efficient alternatives e.g. VM.copy/clone."]));

Having got all of the VBDs of the VM, we now need to find the associated VDIs, filtering out empty CDs, and decide whether we’re going to copy them or mirror them - read-only VDIs can be copied but RW VDIs must be mirrored.

  let vms_vdis = List.filter_map (vdi_filter __context true) vbds in

where vdi_filter is defined earler:

(* We ignore empty or CD VBDs - nothing to do there. Possible redundancy here:
   I don't think any VBDs other than CD VBDs can be 'empty' *)
let vdi_filter __context allow_mirror vbd =
  if Db.VBD.get_empty ~__context ~self:vbd || Db.VBD.get_type ~__context ~self:vbd = `CD
  then None
  else
    let do_mirror = allow_mirror && (Db.VBD.get_mode ~__context ~self:vbd = `RW) in
    let vm = Db.VBD.get_VM ~__context ~self:vbd in
    let vdi = Db.VBD.get_VDI ~__context ~self:vbd in
    Some (get_vdi_mirror __context vm vdi do_mirror)

This in turn calls get_vdi_mirror which gathers together some important info:

let get_vdi_mirror __context vm vdi do_mirror =
  let snapshot_of = Db.VDI.get_snapshot_of ~__context ~self:vdi in
  let size = Db.VDI.get_virtual_size ~__context ~self:vdi in
  let xenops_locator = Xapi_xenops.xenops_vdi_locator ~__context ~self:vdi in
  let location = Db.VDI.get_location ~__context ~self:vdi in
  let dp = Storage_access.presentative_datapath_of_vbd ~__context ~vm ~vdi in
  let sr = Db.SR.get_uuid ~__context ~self:(Db.VDI.get_SR ~__context ~self:vdi) in
  {vdi; dp; location; sr; xenops_locator; size; snapshot_of; do_mirror}

The record is helpfully commented above:

type vdi_mirror = {
  vdi : [ `VDI ] API.Ref.t;           (* The API reference of the local VDI *)
  dp : string;                        (* The datapath the VDI will be using if the VM is running *)
  location : string;                  (* The location of the VDI in the current SR *)
  sr : string;                        (* The VDI's current SR uuid *)
  xenops_locator : string;            (* The 'locator' xenops uses to refer to the VDI on the current host *)
  size : Int64.t;                     (* Size of the VDI *)
  snapshot_of : [ `VDI ] API.Ref.t;   (* API's snapshot_of reference *)
  do_mirror : bool;                   (* Whether we should mirror or just copy the VDI *)
}

xenops_locator is <sr uuid>/<vdi uuid>, and dp is vbd/<domid>/<device> if the VM is running and vbd/<vm_uuid>/<vdi_uuid> if not.

So now we have a list of these records for all VDIs attached to the VM. For these we check explicitly that they’re all defined in the vdi_map, the mapping of VDI references to their destination SR references.

  check_vdi_map ~__context vms_vdis vdi_map;

We then figure out the VIF map:

 let vif_map =
    if is_intra_pool then vif_map
    else infer_vif_map ~__context (vifs @ snapshot_vifs) vif_map
  in

More sanity checks: We can’t do a storage migration if any of the VDIs is a reset-on-boot one - since the state will be lost on the destination when it’s attached:

(* Block SXM when VM has a VDI with on_boot=reset *)
  List.(iter (fun vconf ->
      let vdi = vconf.vdi in
      if (Db.VDI.get_on_boot ~__context ~self:vdi ==`reset) then
        raise (Api_errors.Server_error(Api_errors.vdi_on_boot_mode_incompatible_with_operation, [Ref.string_of vdi]))) vms_vdis) ;

We now consider all of the VDIs associated with the snapshots. As for the VM’s VBDs above, we end up with a vdi_mirror list. Note we pass false to the allow_mirror parameter of the get_vdi_mirror function as none of these snapshot VDIs will ever require mirrorring.

let snapshots_vdis = List.filter_map (vdi_filter __context false)

Finally we get all of the suspend-image VDIs from all snapshots as well as the actual VM, since it might be suspended itself:

snapshots_vbds in
  let suspends_vdis =
    List.fold_left
      (fun acc vm ->
         if Db.VM.get_power_state ~__context ~self:vm = `Suspended
         then
           let vdi = Db.VM.get_suspend_VDI ~__context ~self:vm in
           let sr = Db.VDI.get_SR ~__context ~self:vdi in
           if is_intra_pool && Helpers.host_has_pbd_for_sr ~__context ~host:remote.dest_host ~sr
           then acc
           else (get_vdi_mirror __context vm vdi false):: acc
         else acc)
      [] vm_and_snapshots in

Sanity check that we can see all of the suspend-image VDIs on this host:

 (* Double check that all of the suspend VDIs are all visible on the source *)
  List.iter (fun vdi_mirror ->
      let sr = Db.VDI.get_SR ~__context ~self:vdi_mirror.vdi in
      if not (Helpers.host_has_pbd_for_sr ~__context ~host:localhost ~sr)
      then raise (Api_errors.Server_error (Api_errors.suspend_image_not_accessible, [ Ref.string_of vdi_mirror.vdi ]))) suspends_vdis;

Next is a fairly complex piece that determines the destination SR for all of these VDIs. We don’t require API uses to decide destinations for all of the VDIs on snapshots and hence we have to make some decisions here:

  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let default_sr_ref =
    XenAPI.Pool.get_default_SR remote.rpc remote.session dest_pool in
  let suspend_sr_ref =
    let pool_suspend_SR = XenAPI.Pool.get_suspend_image_SR remote.rpc remote.session dest_pool
    and host_suspend_SR = XenAPI.Host.get_suspend_image_sr remote.rpc remote.session remote.dest_host in
    if pool_suspend_SR <> Ref.null then pool_suspend_SR else host_suspend_SR in

  (* Resolve placement of unspecified VDIs here - unspecified VDIs that
            are 'snapshot_of' a specified VDI go to the same place. suspend VDIs
            that are unspecified go to the suspend_sr_ref defined above *)

  let extra_vdis = suspends_vdis @ snapshots_vdis in

  let extra_vdi_map =
    List.map
      (fun vconf ->
         let dest_sr_ref =
           let is_mapped = List.mem_assoc vconf.vdi vdi_map
           and snapshot_of_is_mapped = List.mem_assoc vconf.snapshot_of vdi_map
           and is_suspend_vdi = List.mem vconf suspends_vdis
           and remote_has_suspend_sr = suspend_sr_ref <> Ref.null
           and remote_has_default_sr = default_sr_ref <> Ref.null in
           let log_prefix =
             Printf.sprintf "Resolving VDI->SR map for VDI %s:" (Db.VDI.get_uuid ~__context ~self:vconf.vdi) in
           if is_mapped then begin
             debug "%s VDI has been specified in the map" log_prefix;
             List.assoc vconf.vdi vdi_map
           end else if snapshot_of_is_mapped then begin
             debug "%s Snapshot VDI has entry in map for it's snapshot_of link" log_prefix;
             List.assoc vconf.snapshot_of vdi_map
           end else if is_suspend_vdi && remote_has_suspend_sr then begin
             debug "%s Mapping suspend VDI to remote suspend SR" log_prefix;
             suspend_sr_ref
           end else if is_suspend_vdi && remote_has_default_sr then begin
             debug "%s Remote suspend SR not set, mapping suspend VDI to remote default SR" log_prefix;
             default_sr_ref
           end else if remote_has_default_sr then begin
             debug "%s Mapping unspecified VDI to remote default SR" log_prefix;
             default_sr_ref
           end else begin
             error "%s VDI not in VDI->SR map and no remote default SR is set" log_prefix;
             raise (Api_errors.Server_error(Api_errors.vdi_not_in_map, [ Ref.string_of vconf.vdi ]))
           end in
         (vconf.vdi, dest_sr_ref))
      extra_vdis in

At the end of this we’ve got all of the VDIs that need to be copied and destinations for all of them:

  let vdi_map = vdi_map @ extra_vdi_map in
  let all_vdis = vms_vdis @ extra_vdis in

  (* The vdi_map should be complete at this point - it should include all the
     VDIs in the all_vdis list. *)

Now we gather some final information together:

  assert_no_cbt_enabled_vdi_migrated ~__context ~vdi_map;

  let dbg = Context.string_of_task __context in
  let open Xapi_xenops_queue in
  let queue_name = queue_of_vm ~__context ~self:vm in
  let module XenopsAPI = (val make_client queue_name : XENOPS) in

  let remote_vdis = ref [] in

  let ha_always_run_reset = not is_intra_pool && Db.VM.get_ha_always_run ~__context ~self:vm in

  let cd_vbds = find_cds_to_eject __context vdi_map vbds in
  eject_cds __context cd_vbds;

check there’s no CBT (we can’t currently migrate the CBT metadata), make our client to talk to Xenopsd, make a mutable list of remote VDIs (which I think is redundant right now), decide whether we need to do anything for HA (we disable HA protection for this VM on the destination until it’s fully migrated) and eject any CDs from the VM.

Up until now this has mostly been gathering info (aside from the ejecting CDs bit), but now we’ll start to do some actions, so we begin a try-catch block:

try

but we’ve still got a bit of thinking to do: we sort the VDIs to copy based on age/size:

    (* Sort VDIs by size in principle and then age secondly. This gives better
       chances that similar but smaller VDIs would arrive comparatively
       earlier, which can serve as base for incremental copying the larger
       ones. *)
    let compare_fun v1 v2 =
      let r = Int64.compare v1.size v2.size in
      if r = 0 then
        let t1 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v1.vdi) in
        let t2 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v2.vdi) in
        compare t1 t2
      else r in
    let all_vdis = all_vdis |> List.sort compare_fun in

    let total_size = List.fold_left (fun acc vconf -> Int64.add acc vconf.size) 0L all_vdis in
    let so_far = ref 0L in

OK, let’s copy/mirror:

    with_many (vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy) all_vdis @@ fun all_map ->

The copy functions are written such that they take continuations. This it to make the error handling simpler - each individual component function can perform its setup and execute the continuation. In the event of an exception coming from the continuation it can then unroll its bit of state and rethrow the exception for the next layer to handle.

with_many is a simple helper function for nesting invocations of functions that take continuations. It has the delightful type:

('a -> ('b -> 'c) -> 'c) -> 'a list -> ('b list -> 'c) -> 'c

(* Helper function to apply a 'with_x' function to a list *)
let rec with_many withfn many fn =
  let rec inner l acc =
    match l with
    | [] -> fn acc
    | x::xs -> withfn x (fun y -> inner xs (y::acc))
  in inner many []

As an example of its operation, imagine our withfn is as follows:

let withfn x c =
  Printf.printf "Starting withfn: x=%d\n" x;
  try
    c (string_of_int x)
  with e ->
    Printf.printf "Handling exception for x=%d\n" x;
    raise e;;

applying this gives the output:

utop # with_many withfn [1;2;3;4] (String.concat ",");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
- : string = "4,3,2,1"

whereas raising an exception in the continutation results in the following:

utop # with_many with_fn [1;2;3;4] (fun _ -> failwith "error");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
Handling exception for x=4
Handling exception for x=3
Handling exception for x=2
Handling exception for x=1
Exception: Failure "error".

All the real action is in vdi_copy_fun, which copies or mirrors a single VDI:

let vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy vconf continuation =
  TaskHelper.exn_if_cancelling ~__context;
  let open Storage_access in
  let dest_sr_ref = List.assoc vconf.vdi vdi_map in
  let dest_sr_uuid = XenAPI.SR.get_uuid remote.rpc remote.session dest_sr_ref in

  (* Plug the destination shared SR into destination host and pool master if unplugged.
     Plug the local SR into destination host only if unplugged *)
  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let master_host = XenAPI.Pool.get_master remote.rpc remote.session dest_pool in
  let pbds = XenAPI.SR.get_PBDs remote.rpc remote.session dest_sr_ref in
  let pbd_host_pair = List.map (fun pbd -> (pbd, XenAPI.PBD.get_host remote.rpc remote.session pbd)) pbds in
  let hosts_to_be_attached = [master_host; remote.dest_host] in
  let pbds_to_be_plugged = List.filter (fun (_, host) ->
      (List.mem host hosts_to_be_attached) && (XenAPI.Host.get_enabled remote.rpc remote.session host)) pbd_host_pair in
  List.iter (fun (pbd, _) ->
      if not (XenAPI.PBD.get_currently_attached remote.rpc remote.session pbd) then
        XenAPI.PBD.plug remote.rpc remote.session pbd) pbds_to_be_plugged;

It begins by attempting to ensure the SRs we require are definitely attached on the destination host and on the destination pool master.

There’s now a little logic to support the case where we have cross-pool SRs and the VDI is already visible to the destination pool. Since this is outside our normal support envelope there is a key in xapi_globs that has to be set (via xapi.conf) to enable this:

  let rec dest_vdi_exists_on_sr vdi_uuid sr_ref retry =
    try
      let dest_vdi_ref = XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid in
      let dest_vdi_sr_ref = XenAPI.VDI.get_SR remote.rpc remote.session dest_vdi_ref in
      if dest_vdi_sr_ref = sr_ref then
        true
      else
        false
    with _ ->
      if retry then
        begin
          XenAPI.SR.scan remote.rpc remote.session sr_ref;
          dest_vdi_exists_on_sr vdi_uuid sr_ref false
        end
      else
        false
  in

  (* CP-4498 added an unsupported mode to use cross-pool shared SRs - the initial
     use case is for a shared raw iSCSI SR (same uuid, same VDI uuid) *)
  let vdi_uuid = Db.VDI.get_uuid ~__context ~self:vconf.vdi in
  let mirror = if !Xapi_globs.relax_xsm_sr_check then
      if (dest_sr_uuid = vconf.sr) then
        begin
          (* Check if the VDI uuid already exists in the target SR *)
          if (dest_vdi_exists_on_sr vdi_uuid dest_sr_ref true) then
            false
          else
            failwith ("SR UUID matches on destination but VDI does not exist")
        end
      else
        true
    else
      (not is_intra_pool) || (dest_sr_uuid <> vconf.sr)
  in

The check also covers the case where we’re doing an intra-pool migration and not copying all of the disks, in which case we don’t need to do anything for that disk.

We now have a wrapper function that creates a new datapath and passes it to a continuation function. On error it handles the destruction of the datapath:

let with_new_dp cont =
    let dp = Printf.sprintf (if vconf.do_mirror then "mirror_%s" else "copy_%s") vconf.dp in
    try cont dp
    with e ->
      (try SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false with _ -> info "Failed to cleanup datapath: %s" dp);
      raise e in

and now a helper that, given a remote VDI uuid, looks up the reference on the remote host and gives it to a continuation function. On failure of the continuation it will destroy the remote VDI:

  let with_remote_vdi remote_vdi cont =
    debug "Executing remote scan to ensure VDI is known to xapi";
    XenAPI.SR.scan remote.rpc remote.session dest_sr_ref;
    let query = Printf.sprintf "(field \"location\"=\"%s\") and (field \"SR\"=\"%s\")" remote_vdi (Ref.string_of dest_sr_ref) in
    let vdis = XenAPI.VDI.get_all_records_where remote.rpc remote.session query in
    let remote_vdi_ref = match vdis with
      | [] -> raise (Api_errors.Server_error(Api_errors.vdi_location_missing, [Ref.string_of dest_sr_ref; remote_vdi]))
      | h :: [] -> debug "Found remote vdi reference: %s" (Ref.string_of (fst h)); fst h
      | _ -> raise (Api_errors.Server_error(Api_errors.location_not_unique, [Ref.string_of dest_sr_ref; remote_vdi])) in
    try cont remote_vdi_ref
    with e ->
      (try XenAPI.VDI.destroy remote.rpc remote.session remote_vdi_ref with _ -> error "Failed to destroy remote VDI");
      raise e in

another helper to gather together info about a mirrored VDI:

let get_mirror_record ?new_dp remote_vdi remote_vdi_reference =
    { mr_dp = new_dp;
      mr_mirrored = mirror;
      mr_local_sr = vconf.sr;
      mr_local_vdi = vconf.location;
      mr_remote_sr = dest_sr_uuid;
      mr_remote_vdi = remote_vdi;
      mr_local_xenops_locator = vconf.xenops_locator;
      mr_remote_xenops_locator = Xapi_xenops.xenops_vdi_locator_of_strings dest_sr_uuid remote_vdi;
      mr_local_vdi_reference = vconf.vdi;
      mr_remote_vdi_reference = remote_vdi_reference } in

and finally the really important function:

let mirror_to_remote new_dp =
    let task =
      if not vconf.do_mirror then
        SMAPI.DATA.copy ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      else begin
        (* Though we have no intention of "write", here we use the same mode as the
           associated VBD on a mirrored VDIs (i.e. always RW). This avoids problem
           when we need to start/stop the VM along the migration. *)
        let read_write = true in
        (* DP set up is only essential for MIRROR.start/stop due to their open ended pattern.
           It's not necessary for copy which will take care of that itself. *)
        ignore(SMAPI.VDI.attach ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location ~read_write);
        SMAPI.VDI.activate ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location;
        ignore(Storage_access.register_mirror __context vconf.location);
        SMAPI.DATA.MIRROR.start ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      end in

    let mapfn x =
      let total = Int64.to_float total_size in
      let done_ = Int64.to_float !so_far /. total in
      let remaining = Int64.to_float vconf.size /. total in
      done_ +. x *. remaining in

    let open Storage_access in

    let task_result =
      task |> register_task __context
      |> add_to_progress_map mapfn
      |> wait_for_task dbg
      |> remove_from_progress_map
      |> unregister_task __context
      |> success_task dbg in

    let mirror_id, remote_vdi =
      if not vconf.do_mirror then
        let vdi = task_result |> vdi_of_task dbg in
        remote_vdis := vdi.vdi :: !remote_vdis;
        None, vdi.vdi
      else
        let mirrorid = task_result |> mirror_of_task dbg in
        let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mirrorid in
        Some mirrorid, m.Mirror.dest_vdi in

    so_far := Int64.add !so_far vconf.size;
    debug "Local VDI %s %s to %s" vconf.location (if vconf.do_mirror then "mirrored" else "copied") remote_vdi;
    mirror_id, remote_vdi in

This is the bit that actually starts the mirroring or copying. Before the call to mirror we call VDI.attach and VDI.activate locally to ensure that if the VM is shutdown then the detach/deactivate there doesn’t kill the mirroring process.

Note the parameters to the SMAPI call are sr and vdi, locating the local VDI and SM backend, new_dp, the datapath we’re using for the mirroring, url, which is the remote url on which SMAPI calls work, and dest, the destination SR uuid. These are also the arguments to copy above too.

There’s a little function to calculate the overall progress of the task, and the function waits until the completion of the task before it continues. The function success_task will raise an exception if the task failed. For DATA.mirror, completion implies both that the disk data has been copied to the destination and that all local writes are being mirrored to the destination. Hence more cleanup must be done on cancellation. In contrast, if the DATA.copy path had been taken then the operation at this point has completely finished.

The result of this function is an optional mirror id and the remote VDI uuid.

Next, there is a post_mirror function:

  let post_mirror mirror_id mirror_record =
    try
      let result = continuation mirror_record in
      (match mirror_id with
       | Some mid -> ignore(Storage_access.unregister_mirror mid);
       | None -> ());
      if mirror && not (Xapi_fist.storage_motion_keep_vdi () || copy) then
        Helpers.call_api_functions ~__context (fun rpc session_id ->
            XenAPI.VDI.destroy rpc session_id vconf.vdi);
      result
    with e ->
      let mirror_failed =
        match mirror_id with
        | Some mid ->
          ignore(Storage_access.unregister_mirror mid);
          let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mid in
          (try SMAPI.DATA.MIRROR.stop ~dbg ~id:mid with _ -> ());
          m.Mirror.failed
        | None -> false in
      if mirror_failed then raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
      else raise e in

This is poorly named - it is post mirror and copy. The aim of this function is to destroy the source VDIs on successful completion of the continuation function, which will have migrated the VM to the destination. In its exception handler it will stop the mirroring, but before doing so it will check to see if the mirroring process it was looking after has itself failed, and raise mirror_failed if so. This is because a failed mirror can result in a range of actual errors, and we decide here that the failed mirror was probably the root cause.

These functions are assembled together at the end of the vdi_copy_fun function:

   if mirror then
    with_new_dp (fun new_dp ->
        let mirror_id, remote_vdi = mirror_to_remote new_dp in
        with_remote_vdi remote_vdi (fun remote_vdi_ref ->
            let mirror_record = get_mirror_record ~new_dp remote_vdi remote_vdi_ref in
            post_mirror mirror_id mirror_record))
  else
    let mirror_record = get_mirror_record vconf.location (XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid) in
    continuation mirror_record

again, mirror here is poorly named, and means mirror or copy.

Once all of the disks have been mirrored or copied, we jump back to the body of migrate_send. We split apart the mirror records according to the source of the VDI:

      let was_from vmap = List.exists (fun vconf -> vconf.vdi = vmap.mr_local_vdi_reference) in

      let suspends_map, snapshots_map, vdi_map = List.fold_left (fun (suspends, snapshots, vdis) vmap ->
          if was_from vmap suspends_vdis then  vmap :: suspends, snapshots, vdis
          else if was_from vmap snapshots_vdis then suspends, vmap :: snapshots, vdis
          else suspends, snapshots, vmap :: vdis
        ) ([],[],[]) all_map in

then we reassemble all_map from this, for some reason:

    let all_map = List.concat [suspends_map; snapshots_map; vdi_map] in

Now we need to update the snapshot-of links:

     (* All the disks and snapshots have been created in the remote SR(s),
       * so update the snapshot links if there are any snapshots. *)
      if snapshots_map <> [] then
        update_snapshot_info ~__context ~dbg ~url:remote.sm_url ~vdi_map ~snapshots_map;

I’m not entirely sure why this is done in this layer as opposed to in the storage layer.

A little housekeeping:

     let xenops_vdi_map = List.map (fun mirror_record -> (mirror_record.mr_local_xenops_locator, mirror_record.mr_remote_xenops_locator)) all_map in

      (* Wait for delay fist to disappear *)
      wait_for_fist __context Xapi_fist.pause_storage_migrate "pause_storage_migrate";

      TaskHelper.exn_if_cancelling ~__context;

the fist thing here simply allows tests to put in a delay at this specific point.

We also check the task to see if we’ve been cancelled and raise an exception if so.

The VM metadata is now imported into the remote pool, with all the XenAPI level objects remapped:

let new_vm =
        if is_intra_pool
        then vm
        else
          (* Make sure HA replaning cycle won't occur right during the import process or immediately after *)
          let () = if ha_always_run_reset then XenAPI.Pool.ha_prevent_restarts_for ~rpc:remote.rpc ~session_id:remote.session ~seconds:(Int64.of_float !Xapi_globs.ha_monitor_interval) in
          (* Move the xapi VM metadata to the remote pool. *)
          let vms =
            let vdi_map =
              List.map (fun mirror_record -> {
                    local_vdi_reference = mirror_record.mr_local_vdi_reference;
                    remote_vdi_reference = Some mirror_record.mr_remote_vdi_reference;
                  })
                all_map in
            let vif_map =
              List.map (fun (vif, network) -> {
                    local_vif_reference = vif;
                    remote_network_reference = network;
                  })
                vif_map in
            let vgpu_map =
              List.map (fun (vgpu, gpu_group) -> {
                    local_vgpu_reference = vgpu;
                    remote_gpu_group_reference = gpu_group;
                  })
                vgpu_map
            in
            inter_pool_metadata_transfer ~__context ~remote ~vm ~vdi_map
              ~vif_map ~vgpu_map ~dry_run:false ~live:true ~copy
          in
          let vm = List.hd vms in
          let () = if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:vm ~value:false in
          (* Reserve resources for the new VM on the destination pool's host *)
          let () = XenAPI.Host.allocate_resources_for_vm remote.rpc remote.session remote.dest_host vm true in
          vm in

More waiting for fist points:

     wait_for_fist __context Xapi_fist.pause_storage_migrate2 "pause_storage_migrate2";

      (* Attach networks on remote *)
      XenAPI.Network.attach_for_vm ~rpc:remote.rpc ~session_id:remote.session ~host:remote.dest_host ~vm:new_vm;

also make sure all the networks are plugged for the VM on the destination. Next we create the xenopsd-level vif map, equivalent to the vdi_map above:

  (* Create the vif-map for xenops, linking VIF devices to bridge names on the remote *)
      let xenops_vif_map =
        let vifs = XenAPI.VM.get_VIFs ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm in
        List.map (fun vif ->
            let vifr = XenAPI.VIF.get_record ~rpc:remote.rpc ~session_id:remote.session ~self:vif in
            let bridge = Xenops_interface.Network.Local
                (XenAPI.Network.get_bridge ~rpc:remote.rpc ~session_id:remote.session ~self:vifr.API.vIF_network) in
            vifr.API.vIF_device, bridge
          ) vifs
      in

Now we destroy any extra mirror datapaths we set up previously:

     (* Destroy the local datapaths - this allows the VDIs to properly detach, invoking the migrate_finalize calls *)
      List.iter (fun mirror_record ->
          if mirror_record.mr_mirrored
          then match mirror_record.mr_dp with | Some dp ->  SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false | None -> ()) all_map;

More housekeeping:

    SMPERF.debug "vm.migrate_send: migration initiated vm:%s" vm_uuid;

      (* In case when we do SXM on the same host (mostly likely a VDI
         migration), the VM's metadata in xenopsd will be in-place updated
         as soon as the domain migration starts. For these case, there
         will be no (clean) way back from this point. So we disable task
         cancellation for them here.
       *)
      if is_same_host then (TaskHelper.exn_if_cancelling ~__context; TaskHelper.set_not_cancellable ~__context);

Finally we get to the memory-image part of the migration:

      (* It's acceptable for the VM not to exist at this point; shutdown commutes with storage migrate *)
      begin
        try
          Xapi_xenops.Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid
            (fun () ->
               let xenops_vgpu_map = (* can raise VGPU_mapping *)
                 infer_vgpu_map ~__context ~remote new_vm in
               migrate_with_retry
                 ~__context queue_name dbg vm_uuid xenops_vdi_map
                 xenops_vif_map xenops_vgpu_map remote.xenops_url;
               Xapi_xenops.Xenopsd_metadata.delete ~__context vm_uuid)
        with
        | Xenops_interface.Does_not_exist ("VM",_)
        | Xenops_interface.Does_not_exist ("extra",_) ->
          info "%s: VM %s stopped being live during migration"
            "vm_migrate_send" vm_uuid
        | VGPU_mapping(msg) ->
          info "%s: VM %s - can't infer vGPU map: %s"
            "vm_migrate_send" vm_uuid msg;
          raise Api_errors.
                  (Server_error
                     (vm_migrate_failed,
                      ([ vm_uuid
                       ; Helpers.get_localhost_uuid ()
                       ; Db.Host.get_uuid ~__context ~self:remote.dest_host
                       ; "The VM changed its power state during migration"
                       ])))
      end;

      debug "Migration complete";
      SMPERF.debug "vm.migrate_send: migration complete vm:%s" vm_uuid;

Now we tidy up after ourselves:

      (* So far the main body of migration is completed, and the rests are
         updates, config or cleanup on the source and destination. There will
         be no (clean) way back from this point, due to these destructive
         changes, so we don't want user intervention e.g. task cancellation.
       *)
      TaskHelper.exn_if_cancelling ~__context;
      TaskHelper.set_not_cancellable ~__context;
      XenAPI.VM.pool_migrate_complete remote.rpc remote.session new_vm remote.dest_host;

      detach_local_network_for_vm ~__context ~vm ~destination:remote.dest_host;
      Xapi_xenops.refresh_vm ~__context ~self:vm;

the function pool_migrate_complete is called on the destination host, and consists of a few things that ordinarily would be set up during VM.start or the like:

let pool_migrate_complete ~__context ~vm ~host =
  let id = Db.VM.get_uuid ~__context ~self:vm in
  debug "VM.pool_migrate_complete %s" id;
  let dbg = Context.string_of_task __context in
  let queue_name = Xapi_xenops_queue.queue_of_vm ~__context ~self:vm in
  if Xapi_xenops.vm_exists_in_xenopsd queue_name dbg id then begin
    Cpuid_helpers.update_cpu_flags ~__context ~vm ~host;
    Xapi_xenops.set_resident_on ~__context ~self:vm;
    Xapi_xenops.add_caches id;
    Xapi_xenops.refresh_vm ~__context ~self:vm;
    Monitor_dbcalls_cache.clear_cache_for_vm ~vm_uuid:id
  end

More tidying up, remapping some remaining VBDs and clearing state on the sender:

      (* Those disks that were attached at the point the migration happened will have been
         remapped by the Events_from_xenopsd logic. We need to remap any other disks at
         this point here *)

      if is_intra_pool
      then
        List.iter
          (fun vm' ->
             intra_pool_vdi_remap ~__context vm' all_map;
             intra_pool_fix_suspend_sr ~__context remote.dest_host vm')
          vm_and_snapshots;

      (* If it's an inter-pool migrate, the VBDs will still be 'currently-attached=true'
         because we supressed the events coming from xenopsd. Destroy them, so that the
         VDIs can be destroyed *)
      if not is_intra_pool && not copy
      then List.iter (fun vbd -> Db.VBD.destroy ~__context ~self:vbd) (vbds @ snapshots_vbds);

      new_vm
    in

The remark about the Events_from_xenopsd is that we have a thread watching for events that are emitted by xenopsd, and we resynchronise xapi’s state according to xenopsd’s state for several fields for which xenopsd is considered the canonical source of truth. One of these is the exact VDI the VBD is associated with.

The suspend_SR field of the VM is set to the source’s value, so we reset that.

Now we move the RRDs:

  if not copy then begin
      Rrdd_proxy.migrate_rrd ~__context ~remote_address:remote.remote_ip ~session_id:(Ref.string_of remote.session)
        ~vm_uuid:vm_uuid ~host_uuid:(Ref.string_of remote.dest_host) ()
    end;

This can be done for intra- and inter- pool migrates in the same way, simplifying the logic.

However, for messages and blobs we have to only migrate them for inter-pool migrations:

   if not is_intra_pool && not copy then begin
      (* Replicate HA runtime flag if necessary *)
      if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm ~value:true;
      (* Send non-database metadata *)
      Xapi_message.send_messages ~__context ~cls:`VM ~obj_uuid:vm_uuid
        ~session_id:remote.session ~remote_address:remote.remote_master_ip;
      Xapi_blob.migrate_push ~__context ~rpc:remote.rpc
        ~remote_address:remote.remote_master_ip ~session_id:remote.session ~old_vm:vm ~new_vm ;
      (* Signal the remote pool that we're done *)
    end;

Lastly, we destroy the VM record on the source:

    Helpers.call_api_functions ~__context (fun rpc session_id ->
        if not is_intra_pool && not copy then begin
          info "Destroying VM ref=%s uuid=%s" (Ref.string_of vm) vm_uuid;
          Xapi_vm_lifecycle.force_state_reset ~__context ~self:vm ~value:`Halted;
          List.iter (fun self -> Db.VM.destroy ~__context ~self) vm_and_snapshots
        end);
    SMPERF.debug "vm.migrate_send exiting vm:%s" vm_uuid;
    new_vm

The exception handler still has to clean some state, but mostly things are handled in the CPS functions declared above:

with e ->
    error "Caught %s: cleaning up" (Printexc.to_string e);

    (* We do our best to tidy up the state left behind *)
    Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid (fun () ->
        try
          let _, state = XenopsAPI.VM.stat dbg vm_uuid in
          if Xenops_interface.(state.Vm.power_state = Suspended) then begin
            debug "xenops: %s: shutting down suspended VM" vm_uuid;
            Xapi_xenops.shutdown ~__context ~self:vm None;
          end;
        with _ -> ());

    if not is_intra_pool && Db.is_valid_ref __context vm then begin
      List.map (fun self -> Db.VM.get_uuid ~__context ~self) vm_and_snapshots
      |> List.iter (fun self ->
          try
            let vm_ref = XenAPI.VM.get_by_uuid remote.rpc remote.session self in
            info "Destroying stale VM uuid=%s on destination host" self;
            XenAPI.VM.destroy remote.rpc remote.session vm_ref
          with e -> error "Caught %s while destroying VM uuid=%s on destination host" (Printexc.to_string e) self)
    end;

    let task = Context.get_task_id __context in
    let oc = Db.Task.get_other_config ~__context ~self:task in
    if List.mem_assoc "mirror_failed" oc then begin
      let failed_vdi = List.assoc "mirror_failed" oc in
      let vconf = List.find (fun vconf -> vconf.location=failed_vdi) vms_vdis in
      debug "Mirror failed for VDI: %s" failed_vdi;
      raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
    end;
    TaskHelper.exn_if_cancelling ~__context;
    begin match e with
      | Storage_interface.Backend_error(code, params) -> raise (Api_errors.Server_error(code, params))
      | Storage_interface.Unimplemented(code) -> raise (Api_errors.Server_error(Api_errors.unimplemented_in_sm_backend, [code]))
      | Xenops_interface.Cancelled _ -> TaskHelper.raise_cancelled ~__context
      | _ -> raise e
    end

Failures during the migration can result in the VM being in a suspended state. There’s no point leaving it like this since there’s nothing that can be done to resume it, so we force shut it down.

We also try to remove the VM record from the destination if we managed to send it there.

Finally we check for mirror failure in the task - this is set by the events thread watching for events from the storage layer, in storage_access.ml

Storage code

The part of the code that is conceptually in the storage layer, but physically in xapi, is located in storage_migrate.ml. There are logically a few separate parts to this file:

A stateful module for persisting state across xapi restarts.
Some general helper functions
Some quite specific helper functions related to actions to be taken on deactivate/detach
An NBD handler
The implementations of the SMAPIv2 mirroring APIs

Let’s start by considering the way the storage APIs are intended to be used.

Copying a VDI

DATA.copy takes several parameters:

dbg - a debug string
sr - the source SR (a uuid)
vdi - the source VDI (a uuid)
dp - unused
url - a URL on which SMAPIv2 API calls can be made
sr - the destination SR in which the VDI should be copied

and returns a parameter of type Task.id. The API call is intended to be called in an asynchronous fashion - ie., the caller makes the call, receives the task ID back and polls or uses the event mechanism to wait until the task has completed. The task may be cancelled via the Task.cancel API call. The result of the operation is obtained by calling TASK.stat, which returns a record:

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
		backtrace: string;
	}

Where the state field contains the result once the task has completed:

type async_result_t =
	| Vdi_info of vdi_info
	| Mirror_id of Mirror.id

type completion_t = {
	duration : float;
	result : async_result_t option
}

type state =
	| Pending of float
	| Completed of completion_t
	| Failed of Rpc.t

Once the result has been obtained from the task, the task should be destroyed via the TASK.destroy API call.

The implementation uses the url parameter to make SMAPIv2 calls to the destination SR. This is used, for example, to invoke a VDI.create call if necessary. The URL contains an authentication token within it (valid for the duration of the XenAPI call that caused this DATA.copy API call).

The implementation tries to minimize the amount of data copied by looking for related VDIs on the destination SR. See below for more details.

Mirroring a VDI

DATA.MIRROR.start takes a similar set of parameters to that of copy:

dbg - a debug string
sr - the source SR (a uuid)
vdi - the source VDI (a uuid)
dp - the datapath on which the VDI has been attached
url - a URL on which SMAPIv2 API calls can be made
sr - the destination SR in which the VDI should be copied

Similar to copy above, this returns a task id. The task ‘completes’ once the mirror has been set up - that is, at any point afterwards we can detach the disk and the destination disk will be identical to the source. Unlike for copy the operation is ongoing after the API call completes, since new writes need to be mirrored to the destination. Therefore the completion type of the mirror operation is Mirror_id which contains a handle on which further API calls related to the mirror call can be made. For example MIRROR.stat whose signature is:

MIRROR.stat: dbg:debug_info -> id:Mirror.id -> Mirror.t

The return type of this call is a record containing information about the mirror:

type state =
	| Receiving
	| Sending
	| Copying

type t = {
	source_vdi : vdi;
	dest_vdi : vdi;
	state : state list;
	failed : bool;
}

Note that state is a list since the initial phase of the operation requires both copying and mirroring.

Additionally the mirror can be cancelled using the MIRROR.stop API call.

Code walkthrough

let’s go through the implementation of copy:

DATA.copy

let copy ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "copy sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

Here we are constructing a module Remote on which we can do SMAPIv2 calls directly on the destination.

try

Wrap the whole function in an exception handler.

    (* Find the local VDI *)
    let vdis = Local.SR.scan ~dbg ~sr in
    let local_vdi =
      try List.find (fun x -> x.vdi = vdi) vdis
      with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

We first find the metadata for our source VDI by doing a local SMAPIv2 call SR.scan. This returns a list of VDI metadata, out of which we extract the VDI we’re interested in.

try

Another exception handler. This looks redundant to me right now.

      let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
      let similars = List.map (fun vdi -> vdi.content_id) similar_vdis in
      debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

Here we look for related VDIs locally using the VDI.similar_content SMAPIv2 API call. This searches for related VDIs and returns an ordered list where the most similar is first in the list. It returns both clones and snapshots, and hence is more general than simply following snapshot_of links.

      let remote_vdis = Remote.SR.scan ~dbg ~sr:dest in
      (** We drop cbt_metadata VDIs that do not have any actual data *)
      let remote_vdis = List.filter (fun vdi -> vdi.ty <> "cbt_metadata") remote_vdis in

      let nearest = List.fold_left
          (fun acc content_id -> match acc with
             | Some x -> acc
             | None ->
               try Some (List.find (fun vdi -> vdi.content_id = content_id && vdi.virtual_size <= local_vdi.virtual_size) remote_vdis)
               with Not_found -> None) None similars in

      debug "Nearest VDI: content_id=%s vdi=%s"
        (Opt.default "None" (Opt.map (fun x -> x.content_id) nearest))
        (Opt.default "None" (Opt.map (fun x -> x.vdi) nearest));

Here we look for VDIs on the destination with the same content_id as one of the locally similar VDIs. We will use this as a base image and only copy deltas to the destination. This is done by cloning the VDI on the destination and then using sparse_dd to find the deltas from our local disk to our local copy of the content_id disk and streaming these to the destination. Note that we need to ensure the VDI is smaller than the one we want to copy since we can’t resize disks downwards in size.

      let remote_base = match nearest with
        | Some vdi ->
          debug "Cloning VDI %s" vdi.vdi;
          let vdi_clone = Remote.VDI.clone ~dbg ~sr:dest ~vdi_info:vdi in
          if vdi_clone.virtual_size <> local_vdi.virtual_size then begin
            let new_size = Remote.VDI.resize ~dbg ~sr:dest ~vdi:vdi_clone.vdi ~new_size:local_vdi.virtual_size in
            debug "Resize remote VDI %s to %Ld: result %Ld" vdi_clone.vdi local_vdi.virtual_size new_size;
          end;
          vdi_clone
        | None ->
          debug "Creating a blank remote VDI";
          Remote.VDI.create ~dbg ~sr:dest ~vdi_info:{ local_vdi with sm_config = [] }  in

If we’ve found a base VDI we clone it and resize it immediately. If there’s nothing on the destination already we can use, we just create a new VDI. Note that the calls to create and clone may well fail if the destination host is not the SRmaster. This is handled purely in the rpc function:

let rec rpc ~srcstr ~dststr url call =
  let result = XMLRPC_protocol.rpc ~transport:(transport_of_url url)
      ~srcstr ~dststr ~http:(xmlrpc ~version:"1.0" ?auth:(Http.Url.auth_of url) ~query:(Http.Url.get_query_params url) (Http.Url.get_uri url)) call
  in
  if not result.Rpc.success then begin
    debug "Got failure: checking for redirect";
    debug "Call was: %s" (Rpc.string_of_call call);
    debug "result.contents: %s" (Jsonrpc.to_string result.Rpc.contents);
    match Storage_interface.Exception.exnty_of_rpc result.Rpc.contents with
    | Storage_interface.Exception.Redirect (Some ip) ->
      let open Http.Url in
      let newurl =
        match url with
        | (Http h, d) ->
          (Http {h with host=ip}, d)
        | _ ->
          remote_url ip in
      debug "Redirecting to ip: %s" ip;
      let r = rpc ~srcstr ~dststr newurl call in
      debug "Successfully redirected. Returning";
      r
    | _ ->
      debug "Not a redirect";
      result
  end
  else result

Back to the copy function:

      let remote_copy = copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi:remote_base.vdi |> vdi_info in

This calls the actual data copy part. See below for more on that.

      let snapshot = Remote.VDI.snapshot ~dbg ~sr:dest ~vdi_info:remote_copy in
      Remote.VDI.destroy ~dbg ~sr:dest ~vdi:remote_copy.vdi;
      Some (Vdi_info snapshot)

Finally we snapshot the remote VDI to ensure we’ve got a VDI of type ‘snapshot’ on the destination, and we delete the non-snapshot VDI.

    with e ->
      error "Caught %s: copying snapshots vdi" (Printexc.to_string e);
      raise (Internal_error (Printexc.to_string e))
  with
  | Backend_error(code, params)
  | Api_errors.Server_error(code, params) ->
    raise (Backend_error(code, params))
  | e ->
    raise (Internal_error(Printexc.to_string e))

The exception handler does nothing - so we leak remote VDIs if the exception happens after we’ve done our cloning :-(

DATA.copy_into

Let’s now look at the data-copying part. This is common code shared between VDI.copy, VDI.copy_into and MIRROR.start and hence has some duplication of the calls made above.

let copy_into ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi

copy_into is a stub and just calls copy'

let copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
  debug "copy local=%s/%s url=%s remote=%s/%s" sr vdi url dest dest_vdi;

This call takes roughly the same parameters as the ``DATA.copy` call above, except it specifies the destination VDI. Once again we construct a module to do remote SMAPIv2 calls

  (* Check the remote SR exists *)
  let srs = Remote.SR.list ~dbg in
  if not(List.mem dest srs)
  then failwith (Printf.sprintf "Remote SR %s not found" dest);

Sanity check.

  let vdis = Remote.SR.scan ~dbg ~sr:dest in
  let remote_vdi =
    try List.find (fun x -> x.vdi = dest_vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Remote VDI %s not found" dest_vdi)
  in

Find the metadata of the destination VDI

  let dest_content_id = remote_vdi.content_id in

If we’ve got a local VDI with the same content_id as the destination, we only need copy the deltas, so we make a note of the destination content ID here.

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

  debug "copy local=%s/%s content_id=%s" sr vdi local_vdi.content_id;
  debug "copy remote=%s/%s content_id=%s" dest dest_vdi remote_vdi.content_id;

Find the source VDI metadata.

  if local_vdi.virtual_size > remote_vdi.virtual_size then begin
    (* This should never happen provided the higher-level logic is working properly *)
    error "copy local=%s/%s virtual_size=%Ld > remote=%s/%s virtual_size = %Ld" sr vdi local_vdi.virtual_size dest dest_vdi remote_vdi.virtual_size;
    failwith "local VDI is larger than the remote VDI";
  end;

Sanity check - the remote VDI can’t be smaller than the source.

  let on_fail : (unit -> unit) list ref = ref [] in

We do some ugly error handling here by keeping a mutable list of operations to perform in the event of a failure.

  let base_vdi =
    try
      let x = (List.find (fun x -> x.content_id = dest_content_id) vdis).vdi in
      debug "local VDI %s has content_id = %s; we will perform an incremental copy" x dest_content_id;
      Some x
    with _ ->
      debug "no local VDI has content_id = %s; we will perform a full copy" dest_content_id;
      None
  in

See if we can identify a local VDI with the same content_id as the destination. If not, no problem.

  try
    let remote_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let base_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let leaf_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in

Construct some datapaths - named reasons why the VDI is attached - that we will pass to VDI.attach/activate.

    let dest_vdi_url = Http.Url.set_uri remote_url (Printf.sprintf "%s/nbd/%s/%s/%s" (Http.Url.get_uri remote_url) dest dest_vdi remote_dp) |> Http.Url.to_string in

    debug "copy remote=%s/%s NBD URL = %s" dest dest_vdi dest_vdi_url;

Here we are constructing a URI that we use to connect to the destination xapi. The handler for this particular path will verify the credentials and then pass the connection on to tapdisk which will behave as a NBD server. The VDI has to be attached and activated for this to work, unlike the new NBD handler in xapi-nbd that is smarter. The handler for this URI is declared in this file

    let id=State.copy_id_of (sr,vdi) in
    debug "Persisting state for copy (id=%s)" id;
    State.add id State.(Copy_op Copy_state.({
        base_dp; leaf_dp; remote_dp; dest_sr=dest; copy_vdi=remote_vdi.vdi; remote_url=url}));

Since we’re about to perform a long-running operation that is stateful, we persist the state here so that if xapi is restarted we can cancel the operation and not leak VDI attaches. Normally in xapi code we would be doing VBD.plug operations to persist the state in the xapi db, but this is storage code so we have to use a different mechanism.

    SMPERF.debug "mirror.copy: copy initiated local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    Pervasiveext.finally (fun () ->
        debug "activating RW datapath %s on remote=%s/%s" remote_dp dest dest_vdi;
        ignore(Remote.VDI.attach ~dbg ~sr:dest ~vdi:dest_vdi ~dp:remote_dp ~read_write:true);
        Remote.VDI.activate ~dbg ~dp:remote_dp ~sr:dest ~vdi:dest_vdi;

        with_activated_disk ~dbg ~sr ~vdi:base_vdi ~dp:base_dp
          (fun base_path ->
             with_activated_disk ~dbg ~sr ~vdi:(Some vdi) ~dp:leaf_dp
               (fun src ->
                  let dd = Sparse_dd_wrapper.start ~progress_cb:(progress_callback 0.05 0.9 task) ?base:base_path true (Opt.unbox src)
                      dest_vdi_url remote_vdi.virtual_size in
                  Storage_task.with_cancel task
                    (fun () -> Sparse_dd_wrapper.cancel dd)
                    (fun () ->
                       try Sparse_dd_wrapper.wait dd
                       with Sparse_dd_wrapper.Cancelled -> Storage_task.raise_cancelled task)
               )
          );
      )
      (fun () ->
         Remote.DP.destroy ~dbg ~dp:remote_dp ~allow_leak:false;
         State.remove_copy id
      );

In this chunk of code we attach and activate the disk on the remote SR via the SMAPI, then locally attach and activate both the VDI we’re copying and the base image we’re copying deltas from (if we’ve got one). We then call sparse_dd to copy the data to the remote NBD URL. There is some logic to update progress indicators and to cancel the operation if the SMAPIv2 call TASK.cancel is called.

Once the operation has terminated (either on success, error or cancellation), we remove the local attach and activations in the with_activated_disk function and the remote attach and activation by destroying the datapath on the remote SR. We then remove the persistent state relating to the copy.

    SMPERF.debug "mirror.copy: copy complete local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    debug "setting remote=%s/%s content_id <- %s" dest dest_vdi local_vdi.content_id;
    Remote.VDI.set_content_id ~dbg ~sr:dest ~vdi:dest_vdi ~content_id:local_vdi.content_id;
    (* PR-1255: XXX: this is useful because we don't have content_ids by default *)
    debug "setting local=%s/%s content_id <- %s" sr local_vdi.vdi local_vdi.content_id;
    Local.VDI.set_content_id ~dbg ~sr ~vdi:local_vdi.vdi ~content_id:local_vdi.content_id;
    Some (Vdi_info remote_vdi)

The last thing we do is to set the local and remote content_id. The local set_content_id is there because the content_id of the VDI is constructed from the location if it is unset in the storage_access.ml module of xapi (still part of the storage layer)

  with e ->
    error "Caught %s: performing cleanup actions" (Printexc.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

Here we perform the list of cleanup operations. Theoretically. It seems we don’t ever actually set this to anything, so this is dead code.

DATA.MIRROR.start

let start' ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "Mirror.start sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  SMPERF.debug "mirror.start called sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

As with the previous calls, we make a remote module for SMAPIv2 calls on the destination, and we find local VDI metadata via SR.scan

  let id = State.mirror_id_of (sr,local_vdi.vdi) in

Mirror ids are deterministically constructed.

  (* A list of cleanup actions to perform if the operation should fail. *)
  let on_fail : (unit -> unit) list ref = ref [] in

This on_fail list is actually used.

  try
    let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
    let similars = List.filter (fun x -> x <> "") (List.map (fun vdi -> vdi.content_id) similar_vdis) in
    debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

As with copy we look locally for similar VDIs. However, rather than use that here we actually pass this information on to the destination SR via the receive_start internal SMAPIv2 call:

    let result_ty = Remote.DATA.MIRROR.receive_start ~dbg ~sr:dest ~vdi_info:local_vdi ~id ~similar:similars in
    let result = match result_ty with
        Mirror.Vhd_mirror x -> x
    in

This gives the destination SR a chance to say what sort of migration it can support. We only support Vhd_mirror style migrations which require the destination to support the compose SMAPIv2 operation. The type of x is a record:

type mirror_receive_result_vhd_t = {
	mirror_vdi : vdi_info;
	mirror_datapath : dp;
	copy_diffs_from : content_id option;
	copy_diffs_to : vdi;
	dummy_vdi : vdi;
}

Field descriptions:

mirror_vdi is the VDI to which new writes should be mirrored.
mirror_datapath is the remote datapath on which the VDI has been attached and activated. This is required to construct the remote NBD url
copy_diffs_from represents the source base VDI to be used for the non-mirrored data copy.
copy_diffs_to is the remote VDI to copy those diffs to
dummy_vdi exists to prevent leaf-coalesce on the mirror_vdi

    (* Enable mirroring on the local machine *)
    let mirror_dp = result.Mirror.mirror_datapath in

    let uri = (Printf.sprintf "/services/SM/nbd/%s/%s/%s" dest result.Mirror.mirror_vdi.vdi mirror_dp) in
    let dest_url = Http.Url.set_uri remote_url uri in
    let request = Http.Request.make ~query:(Http.Url.get_query_params dest_url) ~version:"1.0" ~user_agent:"smapiv2" Http.Put uri in
    let transport = Xmlrpc_client.transport_of_url dest_url in

This is where we connect to the NBD server on the destination.

    debug "Searching for data path: %s" dp;
    let attach_info = Local.DP.attach_info ~dbg:"nbd" ~sr ~vdi ~dp in
    debug "Got it!";

we need the local attach_info to find the local tapdisk so we can send it the connected NBD socket.

    on_fail := (fun () -> Remote.DATA.MIRROR.receive_cancel ~dbg ~id) :: !on_fail;

This should probably be set directly after the call to receive_start

    let tapdev = match tapdisk_of_attach_info attach_info with
      | Some tapdev ->
        debug "Got tapdev";
        let pid = Tapctl.get_tapdisk_pid tapdev in
        let path = Printf.sprintf "/var/run/blktap-control/nbdclient%d" pid in
        with_transport transport (with_http request (fun (response, s) ->
            debug "Here inside the with_transport";
            let control_fd = Unix.socket Unix.PF_UNIX Unix.SOCK_STREAM 0 in
            finally
              (fun () ->
                 debug "Connecting to path: %s" path;
                 Unix.connect control_fd (Unix.ADDR_UNIX path);
                 let msg = dp in
                 let len = String.length msg in
                 let written = Unixext.send_fd control_fd msg 0 len [] s in
                 debug "Sent fd";
                 if written <> len then begin
                   error "Failed to transfer fd to %s" path;
                   failwith "foo"
                 end)
              (fun () ->
                 Unix.close control_fd)));
        tapdev
      | None ->
        failwith "Not attached"
    in

Here we connect to the remote NBD server, then pass that connected fd to the local tapdisk that is using the disk. This fd is passed with a name that is later used to tell tapdisk to start using it - we use the datapath name for this.

    debug "Adding to active local mirrors: id=%s" id;
    let alm = State.Send_state.({
        url;
        dest_sr=dest;
        remote_dp=mirror_dp;
        local_dp=dp;
        mirror_vdi=result.Mirror.mirror_vdi.vdi;
        remote_url=url;
        tapdev;
        failed=false;
        watchdog=None}) in
    State.add id (State.Send_op alm);
    debug "Added";

As for copy we persist some state to disk to say that we’re doing a mirror so we can undo any state changes after a toolstack restart.

    debug "About to snapshot VDI = %s" (string_of_vdi_info local_vdi);
    let local_vdi = add_to_sm_config local_vdi "mirror" ("nbd:" ^ dp) in
    let local_vdi = add_to_sm_config local_vdi "base_mirror" id in
    let snapshot =
    try
      Local.VDI.snapshot ~dbg ~sr ~vdi_info:local_vdi
    with
    | Storage_interface.Backend_error(code, _) when code = "SR_BACKEND_FAILURE_44" ->
      raise (Api_errors.Server_error(Api_errors.sr_source_space_insufficient, [ sr ]))
    | e ->
      raise e
    in
    debug "Done!";

    SMPERF.debug "mirror.start: snapshot created, mirror initiated vdi:%s snapshot_of:%s"
      snapshot.vdi local_vdi.vdi ;

    on_fail := (fun () -> Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi) :: !on_fail;

This bit inserts into sm_config the name of the fd we passed earlier to do mirroring. This is interpreted by the python SM backends and passed on the tap-ctl invocation to unpause the disk. This causes all new writes to be mirrored via NBD to the file descriptor passed earlier.

    begin
      let rec inner () =
        debug "tapdisk watchdog";
        let alm_opt = State.find_active_local_mirror id in
        match alm_opt with
        | Some alm ->
          let stats = Tapctl.stats (Tapctl.create ()) tapdev in
          if stats.Tapctl.Stats.nbd_mirror_failed = 1 then
            Updates.add (Dynamic.Mirror id) updates;
          alm.State.Send_state.watchdog <- Some (Scheduler.one_shot scheduler (Scheduler.Delta 5) "tapdisk_watchdog" inner)
        | None -> ()
      in inner ()
    end;

This is the watchdog that runs tap-ctl stats every 5 seconds watching mirror_failed for evidence of a failure in the mirroring code. If it detects one the only thing it does is to notify that the state of the mirroring has changed. This will be picked up by the thread in xapi that is monitoring the state of the mirror. It will then issue a MIRROR.stat call which will return the state of the mirror including the information that it has failed.

    on_fail := (fun () -> stop ~dbg ~id) :: !on_fail;
    (* Copy the snapshot to the remote *)
    let new_parent = Storage_task.with_subtask task "copy" (fun () ->
        copy' ~task ~dbg ~sr ~vdi:snapshot.vdi ~url ~dest ~dest_vdi:result.Mirror.copy_diffs_to) |> vdi_info in
    debug "Local VDI %s == remote VDI %s" snapshot.vdi new_parent.vdi;

This is where we copy the VDI returned by the snapshot invocation to the remote VDI called copy_diffs_to. We only copy deltas, but we rely on copy' to figure out which disk the deltas should be taken from, which it does via the content_id field.

    Remote.VDI.compose ~dbg ~sr:dest ~vdi1:result.Mirror.copy_diffs_to ~vdi2:result.Mirror.mirror_vdi.vdi;
    Remote.VDI.remove_from_sm_config ~dbg ~sr:dest ~vdi:result.Mirror.mirror_vdi.vdi ~key:"base_mirror";
    debug "Local VDI %s now mirrored to remote VDI: %s" local_vdi.vdi result.Mirror.mirror_vdi.vdi;

Once the copy has finished we invoke the compose SMAPIv2 call that composes the diffs from the mirror with the base image copied from the snapshot.

    debug "Destroying dummy VDI %s on remote" result.Mirror.dummy_vdi;
    Remote.VDI.destroy ~dbg ~sr:dest ~vdi:result.Mirror.dummy_vdi;
    debug "Destroying snapshot %s on src" snapshot.vdi;
    Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi;

    Some (Mirror_id id)

we can now destroy the dummy vdi on the remote (which will cause a leaf-coalesce in due course), and we destroy the local snapshot here (which will also cause a leaf-coalesce in due course, providing we don’t destroy it first). The return value from the function is the mirror_id that we can use to monitor the state or cancel the mirror.

  with
  | Sr_not_attached(sr_uuid) ->
    error " Caught exception %s:%s. Performing cleanup." Api_errors.sr_not_attached sr_uuid;
    perform_cleanup_actions !on_fail;
    raise (Api_errors.Server_error(Api_errors.sr_not_attached,[sr_uuid]))
  | e ->
    error "Caught %s: performing cleanup actions" (Api_errors.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

The exception handler just cleans up afterwards.

This is not the end of the story, since we need to detach the remote datapath being used for mirroring when we detach this end. The hook function is in storage_migrate.ml:

let post_detach_hook ~sr ~vdi ~dp =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  State.find_active_local_mirror id |>
  Opt.iter (fun r ->
      let remote_url = Http.Url.of_string r.url in
      let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
      let t = Thread.create (fun () ->
          debug "Calling receive_finalize";
          log_and_ignore_exn
            (fun () -> Remote.DATA.MIRROR.receive_finalize ~dbg:"Mirror-cleanup" ~id);
          debug "Finished calling receive_finalize";
          State.remove_local_mirror id;
          debug "Removed active local mirror: %s" id
        ) () in
      Opt.iter (fun id -> Scheduler.cancel scheduler id) r.watchdog;
      debug "Created thread %d to call receive finalize and dp destroy" (Thread.id t))

This removes the persistent state and calls receive_finalize on the destination. The body of that functions is:

let receive_finalize ~dbg ~id =
  let recv_state = State.find_active_receive_mirror id in
  let open State.Receive_state in Opt.iter (fun r -> Local.DP.destroy ~dbg ~dp:r.leaf_dp ~allow_leak:false) recv_state;
  State.remove_receive_mirror id

which removes the persistent state on the destination and destroys the datapath associated with the mirror.

Additionally, there is also a pre-deactivate hook. The rationale for this is that we want to detect any failures to write that occur right at the end of the SXM process. So if there is a mirror operation going on, before we deactivate we wait for tapdisk to flush its queue of outstanding requests, then we query whether there has been a mirror failure. The code is just above the detach hook in storage_migrate.ml:

let pre_deactivate_hook ~dbg ~dp ~sr ~vdi =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  let start = Mtime_clock.counter () in
  let get_delta () = Mtime_clock.count start |> Mtime.Span.to_s in
  State.find_active_local_mirror id |>
  Opt.iter (fun s ->
      try
        (* We used to pause here and then check the nbd_mirror_failed key. Now, we poll
				   until the number of outstanding requests has gone to zero, then check the
				   status. This avoids confusing the backend (CA-128460) *)
        let open Tapctl in
        let ctx = create () in
        let rec wait () =
          if get_delta () > reqs_outstanding_timeout then raise Timeout;
          let st = stats ctx s.tapdev in
          if st.Stats.reqs_outstanding > 0
          then (Thread.delay 1.0; wait ())
          else st
        in
        let st = wait () in
        debug "Got final stats after waiting %f seconds" (get_delta ());
        if st.Stats.nbd_mirror_failed = 1
        then begin
          error "tapdisk reports mirroring failed";
          s.failed <- true
        end;
      with
      | Timeout ->
        error "Timeout out after %f seconds waiting for tapdisk to complete all outstanding requests" (get_delta ());
        s.failed <- true
      | e ->
        error "Caught exception while finally checking mirror state: %s"
          (Printexc.to_string e);
        s.failed <- true
    )

The server is generic because it does not accept fd passing, and I call those “special” nbd server/fd receiver. ↩︎

XAPI requests walk-throughs

Let’s detail the handling process of an XML request within XAPI. The first document uses the migration as an example of such request.

How the migration request goes through Xen API?

From RPC migration request to xapi internals

Overview

In this document we will use the VM.pool_migrate request to illustrate the interaction between various components within the XAPI toolstack during migration. However this schema can be applied to other requests as well.

Not all parts of the Xapi toolstack are shown here as not all are involved in the migration process. For instance you won’t see the squeezed nor mpathalert two daemons that belong to the toolstack but don’t participate in the migration of a VM.

Anatomy of a VM migration

Migration is initiated by a Xapi client that sends VM.pool_migrate, an RPC XML request.
The Xen API server handles this request and dispatches it to the server.
The server is generated using XAPI IDL and requests are wrapped whithin a context, either to be forwarded to a host or executed locally. Broadly, the context follows RBAC rules. The executed function is related to the message of the request (refer to XenAPI Reference).
In the case of the migration you can refer to ocaml/idl/datamodel_vm.ml.
The server will dispatch the operation to server helpers, executing the operation synchronously or asynchronously and returning the RPC answer.
Message forwarding decides if operation must be executed by another host of the pool and then forward the call or if is executed locally.
When executed locally the high-level migration operation is send to the Xenopsd daemon by posting a message on a known queue on the message switch.
Xenopsd will get the command and will split it into several atomic operations that will be run by the xenopsd backend.
Xenopsd with its backend can then access xenstore or execute hypercall to interact with xen a server the micro operation.

A diagram is worth a thousand words

flowchart TD

    %% First we are starting by a XAPI client that is sending an XML-RPC request
    client((Xapi client)) -. sends RPC XML request .->
        xapi_server{"`Dispatch RPC
                    **api_server.ml**`"}
    style client stroke:#CAFEEE,stroke-width:4px

    %% XAPI Toolstack internals
    subgraph "Xapi Toolstack (master of the pool)"
        style server stroke:#BAFA00,stroke-width:4px,stroke-dasharray: 5 5

            xapi_server --dispatch call (ie VM.pool_migrate)--> server("`Auto generated using *IDL*
                    **server.ml**`")

            server --do_dispatch (ie VM.pool_migrate)--> server_helpers["`server helpers
            **server_helpers.ml**`"]

            server_helpers -- call management (ie xapi_vm_migrate.ml)--> message_forwarding["`check where to run the call **message_forwarding.ml**`"]

            message_forwarding -- execute locally --> vm_management["`VM Mgmt
            like **xapi_vm_migrate.ml**`"]

            vm_management -- Call --> xapi_xenops["`Transform xenops
            see (**xapi_xenops.ml**)`"]
                xapi_xenops <-- Post following IDL model (see xenops_interface.ml) --> msg_switch


        subgraph "Message Switch Daemon"
            msg_switch[["Queues"]]
        end

        subgraph "Xenopsd Daemon"
            msg_switch <-- Push/Pop on org.xen.xapi.xenopsd.classic --> xenopsd_server

            xenopsd_server["`Xenposd *frontend*
            get & split high level opertion into atomics`"]  o-- linked at compile time --o xenopsd_backend
        end
    end

    %% Xenopsd backend is accessing xen and xenstore
    xenopsd_backend["`Xenopsd *backend*
    Backend XC (libxenctrl)`"] -. access to .-> xen_hypervisor["Xen hypervisor & xenstore"]
    style xen_hypervisor stroke:#BEEF00,stroke-width:2px

    %% Can send request to the host where call must be executed
    message_forwarding -.forward call to .-> elected_host["Host where call must be executed"]
    style elected_host stroke:#B0A,stroke-width:4px

Xenopsd

Xenopsd is the VM manager of the XAPI Toolstack. Xenopsd is responsible for:

Starting, stopping, rebooting, suspending, resuming, migrating VMs.
(Hot-)plugging and unplugging devices such as VBDs, VIFs, vGPUs and PCI devices.
Setting up VM consoles.
Running bootloaders.
Setting QoS parameters.
Configuring SMBIOS tables.
Handling crashes.
etc.

Check out the full features list.

The code is in ocaml/xenopsd.

Principles

Do no harm: Xenopsd should never touch domains/VMs which it hasn’t been asked to manage. This means that it can co-exist with other VM managers such as ‘xl’ and ’libvirt’.
Be independent: Xenopsd should be able to work in isolation. In particular the loss of some other component (e.g. the network) should not by itself prevent VMs being managed locally (including shutdown and reboot).
Asynchronous by default: Xenopsd exposes task monitoring and offers cancellation for all operations. Xenopsd ensures that the system is always in a manageable state after an operation has been cancelled.
Avoid state duplication: where another component owns some state, Xenopsd will always defer to it. We will avoid creating out-of-sync caches of this state.
Be debuggable: Xenopsd will expose diagnostic APIs and tools to allow its internal state to be inspected and modified.

Xenopsd Architecture

Xenopsd instances run on a host and manage VMs on behalf of clients. This picture shows 3 different Xenopsd instances: 2 named “xenopsd-xc” and 1 named “xenopsd-xenlight”.

Where xenopsd fits on a host

Each instance is responsible for managing a disjoint set of VMs. Clients should never ask more than one Xenopsd to manage the same VM. Managing a VM means:

handling start/shutdown/suspend/resume/migrate/reboot
allowing devices (disks, nics, PCI cards, vCPUs etc) to be manipulated
providing updates to clients when things change (reboots, console becomes available, guest agent says something etc).

For a full list of features, consult the feature list.

Each Xenopsd instance has a unique name on the host. A typical name is

org.xen.xcp.xenops.classic
org.xen.xcp.xenops.xenlight

A higher-level tool, such as xapi will associate VMs with individual Xenopsd names.

Running multiple Xenopsds is necessary because

The virtual hardware supported by different technologies (libxc, libxl, qemu) is expected to be different. We can guarantee the virtual hardware is stable across a rolling upgrade by running the VM on the old Xenopsd. We can then switch Xenopsds later over a VM reboot when the VM admin is happy with it. If the VM admin is unhappy then we can reboot back to the original Xenopsd again.
The suspend/resume/migrate image formats will differ across technologies (again libxc vs libxl) and it will be more reliable to avoid switching technology over a migrate.
In the future different security domains may have different Xenopsd instances providing even stronger isolation guarantees between domains than is possible today.

Communication with Xenopsd is handled through a Xapi-global library: xcp-idl. This library supports

message framing: by default using HTTP but a binary framing format is available
message encoding: by default we use JSON but XML is also available
RPCs over Unix domain sockets and persistent queues.

This library allows the communication details to be changed without having to change all the Xapi clients and servers.

Xenopsd has a number of “backends” which perform the low-level VM operations such as (on Xen) “create domain” “hotplug disk” “destroy domain”. These backends contain all the hypervisor-specific code including

connecting to Xenstore
opening the libxc /proc/xen/privcmd interface
initialising libxl contexts

The following diagram shows the internal structure of Xenopsd:

Inside xenopsd

At the top of the diagram two client RPC have been sent: one to start a VM and the other to fetch the latest events. The RPCs are all defined in xcp-idl/xen/xenops_interface.ml. The RPCs are received by the Xenops_server module and decomposed into “micro-ops” (labelled “μ op”). These micro ops represent actions like

create a Xen domain (recall a Xen domain is an empty shell with no memory)
build a Xen domain: this is where the kernel or hvmloader is copied in
launch a device model: this is where a qemu instance is started (if one is required)
hotplug a device: this involves writing the frontend and backend trees to Xenstore
unpause a domain (recall a Xen domain is created in the paused state)

Each of these micro-ops is represented by a function call in a “backend plugin” interface. The micro-ops are enqueued in queues, one queue per VM. There is a thread pool (whose size can be changed dynamically by the admin) which pulls micro-ops from the VM queues and calls the corresponding backend function.

The active backend (there can only be one backend per Xenopsd instance) executes the micro-ops. The Xenops_server_xen backend in the picture above talks to libxc, libxl and qemu to create and destroy domains. The backend also talks to other Xapi services, in particular

it registers datasources with xcp-rrdd, telling xcp-rrdd to measure I/O throughput and vCPU utilisation
it reserves memory for new domains by talking to squeezed
it makes disks available by calling SMAPIv2 VDI.{at,de}tach, VDI.{,de}activate
it launches subprocesses by talking to forkexecd (avoiding problems with accidental fd capture)

Xenopsd backends are also responsible for monitoring running VMs. In the Xenops_server_xen backend this is done by watching Xenstore for

@releaseDomain watch events
device hotplug status changes

When such an event happens (for example: @releaseDomain sent when a domain requests a reboot) the corresponding operation does not happen inline. Instead the event is rebroadcast upwards to Xenops_server as a signal (for example: “VM id needs some attention”) and a “VM_stat” micro-op is queued in the appropriate queue. Xenopsd does not allow operations to run on the same VM in parallel and enforces this by:

pushing all operations pertaining to a VM to the same queue
associating each VM queue to at-most-one worker pool thread

The event takes the form “VM id needs some attention” and not “VM id needs to be rebooted” because, by the time the queue is flushed, the VM may well now be in a different state. Perhaps rather than being rebooted it now needs to be shutdown; or perhaps the domain is now in a good state because the reboot has already happened. The signals sent by the backend to the Xenops_server are a bit like event channel notifications in the Xen ring protocols: they are requests to ask someone to perform work, they don’t themselves describe the work that needs to be done.

An implication of this design is that it should always be possible to answer the question, “what operation should be performed to get the VM into a valid state?”. If an operation is cancelled half-way through or if Xenopsd is suddenly restarted, it will ask the question about all the VMs and perform the necessary operations. The operations must be designed carefully to make this work. For example if Xenopsd is restarted half-way through starting a VM, it must be obvious on restart that the VM should either be forcibly shutdown or rebooted to make it a valid state again. Note: we don’t demand that operations are performed as transactions; we only demand that the state they leave the system be “sensible” in the sense that the admin will recognise it and be able to continue their work.

Sometimes this can be achieved through careful ordering of side-effects within the operations, taking advantage of artifacts of the system such as:

a domain which has not been fully created will have total vCPU time = 0 and will be paused. If we see one of these we should reboot it because it may not be fully intact.

In the absense of “tells” from the system, operations are expected to journal their intentions and support restart after failure.

There are three categories of metadata associated with VMs:

system metadata: this is created as a side-effect of starting VMs. This includes all the information about active disks and nics stored in Xenstore and the list of running domains according to Xen.
VM: this is the configuration to use when the VM is started or rebooted. This is like a “config file” for the VM.
VmExtra: this is the runtime configuration of the VM. When VM configuration is changed it often cannot be applied immediately; instead the VM continues to run with the previous configuration. We need to track the runtime configuration of the VM in order for suspend/resume and migrate to work. It is also useful to be able to tell a client, “on next reboot this value will be x but currently it is x-1”.

VM and VmExtra metadata is stored by Xenopsd in the domain 0 filesystem, in a simple directory hierarchy.

Design

Design documents for xenopsd:

Events

ids rather than data; inherently coalescable
blocking poll + async operations implies a client needs 2 connections
coarse granularity
similarity and differences with: XenAPI, event channels, xenstore watches

https://github.com/xapi-project/xen-api/blob/30cc9a72e8726d1e7501cd01ddb27ced6d53b9be/ocaml/xapi/xapi_xenops.ml#L1467

Hooks

There are a number of hook points at which xenopsd may execute certain scripts. These scripts are found in hook-specific directories of the form /etc/xapi.d/<hookname>/. All executable scripts in these directories are run with the following arguments:

<script.sh> -reason <reason> -vmuuid <uuid of VM>

The scripts are executed in filename-order. By convention, the filenames are usually of the form 10resetvdis.

The hook points are:

vm-pre-shutdown
vm-pre-migrate
vm-post-migrate (Dundee only)
vm-pre-start
vm-pre-reboot
vm-pre-resume
vm-post-resume (Dundee only)
vm-post-destroy

and the reason codes are:

clean-shutdown
hard-shutdown
clean-reboot
hard-reboot
suspend
source -- passed to pre-migrate hook on source host
destination -- passed to post-migrate hook on destination (Dundee only)
none

For example, in order to execute a script on VM shutdown, it would be sufficient to create the script in the post-destroy hook point:

/etc/xapi.d/vm-post-destroy/01myscript.sh

containing

#!/bin/bash
echo I was passed $@ > /tmp/output

And when, for example, VM e30d0050-8f15-e10d-7613-cb2d045c8505 is shut-down, the script is executed:

[vagrant@localhost ~]$ sudo xe vm-shutdown --force uuid=e30d0050-8f15-e10d-7613-cb2d045c8505
[vagrant@localhost ~]$ cat /tmp/output
I was passed -vmuuid e30d0050-8f15-e10d-7613-cb2d045c8505 -reason hard-shutdown

PVS Proxy OVS Rules

Rule Design

The Open vSwitch (OVS) daemon implements a programmable switch. XenServer uses it to re-direct traffic between three entities:

PVS server - identified by its IP address
a local VM - identified by its MAC address
a local Proxy - identified by its MAC address

VM and PVS server are unaware of the Proxy; xapi configures OVS to redirect traffic between PVS and VM to pass through the proxy.

OVS uses rules that match packets. Rules are organised in sets called tables. A rule can be used to match a packet and to inject it into another rule set/table such that a packet can be matched again.

Furthermore, a rule can set registers associated with a packet which that can be matched in subsequent rules. In that way, a packet can be tagged such that it will only match specific rules downstream that match the tag.

Xapi configures 3 rule sets:

Table 0 - Entry Rules

Rules match UDP traffic between VM/PVS, Proxy/VM, and PVS/VM where the PVS server is identified by its IP and all other components by their MAC address. All packets are tagged with the direction they are going and re-submitted into Table 101 which handles ports.

Table 101 - Port Rules

Rules match UDP traffic going to a specific port of the PVS server and re-submit it into Table 102.

Table 102 - Exit Rules

These rules implement the redirection:

Rules matching packets coming from VM to PVS are directed to the Proxy.
Rules matching packets coming from PVS to VM are directed to the Proxy.
Rules matching packets coming from the Proxy are already addressed properly (to the VM) are handled normally.

Requirements for suspend image framing

We are currently (Dec 2013) undergoing a transition from the ‘classic’ xenopsd backend (built upon calls to libxc) to the ‘xenlight’ backend built on top of the officially supported libxl API.

During this work, we have come across an incompatibility between the suspend images created using the ‘classic’ backend and those created using the new libxl-based backend. This needed to be fixed to enable RPU to any new version of XenServer.

Historic ‘classic’ stack

Prior to this work, xenopsd was involved in the construction of the suspend image and we ended up with an image with the following format:

+-----------------------------+
| "XenSavedDomain\n"          |  <-- added by xenopsd-classic
|-----------------------------|
|  Memory image dump          |  <-- libxc
|-----------------------------|
| "QemuDeviceModelRecord\n"   |
|  <size of following record> |  <-- added by xenopsd-classic
|  (a 32-bit big-endian int)  |
|-----------------------------|
| "QEVM"                      |  <-- libxc/qemu
|  Qemu device record         |
+-----------------------------+

We have also been carrying a patch in the Xen patchqueue against xc_domain_restore. This patch (revert_qemu_tail.patch) stopped xc_domain_restore from attempting to read past the memory image dump. At which point xenopsd-classic would just take over and restore what it had put there.

Requirements for new stack

For xenopsd-xenlight to work, we need to operate without the revert_qemu_tail.patch since libxl assumes it is operating on top of an upstream libxc.

We need the following relationship between suspend images created on one backend being able to be restored on another backend. Where the backends are old-classic (OC), new-classic (NC) and xenlight (XL). Obviously all suspend images created on any backend must be able to be restored on the same backend:

                OC _______ NC _______ XL
                 \  >>>>>      >>>>>  /
                  \__________________/
                    >>>>>>>>>>>>>>>>

It turns out this was not so simple. After removing the patch against xc_domain_restore and allowing libxc to restore the hvm_buffer_tail, we found that supsend images created with OC (detailed in the previous section) are not of a valid format for two reasons:

i. The "XenSavedDomain\n" was extraneous;

ii. The Qemu signature section (prior to the record) is not of valid form.

It turns out that the section with the Qemu signature can be one of the following:

a. "QemuDeviceModelRecord" (NB. no newline) followed by the record to EOF;
b. "DeviceModelRecord0002" then a uint32_t length followed by record;
c. "RemusDeviceModelState" then a uint32_t length followed by record;

The old-classic (OC) backend not only uses an invalid signature (since it contains a trailing newline) but it also includes a length, and the length is in big-endian when the uint32_t is seen to be little-endian.

We considered creating a proxy for the fd in the incompatible cases but since this would need to be a 22-lookahead byte-by-byte proxy this was deemed impracticle. Instead we have made patched libxc with a much simpler patch to understand this legacy format.

Because peek-ahead is not possible on pipes, the patch for (ii) needed to be applied at a point where the hvm tail had been read completely. We piggy-backed on the point after (a) had been detected. At this point the remainder of the fd is buffered (only around 7k) and the magic “QEVM” is expected at the head of this buffer. So we simply added a patch to check if there was a pesky newline and the buffer[5:8] was “QEVM” and if it was we could discard the first 5 bytes:

                              0    1    2    3    4    5   6   7   8
Legacy format from OC:  [...| \n | \x | \x | \x | \x | Q | E | V | M |...]

Required at this point: [...|  Q |  E |  V |  M |...]

Changes made

To make the above use-cases work, we have made the following changes:

1. Make new-classic (NC) not restore Qemu tail (let libxc do it)
    xenopsd.git:ef3bf4b

2. Make new-classic use valid signature (b) for future restore images
    xenopsd.git:9ccef3e

3. Make xc_domain_restore in libxc understand legacy xenopsd (OC) format
    xen-4.3.pq.hg:libxc-restore-legacy-image.patch

4. Remove revert-qemu-tail.patch from Xen patchqueue
    xen-4.3.pq.hg:3f0e16f2141e

5. Make xenlight (XL) use "XenSavedDomain\n" start-of-image signature
    xenopsd.git:dcda545

This has made the required use-cases work as follows:

                OC __134__ NC __245__ XL
                 \  >>>>>      >>>>>  /
                  \_______345________/
                    >>>>>>>>>>>>>>>>

And the suspend-resume on same backends work by virtue of:

OC --> OC : Just works
NC --> NC : By 1,2,4
XL --> XL : By 4 (5 is used but not required)

New components

The output of the changes above are:

A new xenops-xc binary for NC
A new xenops-xl binary for XL
A new libxenguest.4.3 for both of NC and XL

Future considerations

This should serve as a useful reference when considering making changes to the suspend image in any way.

Suspend image framing format

Example suspend image layout:

+----------------------------+
| 1. Suspend image signature |
+============================+
| 2.0 Xenops header          |
| 2.1 Xenops record          |
+============================+
| 3.0 Libxc header           |
| 3.1 Libxc record           |
+============================+
| 4.0 Qemu header            |
| 4.1 Qemu save record       |
+============================+
| 5.0 End_of_image footer    |
+----------------------------+

A suspend image is now constructed as a series of header-record pairs. The initial signature (1.) is used to determine whether we are dealing with the unstructured, “legacy” suspend image or the new, structured format.

Each header is two 64-bit integers: the first identifies the header type and the second is the length of the record that follows in bytes. The following types have been defined (the ones marked with a (*) have yet to be implemented):

* Xenops       : Metadata for the suspend image
* Libxc        : The result of a xc_domain_save
* Libxl*       : Not implemented
* Libxc_legacy : Marked as a libxc record saved using pre-Xen-4.5
* Qemu_trad    : The qemu save file for the Qemu used in XenServer
* Qemu_xen*    : Not implemented
* Demu*        : Not implemented
* End_of_image : A footer marker to denote the end of the suspend image

Some of the above types do not have the notion of a length since they cannot be known upfront before saving and also are delegated to other layers of the stack on restoring. Specifically these are the memory image sections, libxc and libxl.

Tasks

Some operations performed by Xenopsd are blocking, for example:

suspend/resume/migration
attaching disks (where the SMAPI VDI.attach/activate calls can perform network I/O)

We want to be able to

present the user with an idea of progress (perhaps via a “progress bar”)
allow the user to cancel a blocked operation that is taking too long
associate logging with the user/client-initiated actions that spawned them

Principles

all operations which may block (the vast majority) should be written in an asynchronous style i.e. the operations should immediately return a Task id
all operations should guarantee to respond to a cancellation request in a bounded amount of time (30s)
when cancelled, the system should always be left in a valid state
clients are responsible for destroying Tasks when they are finished with the results

Types

A task has a state, which may be Pending, Completed or failed:

	type async_result = unit

	type completion_t = {
		duration : float;
		result : async_result option
	}

	type state =
		| Pending of float
		| Completed of completion_t
		| Failed of Rpc.t

When a task is Failed, we assocate it with a marshalled exception (a value of type Rpc.t). This exception must be one from the set defined in the Xenops_interface. To see how they are marshalled, see Xenops_server.

From the point of view of a client, a Task has the immutable type (which can be queried with a Task.stat):

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
	}

where

id is a unique (integer) id generated by Xenopsd. This is how a Task is represented to clients
dbg is a client-provided debug key which will be used in log lines, allowing lines from the same Task to be associated together
ctime is the creation time
state is the current state (Pending/Completed/Failed)
subtasks lists logical internal sub-operations for debugging
debug_info includes miscellaneous key/value pairs used for debugging

Internally, Xenopsd uses a mutable record type to track Task state. This is broadly similar to the interface type except

the state is mutable: this allows Tasks to complete
the task contains a “do this now” thunk
there is a “cancelling” boolean which is toggled to request a cancellation.
there is a list of cancel callbacks
there are some fields related to “cancel points”

Persistence

The Tasks are intended to represent activities associated with in-memory queues and threads. Therefore the active Tasks are kept in memory in a map, and will be lost over a process restart. This is desirable since we will also lose the queued items and the threads, so there is no need to resync on start.

Note that every operation must ensure that the state of the system is recoverable on restart by not leaving it in an invalid state. It is not necessary to either guarantee to complete or roll-back a Task. Tasks are not expected to be transactional.

Lifecycle of a Task

All Tasks returned by API functions are created as part of the enqueue functions: queue_operation_*. Even operations which are performed internally are normally wrapped in Tasks by the function immediate_operation.

A queued operation will be processed by one of the queue worker threads. It will

set the thread-local debug key to the Task.dbg
call task.Xenops_task.run, taking care to catch exceptions and update the task.Xenops_task.state
unset the thread-local debug key
generate an event on the Task to provoke clients to query the current state.

Task implementations must update their progress as they work. For the common case of a compound operation like VM_start which is decomposed into multiple “micro-ops” (e.g. VM_create VM_build) there is a useful helper function perform_atomics which divides the progress ‘bar’ into sections, where each “micro-op” can have a different size (weight). A progress callback function is passed into each Xenopsd backend function so it can be updated with fine granularity. For example note the arguments to B.VM.save

Clients are expected to destroy Tasks they are responsible for creating. Xenopsd cannot do this on their behalf because it does not know if they have successfully queried the Task status/result.

When Xenopsd is a client of itself, it will take care to destroy the Task properly, for example see immediate_operation.

Cancellation

The goal of cancellation is to unstick a blocked operation and to return the system to some valid state, not any valid state in particular. Xenopsd does not treat operations as transactions; when an operation is cancelled it may

fully complete (e.g. if it was about to do this anyway)
fully abort (e.g. if it had made no progress)
enter some other valid state (e.g. if it had gotten half way through)

Xenopsd will never leave the system in an invalid state after cancellation.

Every Xenopsd operation should unblock and return the system to a valid state within a reasonable amount of time after a cancel request. This should be as quick as possible but up to 30s may be acceptable. Bear in mind that a human is probably impatiently watching a UI say “please wait” and which doesn’t have any notion of progress itself. Keep it quick!

Cancellation is triggered by TASK.cancel which calls cancel. This

sets the cancelling boolean
calls all registered cancel callbacks

Implementations respond to cancellation by

if running: periodically call check_cancelling
if about to block: register a suitable cancel callback safely with with_cancel.

Xenopsd’s libxc backend can block in 2 different ways, and therefore has 2 different types of cancel callback:

cancellable Xenstore watches
cancellable subprocesses

Xenstore watches are used for device hotplug and unplug. Xenopsd has to wait for the backend or for a udev script to do something. If that blocks, we need a way to cancel the watch. The easiest way to cancel a watch is to watch an additional path (a “cancel path”) and delete it, see cancellable_watch. The “cancel paths” are placed within the VM’s Xenstore directory to ensure that cleanup code which does xenstore-rm will automatically “cancel” all outstanding watches. Note that we trigger a cancel by deleting rather than creating, to avoid racing with delete and creating orphaned Xenstore entries.

Subprocesses are used for suspend/resume/migrate. Xenopsd hands file descriptors to libxenguest by running a subprocess and passing the fds to it. Xenopsd therefore gets the process id and can send it a signal to cancel it. See Cancellable_subprocess.run.

Testing with cancel points

Cancellation is difficult to test, as it is completely asynchronous. Therefore Xenopsd has some built-in cancellation testing infrastructure known as “cancel points”. A “cancel point” is a point in the code where a Cancelled exception could be thrown, either by checking the cancelling boolean or as a side-effect of a cancel callback. The check_cancelling function increments a counter every time it passes one of these points, and this value is returned to clients in the Task.debug_info.

A test harness runs a series of operations. Each operation is first run all the way through to completion to discover the total number of cancel points. The operation is then re-run with a request to cancel at a particular point. The test then waits for the system to stabilise and verifies that it appears to be in a valid state.

Preventing Tasks leaking

The client who creates a Task must destroy it when the Task is finished, and they have processed the result. What if a client like xapi is restarted while a Task is running?

We assume that, if xapi is talking to a xenopsd, then xapi completely owns it. Therefore xapi should destroy any completed tasks that it doesn’t recognise.

If a user wishes to manage VMs with xenopsd in parallel with xapi, the user should run a separate xenopsd.

Features

General

Pluggable backends including
- xc: drives Xen via libxc and xenguest
- simulator: simulates operations for component-testing
Supports running multiple instances and backends on the same host, looking after different sets of VMs
Extensive configuration via command-line (see manpage) and config file
Command-line tool for easy VM administration and troubleshooting
User-settable degree of concurrency to get VMs started quickly

VMs

VM start/shutdown/reboot
VM suspend/resume/checkpoint/migrate
VM pause/unpause
VM s3suspend/s3resume
customisable SMBIOS tables for OEM-locked VMs
hooks for 3rd party extensions:
- pre-start
- pre-destroy
- post-destroy
- pre-reboot
per-VM xenguest replacement
suppression of VM reboot loops
live vCPU hotplug and unplug
vCPU to pCPU affinity setting
vCPU QoS settings (weight and cap for the Xen credit2 scheduler)
DMC memory-ballooning support
support for storage driver domains
live update of VM shadow memory
guest-initiated disk/nic hotunplug
guest-initiated disk eject
force disk/nic unplug
support for ‘surprise-removable’ devices
disk QoS configuration
nic QoS configuration
persistent RTC
two-way guest agent communication for monitoring and control
network carrier configuration
port-locking for nics
text and VNC consoles over TCP and Unix domain sockets
PV kernel and ramdisk whitelisting
configurable VM videoram
programmable action-after-crash behaviour including: shutting down the VM, taking a crash dump or leaving the domain paused for inspection
ability to move nics between bridges/switches
advertises the VM memory footprints
PCI passthrough
support for discrete emulators (e.g. ‘demu’)
PV keyboard and mouse
qemu stub domains
cirrus and stdvga graphics cards
HVM serial console (useful for debugging)
support for vGPU
workaround for ‘spurious page faults’ kernel bug
workaround for ‘machine address size’ kernel bug

Hosts

CPUid masking for heterogenous pools: reports true features and current features
Host console reading
Hypervisor version and capabilities reporting
Host CPU querying

APIs

versioned JSON-RPC API with feature advertisements
clients can disconnect, reconnect and easily resync with the latest VM state without losing updates
all operations have task control including
- asynchronous cancellation: for both subprocesses and xenstore watches
- progress updates
- subtasks
- per-task debug logs
asynchronous event watching API
advertises VM metrics
- memory usage
- balloon driver co-operativeness
- shadow memory usage
- domain ids
channel passing (via sendmsg(2)) for efficient memory image copying

Operation Walk-Throughs

Let’s trace through interesting operations to see how the whole system works.

Starting a VM
Complete walkthrough of starting a VM, from receiving the request to unpause.
Building a VM
After VM_create, VM_build builds the core of the domain (vCPUs, memory)
- VM_build μ-op
  Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).
- Domain.build
  Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.
- xenguest
  Perform building VMs: Allocate and populate the domain's system memory.
Migrating a VM
Walkthrough of migrating a VM from one host to another.
Live Migration
Sequence diagram of the process of Live Migration.

Inspiration for other walk-throughs:

Shutting down a VM and waiting for it to happen
A VM wants to reboot itself
A disk is hotplugged
A disk refuses to hotunplug
A VM is suspended

Walkthrough: Starting a VM

A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM configuration to use. A VM configuration is broken down into objects:

VM: A device-less Virtual Machine
VBD: A virtual block device for a VM
VIF: A virtual network interface for a VM
PCI: A virtual PCI device for a VM

Treating devices as first-class objects is convenient because we wish to expose operations on the devices such as hotplug, unplug, eject (for removable media), carrier manipulation (for network interfaces) etc.

The “add” functions in the Xenopsd interface cause Xenopsd to create the objects:

In the case of xapi, there are a set of functions which convert between the XenAPI objects and the Xenopsd objects. The two interfaces are slightly different because they have different expected users:

the XenAPI has many clients which are updated on long release cycles. The main property needed is backwards compatibility, so that new release of xapi remain compatible with these older clients. Quite often, we will choose to “grandfather in” some poorly designed interface simply because we wish to avoid imposing churn on 3rd parties.
the Xenopsd API clients are all open-source and are part of the xapi-project. These clients can be updated as the API is changed. The main property needed is to keep the interface clean, so that it properly hides the complexity of dealing with Xen from other components.

The Xenopsd “VM.add” function has code like this:

	let add' x =
		debug "VM.add %s" (Jsonrpc.to_string (rpc_of_t x));
		DB.write x.id x;
		let module B = (val get_backend () : S) in
		B.VM.add x;
		x.id

This function does 2 things:

it stores the VM configuration in the “database”
it tells the “backend” that the VM exists

The Xenopsd database is really a set of config files in the filesystem. All objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not stand-alone entities like disks) and are placed into a subdirectory named after the VM e.g.:

# ls /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701
config	vbd.xvda  vbd.xvdb
# cat /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701/config
{"id": "7b719ce6-0b17-9733-e8ee-dbc1e6e7b701", "name": "fedora",
 ...
}

Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that all objects are deleted when the host is rebooted. However the objects should be persisted over a simple Xenopsd restart, which is why the objects are stored in the filesystem.

Aside: it would probably be more appropriate to store the metadata in Xenstore since this has the exact object lifetime we need. This will require a more performant Xenstore to realise.

Every running Xenopsd process is linked with a single backend. Currently backends exist for:

Xen via libxc, libxenguest and xenstore
Xen via libxl, libxc and xenstore
Xen via libvirt
KVM by direct invocation of qemu
Simulation for testing

From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a. “Xenopsd classic”) backend.

The backend VM.add function checks whether the VM we have to manage already exists – and if it does then it ensures the Xenstore configuration is intact. This Xenstore configuration is important because at any time a client can query the state of a VM with VM.stat and this relies on certain Xenstore keys being present.

Once the VM metadata has been registered with Xenopsd, the client can call VM.start. Like all potentially-blocking Xenopsd APIs, this function returns a Task id. Please refer to the Task handling design for a general overview of how tasks are handled.

Clients can poll the state of a task by calling TASK.stat but most clients will prefer to use the event system instead. Please refer to the Event handling design for a general overview of how events are handled.

The event model is similar to the XenAPI: clients call a blocking UPDATES.get passing in a token which represents the point in time when the last UPDATES.get returned. The call blocks until some objects have changed state, and these object ids are returned (NB in the XenAPI the current object states are returned) The client must then call the relevant “stat” function, in this case TASK.stat

The client will be able to see the task make progress and use this to – for example – populate a progress bar in a UI. If the client needs to cancel the task then it can call the TASK.cancel; again see the Task handling design to understand how this is implemented.

When the Task has completed successfully, then calls to *.stat will show:

the power state is Paused
exactly one valid Xen domain id
all VBDs have active = plugged = true
all VIFs have active = plugged = true
all PCI devices have plugged = true
at least one active console
a valid start time
valid “targets” for memory and vCPU

Note: before a Task completes, calls to *.stat will show partial updates. E.g. the power state may be paused, but no disk may have been plugged. UI clients must choose whether they are happy displaying this in-between state or whether they wish to hide it and pretend the whole operation has happened transactionally. If a particular, when a client wishes to perform side-effects in response to xenopsd state changes (for example, to clean up an external resource when a VIF becomes unplugged), it must be very careful to avoid responding to these in-between states. Generally, it is safest to passively report these values without driving things directly from them.

Note: the Xenopsd implementation guarantees that, if it is restarted at any point during the start operation, on restart the VM state shall be “fixed” by either (i) shutting down the VM; or (ii) ensuring the VM is intact and running.

In the case of xapi every Xenopsd Task id bound one-to-one with a XenAPI task by the function sync_with_task. The function update_task is called when xapi receives a notification that a Xenopsd Task has changed state, and updates the corresponding XenAPI task. Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for background events via the function events_watch while each thread performing a XenAPI call waits for its specific Task to complete via the function event_wait.

It is the responsibility of the client to call TASK.destroy when the Task is no longer needed. Xenopsd won’t destroy the task because it contains the success/failure result of the operation which is needed by the client.

What happens when a Xenopsd receives a VM.start request?

When Xenopsd receives the request it adds it to the appropriate per-VM queue via the function queue_operation. To understand this and other internal details of Xenopsd, consult the architecture description. The queue_operation_int function looks like this:

let queue_operation_int dbg id op =
	let task = Xenops_task.add tasks dbg (fun t -> perform op t; None) in
	Redirector.push id (op, task);
	task

The “task” is a record containing Task metadata plus a “do it now” function which will be executed by a thread from the thread pool. The module Redirector takes care of:

pushing operations to the right queue
ensuring at most one worker thread is working on a VM’s operations
reducing the queue size by coalescing items together
providing a diagnostics interface

Once a thread from the worker pool becomes free, it will execute the “do it now” function. In the example above this is perform op t where op is VM_start vm and t is the Task. The function perform_exn has fragments like this:

  | VM_start (id, force) -> (
      debug "VM.start %s (force=%b)" id force ;
      let power = (B.VM.get_state (VM_DB.read_exn id)).Vm.power_state in
      match power with
      | Running ->
          info "VM %s is already running" id
      | _ ->
          perform_atomics (atomics_of_operation op) t ;
          VM_DB.signal id "^^^^^^^^^^^^^^^^^^^^--------
    )

Each “operation” (e.g. VM_start vm) is decomposed into “micro-ops” by the function atomics_of_operation where the micro-ops are small building-block actions common to the higher-level operations. Each operation corresponds to a list of “micro-ops”, where there is no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM configuration (for example a PV domain may not need a qemu). In the case of VM_start vm the Xenopsd server starts by calling the functions that decompose the VM_hook_script, VM_create and VM_build micro-ops:

        dequarantine_ops vgpus
      ; [
          VM_hook_script
            (id, Xenops_hooks.VM_pre_start, Xenops_hooks.reason__none)
        ; VM_create (id, None, None, no_sharept)
        ; VM_build (id, force)
        ]

This is the complete sequence of micro-ops:

1. run the “VM_pre_start” scripts

The VM_hook_script micro-op runs the corresponding “hook” scripts. The code is all in the Xenops_hooks module and looks for scripts in the hardcoded path /etc/xapi.d.

2. create a Xen domain

The VM_create micro-op calls the VM.create function in the backend. In the classic Xenopsd backend, the VM.create_exn function must

check if we’re creating a domain for a fresh VM or resuming an existing one: if it’s a resume then the domain configuration stored in the VmExtra database table must be used
ask squeezed to create a memory “reservation” big enough to hold the VM memory. Unfortunately the domain cannot be created until the memory is free because domain create often fails in low-memory conditions. This means the “reservation” is associated with our “session” with squeezed; if Xenopsd crashes and restarts the reservation will be freed automatically.
create the Domain via the libxc hypercall Xenctrl.domain_create
call generate_create_info() for storing the platform data (vCPUs, etc) the domain’s Xenstore tree. xenguest then uses this in the build phase (see below) to build the domain.
“transfer” the squeezed reservation to the domain such that squeezed will free the memory if the domain is destroyed later
compute and set an initial balloon target depending on the amount of memory reserved (recall we ask for a range between dynamic_min and dynamic_max)
apply the “suppress spurious page faults” workaround if requested
set the “machine address size”
“hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates lots of vCPUs and then the guest is asked to only use some of them. Every VM therefore starts with the “VCPUs_max” setting and co-operative hotplug is used to reduce the number. Note there is no enforcement mechanism: a VM which cheats and uses too many vCPUs would have to be caught by looking at the performance statistics.

3. build the domain

The build phase waits, if necessary, for the Xen memory scrubber to catch up reclaiming memory, runs NUMA placement, sets vCPU affinity and invokes the xenguest to build the system memory layout of the domain. See the walk-through of the VM_build μ-op for details.

4. mark each VBD as “active”

VBDs and VIFs are said to be “active” when they are intended to be used by a particular VM, even if the backend/frontend connection hasn’t been established, or has been closed. If someone calls VBD.stat or VIF.stat then the result includes both “active” and “plugged”, where “plugged” is true if the frontend/backend connection is established. For example xapi will set VBD.currently_attached to “active || plugged”. The “active” flag is conceptually very similar to the traditional “online” flag (which is not documented in the upstream Xen tree as of Oct/2014 but really should be) except that on unplug, one would set the “online” key to “0” (false) first before initiating the hotunplug. By contrast the “active” flag is set to false after the unplug i.e. “set_active” calls bracket plug/unplug. If the “active” flag was set before the unplug attempt then as soon as the frontend/backend connection is removed clients would see the VBD as completely dissociated from the VM – this would be misleading because Xenopsd will not have had time to use the storage API to release locks on the disks. By cleaning up before setting “active” to false, clients can be assured that the disks are now free to be reassigned.

5. handle non-persistent disks

A non-persistent disk is one which is reset to a known-good state on every VM start. The VBD_epoch_begin is the signal to perform any necessary reset.

6. plug VBDs

The VBD_plug micro-op will plug the VBD into the VM. Every VBD is plugged in a carefully-chosen order. Generally, plug order is important for all types of devices. For VBDs, we must work around the deficiency in the storage interface where a VDI, once attached read/only, cannot be attached read/write. Since it is legal to attach the same VDI with multiple VBDs, we must plug them in such that the read/write VBDs come first. From the guest’s point of view the order we plug them doesn’t matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).

The function VBD.plug will

call VDI.attach and VDI.activate in the storage API to make the devices ready (start the tapdisk processes etc)
add the Xenstore frontend/backend directories containing the block device info
add the extra xenstore keys returned by the VDI.attach call that are needed for SCSIid passthrough which is needed to support VSS
write the VBD information to the Xenopsd database so that future calls to VBD.stat can be told about the associated disk (this is needed so clients like xapi can cope with CD insert/eject etc)
if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

The Xenstore keys are written by the functions Device.Vbd.add_async and Device.Vbd.add_wait. In a Linux domain (such as dom0) when the backend directory is created, the kernel creates a “backend device”. Creating any device will cause a kernel UEVENT to fire which is picked up by udev. The udev rules run a script whose only job is to stat(2) the device (from the “params” key in the backend) and write the major and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do any of this, instead the FreeBSD kernel module simply opens the device in the “params” key). The script also writes the backend key “hotplug-status=connected”. We currently wait for this key to be written so that later calls to VBD.stat will return with “plugged=true”. If the call returns before this key is written then sometimes we receive an event, call VBD.stat and conclude erroneously that a spontaneous VBD unplug occurred.

7. mark each VIF as “active”

This is for the same reason as VBDs are marked “active”.

8. plug VIFs

Again, the order matters. Unlike VBDs, there is no read/write read/only constraint and the devices have unique indices (0, 1, 2, …) but Linux kernels have often (always?) ignored the actual index and instead relied on the order of results from the xenstore-ls listing. The order that xenstored returns the items happens to be the order the nodes were created so this means that (i) xenstored must continue to store directories as ordered lists rather than maps (which would be more efficient); and (ii) Xenopsd must make sure to plug the vifs in the same order. Note that relying on ethX device numbering has always been a bad idea but is still common. I bet if you change this, many tests will suddenly start to fail!

The function VIF.plug_exn will

compute the port locking configuration required and write this to a well-known location in the filesystem where it can be read from the udev scripts. This really should be written to Xenstore instead, since this scheme doesn’t work with driver domains.
add the Xenstore frontend/backend directories containing the network device info
write the VIF information to the Xenopsd database so that future calls to VIF.stat can be told about the associated network
if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

Similarly to the VBD case, the function Device.Vif.add will write the Xenstore keys and wait for the “hotplug-status=connected” key. We do this because we cannot apply the port locking rules until the backend device has been created, and we cannot know the rules have been applied until after the udev script has written the key. If we didn’t wait for it then the VM might execute without all the port locking properly configured.

9. create the device model

The VM_create_device_model micro-op will create a qemu device model if

the VM is HVM; or
the VM uses a PV keyboard or mouse (since only qemu currently has backend support for these devices).

The function VM.create_device_model_exn will

(if using a qemu stubdom) it will create and build the qemu domain
compute the necessary qemu arguments and launch it.

Note that qemu (aka the “device model”) is created after the VIFs and VBDs have been plugged but before the PCI devices have been plugged. Unfortunately qemu traditional infers the needed emulated hardware by inspecting the Xenstore VBD and VIF configuration and assuming that we want one emulated device per PV device, up to the natural limits of the emulated buses (i.e. there can be at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this create an ordering dependency that needn’t exist – and which impacts migration downtime – but it also completely ignores the plain fact that, on a Xen system, qemu can be in a different domain than the backend disk and network devices. This hack only works because we currently run everything in the same domain. There is an option (off by default) to list the emulated devices explicitly on the qemu command-line. If we switch to this by default then we ought to be able to start up qemu early, as soon as the domain has been created (qemu will need to know the domain id so it can map the I/O request ring).

10. plug PCI devices

PCI devices are treated differently to VBDs and VIFs. If we are attaching the device to an HVM guest then instead of relying on the traditional Xenstore frontend/backend state machine we instead send RPCs to qemu requesting they be hotplugged. Note the domain is paused at this point, but qemu still supports PCI hotplug/unplug. The reasons why this doesn’t follow the standard Xenstore model are known only to the people who contributed this support to qemu. Again the order matters because it determines the position of the virtual device in the VM.

Note that Xenopsd doesn’t know anything about the PCI devices; concepts such as “GPU groups” belong to higher layers, such as xapi.

11. mark the domain as alive

A design principle of Xenopsd is that it should tolerate failures such as being suddenly restarted. It guarantees to always leave the system in a valid state, in particular there should never be any “half-created VMs”. We achieve this for VM start by exploiting the mechanism which is necessary for reboot. When a VM wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a “reason code” of “reboot”. When Xenopsd sees this event VM_check_state operation is queued. This operation calls VM.get_domain_action_request to ask the question, “what needs to be done to make this VM happy now?”. The implementation checks the domain state for shutdown codes and also checks a special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd finishes starting the VM it clears this key. This means that if Xenopsd crashes while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted and will clean up the current domain and create a fresh one.

12. unpause the domain

A Xenopsd VM.start will always leave the domain paused, so strictly speaking this is a separate “operation” queued by the client (such as xapi) after the VM.start has completed. The function VM.unpause is reassuringly simple:

		if di.Xenctrl.total_memory_pages = 0n then raise (Domain_not_built);
		Domain.unpause ~xc di.Xenctrl.domid;
		Opt.iter
			(fun stubdom_domid ->
				Domain.unpause ~xc stubdom_domid
			) (get_stubdom ~xs di.Xenctrl.domid)

Building a VM

flowchart
subgraph xenopsd VM_build[xenopsd:&nbsp;VM_build&nbsp;micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end

Walk-through documents for the VM_build phase:

VM_build μ-op
Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).
Domain.build
Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.
xenguest
Perform building VMs: Allocate and populate the domain's system memory.

VM_build micro-op

Overview

On Xen, Xenctrl.domain_create creates an empty domain and returns the domain ID (domid) of the new domain to xenopsd.

In the build phase, the xenguest program is called to create the system memory layout of the domain, set vCPU affinity and a lot more.

The VM_build micro-op collects the VM build parameters and calls VM.build, which calls VM.build_domain, which calls VM.build_domain_exn which calls Domain.build:

flowchart
subgraph xenopsd VM_build[xenopsd:&nbsp;VM_build&nbsp;micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end

The function VM.build_domain_exn must:

Run pygrub (or eliloader) to extract the kernel and initrd, if necessary
Call Domain.build to
- optionally run NUMA placement and
- invoke xenguest to set up the domain memory.
See the walk-through of the Domain.build function for more details on this phase.
Apply the cpuid configuration
Store the current domain configuration on disk – it’s important to know the difference between the configuration you started with and the configuration you would use after a reboot because some properties (such as maximum memory and vCPUs) as fixed on create.

Domain.build

Overview

flowchart LR
subgraph xenopsd VM_build[
  xenopsd&nbsp;thread&nbsp;pool&nbsp;with&nbsp;two&nbsp;VM_build&nbsp;micro#8209;ops:
  During&nbsp;parallel&nbsp;VM_start,&nbsp;Many&nbsp;threads&nbsp;run&nbsp;this&nbsp;in&nbsp;parallel!
]
direction LR
build_domain_exn[
  VM.build_domain_exn
  from thread pool Thread #1
]  --> Domain.build
Domain.build --> build_pre
build_pre --> wait_xen_free_mem
build_pre -->|if NUMA/Best_effort| numa_placement
Domain.build --> xenguest[Invoke xenguest]
click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank

build_domain_exn2[
  VM.build_domain_exn
  from thread pool Thread #2]  --> Domain.build2[Domain.build]
Domain.build2 --> build_pre2[build_pre]
build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
Domain.build2 --> xenguest2[Invoke xenguest]
click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
end

VM.build_domain_exn calls Domain.build to call:

build_pre to prepare the build of a VM:
- If the xe config numa_placement is set to Best_effort, invoke the NUMA placement algorithm.
- Run xenguest
xenguest to invoke the xenguest program to setup the domain’s system memory.

build_pre: Prepare building the VM

Domain.build calls build_pre (which is also used for VM restore) to:

Call wait_xen_free_mem to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory. It
1. calls Xenctrl.physinfo which returns:
  - hostinfo.free_pages - the free and already scrubbed pages (available)
  - host.scrub_pages - the not yet scrubbed pages (not yet available)
2. repeats this until a timeout as long as free_pages is lower than the required pages
  - unless if scrub_pages is 0 (no scrubbing left to do)
Note: free_pages is system-wide memory, not memory specific to a NUMA node. Because this is not NUMA-aware, in case of temporary node-specific memory shortage, this check is not sufficient to prevent the VM from being spread over all NUMA nodes. It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.
Call the hypercall to set the timer mode
Call the hypercall to set the number of vCPUs

Call the numa_placement function as described in the NUMA feature description when the xe configuration option numa_placement is set to Best_effort (except when the VM has a hard CPU affinity).

match !Xenops_server.numa_placement with
| Any ->
    ()
| Best_effort ->
    log_reraise (Printf.sprintf "NUMA placement") (fun () ->
        if has_hard_affinity then
          D.debug "VM has hard affinity set, skipping NUMA optimization"
        else
          numa_placement domid ~vcpus
            ~memory:(Int64.mul memory.xen_max_mib 1048576L)
    )

NUMA placement

build_pre passes the domid, the number of vCPUs and xen_max_mib to the numa_placement function to run the algorithm to find the best NUMA placement.

When it returns a NUMA node to use, it calls the Xen hypercalls to set the vCPU affinity to this NUMA node:

  let vm = NUMARequest.make ~memory ~vcpus in
  let nodea =
    match !numa_resources with
    | None ->
        Array.of_list nodes
    | Some a ->
        Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
  in
  numa_resources := Some nodea ;
  Softaffinity.plan ~vm host nodea

By using the default auto_node_affinity feature of Xen, setting the vCPU affinity causes the Xen hypervisor to activate NUMA node affinity for memory allocations to be aligned with the vCPU affinity of the domain.

Summary: This passes the information to the hypervisor that memory allocation for this domain should preferably be done from this NUMA node.

Invoke the xenguest program

With the preparation in build_pre completed, Domain.build calls the xenguest function to invoke the xenguest program to build the domain.

Notes on future design improvements

The Xen domain feature flag domain->auto_node_affinity can be disabled by calling xc_domain_node_setaffinity() to set a specific NUMA node affinity in special cases:

This can be used, for example, when there might not be enough memory on the preferred NUMA node, and there are other NUMA nodes (in the same CPU package) to use (reference).

xenguest

As part of starting a new domain in VM_build, xenopsd calls xenguest. When multiple domain build threads run in parallel, also multiple instances of xenguest also run in parallel:

flowchart
subgraph xenopsd VM_build[xenopsd&nbsp;VM_build&nbsp;micro#8209;ops]
direction LR
xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1]
xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2]
xenguest1 --> libxenguest
xenguest2 --> libxenguest2[libxenguest]
click xenopsd1 "../Domain.build/index.html"
click xenopsd2 "../Domain.build/index.html"
click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
libxenguest --> Xen[Xen<br>Hypervisor]
libxenguest2 --> Xen
end

About xenguest

xenguest is called by the xenopsd Domain.build function to perform the build phase for new VMs, which is part of the xenopsd VM.start operation.

xenguest was created as a separate program due to issues with libxenguest:

It wasn’t threadsafe: fixed, but it still uses a per-call global struct
It had an incompatible licence, but now licensed under the LGPL.

Those were fixed, but we still shell out to xenguest, which is currently carried in the patch queue for the Xen hypervisor packages, but could become an individual package once planned changes to the Xen hypercalls are stabilised.

Over time, xenguest has evolved to build more of the initial domain state.

Interface to xenguest

flowchart
subgraph xenopsd VM_build[xenopsd&nbsp;VM_build&nbsp;micro#8209;op]
direction TB
mode
domid
memmax
Xenstore
end
mode[--mode build_hvm] --> xenguest
domid --> xenguest
memmax --> xenguest
Xenstore[Xenstore platform data] --> xenguest

xenopsd must pass this information to xenguest to build a VM:

The domain type to build for (HVM, PHV or PV).
- It is passed using the command line option --mode hvm_build.
The domid of the created empty domain,
The amount of system memory of the domain,
A number of other parameters that are domain-specific.

xenopsd uses the Xenstore to provide platform data:

the vCPU affinity
the vCPU credit2 weight/cap parameters
whether the NX bit is exposed
whether the viridian CPUID leaf is exposed
whether the system has PAE or not
whether the system has ACPI or not
whether the system has nested HVM or not
whether the system has an HPET or not

When called to build a domain, xenguest reads those and builds the VM accordingly.

Walkthrough of the xenguest build mode

flowchart
subgraph xenguest[xenguest&nbsp;#8209;#8209;mode&nbsp;hvm_build&nbsp;domid]
direction LR
stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
    get_flags#40;#41;&nbsp;<#8209;&nbsp;Xenstore&nbsp;platform&nbsp;data
]
stub_xc_hvm_build --> configure_vcpus[
    configure_vcpus#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercall
]
stub_xc_hvm_build --> setup_mem[
    setup_mem#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercalls&nbsp;to&nbsp;setup&nbsp;domain&nbsp;memory
]
end

Based on the given domain type, the xenguest program calls dedicated functions for the build process of the given domain type.

These are:

stub_xc_hvm_build() for HVM,
stub_xc_pvh_build() for PVH, and
stub_xc_pv_build() for PV domains.

These domain build functions call these functions:

get_flags() to get the platform data from the Xenstore
configure_vcpus() which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)
The setup_mem function for the given VM type.

The function hvm_build_setup_mem()

For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory layout of the new domain, allocating the required memory and populating for the new domain. It must:

Derive the e820 memory layout of the system memory of the domain including memory holes depending on PCI passthrough and vGPU flags.
Load the BIOS/UEFI firmware images
Store the final MMIO hole parameters in the Xenstore
Call the libxenguest function xc_dom_boot_mem_init() (see below)
Call construct_cpuid_policy() to apply the CPUID featureset policy

The function xc_dom_boot_mem_init()

flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end

hvm_build_setup_mem() calls xc_dom_boot_mem_init() to allocate and populate the domain’s system memory.

It calls meminit_hvm() to loop over the vmemranges of the domain for mapping the system RAM of the guest from the Xen hypervisor heap. Its goals are:

Attempt to allocate 1GB superpages when possible
Fall back to 2MB pages when 1GB allocation failed
Fall back to 4k pages when both failed

It uses the hypercall XENMEM_populate_physmap to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.

https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071

XENMEM_populate_physmap:

Uses construct_memop_from_reservation to convert the arguments for allocating a page from struct xen_memory_reservation to struct memop_args.
Sets flags and calls functions according to the arguments
Allocates the requested page at the most suitable place
- depending on passed flags, allocate on a specific NUMA node
- else, if the domain has node affinity, on the affine nodes
- also in the most suitable memory zone within the NUMA node
Falls back to less desirable places if this fails
- or fail for “exact” allocation requests
When no pages of the requested size are free, it splits larger superpages into pages of the requested size.

For more details on the VM build step involving xenguest and Xen side see: https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest

Walkthrough: Migrating a VM

At the end of this walkthrough, a sequence diagram of the overall process is included.

Invocation

The command to migrate the VM is dispatched by the autogenerated dispatch_call function from xapi/server.ml. For more information about the generated functions you can have a look to XAPI IDL model.

The command triggers the operation VM_migrate that uses many low level atomics operations. These are:

The migrate command has several parameters such as:

Should it be started asynchronously,
Should it be forwarded to another host,
How arguments should be marshalled, and so on.

A new thread is created by xapi/server_helpers.ml to handle the command asynchronously. The helper thread checks if the command should be passed to the message forwarding layer in order to be executed on another host (the destination) or locally (if it is already at the destination host).

It will finally reach xapi/api_server.ml that will take the action of posted a command to the message broker message switch. It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some XAPI daemons. In the case of the migration this message sends by XAPI will be consumed by the xenopsd daemon that will do the job of migrating the VM.

Overview

The migration is an asynchronous task and a thread is created to handle this task. The task reference is returned to the client, which can then check its status until completion.

As shown in the introduction, xenopsd fetches the VM_migrate operation from the message broker.

All tasks specific to libxenctrl, xenguest and Xenstore are handled by the xenopsd xc backend.

The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.

During the migration process, the destination domain will be built with the same UUID as the original VM, except that the last part of the UUID will be XXXXXXXX-XXXX-XXXX-XXXX-000000000001. The original domain will be removed using XXXXXXXX-XXXX-XXXX-XXXX-000000000000.

Preparing VM migration

At specific places, xenopsd can execute hooks to run scripts. In case a pre-migrate script is in place, a command to run this script is sent to the original domain.

Likewise, a command is sent to Qemu using the Qemu Machine Protocol (QMP) to check that the domain can be suspended (see xenopsd/xc/device_common.ml). After checking with Qemu that the VM is can be suspended, the migration can begin.

Importing metadata

As for hooks, commands to source domain are sent using stunnel a daemon which is used as a wrapper to manage SSL encryption communication between two hosts on the same pool. To import the metadata, an XML RPC command is sent to the original domain.

Once imported, it will give us a reference id and will allow building the new domain on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001 where XXX... is the reference id of the original VM.

Memory setup

One of the first steps the setup of the VM’s memory: The backend checks that there is no ballooning operation in progress. If so, the migration could fail.

Once memory has been checked, the daemon will get the state of the VM (running, halted, …) and The backend retrieves the domain’s platform data (memory, vCPUs setc) from the Xenstore.

Once this is complete, we can restore VIF and create the domain.

The synchronisation of the memory is the first point of synchronisation and everything is ready for VM migration.

Destination VM setup

After receiving memory we can set up the destination domain. If we have a vGPU we need to kick off its migration process. We will need to wait for the acknowledgement that the GPU entry has been successfully initialized before starting the main VM migration.

The receiver informs the sender using a handshake protocol that everything is set up and ready for save/restore.

Destination VM restore

VM restore is a low level atomic operation VM.restore. This operation is represented by a function call to backend. It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor and libxc for sending a migration request to the emu-manager.

After sending the request results coming from emu-manager are collected by the main thread. It blocks until results are received.

During the live migration, emu-manager helps in ensuring the correct state transitions for the devices and handling the message passing for the VM as it’s moved between hosts. This includes making sure that the state of the VM’s virtual devices, like disks or network interfaces, is correctly moved over.

Destination VM rename

Once all operations are done, xenopsd renames the target VM from its temporary name to its real UUID. This operation is a low-level atomic VM.rename which takes care of updating the Xenstore on the destination host.

Restoring devices

Restoring devices starts by activating VBD using the low level atomic operation VBD.set_active. It is an update of Xenstore. VBDs that are read-write must be plugged before read-only ones. Once activated the low level atomic operation VBD.plug is called. VDI are attached and activate.

Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug. If there are VGPUs we will set them as active now using the atomic VGPU.set_active.

Creating the device model

create_device_model configures qemu-dm and starts it. This allows to manage PCI devices.

PCI plug

PCI.plug is executed by the backend. It plugs a PCI device and advertises it to QEMU if this option is set. It is the case for NVIDIA SR-IOV vGPUs.

Unpause

The libxenctrl call xc_domain_unpause() unpauses the domain, and it starts running.

Cleanup

VM_set_domain_action_request marks the domain as alive: In case xenopsd restarts, it no longer reboots the VM. See the chapter on marking domains as alive for more information.
If a post-migrate script is in place, it is executed by the Xenops_hooks.VM_post_migrate hook.
The final step is a handshake to seal the success of the migration and the old VM can now be cleaned up.

Syncronisation point 4 has been reached, the migration is complete.

Live migration flowchart

This flowchart gives a visual representation of the VM migration workflow:

sequenceDiagram
autonumber
participant tx as sender
participant rx0 as receiver thread 0
participant rx1 as receiver thread 1
participant rx2 as receiver thread 2

activate tx
tx->>rx0: VM.import_metadata
tx->>tx: Squash memory to dynamic-min

tx->>rx1: HTTP /migrate/vm
activate rx1
rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
rx1->>tx: handshake (control channel)<br/>Synchronisation point 1

tx->>rx2: HTTP /migrate/mem
activate rx2
rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem

tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK

rx2->>rx1: memory fd

tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
tx->>tx: VM_rename
rx1->>rx2: exit
deactivate rx2

tx->>rx1: handshake (control channel)<br/>Synchronisation point 3

rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request

rx1->>tx: handshake (control channel)<br/>Synchronisation point 4

deactivate rx1

tx->>tx: VM_shutdown<br/>VM_remove
deactivate tx

References

These pages might help for a better understanding of the XAPI toolstack:

See the XAPI architecture for the overall architecture of Xapi
See the XAPI dispatcher for service dispatch and message forwarding
See the Xenopsd architecture for the overall architecture of Xenopsd
See the How Xen suspend and resume works for very similar operations in more detail.

Live Migration Sequence Diagram

sequenceDiagram
autonumber
participant tx as sender
participant rx0 as receiver thread 0
participant rx1 as receiver thread 1
participant rx2 as receiver thread 2

activate tx
tx->>rx0: VM.import_metadata
tx->>tx: Squash memory to dynamic-min

tx->>rx1: HTTP /migrate/vm
activate rx1
rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
rx1->>tx: handshake (control channel)<br/>Synchronisation point 1

tx->>rx2: HTTP /migrate/mem
activate rx2
rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem

tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK

rx2->>rx1: memory fd

tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
tx->>tx: VM_rename
rx1->>rx2: exit
deactivate rx2

tx->>rx1: handshake (control channel)<br/>Synchronisation point 3

rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request

rx1->>tx: handshake (control channel)<br/>Synchronisation point 4

deactivate rx1

tx->>tx: VM_shutdown<br/>VM_remove
deactivate tx

Networkd

The xcp-networkd daemon (hereafter simply called “networkd”) is a component in the xapi toolstack that is responsible for configuring network interfaces and virtual switches (bridges) on a host.

The code is in ocaml/networkd.

Principles

Distro-agnostic. Networkd is meant to work on at least CentOS/RHEL as well a Debian/Ubuntu based distros. It therefore should not use any network configuration features specific to those distros.
Stateless. By default, networkd should not maintain any state. If you ask networkd anything about a network interface or bridge, or any other network sub-system property, it will always query the underlying system (e.g. an IP address), rather than returning any cached state. However, if you want networkd to configure networking at host boot time, the you can ask it to remember your configuration you have set for any interface or bridge you choose.
Idempotent. It should be possible to call any networkd function multiple times without breaking things. For example, calling a function to set an IP address on an interface twice in a row should have the same outcome as calling it just once.
Do no harm. Networkd should only configure what you ask it to configure. This means that it can co-exist with other network managers.

Usage

Networkd is a daemon that is typically started at host-boot time. In the same way as the other daemons in the xapi toolstack, it is controlled by RPC requests. It typically receives requests from the xapi daemon, on behalf of which it configures host networking.

Networkd’s RCP API is fully described by the network_interface.ml file. The API has two main namespaces: Interface and Bridge, which are implemented in two modules in network_server.ml.

In line with other xapi daemons, all API functions take an argument of type debug_info (a string) as their first argument. The debug string appears in any log lines that are produced as a side effort of calling the function.

Network Interface API

The Interface API has functions to query and configure properties of Linux network devices, such as IP addresses, and bringing them up or down. Most Interface functions take a name string as a reference to a network interface as their second argument, which is expected to be the name of the Linux network device. There is also a special function, called Interface.make_config, that is able to configure a number of interfaces at once. It takes an argument called config of type (iface * interface_config_t) list, where iface is an interface name, and interface_config_t is a compound type containing the full configuration for an interface (as far as networkd is able to configure them), currently defined as follows:

type interface_config_t = {
	ipv4_conf: ipv4;
	ipv4_gateway: Unix.inet_addr option;
	ipv6_conf: ipv6;
	ipv6_gateway: Unix.inet_addr option;
	ipv4_routes: (Unix.inet_addr * int * Unix.inet_addr) list;
	dns: Unix.inet_addr list * string list;
	mtu: int;
	ethtool_settings: (string * string) list;
	ethtool_offload: (string * string) list;
	persistent_i: bool;
}

When the function returns, it should have completely configured the interface, and have brought it up. The idempotency principle applies to this function, which means that it can be used to successively modify interface properties; any property that has not changed will effectively be ignored. In fact, Interface.make_config is the main function that xapi uses to configure interfaces, e.g. as a result of a PIF.plug or a PIF.reconfigure_ip call.

Also note the persistent property in the interface config. When an interface is made “persistent”, this means that any configuration that is set on it is remembered by networkd, and the interface config is written to disk. When networkd is started, it will read the persistent config and call Interface.make_config on it in order to apply it (see Startup below).

The full networkd API should be documented separately somewhere on this site.

Bridge API

The Bridge API functions are all about the management of virtual switches, also known as “bridges”. The shape of the Bridge API roughly follows that of the Open vSwitch in that it treats a bridge as a collection of “ports”, where a port can contain one or more “interfaces”.

NIC bonding and VLANs are all configured on the Bridge level. There are functions for creating and destroying bridges, adding and removing ports, and configuring bonds and VLANs. Like interfaces, bridges and ports are addressed by name in the Bridge functions. Analogous to the Interface function with the same name, there is a Bridge.make_config function, and bridges can be made persistent.

type port_config_t = {
	interfaces: iface list;
	bond_properties: (string * string) list;
	bond_mac: string option;
}
type bridge_config_t = {
	ports: (port * port_config_t) list;
	vlan: (bridge * int) option;
	bridge_mac: string option;
	other_config: (string * string) list;
	persistent_b: bool;
}

Backends

Networkd currently has two different backends: the “Linux bridge” backend and the “Open vSwitch” backend. The former is the “classic” backend based on the bridge module that is available in the Linux kernel, plus additional standard Linux functionality for NIC bonding and VLANs. The latter backend is newer and uses the Open vSwitch (OVS) for bridging as well as other functionality. Which backend is currently in use is defined by the file /etc/xensource/network.conf, which is read by networkd when it starts. The choice of backend (currently) only affects the Bridge API: every function in it has a separate implementation for each backend.

Low-level Interfaces

Networkd uses standard networking commands and interfaces that are available in most modern Linux distros, rather than relying on any distro-specific network tools (see the distro-agnostic principle). These are tools such as ip (iproute2), dhclient and brctl, as well as the sysfs files system, and netlink sockets. To control the OVS, the ovs-* command line tools are used. All low-level functions are called from network_utils.ml.

Configuration on Startup

Networkd, periodically as well as on shutdown, writes the current configuration of all bridges and interfaces (see above) in a JSON format to a file called networkd.db (currently in /var/lib/xcp). The contents of the file are completely described by the following type:

type config_t = {
	interface_config: (iface * interface_config_t) list;
	bridge_config: (bridge * bridge_config_t) list;
	gateway_interface: iface option;
	dns_interface: iface option;
}

The gateway_interface and dns_interface in the config are global host-level options to define from which interfaces the default gateway and DNS configuration is taken. This is especially important when multiple interfaces are configured by DHCP.

When networkd starts up, it first reads network.conf to determine the network backend. It subsequently attempts to parse networkd.db, and tries to call Bridge.make_config and Interface.make_config on it, with a special options to only apply the config for persistent bridges and interfaces, as well as bridges related to those (for example, if a VLAN bridge is configured, then also its parent bridge must be configured).

Networkd also supports upgrades from older versions of XenServer that used a network configuration script called interface-configure. If networkd.db is not found on startup, then networkd attempts to call this tool (via the /etc/init.d/management-interface script) in order to set up networking at boot time. This is normally followed immediately by a call from xapi instructing networkd to take over.

Finally, if no network config (old or new) is found on disk at all, networkd looks for a XenServer “firstboot” data file, which is written by XenServer’s host installer, and tries to apply it to set up the management interface.

Monitoring

Besides the ability to configure bridges and network interfaces, networkd has facilities for monitoring interfaces and bonds. When networkd starts, a monitor thread is started, which does several things (see network_monitor_thread.ml):

Every 5 seconds, it gathers send/receive counters and link state of all network interfaces. It then writes these stats to a shared-memory file, to be picked up by other components such as xcp-rrdd and xapi (see documentation about “xenostats” elsewhere).
It monitors NIC bonds, and sends alerts through xapi in case of link state changes within a bond.
It uses ip monitor address to watch for an IP address changes, and if so, it calls xapi (Host.signal_networking_change) for it to update the IP addresses of the PIFs in its database that were configured by DHCP.

Host Network Device Ordering on Networkd

Purpose

One of the Toolstack’s functions is to maintain a pool of hosts. A pool can be constructed by joining a host into an existing pool. One challenge in this process is determining which pool-wide network a network device on the joining host should connect to.

At first glance, this could be resolved by specifying a mapping between an individual network device and a pool-wide network. However, this approach would be burdensome for administrators when managing many hosts. It would be more efficient if the Toolstack could determine this automatically.

To achieve this, the Toolstack components on two hosts need to independently work out consistent identifications for the host network devices and connect the network devices with the same identification to the same pool-wide network. The identifications on a host can be considered as an order, with each network device assigned a unique position in the order as its identification. Network devices with the same position will connect to the same network.

The assumption

Why can the Toolstack components on two hosts independently work out an expected order without any communication? This is possible only under the assumption that the hosts have identical hardware, firmware, software, and the way network devices are plugged into them. For example, an administrator will always plug the network devices into the same PCI slot position on multiple hosts if they want these network devices to connect to the same network.

The ordering is considered consistent if the positions of such network devices (plugged into the same PCI slot position) in the generated orders are the same.

The biosdevname

Particularly, when the assumption above holds, a consistent initial order can be worked out on multiple hosts independently with the help of biosdevname. The “all_ethN” policy of the biosdevname utility can generate a device order based on whether the device is embedded or not, PCI cards in ascending slot order, and ports in ascending PCI bus/device/function order breadth-first. Since the hosts are identical, the orders generated by the biosdevname are consistent across the hosts.

An example of biosdevname’s output is as the following. The initial order can be derived from the BIOS device field.

# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth1
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...

However, the BIOS device of a particular network device may change with the addition or removal of devices. For example:

# biosdevname --policy all_ethN -d -x
BIOS device: eth0
Kernel name: enp4s0
Permanent MAC: EC:F4:BB:E6:D7:BB
Assigned MAC : EC:F4:BB:E6:D7:BB
Bus Info: 0000:04:00.0
...

BIOS device: eth1
Kernel name: enp5s0
Permanent MAC: 00:02:C9:ED:FD:F0
Assigned MAC : 00:02:C9:ED:FD:F0
Bus Info: 0000:05:00.0
...

BIOS device: eth2
Kernel name: enp5s1
Permanent MAC: 00:02:C9:ED:FD:F1
Assigned MAC : 00:02:C9:ED:FD:F1
Bus Info: 0000:05:01.0
...

Therefore, the order derived from these values is used solely for determining the initial order and the order of newly added devices.

Principles

Initially, the order is aligned with PCI slots. This is to make the connection between cabling and order predictable: The network devices in identical PCI slots have the same position. The rationale is that PCI slots are more predictable than MAC addresses and correspond to physical locations.
Once a previous order has been established, the ordering should be maintained as stable as possible despite changes to MAC addresses or PCI addresses. The rationale is that the assumption is less likely to hold as long as the hosts are experiencing updates and maintenance. Therefore, maintaining the stable order is the best choice for automatic ordering.

Notation

mac:pci:position
!mac:pci:position

A network device is characterised by

MAC address, which is unique.
PCI slot, which is not unique and multiple network devices can share a PCI slot. PCI addresses correspond to hardware PCI slots and thus are physically observable.
position, the position assigned to this network device by xcp-networkd. At any given time, no position is assigned twice but the sequence of positions may have holes.
The !mac:pci:position notation indicates that this postion was previously used but currently is free because the device it was assgined was removed.

On a Linux system, MAC and PCI addresses have specific formats. However, for simplicity, symbolic names are used here: MAC addresses use lowercase letters, PCI addresses use uppercase letters, and positions use numbers.

Scenarios

The initial order

As mentioned above, the biosdevname can be used to generate consistent orders for the network devices on multiple hosts.

current input: a:A   b:D   c:C
initial order: a:A:0 c:C:1 b:D:2

This only works if the assumption of identical hardware, firmware, software, and network device placement holds. And it is considered that the assumption will hold for the majority of the use cases.

Otherwise, the order can be generated from a user’s configuration. The user can specify the order explicilty for individual hosts. However, administrators would prefer to avoid this as much as possible when managing many hosts.

user spec:     a::0  c::1  b::2
current input: a:A   b:D   c:C
initial order: a:A:0 c:C:1 b:D:2

Keep the order as stable as possible

Once an initial order is created on an individual host, it should be kept as stable as possible across host boot-ups and at runtime. For example, unless there are hardware changes, the position of a network device in the initial order should remain the same regardless of how many times the host is rebooted.

To achieve this, the initial order should be saved persistently on the host’s local storage so it can be referenced in subsequent orderings. When performing another ordering after the initial order has been saved, the position of a currently unordered network device should be determined by finding its position in the last saved order. The MAC address of the network device is a reliable attribute for this purpose, as it is considered unique for each network device globally.

Therefore, the network devices in the saved order should have their MAC addresses saved together, effectively mapping each position to a MAC address. When performing an ordering, the stable position can be found by searching the last saved order using the MAC address.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:D    c:C
new order:     a:A:0  c:C:1  b:D:2

Name labels of the network devices are not considered reliable enough to identify particular devices. For example, if the name labels are determined by the PCI address via systemd, and a firmware update changes the PCI addresses of the network devices, the name labels will also change.

The PCI addresses are not considered reliable as well. They may change due to the firmeware update/setting changes or even plugging/unpluggig other devices.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:B    c:E
new order:     a:A:0  c:E:1  b:B:2

Replacement

However, what happens when the MAC address of an unordered network device cannot be found in the last saved order? There are two possible scenarios:

It’s a newly added network device since the last ordering.
It’s a new device that replaces an existing network device.

Replacement is a supported scenario, as an administrator might replace a broken network device with a new one.

This can be recognized by comparing the PCI address where the network device is located. Therefore, the PCI address of each network device should also be saved in the order. In this case, searching the PCI address in the order results in one of the following:

Not found: This means the PCI address was not occupied during the last ordering, indicating a newly added network device.
Found with a MAC address, but another device with this MAC address is still present in the system: This suggests that the PCI address of an existing network device (with the same MAC address) has changed since the last ordering. This may be caused by either a device move or others like a firmware update. In this case, the current unordered network device is considered newly added.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:B    c:C    d:D
new order:     a:A:0  c:C:1  b:B:2  d:D:3

Found with a MAC address, and no current devices have this MAC address: This indicates that a new network device has replaced the old one in the same PCI slot. The replacing network device should be assigned the same position as the replaced one.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    c:C    d:D
new order:     a:A:0  c:C:1  d:D:2

Removed devices

A network device can be removed or unplugged since the last ordering. Its position, MAC address, and PCI address are saved for future reference, and its position will be reserved. This means there may be a gap in the order: a position that was previously assigned to a network device is now vacant because the device has been removed.

last order:    a:A:0  c:C:1  b:D:2
current input: a:A    b:D
new order:     a:A:0  !c:C:1 d:D:2

Newly added devices

As long as the assumption holds, newly added devices since the last ordering can be assigned positions consistently across multiple hosts. Newly added devices will not be assigned the positions reserved for removed devices.

last order:    a:A:0 !c:C:1  d:D:2
current input: a:A           d:D    e:E
new order:     a:A:0 !c:C:1  d:D:2  e:E:3

Removed and then added back

It is a supported scenario for a removed device to be plugged back in, regardless of whether it is in the same PCI slot or not. This can be recognized by searching for the device in the saved removed devices using its MAC address. The reserved position will be reassigned to the device when it is added back.

last order:    a:A:0 !c:C:1 d:D:2
current input: a:A   c:F    d:D   e:E
new order:     a:A:0 c:F:1  d:D:2 e:E:3

Multinic functions

The multinic function is a special kind of network device. When this type of physical device is plugged into a PCI slot, multiple network devices are reported at a single PCI address. Additionally, the number of reported network devices may change due to driver updates.

current input: a:A b:A c:A d:A
initial order: a:A:0 b:A:1 c:A:2 d:A:3

As long as the assumption holds, the initial order of these devices can be generated automatically and kept stable by using MAC addresses to identify individual devices. However, biosdevname cannot reliably generate an order for all devices reported at one PCI address. For devices located at the same PCI address, their MAC addresses are used to generate the initial order.

last order:    a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5
current input: a:A   b:A   c:A   d:A   e:A   f:A   m:M   n:N
new order:     a:A:0 b:A:1 c:A:2 d:A:3 m:M:4 n:N:5 e:A:6 f:A:7

For example, suppose biosdevname generates an order for a multinic function and other non-multinic devices. Within this order, the N devices of the multinic function with MAC addresses mac[1], …, mac[N] are assigned positions pos[1], …, pos[N] correspondingly. biosdevname cannot ensure that the device with mac[1] is always assigned position pos[1]. Instead, it ensures that the entire set of positions pos[1], …, pos[N] remains stable for the devices of the multinic function. Therefore, to ensure the order follows the MAC address order, the devices of the multinic function need to be sorted by their MAC addresses within the set of positions.

last order:    a:A:0 b:A:1 c:A:2 d:A:3 m:M:4
current input: e:A   f:A   g:A   h:A   m:M
new order:     e:A:0 f:A:1 g:A:2 h:A:3 m:M:4

Rare cases that can not be handled automatically

In summary, to keep the order stable, the auto-generated order needs to be saved for the next ordering. When performing an automatic ordering for the current network devices, either the MAC address or the PCI address is used to recognize the device that was assigned the same position in the last ordering. If neither the MAC address nor the PCI address can be used to find a position from the last ordering, the device is considered newly added and is assigned a new position.

However, following this sorting logic, the ordering result may not always be as expected. In practice, this can be caused by various rare cases, such as switching an existing network device to connect to another network, performing firmware updates, changing firmware settings, or plugging/unplugging network devices. It is not worth complicating the entire function for these rare cases. Instead, the initial user’s configuration can be used to handle these rare scenarios.

Squeezed

Squeezed is the XAPI Toolstack’s host memory manager (aka balloon driver). Squeezed uses ballooning to move memory between running VMs, to avoid wasting host memory.

Principles

Avoid wasting host memory: unused memory should be put to use by returning it to VMs.
Memory should be shared in proportion to the configured policy.
Operate entirely at the level of domains (not VMs), and be independent of Xen toolstack.

Squeezed Architecture

Squeezed is the XAPI Toolstack’s host memory ballooning daemon. It “balances” memory between VMs according to a policy written to Xenstore.

The following diagram shows the internals of Squeezed:

Internals of squeezed

At the center of squeezed is an abstract model of a Xen host. The model includes:

The amount of already-used host memory (used by fixed overheads such as Xen and the crash kernel).
Per-domain memory policy specifically dynamic-min and dynamic-max which together describe a range, within which the domain’s actual used memory should remain.
Per-domain calibration data which allows us to compute the necessary balloon target value to achieve a particular memory usage value.

Squeezed is a single-threaded program which receives commands from xenopsd over a Unix domain socket. When Xenopsd wishes to start a new VM, squeezed will be asked to create a “reservation”. Note this is different to the Xen notion of a reservation. A squeezed reservation consists of an amount of memory squeezed will guarantee to keep free labelled with an id. When Xenopsd later creates the domain to notionally use the reservation, the reservation is “transferred” to the domain before the domain is built.

Squeezed will also wake up every 30s and attempt to rebalance the memory on a host. This is useful to correct imbalances caused by balloon drivers temporarily failing to reach their targets. Note that ballooning is fundamentally a co-operative process, so squeezed must handle cases where the domains refuse to obey commands.

The “output” of squeezed is a list of “actions” which include:

Set domain x’s memory/target to a new value.
Set the maxmem of a domain to a new value (as a hard limit beyond which the domain cannot allocate).

Design

Squeezed is a single host memory ballooning daemon. It helps by:

Allowing VM memory to be adjusted dynamically without having to reboot; and
Avoiding wasting memory by keeping everything fully utilised, while retaining the ability to take memory back to start new VMs.

Squeezed currently includes a simple Ballooning policy which serves as a useful default. The policy is written with respect to an abstract Xen memory model, which is based on a number of assumptions about the environment, for example that most domains have co-operative balloon drivers. In theory the policy could be replaced later with something more sophisticated (for example see [xenballoond](https://github.com/avsm/xen-unstable/blob/master/tools/xenballoon/ xenballoond.README)).

The Toolstack interface is used by Xenopsd to free memory for starting new VMs. Although the only known client is Xenopsd, the interface can in theory be used by other clients. Multiple clients can safely use the interface at the same time.

The internal structure consists of a single-thread event loop. To see how it works end-to-end, consult the example.

No software is ever perfect; to understand the flaws in Squeezed, please consult the list of issues.

Environmental assumptions

The Squeezed daemon runs within a Xen domain 0 and communicates to xenstored via a Unix domain socket. Therefore Squeezed is granted full access to xenstore, enabling it to modify every domain’s memory/target.
The Squeezed daemon calls setmaxmem in order to cap the amount of memory a domain can use. This relies on a patch to xen which allows maxmem to be set lower than totpages See Section maxmem for more information.
The Squeezed daemon assumes that only domains which write control/feature-balloon into xenstore can respond to ballooning requests. It will not ask any other domains to balloon.
The Squeezed daemon assumes that the memory used by a domain is: (i) that listed in domain_getinfo as totpages; (ii) shadow as given by shadow_allocation_get; and (iii) a small (few KiB) of miscellaneous Xen structures (e.g. for domains, vcpus) which are invisible.
The Squeezed daemon assumes that a domain which is created with a particular memory/target (and startmem, to within rounding error) will reach a stable value of totpages before writing control/feature-balloon. The daemon writes this value to memory/memory-offset for future reference.
- The Squeezed daemon does not know or care exactly what causes the difference between totpages and memory/target and it does not expect it to remain constant across Xen releases. It only expects the value to remain constant over the lifetime of a domain.
The Squeezed daemon assumes that the balloon driver has hit its target when difference between memory/target and totpages equals the memory-offset value.
- Corrollary: to make a domain with a responsive balloon driver currenty using totpages allocate or free x it suffices to set memory/target to x+totpages-memoryoffset and wait for the balloon driver to finish. See Section memory model for more detail.
The Squeezed daemon must maintain a “slush fund” of memory (currently 9MiB) which it must prevent any domain from allocating. Since (i) some Xen operations (such as domain creation) require memory within a physical address range (e.g. less than 4GiB) and (ii) since Xen preferentially allocates memory outside these ranges, it follows that by preventing guests from allocating all host memory (even transiently) we guarantee that memory from within these special ranges is always available. Squeezed operates in two phases: first causing memory to be freed; and second causing memory to be allocated.
The Squeezed daemon assumes that it may set memory/target to any value within range: memory/dynamic-max to memory/dynamic-min
The Squeezed daemon assumes that the probability of a domain booting successfully may be increased by setting memory/target closer to memory/static-max.
The Squeezed daemon assumes that, if a balloon driver has not made any visible progress after 5 seconds, it is effectively inactive. Active domains will be expected to pick up the slack.

Toolstack interface

The toolstack interface introduces the concept of a reservation. A reservation is: an amount of host free memory tagged with an associated reservation id. Note this is an internal Squeezed concept and Xen is completely unaware of it. When the daemon is moving memory between domains, it always aims to keep

host free memory >= s + sum_i(reservation_i)

where s is the size of the “slush fund” (currently 9MiB) and reservation_t is the amount corresponding to the ith reservation.

As an aside: Earlier versions of Squeezed always associated memory with a Xen domain. Unfortunately this required domains to be created before memory was freed which was problematic because domain creation requires small amounts of contiguous frames. Rather than implement some form of memory defragmentation, Squeezed and Xenopsd were modified to free memory before creating a domain. This necessitated making memory reservations first-class stand-alone entities.

Once a reservation is made (and the corresponding memory is freed), it can be transferred to a domain created by a toolstack. This associates the reservation with that domain so that, if the domain is destroyed, the reservation is also freed. Note that Squeezed is careful not to count both a domain’s reservation and its totpages during e.g. domain building: instead it considers the domain’s allocation to be the maximum of reservation and totpages.

The size of a reservation may either be specified exactly by the caller or the caller may provide a memory range. If a range is provided the daemon will allocate at least as much as the minimum value provided and as much as possible up to the maximum. By allocating as much memory as possible to the domain, the probability of a successful boot is increased.

Clients of the Squeezed provide a string name when they log in. All untransferred reservations made by a client are automatically deleted when a client logs in. This prevents memory leaks where a client crashes and loses track of its own reservation ids.

The interface looks like this:

string session_id login(
  string client_name
)

string reservation_id reserve_memory(
  string client_name,
  int kib
)

int amount, string reservation_id reserve_memory_range(
  string client_name,
  int min,
  int max
)

void delete_reservation(
  string client_name,
  string reservation_id
)

void transfer_reservation_to_domain(
  string client_name,
  string reservation_id,
  int domid
)

The Xenopsd code in pseudocode works as follows:

 r_id = reserve_memory_range("xenopsd", min, max);
 try:
    d = domain_create()
    transfer_reservation_to_domain("xenopsd", r_id, d)
 with:
    delete_reservation("xenopsd", r_id)

The interface is currently implemented using a trivial RPC protocol over a Unix domain socket in domain 0.

Ballooning policy

This section describes the very simple default policy currently built-into Squeezed.

Every domain has a pair of values written into xenstore: memory/dynamic-min and memory/dynamic-max with the following meanings:

memory/dynamic-min the lowest value that Squeezed is allowed to set memory/target. The administrator should make this as low as possible but high enough to ensure that the applications inside the domain actually work.
memory/dynamic-max the highest value that Squeezed is allowed to set memory/target. This can be used to dynamically cap the amount of memory a domain can use.

If all balloon drivers are responsive then Squeezed daemon allocates memory proportionally, so that each domain has the same value of: target-min/(max-min)

So:

if memory is plentiful then all domains will have memory/target=memory/dynamic-max
if memory is scarce then all domains will have memory/target=memory/dynamic-min

Note that the values of memory/target suggested by the policy are ideal values. In many real-life situations (e.g. when a balloon driver fails to make progress and is declared inactive) the memory/target values will be different.

Note that, by default, domain 0 has memory/dynamic-min=memory/dynamic-max, effectively disabling ballooning. Clearly a more sophisticated policy would be required here since ballooning down domain 0 as extra domains are started would be counterproductive while backends and control interfaces remain in domain 0.

The memory model

Squeezed considers a ballooning-aware domain (i.e. one which has written the feature-balloon flag into xenstore) to be completely described by the parameters:

dynamic-min: policy value written to memory/dynamic-min in xenstore by a toolstack (see Section Ballooning policy)
dynamic-max: policy value written to memory/dynamic-max in xenstore by a toolstack (see Section Ballooning policy)
target: balloon driver target written to memory/target in xenstore by Squeezed.
totpages: instantaneous number of pages used by the domain as returned by the hypercall domain_getinfo
memory-offset: constant difference between target and totpages when the balloon driver believes no ballooning is necessary: where memory-offset = totpages - target when the balloon driver believes it has reached its target.
maxmem: upper limit on totpages: where totpages <= maxmem

For convenience we define a adjusted-target to be the target value necessary to cause a domain currently using totpages to maintain this value indefinitely so adjusted-target = totpages - memory-offset.

The Squeezed daemon believes that:

a domain should be ballooning iff adjusted-target <> target (unless it has become inactive)
a domain has hit its target iff adjusted-target = target (to within 1 page);
if a domain has target = x then, when ballooning is complete, it will have totpages = memory-offset + x; and therefore
to cause a domain to free y it sufficies to set target := totpages - memory-offset - y.

The Squeezed daemon considers non-ballooning aware domains (i.e. those which have not written feature-balloon) to be represented by pairs of:

totpages: instantaneous number of pages used by the domain as returned by domain_getinfo
reservation: memory initially freed for this domain by Squeezed after a transfer_reservation_to_domid call

Note that non-ballooning aware domains will always have startmem = target since the domain will not be instructed to balloon. Since a domain which is being built will have 0 <= totpages <= reservation, Squeezed computes unused(i)=reservation(i)-totpages and subtracts this from its model of the host’s free memory, ensuring that it doesn’t accidentally reallocate this memory for some other purpose.

The Squeezed daemon believes that:

all guest domains start out as non-ballooning aware domains where target=reservation=startmem$;
some guest domains become ballooning-aware during their boot sequence i.e. when they write feature-balloon

The Squeezed daemon considers a host to be represented by:

ballooning domains: a set of domains which Squeezed will instruct to balloon;
other domains: a set of booting domains and domains which have no balloon drivers (or whose balloon drivers have failed)
a “slush fund” of low memory required for Xen
physinfo.free_pages total amount of memory instantanously free (including both free_pages and scrub_pages)
reservations: batches of free memory which are not (yet) associated with any domain

The Squeezed daemon considers memory to be unused (i.e. not allocated for any useful purpose) if it is neither in use by a domain nor reserved.

The main loop

The main loop is triggered by either:

the arrival of an allocation request on the toolstack interface; or
the policy engine – polled every 10s – deciding that a target adjustment is needed.

Each iteration of the main loop generates the following actions:

Domains which were active but have failed to make progress towards their target in 5s are declared inactive. These domains then have: maxmem set to the minimum of target and totpages.
Domains which were inactive but have started to make progress towards their target are declared active. These domains then have: maxmem set to target.
Domains which are currently active have new targets computed according to the policy (see Section Ballooning policy). Note that inactive domains are ignored and not expected to balloon.

Note that domains remain classified as inactive only during one run of the main loop. Once the loop has terminated all domains are optimistically assumed to be active again. Therefore should a domain be classified as inactive once, it will get many later chances to respond.

The targets are set in two phases. The maxmem is used to prevent domains suddenly allocating more memory than we want them to.

The main loop has a notion of a host free memory “target”, similar to the existing domain memory target. When we are trying to free memory (e.g. for starting a new VM), the host free memory “target” is increased. When we are trying to distribute memory among guests (e.g. after a domain has shutdown and freed lots of memory), the host free memory “target” is low. Note the host free memory “target” is always at least several MiB to ensure that some host free memory with physical address less than 4GiB is free (see Two phase target setting for related information).

The main loop terminates when all active domains have reached their targets (this could be because all domains responded or because they all wedged and became inactive); and the policy function hasn’t suggested any new target changes. There are three possible results:

Success if the host free memory is near enough its “target”;
Failure if the operation is simply impossible within the policy limits (i.e. dynamic_min values are too high;
Failure if the operation failed because one or more domains became inactive and this prevented us from reaching our host free memory “target”.

Note that, since only active domains have their targets set, the system effectively rewards domains which refuse to free memory (inactive) and punishes those which do free memory (active). This effect is countered by signalling to the admin which domains/VMs aren’t responding so they can take corrective action. To achieve this, the daemon monitors the list of inactive domains and if a domain is inactive for more than 20s it writes a flag into xenstore memory/uncooperative. This key can be monitored and used to generate an alert, if desired.

Two phase target setting

The following diagram shows how a system with two domains can evolve if domain memory/target values are increased for some domains and decreased for others, at the same time. Each graph shows two domains (domain 1 and domain 2) and a host. For a domain, the square box shows its adjusted-totpages and the arrow indicates the direction of the memory/target. For the host the square box indicates total free memory. Note the highlighted state where the host’s free memory is temporarily exhausted

Two phase target setting

In the initial state (at the top of the diagram), there are two domains, one which has been requested to use more memory and the other requested to use less memory. In effect the memory is to be transferred from one domain to the other. In the final state (at the bottom of the diagram), both domains have reached their respective targets, the memory has been transferred and the host free memory is at the same value it was initially. However the system will not move atomically from the initial state to the final: there are a number of possible transient in-between states, two of which have been drawn in the middle of the diagram. In the left-most transient state the domain which was asked to free memory has freed all the memory requested: this is reflected in the large amount of host memory free. In the right-most transient state the domain which was asked to allocate memory has allocated all the memory requested: now the host’s free memory has hit zero.

If the host’s free memory hits zero then Xen has been forced to give all memory to guests, including memory less than 4GiB which is critical for allocating certain structures. Even if we ask a domain to free memory via the balloon driver there is no guarantee that it will free the useful memory. This leads to an annoying failure mode where operations such as creating a domain free due to ENOMEM despite the fact that there is apparently lots of memory free.

The solution to this problem is to adopt a two-phase memory/target setting policy. The Squeezed daemon forces domains to free memory first before allowing domains to allocate, in-effect forcing the system to move through the left-most state in the diagram above.

Use of maxmem

The Xen domain maxmem value is used to limit memory allocations by the domain. The rules are:

if the domain has never been run and is paused then maxmem is set to reservation (reservations were described in the Toolstack interface section above);
- these domains are probably still being built and we must let them allocate their startmem
- FIXME: this “never been run” concept pre-dates the feature-balloon flag: perhaps we should use the feature-balloon flag instead.
if the domain is running and the balloon driver is thought to be working then maxmem is set to target; and
- there may be a delay between lowering a target and the domain noticing so we prevent the domain from allocating memory when it should in fact be deallocating.
if the domain is running and the balloon driver is thought to be inactive then maxmem is set to the minimum of target and actual.
- if the domain is using more memory than it should then we allow it to make progress down towards its target; however
- if the domain is using less memory than it should then we must prevent it from suddenly waking up and allocating more since we have probably just given it to someone else
- FIXME: should we reduce the target to leave the domain in a neutral state instead of asking it to allocate and fail forever?

Example operation

The diagram shows an initial system state comprising 3 domains on a single host. The state is not ideal; the domains each have the same policy settings (dynamic-min and dynamic-max) and yet are using differing values of adjusted-totpages. In addition the host has more memory free than desired. The second diagram shows the result of computing ideal target values and the third diagram shows the result after targets have been set and the balloon drivers have responded.

calculation

The scenario above includes 3 domains (domain 1, domain 2, domain 3) on a host. Each of the domains has a non-ideal adjusted-totpages value.

Recall we also have the policy constraint that: dynamic-min <= target <= dynamic-max Hypothetically if we reduce target by target-dynamic-min (i.e. by setting target to dynamic-min) then we should reduce totpages by the same amount, freeing this much memory on the host. In the upper-most graph in the diagram above, the total amount of memory which would be freed if we set each of the 3 domain’s target to dynamic-min is: d1 + d2 + d3. In this hypothetical situation we would now have x + s + d1 + d2 + d3 free on the host where s is the host slush fund and x is completely unallocated. Since we always want to keep the host free memory above s, we are free to return x + d1 + d2 + d3 to guests. If we use the default built-in proportional policy then, since all domains have the same dynamic-min and dynamic-max, each gets the same fraction of this free memory which we call g: definition of g For each domain, the ideal balloon target is now target = dynamic-min + g. Squeezed does not set all the targets at once: this would allow the allocating domains to race with the deallocating domains, potentially allowing all low memory to be allocated. Therefore Squeezed sets the targets in two phases.

The structure of the daemon

Squeezed is a single-threaded daemon which is started by an init.d script. It sits waiting for incoming requests on its toolstack interface and checks every 10s whether all domain targets are set to the ideal values (recall the Ballooning policy). If an allocation request arrives or if the domain targets require adjusting then it calls into the module squeeze_xen.ml.

The module src/squeeze_xen.ml contains code which inspects the state of the host (through hypercalls and reading xenstore) and creates a set of records describing the current state of the host and all the domains. Note this snapshot of state is not atomic – it is pieced together from multiple hypercalls and xenstore reads – we assume that the errors generated are small and we ignore them. These records are passed into the squeeze.ml module where they are processed and converted into a list of actions i.e. (i) updates to memory/target and; (ii) declarations that particular domains have become inactive or active. The rationale for separating the Xen interface from the main ballooning logic was to make testing easier: the module test/squeeze_test.ml contains a simple simulator which allows various edge-cases to be checked.

Issues

If a linux domU kernel has the netback, blkback or blktap modules then they away pages via alloc_empty_pages_and_pagevec() during boot. This interacts with the balloon driver to break the assumption that, reducing the target by x from a neutral value should free x amount of memory.
Polling the state of the host (particular the xenstore contents) is a bit inefficient. Perhaps we should move the policy values dynamic_min and dynamic_max to a separate place in the xenstore tree and use watches instead.
The memory values given to the domain builder are in units of MiB. We may wish to similarly quantise the target value or check that the memory-offset calculation still works.
The Xen patch queue reintroduces the lowmem emergency pool. This was an attempt to prevent guests from allocating lowmem before we switched to a two-phase target setting procedure. This patch can probably be removed.
It seems unnecessarily evil to modify an inactive domain’s maxmem leaving maxmem less than target, causing the guest to attempt allocations forwever. It’s probably neater to move the target at the same time.
Declaring a domain active just because it makes small amounts of progress shouldn’t be enough. Otherwise a domain could free 1 byte (or maybe 1 page) every 5s.
Likewise, declaring a domain “uncooperative” only if it has been inactive for 20s means that a domain could alternate between inactive for 19s and active for 1s and not be declared “uncooperative”.

Document history

Version	Date	Change
0.2	10th Nov 2014	Update to markdown
0.1	9th Nov 2009	Initial version

Xapi-guard

The xapi-guard daemon is the component in the xapi toolstack that is responsible for handling persistence requests from VMs (domains). Currently these are UEFI vars and vTPM updates.

The code is in ocaml/xapi-guard. When the daemon managed only with UEFI updates it was called varstored-guard. Some files and package names still use the previous name.

Principles

Calls from domains must be limited in privilege to do certain API calls, and to read and write from their corresponding VM in xapi’s database only.
Xenopsd is able to control xapi-guard through message switch, this access is not limited.
Listening to domain socket is restored whenever the daemon restarts to minimize disruption of running domains.
Disruptions to requests when xapi is unavailable is minimized. The startup procedure is not blocked by the availability of xapi, and write requests from domains must not fail because xapi is unavailable.

Overview

Xapi-guard forwards calls from domains to xapi to persist UEFI variables, and update vTPMs. To do this, it listens to 1 socket per service (varstored, or swtpm) per domain. To create these sockets before the domains are running, it listens to a message-switch socket. This socket listens to calls from xenopsd, which orchestrates the domain creation.

To protect the domains from xapi being unavailable transiently, xapi-guard provides an on-disk cache for vTPM writes. This cache acts as a buffer and stores the requests temporarily until xapi can be contacted again. This situation usually happens when xapi is being restarted as part of an update. SWTPM, the vTPM daemon, reads the contents of the TPM from xapi-guard on startup, suspend, and resume. During normal operation SWTPM does not send read requests from xapi-guard.

Structure

The cache module consists of two Lwt threads, one that writes to disk, and another one that reads from disk. The writer is triggered when a VM writes to the vTPM. It never blocks if xapi is unreachable, but responds as soon as the data has been stored either by xapi or on the local disk, such that the VM receives a timely response to the write request. Both try to send the requests to xapi, depending on the state, to attempt write all the cached data back to xapi, and stop using the cache. The threads communicate through a bounded queue, this is done to limit the amount of memory used. This queue is a performance optimisation, where the writer informs the reader precisely which are the names of the cache files, such that the reader does not need to list the cache directory. And a full queue does not mean data loss, just a loss of performance; vTPM writes are still cached.

This means that the cache operates in three modes:

Direct: during normal operation the disk is not used at all
Engaged: both threads use the queue to order events
Disengaged: A thread dumps request to disk while the other reads the cache until it’s empty

---
title: Cache State
---
stateDiagram-v2
    Disengaged
    note right of Disengaged
        Writer doesn't add requests to queue
        Reader reads from cache and tries to push to xapi
    end note
    Direct
    note left of Direct
        Writer bypasses cache, send to xapi
        Reader waits
    end note
    Engaged
    note right of Engaged
        Writer writes to cache and adds requests to queue
        Reader reads from queue and tries to push to xapi
    end note

    [*] --> Disengaged

    Disengaged --> Disengaged : Reader pushed pending TPMs to xapi, in the meantime TPMs appeared in the cache
    Disengaged --> Direct : Reader pushed pending TPMs to xapi, cache is empty

    Direct --> Direct : Writer receives TPM, sent to xapi
    Direct --> Engaged : Writer receives TPM, error when sent to xapi

    Engaged --> Direct : Reader sent TPM to xapi, finds an empty queue
    Engaged --> Engaged : Writer receives TPM, queue is not full
    Engaged --> Disengaged : Writer receives TPM, queue is full

Startup

At startup, there’s a dedicated routine to transform the existing contents of the cache. This is currently done because the timestamp reference change on each boot. This means that the existing contents might have timestamps considered more recent than timestamps of writes coming from running events, leading to missing content updates. This must be avoided and instead the updates with offending timestamps are renamed to a timestamp taken from the current timestamp, ensuring a consistent ordering. The routine is also used to keep a minimal file tree: unrecognised files are deleted, temporary files created to ensure atomic writes are left untouched, and empty directories are deleted. This mechanism can be changed in the future to migrate to other formats.

Design Documents

Key: Revision Proposed Confirmed Released (vA.B) Unrecognised status

Better VM revert v2 confirmed
Multiple Cluster Managers v2 confirmed
SR-Level RRDs v11 confirmed
Backtrace support v1 confirmed
Add qcow tool to allow VDI import/export v1 proposed
Add supported image formats in sm-list v3 proposed
Aggregated Local Storage and Host Reboots v3 proposed
Code Coverage Profiling v2 proposed
Distributed database v1 proposed
FCoE capable NICs v3 proposed
Local database v1 proposed
Management Interface on VLAN v3 proposed
Multiple device emulators v1 proposed
NUMA v1 proposed
OCFS2 storage v1 proposed
patches in VDIs v1 proposed
PCI passthrough support v1 proposed
Pool-wide SSH v1 proposed
Process events from xenopsd in a timely manner v1 proposed
RRDD plugin protocol v3 v1 proposed
Schedule Snapshot Design v2 proposed
Specifying Emulated PCI Devices v1 proposed
thin LVHD storage v3 proposed
XenPrep v2 proposed
TLS vertification for intra-pool communications v2 released (22.6.0)
Tunnelling API design v1 released (5.6 fp1)
Heterogeneous pools v1 released (5.6)
Emergency Network Reset Design v1 released (6.0.2)
Bonding Improvements design v1 released (6.0)
GPU pass-through support v1 released (6.0)
Integrated GPU passthrough support v3 released (6.5 sp1)
GRO and other properties of PIFs v1 released (6.5)
RRDD archival redesign v1 released (7,0)
CPU feature levelling 2.0 v7 released (7.0)
GPU support evolution v3 released (7.0)
RRDD plugin protocol v2 v1 released (7.0)
VGPU type identifiers v1 released (7.0)
Virtual Hardware Platform Version v1 released (7.0)
SMAPIv3 v1 released (7.6)
User-installable host certificates v2 released (8.2)
RDP control v2 released (xenserver 6.5 sp1)

Design document
Revision	v1
Status	proposed

Add qcow tool to allow VDI import/export

Introduction

At XCP-ng, we are working on overcoming the 2TiB limitation for VM disks while preserving essential features such as snapshots, copy-on-write capabilities, and live migration.

To achieve this, we are introducing Qcow2 support in SMAPI and the blktap driver. With the alpha release, we can: - Create a VDI - Snapshot it - Export and import it to/from XVA - Perform full backups

However, we currently cannot export a VDI to a Qcow2 file, nor import one.

The purpose of this design proposal is to outline a solution for implementing VDI import/export in Qcow2 format.

Design Proposal

The import and export of VHD-based VDIs currently rely on vhd-tool, which is responsible for streaming data between a VDI and a file. It supports both Raw and VHD formats, but not Qcow2.

There is an existing tool called qcow-tool originally packaged by MirageOS. It is no longer actively maintained, but it can produce Qcow files readable by QEMU.

Currently, qcow-tool does not support streaming, but we propose to add this capability. This means replicating the approach used in vhd-tool, where data is pushed to a socket.

We have contacted the original developer, David Scott, and there are no objections to us maintaining the tool if needed.

Therefore, the most appropriate way to enable Qcow2 import/export in XAPI is to add streaming support to qcow-tool.

XenAPI changes

The workflow

The export and import of VDIs are handled by the XAPI HTTP server:
- GET /export_raw_vdi
- PUT /import_raw_vdi
The corresponding handlers are Export_raw_vdi.handler and Import_raw_vdi.handler.
Since the format is checked in the handler, we need to add support for Qcow2, as currently only Raw, Tar, and Vhd are supported.
This requires adding a new type in the Importexport.Format module and a new content type: application/x-qemu-disk. See mime-types format.
This allows the format to be properly decoded. Currently, all formats use a wrapper called Vhd_tool_wrapper, which sets up parameters for vhd-tool. We need to add a new wrapper for the Qcow2 format, which will instead use qcow-tool, a tool that we will package (see the section below).
The new wrapper will be responsible for setting up parameters (source, destination, etc.). Since it only manages Qcow2 files, we don’t need to pass additional format information.
The format (qcow2) will be specified in the URI. For example:
- /import_raw_vdi?session_id=<OpaqueRef>&task_id=<OpaqueRef>&vdi=<OpaqueRef>&format=qcow2

Adding and modifying qcow-tool

We need to package qcow-tool.
This new tool will be called from ocaml/xapi/qcow_tool_wrapper.ml, as described in the previous section.
To export a VDI to a Qcow2 file, we need to add functionality similar to Vhd_tool_wrapper.send, which calls vhd-tool stream.
- It writes data from the source to a destination. Unlike vhd-tool, which supports multiple destinations, we will only support Qcow2 files.
- Here is a typicall call to vhd-tool stream

/bin/vhd-tool stream \
    --source-protocol none \
    --source-format hybrid \
    --source /dev/sm/backend/ff1b27b1-3c35-972e-76ec-a56fe9f25e36/87711319-2b05-41a3-8ee0-3b63a2fc7035:/dev/VG_XenStorage-ff1b27b1-3c35-972e-76ec-a56fe9f25e36/VHD-87711319-2b05-41a3-8ee0-3b63a2fc7035 \
    --destination-protocol none \
    --destination-format vhd \
    --destination-fd 2585f988-7374-8131-5b66-77bbc239cbb2 \
    --tar-filename-prefix  \
    --progress \
    --machine \
    --direct \
    --path /dev/mapper:.

To import a VDI from a Qcow2 file, we need to implement functionality similar to Vhd_tool_wrapper.receive, which calls vhd-tool serve.
- This is the reverse of the export process. As with export, we will only support a single type of import: from a Qcow2 file.
- Here is a typical call to vhd-tool serve

/bin/vhd-tool serve \
    --source-format raw \
    --source-protocol none \
    --source-fd 3451d7ed-9078-8b01-95bf-293d3bc53e7a \
    --tar-filename-prefix  \
    --destination file:///dev/sm/backend/f939be89-5b9f-c7c7-e1e8-30c419ee5de6/4868ac1d-8321-4826-b058-952d37a29b82 \
    --destination-format raw \
    --progress \
    --machine \
    --direct \
    --destination-size 180405760 \
    --prezeroed

We don’t need to propose different protocol and different format. As we will not support different formats we just to handle data copy from socket into file and from file to socket. Sockets and files will be managed into the qcow_tool_wrapper. The forkhelpers.ml manages the list of file descriptors and we will mimic what the vhd tool wrapper does to link a UUID to socket.

Design document
Revision	v3
Status	proposed

Add supported image formats in sm-list

Introduction

At XCP-ng, we are enhancing support for QCOW2 images in SMAPI. The primary motivation for this change is to overcome the 2TB size limitation imposed by the VHD format. By adding support for QCOW2, a Storage Repository (SR) will be able to host disks in VHD and/or QCOW2 formats, depending on the SR type. In the future, additional formats—such as VHDx—could also be supported.

We need a mechanism to expose to end users which image formats are supported by a given SR. The proposal is to extend the SM API object with a new field that clients (such as XenCenter, XenOrchestra, etc.) can use to determine the available formats.

Design Proposal

To expose the available image formats to clients (e.g., XenCenter, XenOrchestra, etc.), we propose adding a new field called supported_image_formats to the Storage Manager (SM) module. This field will be included in the output of the SM.get_all_records call.

With this new information, listing all parameters of the SM object will return:

# xe sm-list params=all

Output of the command will look like (notice that CLI uses hyphens):

uuid ( RO)                         : c6ae9a43-fff6-e482-42a9-8c3f8c533e36
name-label ( RO)                   : Local EXT3 VHD
name-description ( RO)             : SR plugin representing disks as VHD files stored on a local EXT3 filesystem, created inside an LVM volume
type ( RO)                         : ext
vendor ( RO)                       : Citrix Systems Inc
copyright ( RO)                    : (C) 2008 Citrix Systems Inc
required-api-version ( RO)         : 1.0
capabilities ( RO) [DEPRECATED]     : SR_PROBE; SR_SUPPORTS_LOCAL_CACHING; SR_UPDATE; THIN_PROVISIONING; VDI_ACTIVATE; VDI_ATTACH; VDI_CLONE; VDI_CONFIG_CBT; VDI_CREATE; VDI_DEACTIVATE; VDI_DELETE; VDI_DETACH; VDI_GENERATE_CONFIG; VDI_MIRROR; VDI_READ_CACHING; VDI_RESET_ON_BOOT; VDI_RESIZE; VDI_SNAPSHOT; VDI_UPDATE
features (MRO)                    : SR_PROBE: 1; SR_SUPPORTS_LOCAL_CACHING: 1; SR_UPDATE: 1; THIN_PROVISIONING: 1; VDI_ACTIVATE: 1; VDI_ATTACH: 1; VDI_CLONE: 1; VDI_CONFIG_CBT: 1; VDI_CREATE: 1; VDI_DEACTIVATE: 1; VDI_DELETE: 1; VDI_DETACH: 1; VDI_GENERATE_CONFIG: 1; VDI_MIRROR: 1; VDI_READ_CACHING: 1; VDI_RESET_ON_BOOT: 2; VDI_RESIZE: 1; VDI_SNAPSHOT: 1; VDI_UPDATE: 1
configuration ( RO)               : device: local device path (required) (e.g. /dev/sda3)
driver-filename ( RO)              : /opt/xensource/sm/EXTSR
required-cluster-stack ( RO)       :
supported-image-formats ( RO)       : vhd, raw, qcow2

Implementation details

The supported_image_formats field will be populated by retrieving information from the SMAPI drivers. Specifically, each driver will update its DRIVER_INFO dictionary with a new key, supported_image_formats, which will contain a list of strings representing the supported image formats (for example: ["vhd", "raw", "qcow2"]). Although the formats are listed as a list of strings, they are treated as a set-specifying the same format multiple times has no effect.

Driver behavior without `supported_image_formats`

If a driver does not provide this information (as is currently the case with existing drivers), the default value will be an empty list. This signifies that the driver determines which format to use when creating VDI. During a migration, the destination driver will choose the format of the VDI if none is explicitly specified. This ensures backward compatibility with both current and future drivers.

Specifying image formats for VDIs creation

If the supported image format is exposed to the client, then, when creating new VDI, user can specify the desired format via the sm_config parameter image-format=qcow2 (or any format that is supported). If no format is specified, the driver will use its preferred default format. If the specified format is not supported, an error will be generated indicating that the SR does not support it. Here is how it can be achieved using the XE CLI:

# xe vdi-create \
    sr-uuid=cbe2851e-9f9b-f310-9bca-254c1cf3edd8 \
    name-label="A new VDI" \
    virtual-size=10240 \
    sm-config:image-format=vhd

Specifying image formats for VDIs migration

When migrating a VDI, an API client may need to specify the desired image format if the destination SR supports multiple storage formats.

VDI pool migrate

To support this, a new parameter, dest_img_format, is introduced to VDI.pool_migrate. This field accepts a string specifying the desired format (e.g., qcow2), ensuring that the VDI is migrated in the correct format. The new signature of VDI.pool_migrate will be VDI ref pool_migrate (session ref, VDI ref, SR ref, string, (string -> string) map).

If the specified format is not supported or cannot be used (e.g., due to size limitations), an error will be generated. Validation will be performed as early as possible to prevent disruptions during migration. These checks can be performed by examining the XAPI database to determine whether the SR provided as the destination has a corresponding SM object with the expected format. If this is not the case, a format not found error will be returned. If no format is specified by the client, the destination driver will determine the appropriate format.

# xe vdi-pool-migrate \
    uuid=<VDI_UUID> \
    sr-uuid=<SR_UUID> \
    dest-img-format=qcow2

VM migration to remote host

A VDI migration can also occur during a VM migration. In this case, we need to be able to specify the expected destination format as well. Unlike VDI.pool_migrate, which applies to a single VDI, VM migration may involve multiple VDIs. The current signature of VM.migrate_send is (session ref, VM ref, (string -> string) map, bool, (VDI ref -> SR ref) map, (VIF ref -> network ref) map, (string -> string) map, (VGPU ref -> GPU_group ref) map). Thus there is already a parameter that maps each source VDI to its destination SR. We propose to add a new parameter that allows specifying the desired destination format for a given source VDI: (VDI ref -> string). It is similar to the VDI-to-SR mapping. We will update the XE cli to support this new format. It would be image_format:<source-vdi-uuid>=<destination-image-format>:

# xe vm-migrate \
        host-uuid=<HOST_UUID> \
        remote-master=<IP> \
        remote-password=<PASS> \
        remote-username=<USER> \
        vdi:<VDI1_UUID>=<SR1_DEST_UUID> \
        vdi:<VDI2_UUID>=<SR2_DEST_UUID> \
        image-format:<VDI1_UUID>=vhd \
        image-format:<VDI2_UUID>=qcow2 \
        uuid=<VM_UUID>

The destination image format would be a string such as vhd, qcow2, or another supported format. It is optional to specify a format. If omitted, the driver managing the destination SR will determine the appropriate format. As with VDI pool migration, if this parameter is not supported by the SM driver, a format not found error will be returned. The validation must happen before sending a creation message to the SM driver, ideally at the same time as checking whether all VDIs can be migrated.

To be able to check the format, we will need to modify VM.assert_can_migrate and add the mapping from VDI references to their image formats, as is done in VM.migrate_send.

Impact

It should have no impact on existing storage repositories that do not provide any information about the supported image format.

This change impacts the SM data model, and as such, the XAPI database version will be incremented. It also impacts the API.

Data Model:
- A new field (supported_image_formats) is added to the SM records.
- A new parameter is added to VM.migrate_send: (VDI ref -> string) map
- A new parameter is added to VM.assert_can_migrate: (VDI ref -> string) map
- A new parameter is added to VDI.pool_migrate: string
Client Awareness: Clients like the xe CLI will now be able to query and display the supported image formats for a given SR.
Database Versioning: The XAPI database version will be updated to reflect this change.

Design document
Revision	v3
Status	proposed
Review	#144
Revision history
v1	Initial version
v2	Included some open questions under Xapi point 2
v3	Added new error, task, and assumptions

Aggregated Local Storage and Host Reboots

Introduction

When hosts use an aggregated local storage SR, then disks are going to be mirrored to several different hosts in the pool (RAID). This ensures that if a host goes down (e.g. due to a reboot after installing a hotfix or upgrade, or when “fenced” by the HA feature), all disk contents in the SR are still accessible. This also means that if all disks are mirrored to just two hosts (worst-case scenario), just one host may be down at any point in time to keep the SR fully available.

When a node comes back up after a reboot, it will resynchronise all its disks with the related mirrors on the other hosts in the pool. This syncing takes some time, and only after this is done, we may consider the host “up” again, and allow another host to be shut down.

Therefore, when installing a hotfix to a pool that uses aggregated local storage, or doing a rolling pool upgrade, we need to make sure that we do hosts one-by-one, and we wait for the storage syncing to finish before doing the next.

This design aims to provide guidance and protection around this by blocking hosts to be shut down or rebooted from the XenAPI except when safe, and setting the host.allowed_operations field accordingly.

XenAPI

If an aggregated local storage SR is in use, and one of the hosts is rebooting or down (for whatever reason), or resynchronising its storage, the operations reboot and shutdown will be removed from the host.allowed_operations field of all hosts in the pool that have a PBD for the SR.

This is a conservative approach in that assumes that this kind of SR tolerates only one node “failure”, and assumes no knowledge about how the SR distributes its mirrors. We may refine this in future, in order to allow some hosts to be down simultaneously.

The presence of the reboot operation in host.allowed_operations indicates whether the host.reboot XenAPI call is allowed or not (similarly for shutdown and host.shutdown). It will not, of course, prevent anyone from rebooting a host from the dom0 console or power switch.

Clients, such as XenCenter, can use host.allowed_operations, when applying an update to a pool, to guide them when it is safe to update and reboot the next host in the sequence.

In case host.reboot or host.shutdown is called while the storage is busy resyncing mirrors, the call will fail with a new error MIRROR_REBUILD_IN_PROGRESS.

Xapi

Xapi needs to be able to:

Determine whether aggregated local storage is in use; this just means that a PBD for such an SR present.
- TBD: To avoid SR-specific code in xapi, the storage backend should tell us whether it is an aggregated local storage SR.
Determine whether the storage system is resynchronising its mirrors; it will need to be able to query the storage backend for this kind of information.
- Xapi will poll for this and will reflect that a resync is happening by creating a Task for it (in the DB). This task can be used to track progress, if available.
- The exact way to get the syncing information from the storage backend is SR specific. The check may be implemented in a separate script or binary that xapi calls from the polling thread. Ideally this would be integrated with the storage backend.
Update host.allowed_operations for all hosts in the pool according to the rules described above. This comes down to updating the function valid_operations in xapi_host_helpers.ml, and will need to use a combination of the functionality from the two points above, plus and indication of host liveness from host_metrics.live.
Trigger an update of the allowed operations when a host shuts down or reboots (due to a XenAPI call or otherwise), and when it has finished resynchronising when back up. Triggers must be in the following places (some may already be present, but are listed for completeness, and to confirm this):
- Wherever host_metrics.live is updated to detect pool slaves going up and down (probably at least in Db_gc.check_host_liveness and Xapi_ha).
- Immediately when a host.reboot or host.shutdown call is executed: Message_forwarding.Host.{reboot,shutdown,with_host_operation}.
- When a storage resync is starting or finishing.

All of the above runs on the pool master (= SR master) only.

Assumptions

The above will be safe if the storage cluster is equal to the XenServer pool. In general, however, it may be desirable to have a storage cluster that is larger than the pool, have multiple XS pools on a single cluster, or even share the cluster with other kinds of nodes.

To ensure that the storage is “safe” in these scenarios, xapi needs to be able to ask the storage backend:

if a mirror is being rebuilt “somewhere” in the cluster, AND
if “some node” in the cluster is offline (even if the node is not in the XS pool).

If the cluster is equal to the pool, then xapi can do point 2 without asking the storage backend, which will simplify things. For the moment, we assume that the storage cluster is equal to the XS pool, to avoid making things too complicated (while still need to keep in mind that we may change this in future).

Design document
Revision	v1
Status	confirmed

Backtrace support

We want to make debugging easier by recording exception backtraces which are

reliable
cross-process (e.g. xapi to xenopsd)
cross-language
cross-host (e.g. master to slave)

We therefore need

to ensure that backtraces are captured in our OCaml and python code
a marshalling format for backtraces
conventions for storing and retrieving backtraces

Backtraces in OCaml

OCaml has fast exceptions which can be used for both

control flow i.e. fast jumps from inner scopes to outer scopes
reporting errors to users (e.g. the toplevel or an API user)

To keep the exceptions fast, exceptions and backtraces are decoupled: there is a single active backtrace per-thread at any one time. If you have caught an exception and then throw another exception, the backtrace buffer will be reinitialised, destroying your previous records. For example consider a ‘finally’ function:

let finally f cleanup =
  try
    let result = f () in
    cleanup ();
    result
  with e ->
    cleanup ();
    raise e (* <-- backtrace starts here now *)

This function performs some action (i.e. f ()) and guarantees to perform some cleanup action (cleanup ()) whether or not an exception is thrown. This is a common pattern to ensure resources are freed (e.g. closing a socket or file descriptor). Unfortunately the raise e in the exception handler loses the backtrace context: when the exception gets to the toplevel, Printexc.get_backtrace () will point at the finally rather than the real cause of the error.

We will use a variant of the solution proposed by Jacques-Henri Jourdan where we will record backtraces when we catch exceptions, before the buffer is reinitialised. Our finally function will now look like this:

let finally f cleanup =
  try
    let result = f () in
    cleanup ();
    result
  with e ->
    Backtrace.is_important e;
    cleanup ();
    raise e

The function Backtrace.is_important e associates the exception e with the current backtrace before it gets deleted.

Xapi always has high-level exception handlers or other wrappers around all the threads it spawns. In particular Xapi tries really hard to associate threads with active tasks, so it can prefix all log lines with a task id. This helps admins see the related log lines even when there is lots of concurrent activity. Xapi also tries very hard to label other threads with names for the same reason (e.g. db_gc). Every thread should end up being wrapped in with_thread_named which allows us to catch exceptions and log stacktraces from Backtrace.get on the way out.

OCaml design guidelines

Making nice backtraces requires us to think when we write our exception raising and handling code. In particular:

If a function handles an exception and re-raise it, you must call Backtrace.is_important e with the exception to capture the backtrace first.
If a function raises a different exception (e.g. Not_found becoming a XenAPI INTERNAL_ERROR) then you must use Backtrace.reraise <old> <new> to ensure the backtrace is preserved.
All exceptions should be printable – if the generic printer doesn’t do a good enough job then register a custom printer.
If you are the last person who will see an exception (because you aren’t going to rethrow it) then you may log the backtrace via Debug.log_backtrace e if and only if you reasonably expect the resulting backtrace to be helpful and not spammy.
If you aren’t the last person who will see an exception (because you are going to rethrow it or another exception), then do not log the backtrace; the next handler will do that.
All threads should have a final exception handler at the outermost level for example Debug.with_thread_named will do this for you.

Backtraces in python

Python exceptions behave similarly to the OCaml ones: if you raise a new exception while handling an exception, the backtrace buffer is overwritten. Therefore the same considerations apply.

Python design guidelines

The function sys.exc_info() can be used to capture the traceback associated with the last exception. We must guarantee to call this before constructing another exception. In particular, this does not work:

  raise MyException(sys.exc_info())

Instead you must capture the traceback first:

  exc_info = sys.exc_info()
  raise MyException(exc_info)

Marshalling backtraces

We need to be able to take an exception thrown from python code, gather the backtrace, transmit it to an OCaml program (e.g. xenopsd) and glue it onto the end of the OCaml backtrace. We will use a simple json marshalling format for the raw backtrace data consisting of

a string summary of the error (e.g. an exception name)
a list of filenames
a corresponding list of lines

(Note we don’t use the more natural list of pairs as this confuses the “rpclib” code generating library)

In python:

    results = {
      "error": str(s[1]),
      "files": files,
      "lines": lines,
    }
    print json.dumps(results)

In OCaml:

  type error = {
    error: string;
    files: string list;
    lines: int list;
  } with rpc
  print_string (Jsonrpc.to_string (rpc_of_error ...))

Retrieving backtraces

Backtraces will be written to syslog as usual. However it will also be possible to retrieve the information via the CLI to allow diagnostic tools to be written more easily.

The CLI

We add a global CLI argument “–trace” which requests the backtrace be printed, if one is available:

# xe vm-start vm=hvm --trace
Error code: SR_BACKEND_FAILURE_202
Error parameters: , General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str],
Raised Server_error(SR_BACKEND_FAILURE_202, [ ; General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str];  ])
Backtrace:
0/50 EXT @ st30 Raised at file /opt/xensource/sm/SRCommand.py, line 110
1/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 159
2/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 263
3/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1486
4/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 83
5/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1519
6/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1567
7/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1065
8/50 EXT @ st30 Called from file /opt/xensource/sm/EXTSR.py, line 221
9/50 xenopsd-xc @ st30 Raised by primitive operation at file "lib/storage.ml", line 32, characters 3-26
10/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 176, characters 15-19
11/50 xenopsd-xc @ st30 Raised at file "lib/task_server.ml", line 184, characters 8-9
12/50 xenopsd-xc @ st30 Called from file "lib/storage.ml", line 57, characters 1-156
13/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 254, characters 15-63
14/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 1643, characters 15-76
15/50 xenopsd-xc @ st30 Called from file "lib/xenctrl.ml", line 127, characters 13-17
16/50 xenopsd-xc @ st30 Re-raised at file "lib/xenctrl.ml", line 127, characters 56-59
17/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 937, characters 3-54
18/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1103, characters 4-71
19/50 xenopsd-xc @ st30 Called from file "list.ml", line 84, characters 24-34
20/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1098, characters 2-367
21/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1203, characters 3-46
22/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1441, characters 3-9
23/50 xenopsd-xc @ st30 Raised at file "lib/xenops_server.ml", line 1452, characters 9-10
24/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1458, characters 48-60
25/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 151, characters 15-26
26/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1719, characters 11-14
27/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
28/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 2005, characters 13-14
29/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
30/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1785, characters 15-16
31/50 xapi @ st30 Called from file "message_forwarding.ml", line 233, characters 25-44
32/50 xapi @ st30 Called from file "message_forwarding.ml", line 915, characters 15-67
33/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
34/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
35/50 xapi @ st30 Called from file "message_forwarding.ml", line 1205, characters 21-199
36/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
37/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
38/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
9/50 xapi @ st30 Raised at file "rbac.ml", line 236, characters 10-15
40/50 xapi @ st30 Called from file "server_helpers.ml", line 75, characters 11-41
41/50 xapi @ st30 Raised at file "cli_util.ml", line 78, characters 9-12
42/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
43/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
44/50 xapi @ st30 Called from file "cli_operations.ml", line 1889, characters 2-6
45/50 xapi @ st30 Re-raised at file "cli_operations.ml", line 1898, characters 10-11
46/50 xapi @ st30 Called from file "cli_operations.ml", line 1821, characters 14-18
47/50 xapi @ st30 Called from file "cli_operations.ml", line 2109, characters 7-526
48/50 xapi @ st30 Called from file "xapi_cli.ml", line 113, characters 18-56
49/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9

One can automatically set “–trace” for a whole shell session as follows:

export XE_EXTRA_ARGS="--trace"

The XenAPI

We already store error information in the XenAPI “Task” object and so we can store backtraces at the same time. We shall add a field “backtrace” which will have type “string” but which will contain s-expression encoded backtrace data. Clients should not attempt to parse this string: its contents may change in future. The reason it is different from the json mentioned before is that it also contains host and process information supplied by Xapi, and may be extended in future to contain other diagnostic information.

The Xenopsd API

We already store error information in the xenopsd API “Task” objects, we can extend these to store the backtrace in an additional field (“backtrace”). This field will have type “string” but will contain s-expression encoded backtrace data.

The SMAPIv1 API

Errors in SMAPIv1 are returned as XMLRPC “Faults” containing a code and a status line. Xapi transforms these into XenAPI exceptions usually of the form SR_BACKEND_FAILURE_<code>. We can extend the SM backends to use the XenAPI exception type directly: i.e. to marshal exceptions as dictionaries:

  results = {
    "Status": "Failure",
    "ErrorDescription": [ code, param1, ..., paramN ]
  }

We can then define a new backtrace-carrying error:

code = SR_BACKEND_FAILURE_WITH_BACKTRACE
param1 = json-encoded backtrace
param2 = code
param3 = reason

which is internally transformed into SR_BACKEND_FAILURE_<code> and the backtrace is appended to the current Task backtrace. From the client’s point of view the final exception should look the same, but Xapi will have a chance to see and log the whole backtrace.

As a side-effect, it is possible for SM plugins to throw XenAPI errors directly, without interpretation by Xapi.

Design document
Revision	v2
Status	confirmed

Better VM revert

Overview

XenAPI allows users to rollback the state of a VM to a previous state, which is stored in a snapshot, using the call VM.revert. Because there is no VDI.revert call, VM.revert uses VDI.clone on the snapshot to duplicate the contents of that disk and then use the new clone as the storage for the VM.

Because VDI.clone creates new VDI refs and uuids, some problematic behaviours arise:

Clients such as Apache CloudStack need to include complex logic to keep track of the disks they are actively managing
Because the snapshot is cloned and the original vdi is deleted, VDI references to the VDI become invalid, like VDI.snapshot_of. This means that the database has to be combed through to change these references. Because the database doesn’t support transactions this operation is not atomic and can produce inconsistent database states.

Additionally, some filesystems support snapshots natively, doing the clone procedure is much costlier than allowing the filesystem to do the revert.

We will fix these problems by:

introducing the new feature VDI_REVERT in SM interface (xapi_smint). This allows backends to advertise that they support the new functionality
defining a new storage operation VDI.revert in storage_interface, which is gated by the feature VDI_REVERT
proxying the storage operation to SMAPIv3 and SMAPv1 backends accordingly
adding VDI.revert to xapi_vdi which will call the storage operation if the backend advertises it, and fallback to the previous method that uses VDI.clone if it doesn’t advertise it, or issues are detected at runtime that prevent it
changing the Xapi implementation of VM.revert to use VDI.revert
implement vdi_revert for common storage types, including File and LVM-based SRs
adding unit and quick tests to xapi to test that VM.revert does not regress

Current VM.revert behaviour

The code that reverts the state of storage is located in update_vifs_vbds_vgpus_and_vusbs. The steps it does is:

destroys the VM’s VBDs (both disks and CDs)
destroys the VM’s VDI (disks only), referenced by the snapshot’s VDIs using snapshot_of; as well as the suspend VDI.
clones the snapshot’s VDIs (disks and CDs), if one clone fails none remain.
searches the database for all snapshot_of references to the deleted VDIs and replaces them with the referenced of the newly cloned snapshots.
clones the snapshot’s resume VDI
creates copies of all the cloned VBDs and associates them with the cloned VDIs
assigns the new resume VDI to the VM

XenAPI design

API

The function VDI.revert will be added, with arguments:

in: snapshot: Ref(VDI): the snapshot to which we want to revert
in: driver_params: Map(String,String): optional extra parameters
out: Ref(VDI) reference to the new VDI with the reverted contents

The function will extract the reference of VDI whose contents need to be replaced. This is the snapshot’s snapshot_of field, then it will call the storage function function VDI.revert to have its contents replaced with the snapshot’s. The VDI object will not be modified, and the reference returned is the VDI’s original reference. If anything impedes the successful finish of an in-place revert, like the SM backend does not advertising the feature VDI_REVERT, not implement the feature, or the snapshot_of reference is invalid; an exception will be raised.

Xapi Storage

The function VDI.revert is added, with the following arguments:

in: dbg: the task identifier, useful for tracing
in: sr: SR where the new VDI must be created
in: snapshot_info: metadata of the snapshot, the contents of which must be made available in the VDI indicated by the snapshot_of field

SMAPIv1

The function vdi_revert is defined with the following arguments:

in: sr_uuid: the UUID of the SR containing both the VDI and the snapshot
in: vdi_uuid: the UUID of the snapshot whose contents must be duplicated
in: target_uuid: the UUID of the target whose contents must be replaced

The function will replace the contents of the target_uuid VDI with the contents of the vdi_uuid VDI without changing the identify of the target (i.e. name-label, uuid and location are guaranteed to remain the same). The vdi_uuid is preserved by this operation. The operation is obvoiusly idempotent.

SMAPIv3

In an analogous way to SMAPIv1, the function Volume.revert is defined with the following arguments:

in: dbg: the task identifier, useful for tracing
in: sr: the UUID of the SR containing both the VDI and the snapshot
in: snapshot: the UUID of the snapshot whose contents must be duplicated
in: vdi: the UUID of the VDI whose contents must be replaced

Xapi

add the capability VDI_REVERT so backends can advertise it
use VDI.revert in the VM.revert after the VDIs have been destroyed, and before the snapshot’s VDIs have been cloned. If any of the reverts fail because a Not_implemented exception is thrown, or the snapshot_of contains an invalid reference, add the affected VDIs to the list to be cloned and recovered, using the existing method
expose a new xe vdi-revert CLI command

SM changes

We will modify

SRCommand.py and VDI.py to add a new vdi_revert function which throws a ’not implemented’ exception
FileSR.py to implement VDI.revert using a variant of the existing snapshot/clone machinery
EXTSR.py and NFSSR.py to advertise the VDI_REVERT capability
LVHDSR.py to implement VDI.revert using a variant of the existing snapshot/clone machinery
LVHDoISCSISR.py and LVHDoHBASR.py to advertise the VDI_REVERT capability

Prototype code from the previous proposal

Prototype code exists here:

xapi-project/xcp-idl#37 by @johnelse
xapi-project/xen-api#2058 mainly by @johnelse but with 2 extra patches from me
Definition of SMAPIv1 vdi_revert
Hacky implementation for EXT/NFS

Design document
Revision	v1
Status	released (6.0)

Bonding Improvements design

This document describes design details for the PR-1006 requirements.

XAPI and XenAPI

Creating a Bond

Current Behaviour on Bond creation

Steps for a user to create a bond:

Shutdown all VMs with VIFs using the interfaces that will be bonded, in order to unplug those VIFs.
Create a Network to be used by the bond: Network.create
Call Bond.create with a ref to this Network, a list of refs of slave PIFs, and a MAC address to use.
Call PIF.reconfigure_ip to configure the bond master.
Call Host.management_reconfigure if one of the slaves is the management interface. This command will call interface-reconfigure to bring up the master and bring down the slave PIFs, thereby activating the bond. Otherwise, call PIF.plug to activate the bond.

Bond.create XenAPI call:

Remove duplicates in the list of slaves.
Validate the following:
- Slaves must not be in a bond already.
- Slaves must not be VLAN masters.
- Slaves must be on the same host.
- Network does not already have a PIF on the same host as the slaves.
- The given MAC is valid.
Create the master PIF object.
- The device name of this PIF is bondx, with x the smallest unused non-negative integer.
- The MAC of the first-named slave is used if no MAC was specified.
Create the Bond object, specifying a reference to the master. The value of the PIF.master_of field on the master is dynamically computed on request.
Set the PIF.bond_slave_of fields of the slaves. The value of the Bond.slaves field is dynamically computed on request.

New Behaviour on Bond creation

Steps for a user to create a bond:

Create a Network to be used by the bond: Network.create
Call Bond.create with a ref to this Network, a list of refs of slave PIFs, and a MAC address to use.
The new bond will automatically be plugged if one of the slaves was plugged.

In the following, for a host h, a VIF-to-move is a VIF associated with a VM that is either

running, suspended or paused on h, OR
halted, and h is the only host that the VM can be started on.

The Bond.create XenAPI call is updated to do the following:

Remove duplicates in the list of slaves.
Validate the following, and raise an exception if any of these check fails:
- Slaves must not be in a bond already.
- Slaves must not be VLAN masters.
- Slaves must not be Tunnel access PIFs.
- Slaves must be on the same host.
- Network does not already have a PIF on the same host as the slaves.
- The given MAC is valid.
Try unplugging all currently attached VIFs of the set of VIFs that need to be moved. Roll back and raise an exception of one of the VIFs cannot be unplugged (e.g. due to the absence of PV drivers in the VM).
Determine the primary slave: the management PIF (if among the slaves), or the first slave with IP configuration.
Create the master PIF object.
- The device name of this PIF is bondx, with x the smallest unused non-negative integer.
- The MAC of the primary slave is used if no MAC was specified.
- Include the IP configuration of the primary slave.
- If any of the slaves has PIF.disallow_unplug = true, this will be copied to the master.
Create the Bond object, specifying a reference to the master. The value of the PIF.master_of field on the master is dynamically computed on request. Also a reference to the primary slave is written to Bond.primary_slave on the new Bond object.
Set the PIF.bond_slave_of fields of the slaves. The value of the Bond.slaves field is dynamically computed on request.
Move VLANs, plus the VIFs-to-move on them, to the master.
- If all VLANs on the slaves have different tags, all VLANs will be moved to the bond master, while the same Network is used. The network effectively moves up to the bond and therefore no VIFs need to be moved.
- If multiple VLANs on different slaves have the same tag, they necessarily have different Networks as well. Only one VLAN with this tag is created on the bond master. All VIFs-to-move on the remaining VLAN networks are moved to the Network that was moved up.
Move Tunnels to the master. The tunnel Networks move up with the tunnels. As tunnel keys are different for all tunnel networks, there are no complications as in the VLAN case.
Move VIFs-to-move on the slaves to the master.
If one of the slaves is the current management interface, move management to the master; the master will automatically be plugged. If none of the slaves is the management interface, plug the master if any of the slaves was plugged. In both cases, the slaves will automatically be unplugged.
On all slaves, reset the IP configuration and set disallow_unplug to false.

Note: “moving” a VIF, VLAN or tunnel means “re-creating somewhere else, and destroying the old one”.

Destroying a Bond

Current Behaviour on Bond destruction

Steps for a user to destroy a bond:

If the management interface is on the bond, move it to another PIF using PIF.reconfigure_ip and Host.management_reconfigure. Otherwise, no PIF.unplug needs to be called on the bond master, as Bond.destroy does this automatically.
Call Bond.destroy with a ref to the Bond object.
If desired, bring up the former slave PIFs by calls to PIF.plug (this is does not happen automatically).

Bond.destroy XenAPI call:

Validate the following constraints:
- No VLANs are attached to the bond master.
- The bond master is not the management PIF.
Bring down the master PIF and clean up the underlying network devices.
Remove the Bond and master PIF objects.

New Behaviour on Bond destruction

Steps for a user to destroy a bond:

Call Bond.destroy with a ref to the Bond object.
If desired, move VIFs/VLANs/tunnels/management from (former) primary slave to other PIFs.

Bond.destroy XenAPI call is updated to do the following:

Try unplugging all currently attached VIFs of the set of VIFs that need to be moved. Roll back and raise an exception of one of the VIFs cannot be unplugged (e.g. due to the absence of PV drivers in the VM).
Copy the IP configuration of the master to the primary slave.
Move VLANs, with their Networks, to the primary slave.
Move Tunnels, with their Networks, to the primary slave.
Move VIFs-to-move on the master to the primary slave.
If the master is the current management interface, move management to the primary slave. The primary slave will automatically be plugged.
If the master was plugged, plug the primary slave. This will automatically clean up the underlying devices of the bond.
If the master has PIF.disallow_unplug = true, this will be copied to the primary slave.
Remove the Bond and master PIF objects.

Using Bond Slaves

Current Behaviour for Bond Slaves

It possible to plug any existing PIF, even bond slaves. Any other PIFs that cannot be attached at the same time as the PIF that is being plugged, are automatically unplugged.
Similarly, it is possible to make a bond slave the management interface. Any other PIFs that cannot be attached at the same time as the PIF that is being plugged, are automatically unplugged.
It is possible to have a VIF on a Network associated with a bond slave. When the VIF’s VM is started, or the VIF is hot-plugged, the PIF is relies on is automatically plugged, and any other PIFs that cannot be attached at the same time as this PIF are automatically unplugged.
It is possible to have a VLAN on a bond slave, though the bond (master) and the VLAN may not be simultaneously attached. This is not currently enforced (which may be considered a bug).

New behaviour for Bond Slaves

It is no longer possible to plug a bond slave. The exception CANNOT_PLUG_BOND_SLAVE is raised when trying to do so.
It is no longer possible to make a bond slave the management interface. The exception CANNOT_PLUG_BOND_SLAVE is raised when trying to do so.
It is still possible to have a VIF on the Network of a bond slave. However, it is not possible to start such a VIF’s VM on a host, if this would need a bond slave to be plugged. Trying this will result in a CANNOT_PLUG_BOND_SLAVE exception. Likewise, it is not possible to hot-plug such a VIF.
It is no longer possible to place a VLAN on a bond slave. The exception CANNOT_ADD_VLAN_TO_BOND_SLAVE is raised when trying to do so.
It is no longer possible to place a tunnel on a bond slave. The exception CANNOT_ADD_TUNNEL_TO_BOND_SLAVE is raised when trying to do so.

Actions on Start-up

Current Behaviour on Start-up

When a pool slave starts up, bonds and VLANs on the pool master are replicated on the slave:

Create all VLANs that the master has, but the slave has not. VLANs are identified by their tag, the device name of the slave PIF, and the Networks of the master and slave PIFs.
Create all bonds that the master has, but the slave has not. If the interfaces needed for the bond are not all available on the slave, a partial bond is created. If some of these interface are already bonded on the slave, this bond is destroyed first.

New Behaviour on Start-up

The current VLAN/tunnel/bond recreation code is retained, as it uses the new Bond.create and Bond.destroy functions, and therefore does what it needs to do.
Before VLAN/tunnel/bond recreation, any violations of the rules defined in R2 are rectified, by moving VIFs, VLANs, tunnels or management up to bonds.

CLI

The behaviour of the xe CLI commands bond-create, bond-destroy, pif-plug, and host-management-reconfigure is changed to match their associated XenAPI calls.

XenCenter

XenCenter already automatically moves the management interface when a bond is created or destroyed. This is no longer necessary, as the Bond.create/destroy calls already do this. XenCenter only needs to copy any PIF.other_config keys that is needs between primary slave and bond master.

Manual Tests

Create a bond of two interfaces…
- without VIFs/VLANs/management on them;
- with management on one of them;
- with a VLAN on one of them;
- with two VLANs on two different interfaces, having the same VLAN tag;
- with a VIF associated with a halted VM on one of them;
- with a VIF associated with a running VM (with and without PV drivers) on one of them.
Destroy a bond of two interfaces…
- without VIFs/VLANs/management on it;
- with management on it;
- with a VLAN on it;
- with a VIF associated with a halted VM on it;
- with a VIF associated with a running VM (with and without PV drivers) on it.
In a pool of two hosts, having VIFs/VLANs/management on the interfaces of the pool slave, create a bond on the pool master, and restart XAPI on the slave.
Restart XAPI on a host with a networking configuration that has become illegal due to these requirements.

Design document
Revision	v2
Status	proposed

Code Coverage Profiling

We would like to add optional coverage profiling to existing OCaml projects in the context of XenServer and XenAPI. This article presents how we do it.

Binaries instrumented for coverage profiling in the XenServer project need to run in an environment where several services act together as they provide operating-system-level services. This makes it a little harder than profiling code that can be profiled and executed in isolation.

TL;DR

To build binaries with coverage profiling, do:

./configure --enable-coverage
make

Binaries will log coverage data to /tmp/bisect*.out from which a coverage report can be generated in coverage/:

bisect-ppx-report -I _build -html coverage /tmp/bisect*.out

Profiling Framework Bisect-PPX

The open-source BisectPPX instrumentation framework uses extension points (PPX) in the OCaml compiler to instrument code during compilation. Instrumented code for a binary is then compiled as usual and logs during execution data to in-memory data structures. Before an instrumented binary terminates, it writes the logged data to a file. This data can then be analysed with the bisect-ppx-report tool, to produce a summary of annotated code that highlights what part of a codebase was executed.

BisectPPX has several desirable properties:

a robust code base that is well tested
it is easy to integrate into the compilation pipeline (see below)
is specific to the OCaml language; an expression-oriented language like OCaml doesn’t fit the traditional statement coverage well
it is actively maintained
is generates useful reports for interactive and non-interactive use that help to improve code coverage

Coverage Analysis

Red parts indicate code that wasn’t executed whereas green parts were. Hovering over a dark green spot reveals how often that point was executed.

The individual steps of instrumenting code with BisectPPX are greatly abstracted by OCamlfind (OCaml’s library manager) and OCamlbuild (OCaml’s compilation manager):

# write code
vim example.ml

# build it with instrumentation from bisect_ppx
ocamlbuild -use-ocamlfind -pkg bisect_ppx -pkg unix example.native

# execute it - generates files ./bisect*.out
./example.native

# generate report
bisect-ppx-report -I _build -html coverage bisect000*

# view coverage/index.html

Summary:
 - 'binding' points: 2/2 (100.00%)
 - 'sequence' points: 10/10 (100.00%)
 - 'match/function' points: 5/8 (62.50%)
 - total: 17/20 (85.00%)

The fourth step generates a HTML report in coverage/. All it takes is to declare to OCamlbuild that a module depends on bisect_ppx and it will be instrumented during compilation. Behind the scenes ocamlfind makes sure that the compiler uses a preprocessing step that instruments the code.

Signal Handling

During execution the code instrumentation leads to the collection of data. This code registers a function with at_exit that writes the data to bisect*.out when exit is called. A binary can terminate without calling exit and in that case the file would not be written. It is therefore important to make sure that exit is called. If this does not happen naturally, for example in the context of a daemon that is terminated by receiving the TERM signal, a signal handler must be installed:

let stop signal =
  printf "caught signal %a\n" Debug.Pp.signal signal;
  exit 0

Sys.set_signal Sys.sigterm (Sys.Signal_handle stop)

Dumping coverage information at runtime

By default coverage data can only be dumped at exit, which is inconvenient if you have a test-suite that needs to reuse a long running daemon, and starting/stopping it each time is not feasible.

In such cases we need an API to dump coverage at runtime, which is provided by bisect_ppx >= 1.3.0. However each daemon will need to set up a way to listen to an event that triggers this coverage dump, furthermore it is desirable to make runtime coverage dumping compiled in conditionally to be absolutely sure that production builds do not use coverage preprocessed code.

Hence instead of duplicating all this build logic in each daemon (xapi, xenopsd, etc.) provide this functionality in a common library xapi-idl that:

logs a message on startup so we know it is active
sets BISECT_FILE environment variable to dump coverage in the appropriate place
listens on org.xen.xapi.coverage.<name> message queue for runtime coverage dump commands:
- sending dump <Number> will cause runtime coverage to be dumped to a file named bisect-<name>-<random>.<Number>.out
- sending reset will cause the runtime coverage counters to be reset

Daemons that use Xcp_service.configure2 (e.g. xenopsd) will benefit from this runtime trigger automatically, provided they are themselves preprocessed with bisect_ppx.

Since we are interested in collecting coverage data for system-wide test-suite runs we need a way to trigger dumping of coverage data centrally, and a good candidate for that is xapi as the top-level daemon.

It will call Xcp_coverage.dispatcher_init (), which listens on org.xen.xapi.coverage.dispatch and dispatches the coverage dump command to all message queues under org.xen.xapi.coverage.* except itself.

On production, and regular builds all of this is a no-op, ensured by using separate lib/coverage/disabled.ml and lib/coverage/enabled.ml files which implement the same interface, and choosing which one to use at build time.

Where Data is Written

By default, BisectPPX writes data in a binary’s current working directory as bisectXXXX.out. It doesn’t overwrite existing files and files from several runs can be combined during analysis. However, this name and the location can be inconvenient when multiple programs share a directory.

BisectPPX’s default can be overridden with the BISECT_FILE environment variable. This can happen on the command line:

BISECT_FILE=/tmp/example ./example.native

In the context of XenServer we could do this in startup scripts. However, we added a bit of code

val Coverage.init: string -> unit

that sets the environment variable from inside the program. The files are written to a temporary directory (respecting $TMP or using /tmp) and uses the string-typed argument to include it in the name. To be effective, this function must be called before the programs exits. For clarity it is called at the begin of program execution.

Instrumenting an Oasis Project

While instrumentation is easy on the level of a small file or project it is challenging in a bigger project. We decided to focus on projects that are build with the Oasis build and packaging manager. These have a well-defined structure and compilation process that is controlled by a central _oasis file. This file describes for each library and binary its dependencies at a package level. From this, Oasis generates a configure script and compilation rules for the OCamlbuild system. Oasis is designed that the generated files can be shipped without requiring Oasis itself being available.

Goals for instrumentation are:

what files are instrumented should be obvious and easy to manage
instrumentation must be optional, yet easy to activate
avoid methods that require to keep several files in sync like multiple _oasis files
avoid separate Git branches for instrumented and non-instrumented code

In the ideal case, we could introduce a configuration switch ./configure --enable-coverage that would prepare compilation for coverage instrumentation. While Oasis supports the creation of such switches, they cannot be used to control build dependencies like compiling a file with or without package bisec_ppx. We have chosen a different method:

A Makefile target coverage augments the _tags file to include the rules in file _tags.coverage that cause files to be instrumented:

make coverage # prepare
make          # build

leads to the execution of this code during preparation:

coverage: _tags _tags.coverage
  test ! -f _tags.orig && mv _tags _tags.orig || true
  cat _tags.coverage _tags.orig > _tags

The file _tags.coverage contains two simple OCamlbuild rules that could be tweaked to instrument only some files:

<**/*.ml{,i,y}>: pkg_bisect_ppx
<**/*.native>:   pkg_bisect_ppx

When make coverage is not called, these rules are not active and hence, code is not instrumented for coverage. We believe that this solution to control instrumentation meets the goals from above. In particular, what files are instrumented and when is controlled by very few lines of declarative code that lives in the main repository of a project.

Project Layout

The crucial files in an Oasis-controlled project that is set up for coverage analysis are:

./_oasis                      - make "profiling" a build depdency
./_tags.coverage              - what files get instrumented
./profiling/coverage.ml       - support file, sets env var
./Makefile                    - target 'coverage'

The _oasis file bundles the files under profiling/ into an internal library which executables then depend on:

	# Support files for profiling
	Library profiling
		CompiledObject:     best
		Path:               profiling
		Install:            false
		Findlibname:        profiling
		Modules:            Coverage
		BuildDepends:

	Executable set_domain_uuid
		CompiledObject:     best
		Path:               tools
		ByteOpt:            -warn-error +a-3
		NativeOpt:          -warn-error +a-3
		MainIs:             set_domain_uuid.ml
		Install:            false
		BuildDepends:
			xenctrl,
			uuidm,
			cmdliner,
			profiling			# <-- here

The Makefile target coverage primes the project for a profiling build:

	# make coverage - prepares for building with coverage analysis

	coverage: _tags _tags.coverage
		test ! -f _tags.orig && mv _tags _tags.orig || true
		cat _tags.coverage _tags.orig > _tags

Design document
Revision	v7
Status	released (7.0)
Revision history
v1	Initial version
v2	Add details about VM migration and import
v3	Included and excluded use cases
v4	Rolling Pool Upgrade use cases
v5	Lots of changes to simplify the design
v6	Use case refresh based on simplified design
v7	RPU refresh based on simplified design

CPU feature levelling 2.0

Executive Summary

The old XS 5.6-style Heterogeneous Pool feature that is based around hardware-level CPUID masking will be replaced by a safer and more flexible software-based levelling mechanism.

History

Original XS 5.6 design: heterogeneous-pools
Changes made in XS 5.6 FP1 for the DR feature (added CPUID checks upon migration)
XS 6.1: migration checks extended for cross-pool scenario

High-level Interfaces and Behaviour

A VM can only be migrated safely from one host to another if both hosts offer the set of CPU features which the VM expects. If this is not the case, CPU features may appear or disappear as the VM is migrated, causing it to crash. The purpose of feature levelling is to hide features which the hosts do not have in common from the VM, so that it does not see any change in CPU capabilities when it is migrated.

Most pools start off with homogenous hardware, but over time it may become impossible to source new hosts with the same specifications as the ones already in the pool. The main use of feature levelling is to allow such newer, more capable hosts to be added to an existing pool while preserving the ability to migrate existing VMs to any host in the pool.

Principles for Migration

The CPU levelling feature aims to both:

Make VM migrations safe by ensuring that a VM will see the same CPU features before and after a migration.
Make VMs as mobile as possible, so that it can be freely migrated around in a XenServer pool.

To make migrations safe:

A migration request will be blocked if the destination host does not offer the some of the CPU features that the VM currently sees.
Any additional CPU features that the destination host is able to offer will be hidden from the VM.

Note: Due to the limitations of the old Heterogeneous Pools feature, we are not able to guarantee the safety of VMs that are migrated to a Levelling-v2 host from an older host, during a rolling pool upgrade. This is because such VMs may be using CPU features that were not captured in the old feature sets, of which we are therefore unaware. However, migrations between the same two hosts, but before the upgrade, may have already been unsafe. The promise is that we will not make migrations more unsafe during a rolling pool upgrade.

To make VMs mobile:

A VM that is started in a XenServer pool will be able to see only CPU features that are common to all hosts in the pool. The set of common CPU features is referred to in this document as the pool CPU feature level, or simply the pool level.

Use Cases for Pools

A user wants to add a new host to an existing XenServer pool. The new host has all the features of the existing hosts, plus extra features which the existing hosts do not. The new host will be allowed to join the pool, but its extra features will be hidden from VMs that are started on the host or migrated to it. The join does not require any host reboots.
A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones. XenCenter warns the user that adding the host to the pool is possible, but it would lower the pool’s CPU feature level. The user accepts this and continues the join. The join does not require any host reboots. VMs that are started anywhere on the pool, from now on, will only see the features of the new host (the lowest common denominator), such that they are migratable to any host in the pool, including the new one. VMs that were running before the pool join will not be migratable to the new host, because these VMs may be using features that the new host does not have. However, after a reboot, such VMs will be fully mobile.
A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones, and at the same time, it has certain features that the pool does not have (the feature sets overlap). This is essentially a combination of the two use cases above, where the pool’s CPU feature level will be downgraded to the intersection of the feature sets of the pool and the new host. The join does not require any host reboots.
A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has all the features it used to have, plus extra features which other hosts in the pool do not have. The extra features are masked out and the host resumes its place in the pool when it is booted up again.
A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has fewer features than it used to have. When the host is booted up again, the pool CPU’s feature level will be automatically lowered, and the user will be alerted of this fact (through the usual alerting mechanism).
A user wants to remove a host from an existing XenServer pool. The host will be removed as normal after any VMs on it have been migrated away. The feature set offered by the pool will be automatically re-levelled upwards in case the host which was removed was the least capable in the pool, and additional features common to the remaining hosts will be unmasked.

Rolling Pool Upgrade

A VM which was running on the pool before the upgrade is expected to continue to run afterwards. However, when the VM is migrated to an upgraded host, some of the CPU features it had been using might disappear, either because they are not offered by the host or because the new feature-levelling mechanism hides them. To have the best chance for such a VM to successfully migrate (see the note under “Principles for Migration”), it will be given a temporary VM-level feature set providing all of the destination’s CPU features that were unknown to XenServer before the upgrade. When the VM is rebooted it will inherit the pool-level feature set.
A VM which is started during the upgrade will be given the current pool-level feature set. The pool-level feature set may drop after the VM is started, as more hosts are upgraded and re-join the pool, however the VM is guaranteed to be able to migrate to any host which has already been upgraded. If the VM is started on the master, there is a risk that it may only be able to run on that host.
To allow the VMs with grandfathered-in flags to be migrated around in the pool, the intra pool VM migration pre-checks will compare the VM’s feature flags to the target host’s flags, not the pool flags. This will maximise the chance that a VM can be migrated somewhere in a heterogeneous pool, particularly in the case where only a few hosts in the pool do not have features which the VMs require.
To allow cross-pool migration, including to pool of a higher XenServer version, we will still check the VM’s requirements against the pool-level features of the target pool. This is to avoid the possibility that we migrate a VM to an ‘island’ in the other pool, from which it cannot be migrated any further.

XenAPI Changes

Fields

host.cpu_info is a field of type (string -> string) map that contains information about the CPUs in a host. It contains the following keys: cpu_count, socket_count, vendor, speed, modelname, family, model, stepping, flags, features, features_after_reboot, physical_features and maskable.
- The following keys are specific to hardware-based CPU masking and will be removed: features_after_reboot, physical_features and maskable.
- The features key will continue to hold the current CPU features that the host is able to use. In practise, these features will be available to Xen itself and dom0; guests may only see a subset. The current format is a string of four 32-bit words represented as four groups of 8 hexadecimal digits, separated by dashes. This will change to an arbitrary number of 32-bit words. Each bit at a particular position (starting from the left) still refers to a distinct CPU feature (1: feature is present; 0: feature is absent), and feature strings may be compared between hosts. The old format simply becomes a special (4 word) case of the new format, and bits in the same position may be compared between old and new feature strings.
- The new key features_pv will be added, representing the subset of features that the host is able to offer to a PV guest.
- The new key features_hvm will be added, representing the subset of features that the host is able to offer to an HVM guest.
A new field pool.cpu_info of type (string -> string) map (read only) will be added. It will contain:
- vendor: The common CPU vendor across all hosts in the pool.
- features_pv: The intersection of features_pv across all hosts in the pool, representing the feature set that a PV guest will see when started on the pool.
- features_hvm: The intersection of features_hvm across all hosts in the pool, representing the feature set that an HVM guest will see when started on the pool.
- cpu_count: the total number of CPU cores in the pool.
- socket_count: the total number of CPU sockets in the pool.
The pool.other_config:cpuid_feature_mask override key will no longer have any effect on pool join or VM migration.
The field VM.last_boot_CPU_flags will be updated to the new format (see host.cpu_info:features). It will still contain the feature set that the VM was started with as well as the vendor (under the features and vendor keys respectively).

Messages

pool.join currently requires that the CPU vendor and feature set (according to host.cpu_info:vendor and host.cpu_info:features) of the joining host are equal to those of the pool master. This requirement will be loosened to mandate only equality in CPU vendor:
- The join will be allowed if host.cpu_info:vendor equals pool.cpu_info:vendor.
- This means that xapi will additionally allow hosts that have a more extensive feature set than the pool (as long as the CPU vendor is common). Such hosts are transparently down-levelled to the pool level (without needing reboots).
- This further means that xapi will additionally allow hosts that have a less extensive feature set than the pool (as long as the CPU vendor is common). In this case, the pool is transparently down-levelled to the new host’s level (without needing reboots). Note that this does not affect any running VMs in any way; the mobility of running VMs will not be restricted, which can still migrate to any host they could migrate to before. It does mean that those running VMs will not be migratable to the new host.
- The current error raised in case of a CPU mismatch is POOL_HOSTS_NOT_HOMOGENEOUS with reason argument "CPUs differ". This will remain the error that is raised if the pool join fails due to incompatible CPU vendors.
- The pool.other_config:cpuid_feature_mask override key will no longer have any effect.
host.set_cpu_features and host.reset_cpu_features will be removed: it is no longer to use the old method of CPU feature masking (CPU feature sets are controlled automatically by xapi). Calls will fail with MESSAGE_REMOVED.
VM lifecycle operations will be updated internally to use the new feature fields, to ensure that:
- Newly started VMs will be given CPU features according to the pool level for maximal mobility.
- For safety, running VMs will maintain their feature set across migrations and suspend/resume cycles. CPU features will transparently be hidden from VMs.
- Furthermore, migrate and resume will only be allowed in case the target host’s CPUs are capable enough, i.e. host.cpu_info:vendor = VM.last_boot_CPU_flags:vendor and host.cpu_info:features_{pv,hvm} ⊇ VM.last_boot_CPU_flags:features. A VM_INCOMPATIBLE_WITH_THIS_HOST error will be returned otherwise (as happens today).
- For cross pool migrations, to ensure maximal mobility in the target pool, a stricter condition will apply: the VM must satisfy the pool CPU level rather than just the target host’s level: pool.cpu_info:vendor = VM.last_boot_CPU_flags:vendor and pool.cpu_info:features_{pv,hvm} ⊇ VM.last_boot_CPU_flags:features

CLI Changes

The following changes to the xe CLI will be made:

xe host-cpu-info (as well as xe host-param-list and friends) will return the fields of host.cpu_info as described above.
xe host-set-cpu-features and xe host-reset-cpu-features will be removed.
xe host-get-cpu-features will still return the value of host.cpu_info:features for a given host.

Low-level implementation

Xenctrl

The old xc_get_boot_cpufeatures hypercall will be removed, and replaced by two new functions, which are available to xenopsd through the Xenctrl module:

external get_levelling_caps : handle -> int64 = "stub_xc_get_levelling_caps"

type featureset_index = Featureset_host | Featureset_pv | Featureset_hvm
external get_featureset : handle -> featureset_index -> int64 array = "stub_xc_get_featureset"

In particular, the get_featureset function will be used by xapi/xenopsd to ask Xen which are the widest sets of CPU features that it can offer to a VM (PV or HVM). I don’t think there is a use for get_levelling_caps yet.

Xenopsd

Update the type Host.cpu_info, which contains all the fields that need to go into the host.cpu_info field in the xapi DB. The type already exists but is unused. Add the function HOST.get_cpu_info to obtain an instance of the type. Some code from xapi and the cpuid.ml from xen-api-libs can be reused.
Add a platform key featureset (Vm.t.platformdata), which xenopsd will write to xenstore along with the other platform keys (no code change needed in xenopsd). Xenguest will pick this up when a domain is created, and will apply the CPUID policy to the domain. This has the effect of masking out features that the host may have, but which have a 0 in the feature set bitmap.
Review current cpuid-related functions in xc/domain.ml.

Xapi

Xapi startup

Update Create_misc.create_host_cpu function to use the new xenopsd call.
If the host features fall below pool level, e.g. due to a change in hardware: down-level the pool by updating pool.cpu_info.features_{pv,hvm}. Newly started VMs will inherit the new level; already running VMs will not be affected, but will not be able to migrate to this host.
To notify the admin of this event, an API alert (message) will be set: pool_cpu_features_downgraded.

VM start

Inherit feature set from pool (pool.cpu_info.features_{pv,hvm}) and set VM.last_boot_CPU_flags (cpuid_helpers.ml).
The domain will be started with this CPU feature set enabled, by writing the feature set string to platformdata (see above).

VM migrate and resume

There are already CPU compatiblity checks on migration, both in-pool and cross-pool, as well as resume. Xapi compares VM.last_boot_CPU_flags of the VM to-migrate with host.cpu_info of the receiving host. Migration is only allowed if the CPU vendors and the same, and host.cpu_info:features ⊇ VM.last_boot_CPU_flags:features. The check can be overridden by setting the force argument to true.
For in-pool migrations, these checks will be updated to use the appropriate features_pv or features_hvm field.
For cross-pool migrations. These checks will be updated to use pool.cpu_info (features_pv or features_hvm depending on how the VM was booted) rather than host.cpu_info.
If the above checks pass, then the VM.last_boot_CPU_flags will be maintained, and the new domain will be started with the same CPU feature set enabled, by writing the feature set string to platformdata (see above).
In case the VM is migrated to a host with a higher xapi software version (e.g. a migration from a host that does not have CPU levelling v2), the feature string may be longer. This may happen during a rolling pool upgrade or a cross-pool migration, or when a suspended VM is resume after an upgrade. In this case, the following safety rules apply:
- Only the existing (shorter) feature string will be used to determine whether the migration will be allowed. This is the best we can do, because we are unaware of the state of the extended feature set on the older host.
- The existing feature set in VM.last_boot_CPU_flags will be extended with the extra bits in host.cpu_info:features_{pv,hvm}, i.e. the widest feature set that can possibly be granted to the VM (just in case the VM was using any of these features before the migration).
- Strictly speaking, a migration of a VM from host A to B that was allowed before B was upgraded, may no longer be allowed after the upgrade, due to stricter feature sets in the new implementation (from the xc_get_featureset hypercall). However, the CPU features that are switched off by the new implementation are features that a VM would not have been able to actually use. We therefore need a don’t-care feature set (similar to the old pool.other_config:cpuid_feature_mask key) with bits that we may ignore in migration checks, and switch off after the migration. This will be a xapi config file option.
- XXX: Can we actually block a cross-pool migration at the receiver end??

VM import

The VM.last_boot_CPU_flags field must be upgraded to the new format (only really needed for VMs that were suspended while exported; preserve_power_state=true), as described above.

Pool join

Update pool join checks according to the rules above (see pool.join), i.e. remove the CPU features constraints.

Upgrade

The pool level (pool.cpu_info) will be initialised when the pool master upgrades, and automatically adjusted if needed (downwards) when slaves are upgraded, by each upgraded host’s started sequence (as above under “Xapi startup”).
The VM.last_boot_CPU_flags fields of running and suspended VMs will be “upgraded” to the new format on demand, when a VM is migrated to or resume on an upgraded host, as described above.

XenCenter integration

Don’t explicitly down-level upon join anymore
Become aware of new pool join rule
Update Rolling Pool Upgrade

Design document
Revision	v1
Status	proposed

Distributed database

All hosts in a pool use the shared database by sending queries to the pool master. This creates

a performance bottleneck as the pool size increases
a reliability problem when the master fails.

The reliability problem can be ameliorated by running with HA enabled, but this is not always possible.

Both problems can be addressed by observing that the database objects correspond to distinct physical objects where eventual consistency is perfectly ok. For example if host ‘A’ is running a VM and changes the VM’s name, it doesn’t matter if it takes a while before the change shows up on host ‘B’. If host ‘B’ changes its network configuration then it doesn’t matter how long it takes host ‘A’ to notice. We would still like the metadata to be replicated to cope with failure, but we can allow changes to be committed locally and synchronised later.

Note the one exception to this pattern: the current SM plugins use database fields to implement locks. This should be shifted to a special-purpose lock acquire/release API.

Using git via Irmin

A git repository is a database of key=value pairs with branching history. If we placed our host and VM metadata in git then we could commit changes and pull and push them between replicas. The Irmin library provides an easy programming interface on top of git which we could link with the Xapi database layer.

Proposed new architecture

Pools of one

The diagram above shows two hosts: one a master and the other a regular host. The XenAPI client has sent a request to the wrong host; normally this would result in a HOST_IS_SLAVE error being sent to the client. In the new world, the host is able to process the request, only contacting the master if it is necessary to acquire a lock. Starting a VM would require a lock; but rebooting or migrating an existing VM would not. Assuming the lock can be acquired, then the operation is executed locally with all state updates being made to a git topic branch.

Topic branches

Roughly we would have 1 topic branch per pending XenAPI Task. Once the Task completes successfully, the topic branch (containing the new VM state) is merged back into master. Separately each host will pull and push updates between each other for replication.

We would avoid merge conflicts by construction; either

a host’s configuration will always be “owned” by the host and it will be an error for anyone else to merge updates to it
the master’s locking will guarantee that a VM is running on at most one host at a time. It will be an error for anyone else to merge updates to it.

What we gain

We will gain the following

the master will only be a bottleneck when the number of VM locks gets really large;
you will be able to connect XenCenter to hosts without a master and manage them. Today such hosts are unmanageable.
the database will have a history and you’ll be able to “go back in time” either for debugging or to recover from mistakes
bugs caused by concurrent threads (in separate Tasks) confusing each other will be vanquished. A typical failure mode is: one active thread destroys an object; a passive thread sees the object and then tries to read it and gets a database failure instead. Since every thread is operating a separate Task they will all have their own branch and will be isolated from each other.

What we lose

We will lose the following

the ability to use the Xapi database as a “lock”
coherence between hosts: there will be no guarantee that an effect seen by host ‘A’ will be seen immediately by host ‘B’. In particular this means that clients should send all their commands and event.from calls to the same host (although any host will do)

Stuff we need to build

A pull/push replicator: this would have to monitor the list of hosts in the pool and distribute updates to them in some vaguely efficient manner. Ideally we would avoid hassling the pool master and use some more efficient topology: perhaps a tree?
A git diff to XenAPI event converter: whenever a host pulls updates from another it needs to convert the diff into a set of touched objects for any event.from to read. We could send the changeset hash as the event.from token.
Irmin nested views: since Tasks can be nested (and git branches can be nested) we need to make sure that Irmin views can be nested.
We need to go through the xapi code and convert all mixtures of database access and XenAPI updates into pure database calls. With the previous system it was better to use a XenAPI to remote large chunks of database effects to the master than to perform them locally. It will now be better to run them all locally and merge them at the end. Additionally since a Task will have a local branch, it won’t be possible to see the state on a remote host without triggering an early merge (which would harm efficiency)
We need to create a first-class locking API to use instead of the VDI.sm_config locks.

Prototype

A basic prototype has been created:

$ opam pin xen-api-client git://github.com/djs55/xen-api-client#improvements
$ opam pin add xapi-database git://github.com/djs55/xapi-database
$ opam pin add xapi git://github.com/djs55/xen-api#schema-sexp

The xapi-database is clone of the existing Xapi database code configured to run as a separate process. There is code to convert from XML to git and an implementation of the Xapi remote database API which uses the following layout:

$ git clone /xapi.db db
Cloning into 'db'...
done.

$ cd db; ls
xapi

$ ls xapi
console   host_metrics  PCI          pool     SR      user  VM
host      network       PIF          session  tables  VBD   VM_metrics
host_cpu  PBD           PIF_metrics  SM       task    VDI

$ ls xapi/pool
OpaqueRef:39adc911-0c32-9e13-91a8-43a25939110b

$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/
crash_dump_SR                 __mtime           suspend_image_SR
__ctime                       name_description  uuid
default_SR                    name_label        vswitch_controller
ha_allow_overcommit           other_config      wlb_enabled
ha_enabled                    redo_log_enabled  wlb_password
ha_host_failures_to_tolerate  redo_log_vdi      wlb_url
ha_overcommitted              ref               wlb_username
ha_plan_exists_for            _ref              wlb_verify_cert
master                        restrictions

$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/
cpuid_feature_mask  memory-ratio-hvm  memory-ratio-pv

$ cat xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/cpuid_feature_mask
ffffff7f-ffffffff-ffffffff-ffffffff

Notice how:

every object is a directory
every key/value pair is represented as a file

Design document
Revision	v1
Status	released (6.0.2)

Emergency Network Reset Design

This document describes design details for the PR-1032 requirements.

The design consists of four parts:

A new XenAPI call Host.reset_networking, which removes all the PIFs, Bonds, VLANs and tunnels associated with the given host, and a call PIF.scan_bios to bring back the PIFs with device names as defined in the BIOS.
A xe-reset-networking script that can be executed on a XenServer host, which prepares the reset and causes the host to reboot.
An xsconsole page that essentially does the same as xe-reset-networking.
A new item in the XAPI start-up sequence, which when triggered by xe-reset-networking, calls Host.reset_networking and re-creates the PIFs.

Command-Line Utility

The xe-reset-networking script takes the following parameters:

Parameter	Description
`-m`, `--master`	The IP address of the master. Optional if the host is pool slave, ignored otherwise.
`--device`	Device name of management interface. Optional. If not specified, it is taken from the firstboot data.
`--mode`	IP configuration mode for management interface. Optional. Either `dhcp` or `static` (default is `dhcp`).
`--ip`	IP address for management interface. Required if `--mode=static`, ignored otherwise.
`--netmask`	Netmask for management interface. Required if `--mode=static`, ignored otherwise.
`--gateway`	Gateway for management interface. Optional; ignored if `--mode=dhcp`.
`--dns`	DNS server for management interface. Optional; ignored if `--mode=dhcp`.

DNS server for management interface. Optional; ignored if --mode=dhcp.

The script takes the following steps after processing the given parameters:

Inform the user that the host will be restarted, and that any running VMs should be shut down. Make the user confirm that they really want to reset the networking by typing ‘yes’.
Read /etc/xensource/pool.conf to determine whether the host is a pool master or pool slave.
If a pool slave, update the IP address in the pool.conf file to the one given in the -m parameter, if present.
Shut down networking subsystem (service network stop).
If no management device is specified, take it from /etc/firstboot.d/data/management.conf.
If XAPI is running, stop it.
Reconfigure the management interface and associated bridge by interface-reconfigure --force.
Update MANAGEMENT_INTERFACE and clear CURRENT_INTERFACES in /etc/xensource-inventory.
Create the file /tmp/network-reset to trigger XAPI to complete the network reset after the reboot. This file should contain the full configuration details of the management interface as key/value pairs (format: <key>=<value>\n), and looks similar to the firstboot data files. The file contains at least the keys DEVICE and MODE, and IP, NETMASK, GATEWAY, or DNS when appropriate.
Reboot

XAPI

XenAPI

A new hidden API call:

Host.reset_networking
- Parameter: host reference host
- Calling this function removes all the PIF, Bond, VLAN and tunnel objects associated with the given host from the master database. All Network and VIF objects are maintained, as these do not necessarily belong to a single host.

Start-up Sequence

After reboot, in the XAPI start-up sequence trigged by the presence of /tmp/network-reset:

Read the desired management configuration from /tmp/network-reset.
Call Host.reset_networking with a ref to the localhost.
Call PIF.scan with a ref to the localhost to recreate the (physical) PIFs.
Call PIF.reconfigure_ip to configure the management interface.
Call Host.management_reconfigure.
Delete /tmp/network-reset.

xsconsole

Add an “Emergency Network Reset” option under the “Network and Management Interface” menu. Selecting this option will show some explanation in the pane on the right-hand side. Pressing <Enter> will bring up a dialogue to select the interfaces to use as management interface after the reset. After choosing a device, the dialogue continues with configuration options like in the “Configure Management Interface” dialogue. After completing the dialogue, the same steps as listed for xe-reset-networking are executed.

Notes

On a pool slave, the management interface should be the same as on the master (the same device name, e.g. eth0).
Resetting the networking configuration on the master should be ideally be followed by resets of the pool slaves as well, in order to synchronise their configuration (especially bonds/VLANs/tunnels). Furthermore, in case the IP address of the master has changed, as a result of a network reset or Host.management_reconfigure, pool slaves may also use the network reset functionality to reconnect to the master on its new IP.

Design document
Revision	v3
Status	proposed
Review	#120

FCoE capable NICs

It has been possible to identify the NICs of a Host which can support FCoE. This property can be listed in PIF object under capabilities field.

Introduction

FCoE supported on a NIC is a hardware property. With the help of dcbtool, we can identify which NIC support FCoE.
The new field capabilities will be Set(String) in PIF object. For FCoE capable NIC will have string “fcoe” in PIF capabilities field.
capabilities field will be ReadOnly, This field cannot be modified by user.

PIF Object

New field:

Field PIF.capabilities will be type Set(string).
Default value in PIF capabilities will have an empty set.

Xapi Changes

Set the field capabilities “fcoe” depending on output of xcp-networkd call get_capabilities.
Field capabilities “fcoe” can be set during introduce_internal on when creating a PIF.
Field capabilities “fcoe” can be updated during refresh_all on xapi startup.
The above field will be set everytime when xapi-restart.

XCP-Networkd Changes

New function:

String list string list get_capabilties (string)
Argument: device_name for the PIF.
This function calls method capable exposed by fcoe_driver.py as part of dom0.
It returns string list [“fcoe”] or [] depending on capable method output.

Defaults, Installation and Upgrade

Any newly introduced PIF will have its capabilities field as empty set until fcoe_driver method capable states FCoE is supported on the NIC.
It includes PIFs obtained after a fresh install of Xenserver, as well as PIFs created using PIF.introduce then PIF.scan.
During an upgrade Xapi Restart will call refresh_all which then populate the capabilities field as empty set.

Command Line Interface

The PIF.capabilities field is exposed through xe pif-list and xe pif-param-list as usual.

Design document
Revision	v1
Status	released (6.0)

GPU pass-through support

This document contains the software design for GPU pass-through. This code was originally included in the version of Xapi used in XenServer 6.0.

Overview

Rather than modelling GPU pass-through from a PCI perspective, and having the user manipulate PCI devices directly, we are taking a higher-level view by introducing a dedicated graphics model. The graphics model is similar to the networking and storage model, in which virtual and physical devices are linked through an intermediate abstraction layer (e.g. the “Network” class in the networking model).

The basic graphics model is as follows:

A host owns a number of physical GPU devices (pGPUs), each of which is available for passing through to a VM.
A VM may have a virtual GPU device (vGPU), which means it expects to have access to a GPU when it is running.
Identical pGPUs are grouped across a resource pool in GPU groups. GPU groups are automatically created and maintained by XS.
A GPU group connects vGPUs to pGPUs in the same way as VIFs are connected to PIFs by Network objects: for a VM v having a vGPU on GPU group p to run on host h, host h must have a pGPU in GPU group p and pass it through to VM v.
VM start and non-live migration rules are analogous to the network API and follow the above rules.
In case a VM that has a vGPU is started, while no pGPU available, an exception will occur and the VM won’t start. As a result, in order to guarantee that a VM always has access to a pGPU, the number of vGPUs should not exceed the number of pGPUs in a GPU group.

Currently, the following restrictions apply:

Hotplug is not supported.
Suspend/resume and checkpointing (memory snapshots) are not supported.
Live migration (XenMotion) is not supported.
No more than one GPU per VM will be supported.
Only Windows guests will be supported.

XenAPI Changes

The design introduces a new generic class called PCI to capture state and information about relevant PCI devices in a host. By default, xapi would not create PCI objects for all PCI devices, but only for the ones that are managed and configured by xapi; currently only GPU devices.

The PCI class has no fields specific to the type of the PCI device (e.g. a graphics card or NIC). Instead, device specific objects will contain a link to their underlying PCI device’s object.

The new XenAPI classes and changes to existing classes are detailed below.

PCI class

Fields:

Name	Type	Description
uuid	string	Unique identifier/object reference.
class_id	string	PCI class ID (hidden field)
class_name	string	PCI class name (GPU, NIC, …)
vendor_id	string	Vendor ID (hidden field).
vendor_name	string	Vendor name.
device_id	string	Device ID (hidden field).
device_name	string	Device name.
host	host ref	The host that owns the PCI device.
pci_id	string	BDF (domain/Bus/Device/Function identifier) of the (physical) PCI function, e.g. “0000:00:1a.1”. The format is hhhh:hh:hh.h, where h is a hexadecimal digit.
functions	int	Number of (physical + virtual) functions; currently fixed at 1 (hidden field).
attached_VMs	VM ref set	List of VMs that have this PCI device “currently attached”, i.e. plugged, i.e. passed-through to (hidden field).
dependencies	PCI ref set	List of dependent PCI devices: all of these need to be passed-thru to the same VM (co-location).
other_config	(string -> string) map	Additional optional configuration (as usual).

Hidden fields are only for use by xapi internally, and not visible to XenAPI users.

Messages: none.

PGPU class

A physical GPU device (pGPU).

Fields:

Name	Type	Description
uuid	string	Unique identifier/object reference.
PCI	PCI ref	Link to the underlying PCI device.
other_config	(string -> string) map	Additional optional configuration (as usual).
host	host ref	The host that owns the GPU.
GPU_group	GPU_group ref	GPU group the pGPU is contained in. Can be Null.

Messages: none.

GPU_group class

A group of identical GPUs across hosts. A VM that is associated with a GPU group can use any of the GPUs in the group. A VM does not need to install new GPU drivers if moving from one GPU to another one in the same GPU group.

Fields:

Name	Type	Description
VGPUs	VGPU ref set	List of vGPUs in the group.
uuid	string	Unique identifier/object reference.
PGPUs	PGPU ref set	List of pGPUs in the group.
other_config	(string -> string) map	Additional optional configuration (as usual).
name_label	string	A human-readable name.
name_description	string	A notes field containing human-readable description.
GPU_types	string set	List of GPU types (vendor+device ID) that can be in this group (hidden field).

Messages: none.

VGPU class

A virtual GPU device (vGPU).

Fields:

Name	Type	Description
uuid	string	Unique identifier/object reference.
VM	VM ref	VM that owns the vGPU.
GPU_group	GPU_group ref	GPU group the vGPU is contained in.
currently_attached	bool	Reflects whether the virtual device is currently “connected” to a physical device.
device	string	Order in which the devices are plugged into the VM. Restricted to “0” for now.
other_config	(string -> string) map	Additional optional configuration (as usual).

Messages:

Prototype	Description
VGPU ref create (GPU_group ref, string, VM ref)	Manually assign the vGPU device to the VM given a device number, and link it to the given GPU group.
void destroy (VGPU ref)	Remove the association between the GPU group and the VM.

It is possible to assign more vGPUs to a group than number number of pGPUs in the group. When a VM is started, a pGPU must be available; if not, the VM will not start. Therefore, to guarantee that a VM has access to a pGPU at any time, one must manually enforce that the number of vGPUs in a GPU group does not exceed the number of pGPUs. XenCenter might display a warning, or simply refuse to assign a vGPU, if this constraint is violated. This is analogous to the handling of memory availability in a pool: a VM may not be able to start if there is no host having enough free memory.

VM class

Fields:

Deprecate (unused) PCI_bus field
Add field VGPU ref set VGPUs: List of vGPUs.
Add field PCI ref set attached_PCIs: List of PCI devices that are “currently attached” (plugged, passed-through) (hidden field).

host class

Fields:

Add field PCI ref set PCIs: List of PCI devices.
Add field PGPU ref set PGPUs: List of physical GPU devices.
Add field (string -> string) map chipset_info, which contains at least the key iommu. If "true", this key indicates whether the host has IOMMU/VT-d support build in, and this functionality is enabled by Xen; the value will be "false" otherwise.

Initialisation and Operations

Enabling IOMMU/VT-d

(This may not be needed in Xen 4.1. Confirm with Simon.)

Provide a command that does this:

/opt/xensource/libexec/xen-cmdline --set-xen iommu=1
reboot

Xapi startup

Definitions:

PCI devices are matched on the combination of their pci_id, vendor_id, and device_id.

First boot and any subsequent xapi start:

Find out from dmesg whether IOMMU support is present and enabled in Xen, and set host.chipset_info:iommu accordingly.
Detect GPU devices currently present in the host. For each:
1. If there is no matching PGPU object yet, create a PGPU object, and add it to a GPU group containing identical PGPUs, or a new group.
2. If there is no matching PCI object yet, create one, and also create or update the PCI objects for dependent devices.
Destroy all existing PCI objects of devices that are not currently present in the host (i.e. objects for devices that have been replaced or removed).
Destroy all existing PGPU objects of GPUs that are not currently present in the host. Send a XenAPI alert to notify the user of this fact.
Update the list of dependencies on all PCI objects.
Sync VGPU.currently_attached on all VGPU objects.

Upgrade

For any VMs that have VM.other_config:pci set to use a GPU, create an appropriate vGPU, and remove the other_config option.

Generic PCI Interface

A generic PCI interface exposed to higher-level code, such as the networking and GPU management modules within Xapi. This functionality relies on Xenops.

The PCI module exposes the following functions:

Check whether a PCI device has free (unassigned) functions. This is the case if the number of assignments in PCI.attached_VMs is smaller than PCI.functions.
Plug a PCI function into a running VM.
1. Raise exception if there are no free functions.
2. Plug PCI device, as well as dependent PCI devices. The PCI module must also tell device-specific modules to update the currently_attached field on dependent VGPU objects etc.
3. Update PCI.attached_VMs.
Unplug a PCI function from a running VM.
1. Raise exception if the PCI function is not owned by (passed through to) the VM.
2. Unplug PCI device, as well as dependent PCI devices. The PCI module must also tell device-specific modules to update the currently_attached field on dependent VGPU objects etc.
3. Update PCI.attached_VMs.

Construction and Destruction

VGPU.create:

Check license. Raise FEATURE_RESTRICTED if the GPU feature has not been enabled.
Raise INVALID_DEVICE if the given device number is not “0”, or DEVICE_ALREADY_EXISTS if (indeed) the device already exists. This is a convenient way of enforcing that only one vGPU per VM is supported, for now.
Create VGPU object in the DB.
Initialise VGPU.currently_attached = false.
Return a ref to the new object.

VGPU.destroy:

Raise OPERATION_NOT_ALLOWED if VGPU.currently_attached = true and the VM is running.
Destroy VGPU object.

VM Operations

VM.start(_on):

If host.chipset_info:iommu = "false", raise VM_REQUIRES_IOMMU.
Raise FEATURE_REQUIRES_HVM (carrying the string “GPU passthrough needs HVM”) if the VM is PV rather than HVM.
For each of the VM’s vGPUs:
1. Confirm that the given host has a pGPU in its associated GPU group. If not, raise VM_REQUIRES_GPU.
2. Consult the generic PCI module for all pGPUs in the group to find out whether a suitable PCI function is available. If a physical device is not available, raise VM_REQUIRES_GPU.
3. Ask PCI module to plug an available pGPU into the VM’s domain and set VGPU.currently_attached to true. As a side-effect, any dependent PCI devices would be plugged.

VM.shutdown:

Ask PCI module to unplug all GPU devices.
Set VGPU.currently_attached to false for all the VM’s VGPUs.

VM.suspend, VM.resume(_on):

Raise VM_HAS_PCI_ATTACHED if the VM has any plugged VGPU objects, as suspend/resume for VMs with GPUs is currently not supported.

VM.pool_migrate:

Raise VM_HAS_PCI_ATTACHED if the VM has any plugged VGPU objects, as live migration for VMs with GPUs is currently not supported.

VM.clone, VM.copy, VM.snapshot:

Copy VGPU objects along with the VM.

VM.import, VM.export:

Include VGPU and GPU_group objects in the VM export format.

VM.checkpoint

Raise VM_HAS_PCI_ATTACHED if the VM has any plugged VGPU objects, as checkpointing for VMs with GPUs is currently not supported.

Pool Join and Eject

Pool join:

For each PGPU:
1. Copy it to the pool.
2. Add it to a GPU_group of identical PGPUs, or a new one.
Copy each VGPU to the pool together with the VM that owns it, and add it to the GPU group containing the same PGPU as before the join.

Step 1 is done automatically by the xapi startup code, and step 2 is handled by the VM export/import code. Hence, no work needed.

Pool eject:

VGPU objects will be automatically GC’ed when the VMs are removed.
Xapi’s startup code recreates the PGPU and GPU_group objects.

Hence, no work needed.

Required Low-level Interface

Xapi needs a way to obtain a list of all PCI devices present on a host. For each device, xapi needs to know:

The PCI ID (BDF).
The type of device (NIC, GPU, …) according to a well-defined and stable list of device types (as in /usr/share/hwdata/pci.ids).
The device and vendor ID+name (currently, for PIFs, xapi looks up the name in /usr/share/hwdata/pci.ids).
Which other devices/functions are required to be passed through to the same VM (co-located), e.g. other functions of a compound PCI device.

Command-Line Interface (xe)

xe pgpu-list
xe pgpu-param-list/get/set/add/remove/clear
xe gpu-group-list
xe gpu-group-param-list/get/set/add/remove/clear
xe vgpu-list
xe vgpu-create
xe vgpu-destroy
xe vgpu-param-list/get/set/add/remove/clear
xe host-param-get param-name=chipset-info param-key=iommu

Design document
Revision	v3
Status	released (7.0)
Revision history
v1	Documented interface changes between xapi and xenopsd for vGPU
v2	Added design for storing vGPU-to-pGPU allocation in xapi database
v3	Marked new xapi DB fields as internal-only

GPU support evolution

Introduction

As of XenServer 6.5, VMs can be provisioned with access to graphics processors (either emulated or passed through) in four different ways. Virtualisation of Intel graphics processors will exist as a fifth kind of graphics processing available to VMs. These five situations all require the VM’s device model to be created in subtly different ways:

Pure software emulation

qemu is launched either with no special parameter, if the basic Cirrus graphics processor is required, otherwise qemu is launched with the -std-vga flag.

Generic GPU passthrough

qemu is launched with the -priv flag to turn on privilege separation
qemu can additionally be passed the -std-vga flag to choose the corresponding emulated graphics card.

Intel integrated GPU passthrough (GVT-d)

As well as the -priv flag, qemu must be launched with the -std-vga and -gfx_passthru flags. The actual PCI passthrough is handled separately via xen.

NVIDIA vGPU

qemu is launched with the -vgpu flag
a secondary display emulator, demu, is launched with the following parameters:
- --domain - the VM’s domain ID
- --vcpus - the number of vcpus available to the VM
- --gpu - the PCI address of the physical GPU on which the emulated GPU will run
- --config - the path to the config file which contains detail of the GPU to emulate

Intel vGPU (GVT-g)

here demu is not used, but instead qemu is launched with five parameters:
- -xengt
- -vgt_low_gm_sz - the low GM size in MiB
- -vgt_high_gm_sz - the high GM size in MiB
- -vgt_fence_sz - the number of fence registers
- -priv

xenopsd

To handle all these possibilities, we will add some new types to xenopsd’s interface:

module Pci = struct
  type address = {
    domain: int;
    bus: int;
    device: int;
    fn: int;
  }

  ...
end

module Vgpu = struct
  type gvt_g = {
    physical_pci_address: Pci.address;
    low_gm_sz: int64;
    high_gm_sz: int64;
    fence_sz: int;
  }

  type nvidia = {
    physical_pci_address: Pci.address;
    config_file: string
  }

  type implementation =
    | GVT_g of gvt_g
    | Nvidia of nvidia

  type id = string * string

  type t = {
    id: id;
    position: int;
    implementation: implementation;
  }

  type state = {
    plugged: bool;
    emulator_pid: int option;
  }
end

module Vm = struct
  type igd_passthrough of
    | GVT_d

  type video_card =
    | Cirrus
    | Standard_VGA
    | Vgpu
    | Igd_passthrough of igd_passthrough

  ...
end

module Metadata = struct
  type t = {
    vm: Vm.t;
    vbds: Vbd.t list;
    vifs: Vif.t list;
    pcis: Pci.t list;
    vgpus: Vgpu.t list;
    domains: string option;
  }
end

The video_card type is used to indicate to the function Xenops_server_xen.VM.create_device_model_config how the VM’s emulated graphics card will be implemented. A value of Vgpu indicates that the VM needs to be started with one or more virtualised GPUs - the function will need to look at the list of GPUs associated with the VM to work out exactly what parameters to send to qemu.

If Vgpu.state.emulator_pid of a plugged vGPU is None, this indicates that the emulation of the vGPU is being done by qemu rather than by a separate emulator.

n.b. adding the vgpus field to Metadata.t will break backwards compatibility with old versions of xenopsd, so some upgrade logic will be required.

This interface will allow us to support multiple vGPUs per VM in future if necessary, although this may also require reworking the interface between xenopsd, qemu and demu. For now, xenopsd will throw an exception if it is asked to start a VM with more than one vGPU.

xapi

To support the above interface, xapi will convert all of a VM’s non-passthrough GPUs into Vgpu.t objects when sending VM metadata to xenopsd.

In contrast to GVT-d, which can only be run on an Intel GPU which has been has been hidden from dom0, GVT-g will only be allowed to run on a GPU which has not been hidden from dom0.

If a GVT-g-capable GPU is detected, and it is not hidden from dom0, xapi will create a set of VGPU_type objects to represent the vGPU presets which can run on the physical GPU. Exactly how these presets are defined is TBD, but a likely solution is via a set of config files as with NVIDIA vGPU.

Allocation of vGPUs to physical GPUs

For NVIDIA vGPU, when starting a VM, each vGPU attached to the VM is assigned to a physical GPU as a result of capacity planning at the pool level. The resulting configuration is stored in the VM.platform dictionary, under specific keys:

vgpu_pci_id - the address of the physical GPU on which the vGPU will run
vgpu_config - the path to the vGPU config file which the emulator will use

Instead of storing the assignment in these fields, we will add a new internal-only database field:

VGPU.scheduled_to_be_resident_on (API.ref_PGPU)

This will be set to the ref of the physical GPU on which the vGPU will run. From here, xapi can easily obtain the GPU’s PCI address. Capacity planning will also take into account which vGPUs are scheduled to be resident on a physical GPU, which will avoid races resulting from many vGPU-enabled VMs being started at once.

The path to the config file is already stored in the VGPU_type.internal_config dictionary, under the key vgpu_config. xapi will use this value directly rather than copying it to VM.platform.

To support other vGPU implementations, we will add another internal-only database field:

VGPU_type.implementation enum(Passthrough|Nvidia|GVT_g)

For the GVT_g implementation, no config file is needed. Instead, VGPU_type.internal_config will contain three key-value pairs, with the keys

vgt_low_gm_sz
vgt_high_gm_sz
vgt_fence_sz

The values of these pairs will be used to construct a value of type Xenops_interface.Vgpu.gvt_g, which will be passed down to xenopsd.

Design document
Revision	v1
Status	released (6.5)

GRO and other properties of PIFs

It has been possible to enable and disable GRO and other “ethtool” features on PIFs for a long time, but there was never an official API for it. Now there is.

Introduction

The former way to enable GRO via the CLI is as follows:

xe pif-param-set uuid=<pif-uuid> other-config:ethtool-gro=on
xe pif-plug uuid=<pif-uuid>

The other-config field is a grab-bag of options that are not clearly defined. The options exposed through other-config are mostly experimental features, and the interface is not considered stable. Furthermore, the field is read/write and does not have any input validation, and cannot not trigger any actions immediately. The latter is why it is needed to call pif-plug after setting the ethtool-gro key, in order to actually make things happen.

New API

New field:

Field PIF.properties of type (string -> string) map.
Physical and bond PIFs have a gro key in their properties, with possible values on and off. There are currently no other properties defined.
VLAN and Tunnel PIFs do not have any properties. They implicitly inherit the properties from the PIF they are based upon (either a physical PIF or a bond).
For backwards compatibility, if there is a other-config:ethtool-gro key present on the PIF, it will be treated as an override of the gro key in PIF.properties.

New function:

Message void PIF.set_property (PIF ref, string, string).
First argument: the reference of the PIF to act on.
Second argument: the key to change in the properties field.
Third argument: the value to write.
The function can only be used on physical PIFs that are not bonded, and on bond PIFs. Attempts to call the function on bond slaves, VLAN PIFs, or Tunnel PIFs, fail with CANNOT_CHANGE_PIF_PROPERTIES.
Calls with invalid keys or values fail with INVALID_VALUE.
When called on a bond PIF, the key in the properties of the associated bond slaves will also be set to same value.
The function automatically causes the settings to be applied to the network devices (no additional plug is needed). This includes any VLANs that are on top of the PIF to-be-changed, as well as any bond slaves.

Defaults, Installation and Upgrade

Any newly introduced PIF will have its properties field set to "gro" -> "on". This includes PIFs obtained after a fresh installation of XenServer, as well as PIFs created using PIF.introduce or PIF.scan. In other words, GRO will be “on” by default.
An upgrade from a version of XenServer that does not have the PIF.properties field, will give every physical and bond PIF a properties field set to "gro" -> "on". In other words, GRO will be “on” by default after an upgrade.

Bonding

When creating a bond, the bond-slaves-to-be must all have equal PIF.properties. If not, the bond.create call will fail with INCOMPATIBLE_BOND_PROPERTIES.
When a bond is created successfully, the properties of the bond PIF will be equal to the properties of the bond slaves.

Command Line Interface

The PIF.properties field is exposed through xe pif-list and xe pif-param-list as usual.
The PIF.set_property call is exposed through xe pif-param-set. For example: xe pif-param-set uuid=<pif-uuid> properties:gro=off.

Design document
Revision	v1
Status	released (5.6)

Heterogeneous pools

Notes

The cpuid instruction is used to obtain a CPU’s manufacturer, family, model, stepping and features information.
The feature bitvector is 128 bits wide: 2 times 32 bits of base features plus 2 times 32 bits of extended features, which are referred to as base_ecx, base_edx, ext_ecx and ext_edx (after the registers used by cpuid to store the results).
The feature bits can be masked by Intel FlexMigration and AMD Extended Migration. This means that features can be made to appear as absent. Hence, a CPU can appear as a less-capable CPU.
- AMD Extended Migration is able to mask both base and extended features.
- Intel FlexMigration on Core 2 CPUs (Penryn) is able to mask only the base features (base_ecx and base_edx). The newer Nehalem and Westmere CPUs support extended-feature masking as well.
A process in dom0 (e.g. xapi) is able to call cpuid to obtain the (possibly modified) CPU info, or can obtain this information from Xen. Masking is done only by Xen at boot time, before any domains are loaded.
To apply a feature mask, a dom0 process may specify the mask in the Xen command line in the file /boot/extlinux.conf. After a reboot, the mask will be enforced.
It is not possible to obtain the original features from a dom0 process, if the features have been masked. Before applying the first mask, the process could remember/store the original feature vector, or obtain the information from Xen.
All CPU cores on a host can be assumed to be identical. Masking will be done simultaneously on all cores in a host.
Whether a CPU supports FlexMigration/Extended Migration can (only) be derived from the family/model/stepping information.
XS5.5 has an exception for the EST feature in base_ecx. This flag is ignored on pool join.

Overview of XenAPI Changes

Fields

Currently, the datamodel has Host_cpu objects for each CPU core in a host. As they are all identical, we are considering keeping just one CPU record in the Host object itself, and deprecating the Host_cpu class. For backwards compatibility, the Host_cpu objects will remain as they are in MNR, but may be removed in subsequent releases.

Hence, there will be a new field called Host.cpu_info, a read-only string-string map, containing the following fixed set of keys:

Key name	Description
`cpu_count`	The number of CPU cores in the host.
`family`	The family (number) of the CPU.
`features`	The current (possibly masked) feature vector, as given by `cpuid`. Format: `"<base_ecx>-<base_edx>-<ext_ecx>-<ext_edx>"`, 4 groups of 8 hexadecimal digits, separated by dashes.
`features_after_reboot`	The feature vector to be used after rebooting the host. This field can be modified by calling `Host.set_cpu_features`. Same format as `features`.
`flags`	The flags of the physical CPU (a decoded version of the features field).
`maskable`	Indicating whether the CPU supports Intel FlexMigration or AMD Extended Migration. There are three possible values: `"no"` means that masking is not possible, `"base"` means that only base features can be masked, and `"full"` means that base as well as extended features can be masked.
`model`	The model number of the CPU.
`modelname`	The model name of the CPU.
`physical_features`	The original, unmasked features. Same format as `features`.
`speed`	The speed of the CPU.
`stepping`	The stepping of the CPU.
`vendor`	The manufacturer of the CPU.

Indicating whether the CPU supports Intel FlexMigration or AMD Extended Migration. There are three possible values: "no" means that masking is not possible, "base" means that only base features can be masked, and "full" means that base as well as extended features can be masked.

Note: When the features and features_after_reboot are different, XenCenter could display a warning saying that a reboot is needed to enforce the feature masking.

The Pool.other_config:cpuid_feature_mask key is recognised. If this key is present and if it contains a value in the same format as Host.cpu_info:features, the value is used to mask the feature vectors before comparisons during any pool join in the pool it is defined on. This can be used to white-list certain feature flags, i.e. to ignore them when adding a new host to a pool. The default it ffffff7f-ffffffff-ffffffff-ffffffff, which white-lists the EST feature for compatibility with XS 5.5 and earlier.

Messages

New messages:

Host.set_cpu_features
- Parameters: Host reference host, new CPU feature vector features.
- Roles: only Pool Operator and Pool Admin.
- Sets the feature vector to be used after a reboot (Host.cpu_info:features_after_reboot), if features is valid.
Host.reset_cpu_features
- Parameter: Host reference host.
- Roles: only Pool Operator and Pool Admin.
- Removes the feature mask, such that after a reboot all features of the CPU are enabled.

XAPI

Back-end

Xen keeps the physical (unmasked) CPU features in memory when starts, before applying any masks. Xen exposes the physical features, as well as the current (possibly masked) features, to dom0/xapi via the function xc_get_boot_cpufeatures in libxc.
A dom0 script /etc/xensource/libexec/xen-cmdline, which provides a future-proof way of modifying the Xen command-line key/value pairs. This script has the following options, where mask is one of cpuid_mask_ecx, cpuid_mask_edx, cpuid_mask_ext_ecx or cpuid_mask_ext_edx, and value is 0xhhhhhhhh (h is represents a hex digit).:
- --list-cpuid-masks
- --set-cpuid-masks mask=value mask=value
- --delete-cpuid-masks mask mask
A restrict_cpu_masking key has been added to the host licensing restrictions map. This will be true when the Host.edition is free, and false if it is enterprise or platinum.

Start-up

The Host.cpu_info field is refreshed:

The values for the keys cpu_count, vendor, speed, modelname, flags, stepping, model, and family are obtained from /etc/xensource/boot_time_cpus (and ultimately from /proc/cpuinfo).
The values of the features and physical_features are obtained from Xen and the features_after_reboot key is made equal to the features field.
The value of the maskable key is determined by the CPU details.
- for Intel Core2 (Penryn) CPUs: family = 6 and (model = 1dh or (model = 17h and stepping >= 4)) (maskable = "base")
- for Intel Nehalem/Westmere CPUs: family = 6 and ((model = 1ah and stepping > 2) or model = 1eh or model = 25h or model = 2ch or model = 2eh or model = 2fh) (maskable = "full")
- for AMD CPUs: family >= 10h (maskable = "full")

Setting (Masking) and Resetting the CPU Features

The Host.set_cpu_features call:
- checks whether the license of the host is Enterprise or Platinum; throws FEATURE_RESTRICTED if not.
- expects a string of 32 hexadecimal digits, optionally containing spaces; throws INVALID_FEATURE_STRING if malformed.
- checks whether the given feature vector can be formed by masking the physical feature vector; throws INVALID_FEATURE_STRING if not. Note that on Intel Core 2 CPUs, it is only possible to the mask the base features!
- checks whether the CPU supports FlexMigration/Extended Migration; throws CPU_FEATURE_MASKING_NOT_SUPPORTED if not.
- sets the value of features_after_reboot to the given feature vector.
- adds the new feature mask to the Xen command-line via the xen-cmdline script. The mask is represented by one or more of the following key/value pairs (where h represents a hex digit):
  - cpuid_mask_ecx=0xhhhhhhhh
  - cpuid_mask_edx=0xhhhhhhhh
  - cpuid_mask_ext_ecx=0xhhhhhhhh
  - cpuid_mask_ext_edx=0xhhhhhhhh
The Host.reset_cpu_features call:
- copies physical_features to features_after_reboot.
- removes the feature mask from the Xen command-line via the xen-cmdline script (if any).

Pool Join and Eject

Pool.join fails when the vendor and feature keys do not match, and disregards any other key in Host.cpu_info.
- However, as XS5.5 disregards the EST flag, there is a new way to disregard/ignore feature flags on pool join, by setting a mask in Pool.other_config:cpuid_feature_mask. The value of this field should have the same format as Host.cpu_info:features. When comparing the CPUID features of the pool and the joining host for equality, this mask is applied before the comparison. The default is ffffff7f-ffffffff-ffffffff-ffffffff, which defines the EST feature, bit 7 of the base ecx flags, as “don’t care”.
Pool.eject clears the database (as usual), and additionally removes the feature mask from /boot/extlinux.conf (if any).

CLI

New commands:

host-cpu-info
- Parameters: uuid (optional, uses localhost if absent).
- Lists Host.cpu_info associated with the host.
host-get-cpu-features
- Parameters: uuid (optional, uses localhost if absent).
- Returns the value of Host.cpu_info:features] associated with the host.
host-set-cpu-features
- Parameters: features (string of 32 hexadecimal digits, optionally containing spaces or dashes), uuid (optional, uses localhost if absent).
- Calls Host.set_cpu_features.
host-reset-cpu-features
- Parameters: uuid (optional, uses localhost if absent).
- Calls Host.reset_cpu_features.

The following commands will be deprecated: host-cpu-list, host-cpu-param-get, host-cpu-param-list.

WARNING:

If the user is able to set any mask they like, they may end up disabling CPU features that are required by dom0 (and probably other guest OSes), resulting in a kernel panic when the machine restarts. Hence, using the set function is potentially dangerous.

It is apparently not easy to find out exactly which flags are safe to mask and which aren’t, so we cannot prevent an API/CLI user from making mistakes in this way. However, using XenCenter would always be safe, as XC always copies features masks from real hosts.

If a machine ends up in such a bad state, there is a way to get out of it. At the boot prompt (before Xen starts), you can type “menu.c32”, select a boot option and alter the Xen command-line to remove the feature masks, after which the machine will again boot normally (note: in our set-up, there is first a PXE boot prompt; the second prompt is the one we mean here).

The API/CLI documentation should stress the potential danger of using this functionality, and explain how to get out of trouble again.

Design document
Revision	v3
Status	released (6.5 sp1)
Review	#33

Integrated GPU passthrough support

Introduction

Passthrough of discrete GPUs has been available since XenServer 6.0. With some extensions, we will also be able to support passthrough of integrated GPUs.

Whether an integrated GPU will be accessible to dom0 or available to passthrough to guests must be configurable via XenAPI.
Passthrough of an integrated GPU requires an extra flag to be sent to qemu.

Host Configuration

New fields will be added (both read-only):

PGPU.dom0_access enum(enabled|disable_on_reboot|disabled|enable_on_reboot)
host.display enum(enabled|disable_on_reboot|disabled|enable_on_reboot)

as well as new API calls used to modify the state of these fields:

PGPU.enable_dom0_access
PGPU.disable_dom0_access
host.enable_display
host.disable_display

Each of these API calls will return the new state of the field e.g. calling host.disable_display on a host with display = enabled will return disable_on_reboot.

Disabling dom0 access will modify the xen commandline (using the xen-cmdline tool) such that dom0 will not be able to access the GPU on next boot.

Calling host.disable_display will modify the xen and dom0 commandlines such that neither will attempt to send console output to the system display device.

A state diagram for the fields PGPU.dom0_access and host.display is shown below:

host.integrated_GPU_passthrough flow diagram

While it is possible for these two fields to be modified independently, a client must disable both the host display and dom0 access to the system display device before that device can be passed through to a guest.

Note that when a client enables or disables either of these fields, the change can be cancelled until the host is rebooted.

Handling vga_arbiter

Currently, xapi will not create a PGPU object for the PCI device with address reported by /dev/vga_arbiter. This is to prevent a GPU in use by dom0 from from being passed through to a guest. This behaviour will be changed - instead of not creating a PGPU object at all, xapi will create a PGPU, but its supported_VGPU_types field will be empty.

However, the PGPU’s supported_VGPU_types will be populated as normal if:

dom0 access to the GPU is disabled.
The host’s display is disabled.
The vendor ID of the device is contained in a whitelist provided by xapi’s config file.

A read-only field will be added:

PGPU.is_system_display_device bool

This will be true for a PGPU iff /dev/vga_arbiter reports the PGPU as the system display device for the host on which the PGPU is installed.

Interfacing with xenopsd

When starting a VM attached to an integrated GPU, the VM config sent to xenopsd will contain a video_card of type IGD_passthrough. This will override the type determined from VM.platform:vga. xapi will consider a GPU to be integrated if both:

It resides on bus 0.
The vendor ID of the device is contained in a whitelist provided by xapi’s config file.

When xenopsd starts qemu for a VM with a video_card of type IGD_passthrough, it will pass the flags “-std-vga” AND “-gfx_passthru”.

Design document
Revision	v1
Status	proposed

Local database

All hosts in a pool use the shared database by sending queries to the pool master. This creates a performance bottleneck as the pool size increases. All hosts in a pool receive a database backup from the master periodically, every couple of hours. This creates a reliability problem as updates may be lost if the master fails during the window before the backup.

The reliability problem can be avoided by running with HA or the redo log enabled, but this is not always possible.

We propose to:

adapt the existing event machinery to allow every host to maintain an up-to-date database replica;
actively cache the database locally on each host and satisfy read operations from the cache. Most database operations are reads so this should reduce the number of RPCs across the network.

In a later phase we can move to a completely distributed database.

Replicating the database

We will create a database-level variant of the existing XenAPI event.from API. The new RPC will block until a database event is generated, and then the events will be returned using the existing “redo-log” event types. We will add a few second delay into the RPC to batch the updates.

We will replace the pool database download logic with an event.from-like loop which fetches all the events from the master’s database and applies them to the local copy. The first call will naturally return the full database contents.

We will turn on the existing “in memory db cache” mechanism on all hosts, not just the master. This will be where the database updates will go.

The result should be that every host will have a /var/xapi/state.db file, with writes going to the master first and then filtering down to all slaves.

Using the replica as a cache

We will re-use the Disaster Recovery multiple database mechanism to allow slaves to access their local database. We will change the defalult database “context” to snapshot the local database, perform reads locally and write-through to the master.

We will add an HTTP header to all forwarded XenAPI calls from the master which will include the current database generation count. When a forwarded XenAPI operation is received, the slave will deliberately wait until the local cache is at least as new as this, so that we always use fresh metadata for XenAPI calls (e.g. the VM.start uses the absolute latest VM memory size).

We will document the new database coherence policy, i.e. that writes on a host will not immediately be seen by reads on another host. We believe that this is only a problem when we are using the database for locking and are attempting to hand over a lock to another host. We are already using XenAPI calls forwarded to the master for some of this, but may need to do a bit more of this; in particular the storage backends may need some updating.

Design document
Revision	v3
Status	proposed
Revision history
v1	Initial version
v2	Addition of `networkd_db` update for Upgrade
v3	More info on `networkd_db` and API Errors

Management Interface on VLAN

This document describes design details for the REQ-42: Support Use of VLAN on XAPI Management Interface.

XAPI and XCP-Networkd

Creating a VLAN

Creating a VLAN is already there, Lisiting the steps to create a VLAN which is used later in the document. Steps:

Check the PIFs created on a Host for physical devices eth0, eth1. xe pif-list params=uuid physical=true host-uuid=UUID this will list pif-UUID
Create a new network for the VLAN interface. xe network-create name-label=VLAN1 It returns a new network-UUID
Create a VLAN PIF. xe vlan-create pif-uuid=pif-UUID network-uuid=network-UUID vlan=VLAN-ID It returns a new VLAN PIF new-pif-UUID
Plug the VLAN PIF. xe pif-plug uuid=new-pif-UUID
Configure IP on the VLAN PIF. xe pif-reconfigure-ip uuid=new-pif-UUID mode= IP= netmask= gateway= DNS= This will configure IP on the PIF, here mode is must and other parametrs are needed on selecting mode=static

Similarly, creating a vlan pif can be achieved by corresponding XenAPI calls.

Recognise VLAN config from management.conf

For a newly installed host, If host installer was asked to put the management interface on given VLAN. We will expect a new entry VLAN=ID under /etc/firstboot.d/data/management.conf.

Listing current contents of management.conf which will be used later in the document. LABEL=eth0 -> Represents Pyhsical device on which Management Interface must reside. MODE=dhcp||static -> Represents IP configuration mode for the Management Interface. There can be other parameters like IP, NETMASK, GATEWAY and DNS when we have static mode. VLAN=ID -> New entry for specifying VLAN TAG going to be configured on device LABEL. Management interface going to be configured on this VLAN ID with specified mode.

Firstboot script need to recognise VLAN config

Firstboot script /etc/firstboot.d/30-prepare-networking need to be updated for configuring management interface to be on provided VLAN ID.

Steps to be followed:

PIF.scan performed in the script must have created the PIFs for the underlying pyhsical devices.
Get the PIF UUID for physical device LABEL.
Repeat the steps mentioned in Creating a VLAN, i.e. network-create, vlan-create and pif-plug. Now we have a new PIF for the VLAN.
Perform pif-reconfigure-ip for the new VLAN PIF.
Perform host-management-reconfigure using new VLAN PIF.

XCP-Networkd need to recognise VLAN config during startup

XCP-Networkd during first boot and boot after pool eject gets the initial network setup from the management.conf and xensource-inventory file to update the network.db for management interface info. XCP-Networkd must honour the new VLAN config.

Steps to be followed:

During startup read_config step tries to read the /var/lib/xcp/networkd.db file which is not yet created just after host installation.
Since networkd.db read throws Read_Error, it tries to read network.dbcache which is also not available hence it goes to read read_management_conf file.
There can be two possible MODE static or dhcp taken from management.conf.
bridge_name is taken as MANAGEMENT_INTERFACE from xensource-inventory, further bridge_config and interface_config are build based on MODE.
Call Bridge.make_config() and Interface.make_config() are performed with respective bridge_config and interface_config.

Updating networkd_db program

networkd_db provides the management interface info to the host installer during upgrade. It reads /var/lib/xcp/networkd.db file to output the Management Interface information. Here we need to update the networkd_db to output the VLAN information when vlan bridge is a input.

Steps to be followed:

Currently VLAN interface IP information is provided correctly on passing VLAN bridge as input. networkd_db -iface xapi0 this will list mode as dhcp or static, if mode=static then it will provide ipaddr and netmask too.
We need to udpate this program to provide VLAN ID and parent bridge info on passing VLAN bridge as input. networkd_db -bridge xapi0 It should output the VLAN info like: interfaces= vlan=vlanID parent=xenbr0 using the parent bridge user can identify the physical interfaces. Here we will extract VLAN and parent bridge from bridge_config under networkd.db.

Additional VLAN parameter for Emergency Network Reset

Detail design is mentioned on http://xapi-project.github.io/xapi/design/emergency-network-reset.html For using xe-reset-networking utility to configure management interface on VLAN, We need to add one more parameter --vlan=vlanID to the utility. There are certain parameters need to be passed to this utility: –master, –device, –mode, –ip, –netmask, –gateway, –dns and new one –vlan.

VLAN parameter addition to xe-reset-networking

Steps to be followed:

Check if VLANID is passed then let bridge=xapi0.
Write the bridge=xapi0 into xensource-inventory file, This should work as Xapi check avialable bridges while creating networks.
Write the VLAN=vlanID into management.conf and /tmp/network-reset.
Modify check_network_reset under xapi.ml to perform steps Creating a VLAN and perform management_reconfigure on vlan pif. Step Creating a VLAN must have created the VLAN record in Xapi DB similar to firstboot script.
If no VLANID is specified then retain the current one, This utility must take the management interface info from networkd_db program and handle the VLAN config.

VLAN parameter addition to xsconsole Emergency Network Reset

Under Emergency Network Reset option under the Network and Management Interface menu. Selecting this option will show some explanation in the pane on the right-hand side. Pressing will bring up a dialogue to select the interfaces to use as management interface after the reset. After choosing a device, the dialogue continues with configuration options like in the Configure Management Interface dialogue. There will be an additionall option for VLAN in the dialogue. After completing the dialogue, the same steps as listed for xe-reset-networking are executed.

Updating Pool Join/Eject operations

Pool Join while Pool having Management Interface on a VLAN

Currently pool-join fails if VLANs are present on the host joining a pool. We need to allow pool-join only if Pool and host joining a pool both has management interface on same VLAN.

Steps to be followed:

Under pre_join_checks update function assert_only_physical_pifs to check Pool master management_interface is on same VLAN.
Call Host.get_management_interface on Pool master and get the vlanID, match it with localhost management_interface VLAN ID. If it matches then allow pool-join.
In case if there are multiple VLANs on host joining a pool, fail the pool-join gracefully.
After the pool-join, Host xapi db will get sync from pool master xapi db, This will be fine to have management interface on VLAN.

Pool Eject while host ejected having Management Interface on a VLAN

Currently managament interface VLAN config on host is not been retained in xensource-inventory or management.conf file. We need to retain the vlanID under config files.

Steps to be followed:

Under call Pool.eject we need to update write_first_boot_management_interface_configuration_file function.
Check if management_interface is on VLAN then get the VLANID from the pif.
Update the VLANID into the managament.conf file and the bridge into xensource-inventory file. In order to be retained by XCP-Networkd on startup after the host is ejected.

New API for Pool Management Reconfigure

Currently there is no Pool Level API to reconfigure management_interface for all of the Hosts in a Pool at once. API Pool.management_reconfigure will be needed in order to reconfigure manamegemnt_interface on all hosts in a Pool to the same Network either VLAN or Physical.

Current behaviour to change the Management Interface on Host

Currently call Host.management_reconfigure with VLAN pif-uuid can change the management_interface to specified VLAN. Listing the steps to understand the workflow of management_interface reconfigure. We will be using Host.management_reconfigure call inside the new API.

Steps performed during management_reconfigure:

bring_pif_up get called for the pif.
xensource-inventory get updated with the latest info of interface. 3 update-mh-info updates the management_mac into xenstore.
Http server gets restarted, even though xapi listen on all IP addresses, This new interface as _the_ management interface is used by slaves to connect to pool master.
on_dom0_networking_change refreshes console URIs for the new IP address.
Xapi db is updated with new management interface info.

Management Reconfigure on Pool from Physical Network to VLAN Network or from VLAN Network to Other VLAN Network or from VLAN Network to Physical Network

Listing steps to be performed manually on each Host or Pool as a prerequisite to use the New API. We need to make sure that new network which is going to be a management interface has PIFs configured on each Host. In case of pyhsical network we will assume pifs are configured on each host, In case of vlan network we need to create vlan pifs on each Host. We would assume that VLAN is available on the switch/network.

Manual steps to be performed before calling new API:

Create a vlan network on pool via network.create, In case of pyhsical NICs network must be present.
Create a vlan pif on each host via VLAN.create using above network ref, physical PIF ref and vlanID, Not needed in case of pyhsical network. Or An Alternate call pool.create_VLAN providing device and above network will create vlan PIFs for all hosts in a pool.
Perform PIF.reconfigure_ip for each new Network PIF on each Host.

If User wishes to change the management interface manually on each Host in a Pool, We should allow it, There will be a guideline for that:

User can individually change management interface on each host calling Host.management_reconfigure using pifs on physical devices or vlan pifs. This must be perfomed on slaves first and lastly on Master, As changing management_interface on master will disconnect slaves from master then further calls Host.management_reconfigure cannot be performed till master recover slaves via call pool.recover_slaves.

API Details

Pool.management_reconfigure
- Parameter: network reference network.
- Calling this function configures management_interface on each host of a pool.
- For the network provided it will check pifs are present on each Host, In case of VLAN network it will check vlan pifs on provided network are present on each Host of Pool.
- Check IP is configured on above pifs on each Host.
- If PIFs are not present or IP is not configured on PIFs this call must fail gracefully, Asking user to configure them.
- Call Host.management_reconfigure on each slave then lastly on master.
- Call pool.recover_slaves on master inorder to recover slaves which might have lost the connection to master.

API errors

Possible API errors that may be raised by pool.management_reconfigure:

INTERFACE_HAS_NO_IP : the specified PIF (pif parameter) has no IP configuration. The new API checks for all PIFs on the new Network has IP configured. There might be a case when user has forgotten to configure IP on PIF on one or many of the Hosts in a Pool.

New API ERROR:

REQUIRED_PIF_NOT_PRESENT : the specified Network (network parameter) has no PIF present on the host in pool. There might be a case when user has forgotten to create vlan pif on one or many of the Hosts in a Pool.

CP-Tickets

CP-14027
CP-14028
CP-14029
CP-14030
CP-14031
CP-14032
CP-14033

Design document
Revision	v2
Status	confirmed
Revision history
v1	Initial revision
v2	Short-term simplications and scope reduction

Multiple Cluster Managers

Introduction

Xapi currently uses a cluster manager called xhad. Sometimes other software comes with its own built-in way of managing clusters, which would clash with xhad (example: xhad could choose to fence node ‘a’ while the other system could fence node ‘b’ resulting in a total failure). To integrate xapi with this other software we have 2 choices:

modify the other software to take membership information from xapi; or
modify xapi to take membership information from this other software.

This document proposes a way to do the latter.

XenAPI changes

New field

We will add the following new field:

pool.ha_cluster_stack of type string (read-only)
- If HA is enabled, this field reflects which cluster stack is in use.
- Set to "xhad" on upgrade, which implies that so far we have used XenServer’s own cluster stack, called xhad.

Cluster-stack choice

We assume for now that a particular cluster manager will be mandated (only) by certain types of clustered storage, recognisable by SR type (e.g. OCFS2 or Melio). The SR backend will be able to inform xapi if the SR needs a particular cluster stack, and if so, what is the name of the stack.

When pool.enable_ha is called, xapi will determine which cluster stack to use based on the presence or absence of such SRs:

If an SR that needs its own cluster stack is attached to the pool, then xapi will use that cluster stack.
If no SR that needs a particular cluster stack is attached to the pool, then xapi will use xhad.

If multiple SRs that need a particular cluster stack exist, then the storage parts of xapi must ensure that no two such SRs are ever attached to a pool at the same time.

New errors

We will add the following API error that may be raised by pool.enable_ha:

INCOMPATIBLE_STATEFILE_SR: the specified SRs (heartbeat_srs parameter) are not of the right type to hold the HA statefile for the cluster_stack that will be used. For example, there is a Melio SR attached to the pool, and therefore the required cluster stack is the Melio one, but the given heartbeat SR is not a Melio SR. The single parameter will be the name of the required SR type.

The following new API error may be raised by PBD.plug:

INCOMPATIBLE_CLUSTER_STACK_ACTIVE: the operation cannot be performed because an incompatible cluster stack is active. The single parameter will be the name of the required cluster stack. This could happen (or example) if you tried to create an OCFS2 SR with XenServer HA already enabled.

Future extensions

In future, we may add a parameter to explicitly choose the cluster stack:

New parameter to pool.enable_ha called cluster_stack of type string which will have the default value of empty string (meaning: let the implementation choose).
With the additional parameter, pool.enable_ha may raise two new errors:
- UNKNOWN_CLUSTER_STACK: The operation cannot be performed because the requested cluster stack does not exist. The user should check the name was entered correctly and, failing that, check to see if the software is installed. The exception will have a single parameter: the name of the cluster stack which was not found.
- CLUSTER_STACK_CONSTRAINT: HA cannot be enabled with the provided cluster stack because some third-party software is already active which requires a different cluster stack setting. The two parameters are: a reference to an object (such as an SR) which has created the restriction, and the name of the cluster stack that this object requires.

Implementation

The xapi.conf file will have a new field: cluster-stack-root which will have the default value /usr/libexec/xapi/cluster-stack. The existing xhad scripts and tools will be moved to /usr/libexec/xapi/cluster-stack/xhad/. A hypothetical cluster stack called foo would be placed in /usr/libexec/xapi/cluster-stack/foo/.

In Pool.enable_ha with cluster_stack="foo" we will verify that the subdirectory <cluster-stack-root>/foo exists. If it does not exist, then the call will fail with UNKNOWN_CLUSTER_STACK.

Alternative cluster stacks will need to conform to the exact same interface as xhad.

Design document
Revision	v1
Status	proposed

Multiple device emulators

Xen’s ioreq-server feature allows for several device emulator processes to be attached to the same domain, each emulating different sets of virtual hardware. This makes it possible, for example, to emulate network devices in a separate process for improved security and isolation, or to provide special purpose emulators for particular virtual hardware devices.

ioreq-server is currently used in XenServer to support vGPU, where it is configured via the legacy toolstack interface. These changes will make multiple emulators usable in open source Xen via the new libxl interface.

libxl changes

The singleton device_model_version, device_model_stubdomain and device_model fields in the b_info structure will be replaced by a list of (version, stubdomain, model, arguments) tuples, one for each emulator.
libxl_domain_create_new() will be changed to spawn a new device model for each entry in the list.

It may also be useful to spawn the device models separately and only attach them during domain creation. This could be supported by making each device_model entry a union of pid | parameter_tuple. If such an entry specifies a parameter tuple, it is processed as above; if it specifies a pid, libxl_domain_create_new(), the existing device model with that pid is attached instead.

QEMU changes

Patches to make QEMU register with Xen as an ioreq-server have been submitted upstream, but not yet applied.
QEMU’s --machine none and --nodefaults options should make it possible to create an empty machine and add just a host bus, PCI bus and device. This has not yet been fully demonstrated, so QEMU changes may be required.

Xen changes

Until now, ioreq-server has only been used to connect one extra device model, in addition to the default one. Multiple emulators should work, but there is a chance that bugs will be discovered.

Interfacing with xenopsd

This functionality will only be available through the experimental Xenlight-based xenopsd.

the VM_build clause in the atomics_of_operation function will be changed to fill in the list of emulators to be created (or attached) in the b_info struct

Host Configuration

vGPU support is implemented mostly in xenopsd, so no Xapi changes are required to support vGPU through the generic device model mechanism. Changes would be required if we decided to expose the additional device models through the API, but in the near future it is more likely that any additional device models will be dealt with entirely by xenopsd.

Design document
Revision	v1
Status	proposed

NUMA

NUMA stands for Non-Uniform Memory Access and describes that RAM access for CPUs in a large system is not equally fast for all of them. CPUs are grouped into so-called nodes and each node has fast access to RAM that is considered local to its node and slower access to other RAM. Conceptually, a node is a container that bundles some CPUs and RAM and there is an associated cost when accessing RAM in a different node. In the context of CPU virtualisation assigning vCPUs to NUMA nodes is an optimisation strategy to reduce memory latency. This document describes a design to make NUMA-related assignments for Xen domains (hence, VMs) visible to the user. Below we refer to these assignments and optimisations collectively as NUMA for simplicity.

NUMA is more generally discussed as NUMA Feature.

NUMA Properties

Xen 4.20 implements NUMA optimisation. We want to expose the following NUMA-related properties of VMs to API clients, and in particualar XenCenter. Each one is represented by a new field in XAPI’s VM_metrics data model:

RO VM_metrics.numa_optimised: boolean: if the VM is optimised for NUMA
RO VM_metrics.numa_nodes: integer: number of NUMA nodes of the host the VM is using
MRO VM_metrics.numa_node_memory: int -> int map; mapping a NUMA node (int) to an amount of memory (bytes) in that node.

Required NUMA support is only available in Xen 4.20. Some parts of the code will have to be managed by patches.

XAPI High-Level Implementation

As far as Xapi clients are concerned, we implement new fields in the VM_metrics class of the data model and surface the values in the CLI via records.ml; we could decide to make numa_optimised visible by default in xe vm-list.

Introducing new fields requires defaults; these would be:

numa_optimised: false
numa_nodes: 0
numa_node_memory: []

The data model ensures that the values are visible to API clients.

XAPI Low-Level Implementation

NUMA properties are observed by Xenopsd and Xapi learns about them as part of the Client.VM.stat call implemented by Xenopsd. Xapi makes these calls frequently and we will update the Xapi VM fields related to NUMA simply as part of processing the result of such a call in Xapi.

For this to work, we extend the return type of VM.stat in

xenops_types.ml, type Vm.state

with three fields:

numa_optimised: bool
numa_nodes: int
numa_node_memory: (int, int64) list

matching the semantics from above.

Xenopsd Implementation

Xenopsd implements the VM.stat return value in

Xenops_server_sen.get_state

where the three fields would be set. Xenopsds relies on bindings to Xen to observe NUMA-related properties of a domain.

Given that NUMA related functionality is only available for Xen 4.20, we probably will have to maintain a patch in xapi.spec for compatibility with earlier Xen versions.

The (existing) C bindings and changes come in two forms: new functions and an extension of a type used by and existing function.

    external domain_get_numa_info_node_pages_size : handle -> int -> int
      = "stub_xc_domain_get_numa_info_node_pages_size"

Thia function reports the number of NUMA nodes used by a Xen domain (supplied as an argument)

    type domain_numainfo_node_pages = {
      tot_pages_per_node : int64 array;
    }
    external domain_get_numa_info_node_pages :
      handle -> int -> int -> domain_numainfo_node_pages
      = "stub_xc_domain_get_numa_info_node_pages"

This function receives as arguments a domain ID and the number of nodes this domain is using (acquired using domain_get_numa_info_node_pages)

The number of NUMA nodes of the host (not domain) is reported by Xenctrl.physinfo which returns a value of type physinfo.

    index b4579862ff..491bd3fc73 100644
    --- a/tools/ocaml/libs/xc/xenctrl.ml
    +++ b/tools/ocaml/libs/xc/xenctrl.ml
    @@ -155,6 +155,7 @@ type physinfo =
         capabilities     : physinfo_cap_flag list;
         max_nr_cpus      : int;
         arch_capabilities : arch_physinfo_cap_flags;
    +    nr_nodes         : int;
       }

We are not reporting nr_nodes directly but use it to determine the value of numa_optimised for a domain/VM:

numa_optimised =
    (VM.numa_nodes = 1)
    or (VM.numa_nodes < physinfo.Xenctrl.nr_nodes)

Details

The three new fields that become part of type VM.state are updated as part of get_state() using the primitives above.

Design document
Revision	v1
Status	proposed

OCFS2 storage

OCFS2 is a (host-)clustered filesystem which runs on top of a shared raw block device. Hosts using OCFS2 form a cluster using a combination of network and storage heartbeats and host fencing to avoid split-brain.

The following diagram shows the proposed architecture with xapi:

Proposed architecture

Please note the following:

OCFS2 is configured to use global heartbeats rather than per-mount heartbeats because we quite often have many SRs and therefore many mountpoints
The OCFS2 global heartbeat should be collocated on the same SR as the XenServer HA SR so that we depend on fewer SRs (the storage is a single point of failure for OCFS2)
The OCFS2 global heartbeat should itself be a raw VDI within an LVHDSR.
Every host can be in at-most-one OCFS2 cluster i.e. the host cluster membership is a per-host thing rather than a per-SR thing. Therefore xapi will be modified to configure the cluster and manage the cluster node numbers.
Every SR will be a filesystem mount, managed by a SM plugin called “OCFS2”.
Xapi HA uses the xhad process which runs in userspace but in the realtime scheduling class so it has priority over all other userspace tasks. xhad sends heartbeats via the ha_statefile VDI and via UDP, and uses the Xen watchdog for host fencing.
OCFS2 HA uses the o2cb kernel driver which sends heartbeats via the o2cb_statefile and via TCP, fencing the host by panicing domain 0.

Managing O2CB

OCFS2 uses the O2CB “cluster stack” which is similar to our xhad. To configure O2CB we need to

assign each host an integer node number (from zero)
on pool/cluster join: update the configuration on every node to include the new node. In OCFS2 this can be done online.
on pool/cluster leave/eject: update the configuration on every node to exclude the old node. In OCFS2 this needs to be done offline.

In the current Xapi toolstack there is a single global implicit cluster called a “Pool” which is used for: resource locking; “clustered” storage repositories and fault handling (in HA). In the long term we will allow these types of clusters to be managed separately or all together, depending on the sophistication of the admin and the complexity of their environment. We will take a small step in that direction by keeping the OCFS2 O2CB cluster management code at “arms length” from the Xapi Pool.join code.

In xcp-idl we will define a new API category called “Cluster” (in addition to the categories for Xen domains , ballooning , stats , networking and storage ). These APIs will only be called by Xapi on localhost. In particular they will not be called across-hosts and therefore do not have to be backward compatible. These are “cluster plugin APIs”.

We will define the following APIs:

Plugin:Membership.create: add a host to a cluster. On exit the local host cluster software will know about the new host but it may need to be restarted before the change takes effect
- in:hostname:string: the hostname of the management domain
- in:uuid:string: a UUID identifying the host
- in:id:int: the lowest available unique integer identifying the host where an integer will never be re-used unless it is guaranteed that all nodes have forgotten any previous state associated with it
- in:address:string list: a list of addresses through which the host can be contacted
- out: Task.id
Plugin:Membership.destroy: removes a named host from the cluster. On exit the local host software will know about the change but it may need to be restarted before it can take effect
- in:uuid:string: the UUID of the host to remove
Plugin:Cluster.query: queries the state of the cluster
- out:maintenance_required:bool: true if there is some outstanding configuration change which cannot take effect until the cluster is restarted.
- out:hosts: a list of all known hosts together with a state including: whether they are known to be alive or dead; or whether they are currently “excluded” because the cluster software needs to be restarted
Plugin:Cluster.start: turn on the cluster software and let the local host join
Plugin:Cluster.stop: turn off the cluster software

Xapi will be modified to:

add table Cluster which will have columns
- name: string: this is the name of the Cluster plugin (TODO: use same terminology as SM?)
- configuration: Map(String,String): this will contain any cluster-global information, overrides for default values etc.
- enabled: Bool: this is true when the cluster “should” be running. It may require maintenance to synchronise changes across the hosts.
- maintenance_required: Bool: this is true when the cluster needs to be placed into maintenance mode to resync its configuration
add method XenAPI:Cluster.enable which sets enabled=true and waits for all hosts to report Membership.enabled=true.
add method XenAPI:Cluster.disable which sets enabled=false and waits for all hosts to report Membership.enabled=false.
add table Membership which will have columns
- id: int: automatically generated lowest available unique integer starting from 0
- cluster: Ref(Cluster): the type of cluster. This will never be NULL.
- host: Ref(host): the host which is a member of the cluster. This may be NULL.
- left: Date: if not 1/1/1970 this means the time at which the host left the cluster.
- maintenance_required: Bool: this is true when the Host believes the cluster needs to be placed into maintenance mode.
add field Host.memberships: Set(Ref(Membership))
extend enum vdi_type to include o2cb_statefile as well as ha_statefile
add method Pool.enable_o2cb with arguments
- in: heartbeat_sr: Ref(SR): the SR to use for global heartbeats
- in: configuration: Map(String,String): available for future configuration tweaks
- Like Pool.enable_ha this will find or create the heartbeat VDI, create the Cluster entry and the Membership entries. All Memberships will have maintenance_required=true reflecting the fact that the desired cluster state is out-of-sync with the actual cluster state.
add method XenAPI:Membership.enable
- in: self:Host: the host to modify
- in: cluster:Cluster: the cluster.
add method XenAPI:Membership.disable
- in: self:Host: the host to modify
- in: cluster:Cluster: the cluster name.
add a cluster monitor thread which
- watches the Host.memberships field and calls Plugin:Membership.create and Plugin:Membership.destroy to keep the local cluster software up-to-date when any host in the pool changes its configuration
- calls Plugin:Cluster.query after an Plugin:Membership:create or Plugin:Membership.destroy to see whether the SR needs maintenance
- when all hosts have a last start time later than a Membership record’s left date, deletes the Membership.
modify XenAPI:Pool.join to resync with the master’s Host.memberships list.
modify XenAPI:Pool.eject to
- call Membership.disable in the cluster plugin to stop the o2cb service
- call Membership.destroy in the cluster plugin to remove every other host from the local configuration
- remove the Host metadata from the pool
- set XenAPI:Membership.left to NOW()
modify XenAPI:Host.forget to
- remove the Host metadata from the pool
- set XenAPI:Membership.left to NOW()
- set XenAPI:Cluster.maintenance_required to true

A Cluster plugin called “o2cb” will be added which

on Plugin:Membership.destroy
- comment out the relevant node id in cluster.conf
- set the ’needs a restart’ flag
on Plugin:Membership.create
- if the provided node id is too high: return an error. This means the cluster needs to be rebooted to free node ids.
- if the node id is not too high: rewrite the cluster.conf using the “online” tool.
on Plugin:Cluster.start: find the VDI with type=o2cb_statefile; add this to the “static-vdis” list; chkconfig the service on. We will use the global heartbeat mode of o2cb.
on Plugin:Cluster.stop: stop the service; chkconfig the service off; remove the “static-vdis” entry; leave the VDI itself alone
keeps track of the current ’live’ cluster.conf which allows it to
- report the cluster service as ’needing a restart’ (which implies we need maintenance mode)

Summary of differences between this and xHA:

we allow for the possibility that hosts can join and leave, without necessarily taking the whole cluster down. In the case of o2cb we should be able to have join work live and only eject requires maintenance mode
rather than write explicit RPCs to update cluster configuration state we instead use an event watch and resync pattern, which is hopefully more robust to network glitches while a reconfiguration is in progress.

Managing xhad

We need to ensure o2cb and xhad do not try to conflict by fencing hosts at the same time. We shall:

use the default o2cb timeouts (hosts fence if no I/O in 60s): this needs to be short because disk I/O on otherwise working hosts can be blocked while another host is failing/ has failed.
make the xhad host fence timeouts much longer: 300s. It’s much more important that this is reliable than fast. We will make this change globally and not just when using OCFS2.

In the xhad config we will cap the HeartbeatInterval and StatefileInterval at 5s (the default otherwise would be 31s). This means that 60 heartbeat messages have to be lost before xhad concludes that the host has failed.

SM plugin

The SM plugin OCFS2 will be a file-based plugin.

TODO: which file format by default?

The SM plugin will first check whether the o2cb cluster is active and fail operations if it is not.

I/O paths

When either HA or OCFS O2CB “fences” the host it will look to the admin like a host crash and reboot. We need to (in priority order)

help the admin prevent fences by monitoring their I/O paths and fixing issues before they lead to trouble
when a fence/crash does happen, help the admin
- tell the difference between an I/O error (admin to fix) and a software bug (which should be reported)
- understand how to make their system more reliable

Monitoring I/O paths

If heartbeat I/O fails for more than 60s when running o2cb then the host will fence. This can happen either

for a good reason: for example the host software may have deadlocked or someone may have pulled out a network cable.
for a bad reason: for example a network bond link failure may have been ignored and then the second link failed; or the heartbeat thread may have been starved of I/O bandwidth by other processes

Since the consequences of fencing are severe – all VMs on the host crash simultaneously – it is important to avoid the host fencing for bad reasons.

We should recommend that all users

use network bonding for their network heartbeat
use multipath for their storage heartbeat

Furthermore we need to help users monitor their I/O paths. It’s no good if they use a bonded network but fail to notice when one of the paths have failed.

The current XenServer HA implementation generates the following I/O-related alerts:

HA_HEARTBEAT_APPROACHING_TIMEOUT (priority 5 “informational”): when half the network heartbeat timeout has been reached.
HA_STATEFILE_APPROACHING_TIMEOUT (priority 5 “informational”): when half the storage heartbeat timeout has been reached.
HA_NETWORK_BONDING_ERROR (priority 3 “service degraded”): when one of the bond links have failed.
HA_STATEFILE_LOST (priority 2 “service loss imminent”): when the storage heartbeat has completely failed and only the network heartbeat is left.
MULTIPATH_PERIODIC_ALERT (priority 3 “service degrated”): when one of the multipath links have failed.

Unfortunately alerts are triggered on “edges” i.e. when state changes, and not on “levels” so it is difficult to see whether the link is currently broken.

We should define datasources suitable for use by xcp-rrdd to expose the current state (and the history) of the I/O paths as follows:

pif_<name>_paths_failed: the total number of paths which we know have failed.
pif_<name>_paths_total: the total number of paths which are configured.
sr_<name>_paths_failed: the total number of storage paths which we know have failed.
sr_<name>_paths_total: the total number of storage paths which are configured.

The pif datasources should be generated by xcp-networkd which already has a network bond monitoring thread. THe sr datasources should be generated by xcp-rrdd plugins since there is no storage daemon to generate them. We should create RRDs using the MAX consolidation function, otherwise information about failures will be lost by averaging.

XenCenter (and any diagnostic tools) should warn when the system is at risk of fencing in particular if any of the following are true:

pif_<name>_paths_failed is non-zero
sr_<name>_paths_failed is non-zero
pif_<name>_paths_total is less than 2
sr_<name>_paths_total is less than 2

XenCenter (and any diagnostic tools) should warn if any of the following have been true over the past 7 days:

pif_<name>_paths_failed is non-zero
sr_<name>_paths_failed is non-zero

Heartbeat “QoS”

The network and storage paths used by heartbeats must remain responsive otherwise the host will fence (i.e. the host and all VMs will crash).

Outstanding issue: how slow can multipathd get? How does it scale with the number of LUNs.

Post-crash diagnostics

When a host crashes the effect on the user is severe: all the VMs will also crash. In cases where the host crashed for a bad reason (such as a single failure after a configuration error) we must help the user understand how they can avoid the same situation happening again.

We must make sure the crash kernel runs reliably when xhad and o2cb fence the host.

Xcp-rrdd will be modified to store RRDs in an mmap(2)d file sin the dom0 filesystem (rather than in-memory). Xcp-rrdd will call msync(2) every 5s to ensure the historical records have hit the disk. We should use the same on-disk format as RRDtool (or as close to it as makes sense) because it has already been optimised to minimise the amount of I/O.

Xapi will be modified to run a crash-dump analyser program xen-crash-analyse.

xen-crash-analyse will:

parse the Xen and dom0 stacks and diagnose whether
- the dom0 kernel was panic’ed by o2cb
- the Xen watchdog was fired by xhad
- anything else: this would indicate a bug that should be reported
in cases where the system was fenced by o2cb or xhad then the analyser
- will read the archived RRDs and look for recent evidence of a path failure or of a bad configuration (i.e. one where the total number of paths is 1)
- will parse the xhad.log and look for evidence of heartbeats “approaching timeout”

TODO: depending on what information we can determine from the analyser, we will want to record some of it in the Host_crash_dump database table.

XenCenter will be modified to explain why the host crashed and explain what the user should do to fix it, specifically:

if the host crashed for no obvious reason then consider this a software bug and recommend a bugtool/system-status-report is taken and uploaded somewhere
if the host crashed because of o2cb or xhad then either
- if there is evidence of path failures in the RRDs: recommend the user increase the number of paths or investigate whether some of the equipment (NICs or switches or HBAs or SANs) is unreliable
- if there is evidence of insufficient paths: recommend the user add more paths

Network configuration

The documentation should strongly recommend

the management network is bonded
the management network is dedicated i.e. used only for management traffic (including heartbeats)
the OCFS2 storage is multipathed

xcp-networkd will be modified to change the behaviour of the DHCP client. Currently the dhclient will wait for a response and eventually background itself. This is a big problem since DHCP can reset the hostname, and this can break o2cb. Therefore we must insist that PIF.reconfigure_ip becomes fully synchronous, supporting timeout and cancellation. Once the call returns – whether through success or failure – there must not be anything in the background which will change the system’s hostname.

TODO: figure out whether we need to request “maintenance mode” for hostname changes.

Maintenance mode

The purpose of “maintenance mode” is to take a host out of service and leave it in a state where it’s safe to fiddle with it without affecting services in VMs.

XenCenter currently does the following:

Host.disable: prevents new VMs starting here
makes a list of all the VMs running on the host
Host.evacuate: move the running VMs somewhere else

The problems with maintenance mode are:

it’s not safe to fiddle with the host network configuration with storage still attached. For NFS this risks deadlocking the SR. For OCFS2 this risks fencing the host.
it’s not safe to fiddle with the storage or network configuration if HA is running because the host will be fenced. It’s not safe to disable fencing unless we guarantee to reboot the host on exit from maintenance mode.

We should also

PBD.unplug: all storage. This allows the network to be safely reconfigured. If the network is configured when NFS storage is plugged then the SR can permanently deadlock; if the network is configured when OCFS2 storage is plugged then the host can crash.

TODO: should we add a Host.prepare_for_maintenance (better name TBD) to take care of all this without XenCenter having to script it. This would also help CLI and powershell users do the right thing.

TODO: should we insist that the host is rebooted to leave maintenance mode? This would make maintenance mode more reliable and allow us to integrate maintenance mode with xHA (where maintenance mode is a “staged reboot”)

TODO: should we leave all clusters as part of maintenance mode? We probably need to do this to avoid fencing.

Walk-through: adding OCFS2 storage

Assume you have an existing Pool of 2 hosts. First the client will set up the O2CB cluster, choosing where to put the global heartbeat volume. The client should check that the I/O paths have all been setup correctly with bonding and multipath and prompt the user to fix any obvious problems.

The client enables O2CB and then creates an SR

Internally within Pool.enable_o2cb Xapi will set up the cluster metadata on every host in the pool:

Xapi creates the cluster configuration and each host updates its metadata

At this point all hosts have in-sync cluster.conf files but all cluster services are disabled. We also have requires_mainenance=true on all Membership entries and the global Cluster has enabled=false. The client will now try to enable the cluster with Cluster.enable:

Xapi enables the cluster software on all hosts

Now all hosts are in the cluster and the SR can be created using the standard SM APIs.

Walk-through: remove a host

Assume you have an existing Pool of 2 hosts with o2cb clustering enabled and at least one ocfs2 filesystem mounted. If the host is online then XenAPI:Pool.eject will:

Xapi ejects a host from the pool

Note that:

All hosts will have modified their o2cb cluster.conf to comment out the former host
The Membership table still remembers the node number of the ejected host– this cannot be re-used until the SR is taken down for maintenance.
All hosts can see the difference between their current cluster.conf and the one they would use if they restarted the cluster service, so all hosts report that the cluster must be taken offline i.e. requires_maintence=true.

Summary of the impact on the admin

OCFS2 is fundamentally a different type of storage to all existing storage types supported by xapi. OCFS2 relies upon O2CB, which provides Host-level High Availability. All HA implementations (including O2CB and xhad) impose restrictions on the server admin to prevent unnecessary host “fencing” (i.e. crashing). Once we have OCFS2 as a feature, we will have to live with these restrictions which previously only applied when HA was explicitly enabled. To reduce complexity we will not try to enforce restrictions only when OCFS2 is being used or is likely to be used.

Impact even if not using OCFS2

“Maintenance mode” now includes detaching all storage.
Host network reconfiguration can only be done in maintenance mode
XenServer HA enable takes longer
XenServer HA failure detection takes longer
Network configuration with DHCP must be fully synchronous i.e. it wil block until the DHCP server responds. On a timeout, the change will not be made.

Impact when using OCFS2

Sometimes a host will not be able to join the pool without taking the pool into maintenance mode
Every VM will have to be XSM’ed (is that a verb?) to the new OCFS2 storage. This means that VMs with more than 2 snapshots will have their snapshots deleted; it means you need to provision another storage target, temporarily doubling your storage needs; and it will take a long time.
There will now be 2 different reasons why a host has fenced which the admin needs to understand.

Design document
Revision	v1
Status	proposed

patches in VDIs

“Patches” are signed binary blobs which can be queried and applied. They are stored in the dom0 filesystem under /var/patch. Unfortunately the patches can be quite large – imagine a repo full of RPMs – and the dom0 filesystem is usually quite small, so it can be difficult to upload and apply some patches.

Instead of writing patches to the dom0 filesystem, we shall write them to disk images (VDIs) instead. We can then take advantage of features like

shared storage
cross-host VDI.copy

to manage the patches.

XenAPI changes

Add a field pool_patch.VDI of type Ref(VDI). When a new patch is stored in a VDI, it will be referenced here. Older patches and cleaned patches will have invalid references here.
The HTTP handler for uploading patches will choose an SR to stream the patch into. It will prefer to use the pool.default_SR and fall back to choosing an SR on the master whose driver supports the VDI_CLONE capability: we want the ability to fast clone patches, one per host concurrently installing them. A VDI will be created whose size is 4x the apparent size of the patch, defaulting to 4GiB if we have no size information (i.e. no content-length header)
pool_patch.clean_on_host will be deprecated. It will still try to clean a patch from the local filesystem but this is pointless for the new VDI patch uploads.
pool_patch.clean will be deprecated. It will still try to clean a patch from the local filesystem of the master but this is pointless for the new VDI patch uploads.
pool_patch.pool_clean will be deprecated. It will destroy any associated patch VDI. Users will be encouraged to call VDI.destroy instead.

Changes beneath the XenAPI

pool_patch records will only be deleted if both the filename field refers to a missing file on the master and the VDI field is a dangling reference
Patches stored in VDIs will be stored within a filesystem, like we used to do with suspend images. This is needed because (a) we want to execute the patches and block devices cannot be executed; and (b) we can use spare space in the VDI as temporary scratch space during the patch application process. Within the VDI we will call patches patch rather than using a complicated filename.
When a host wishes to apply a patch it will call VDI.copy to duplicate the VDI to a locally-accessible SR, mount the filesystem and execute it. If the patch is still in the master’s dom0 filesystem then it will fall back to the HTTP handler.

Summary of the impact on the admin

There will no longer be a size limit on hotfixes imposed by the mechanism itself.
There must be enough free space in an SR connected to the host to be able to apply a patch on that host.

Design document
Revision	v1
Status	proposed

PCI passthrough support

Introduction

GPU passthrough is already available in XAPI, this document proposes to also offer passthrough for all PCI devices through XAPI.

Design proposal

New methods for PCI object:

PCI.enable_dom0_access
PCI.disable_dom0_access
PCI.get_dom0_access_status: compares the outputs of /opt/xensource/libexec/xen-cmdline and /proc/cmdline to produce one of the four values that can be currently contained in the PGPU.dom0_access field:
- disabled
- disabled_on_reboot
- enabled
- enabled_on_reboot
How do determine the expected dom0 access state: If the device id is present in both pciback.hide of /proc/cmdline and xen-cmdline: enabled If the device id is present not in both pciback.hide of /proc/cmdline and xen-cmdline: disabled If the device id is present in the pciback.hide of /proc/cmdline but not in the one of xen-cmdline: disabled_on_reboot If the device id is not present in the pciback.hide of /proc/cmdline but is in the one of xen-cmdline: enabled_on_reboot
A function rather than a field makes the data always accurate and even accounts for changes made by users outside XAPI, directly through /opt/xensource/libexec/xen-cmdline

With these generic methods available, the following field and methods will be deprecated:

PGPU.enable_dom0_access
PGPU.disable_dom0_access
PGPU.dom0_access (DB field)

They would still be usable and up to date with the same info as for the PCI methods.

Test cases

hide a PCI:
- call PCI.disable_dom0_access on an enabled PCI
- check the PCI goes in state disabled_on_reboot
- reboot the host
- check the PCI goes in state disabled
unhide a PCI:
- call PCI.enable_dom0_access on an disabled PCI
- check the PCI goes in state enabled_on_reboot
- reboot the host
- check the PCI goes in state enabled
get a PCI dom0 access state:
- on a enabled PCI, make sure the get_dom0_access_status returns enabled
- hide the PCI
- make sure the get_dom0_access_status returns disabled_on_reboot
- reboot
- make sure the get_dom0_access_status returns disabled
- unhide the PCI
- make sure the get_dom0_access_status returns enabled_on_reboot
- reboot
- make sure the get_dom0_access_status returns enabled
Check PCI/PGPU dom0 access coherence:
- hide a PCI belonging to a PGPU and make sure both states remains coherent at every step
- unhide a PCI belonging to a PGPU and make sure both states remains coherent at every step
- hide a PGPU and make sure its and its PCI’s states remains coherent at every step
- unhide a PGPU and make sure its and its PCI’s states remains coherent at every step

Design document
Revision	v1
Status	proposed

Pool-wide SSH

Background

The SMAPIv3 plugin architecture requires that storage plugins are able to work in the absence of xapi. Amongst other benefits, this allows them to be tested in isolation, are able to be shared more widely than just within the XenServer community and will cause less load on xapi’s database.

However, many of the currently existing SMAPIv1 backends require inter-host operations to be performed. This is achieved via the use of the Xen-API call ‘host.call_plugin’, which allows an API user to execute a pre-installed plugin on any pool member. This is important for operations such as coalesce / snapshot where the active data path for a VM somewhere in the pool needs to be refreshed in order to complete the operation. In order to use this, the RPM in which the SM backend lives is used to deliver a plugin script into /etc/xapi.d/plugins, and this executes the required function when the API call is made.

In order to support these use-cases without xapi running, a new mechanism needs to be provided to allow the execution of required functionality on remote hosts. The canonical method for remotely executing scripts is ssh - the secure shell. This design proposal is setting out how xapi might manage the public and private keys to enable passwordless authentication of ssh sessions between all hosts in a pool.

Modifications to the host

On firstboot (and after being ejected), the host should generate a host key (already done I believe), and an authentication key for the user (root/xapi?).

Modifications to xapi

Three new fields will be added to the host object:

host.ssh_public_host_key : string: This is the host key that identifies the host during the initial ssh key exchange protocol. This should be added to the ‘known_hosts’ field of any other host wishing to ssh to this host.
host.ssh_public_authentication_key : string: This field is the public key used for authentication when sshing from the root account on that host - host A. This can be added to host B’s authorized_keys file in order to allow passwordless logins from host A to host B.
host.ssh_ready : bool: A boolean flag indicating that the configuration files in use by the ssh server/client on the host are up to date.

One new field will be added to the pool record:

pool.revoked_authentication_keys : string list: This field records all authentication keys that have been used by hosts in the past. It is updated when a host is ejected from the pool.

Pool Join

On pool join, the master creates the record for the new host and populates the two public key fields with values supplied by the joining host. It then sets the ssh_ready field on all other hosts to false.

On each host in the pool, a thread is watching for updates to the ssh_ready value for the local host. When this is set to false, the host then adds the keys from xapi’s database to the appropriate places in the ssh configuration files and restarts sshd. Once this is done, the host sets the ssh_ready field to ’true’

Pool Eject

On pool eject, the host’s ssh_public_host_key is lost, but the authetication key is added to a list of revoked keys on the pool object. This allows all other hosts to remove the key from the authorized_keys list when they next sync, which in the usual case is immediately the database is modified due to the event watch thread. If the host is offline though, the authorized_keys file will be updated the next time the host comes online.

Questions

Do we want a new user? e.g. ‘xapi’ - how would we then use this user to execute privileged things? setuid binaries?
Is keeping the revoked_keys list useful? If we ‘control the world’ of the authorized_keys file, we could just remove anything that’s currently in there that xapi doesn’t know about

Design document
Revision	v1
Status	proposed

Process events from xenopsd in a timely manner

Background

There is a significant delay between the VM being unpaused and XAPI reporting it as started during a bootstorm. It can happen that the VM is able to send UDP packets already, but XAPI still reports it as not started for minutes.

XAPI currently processes all events from xenopsd in a single thread, the unpause events get queued up behind a lot of other events generated by the already running VMs.

We need to ensure that unpause events from xenopsd get processed in a timely manner, even if XAPI is busy processing other events.

Timely processing of events

If we process the events in a Round-Robin fashion then unpause events are reported in a timely fashion. We need to ensure that events operating on the same VM are not processed in parallel.

Xenopsd already has code that does exactly this, the purpose of the xapi-work-queues refactoring PR is to reuse this code in XAPI by creating a shared package between xenopsd and xapi: xapi-work-queues.

xapi-work-queues

From the documentation of the new Worker Pool interface:

A worker pool has a limited number of worker threads. Each worker pops one tagged item from the queue in a round-robin fashion. While the item is executed the tag temporarily doesn’t participate in round-robin scheduling. If during execution more items get queued with the same tag they get redirected to a private queue. Once the item finishes execution the tag will participate in RR scheduling again.

This ensures that items with the same tag do not get executed in parallel, and that a tag with a lot of items does not starve the execution of other tags.

The XAPI side of the changes will look like this

Known limitations: The active per-VM events should be a small number, this is already ensured in the push_with_coalesce / should_keep code on the xenopsd side. Events to XAPI from xenopsd should already arrive coalesced.

Design document
Revision	v2
Status	released (xenserver 6.5 sp1)
Review	#12

RDP control

Purpose

To administer guest VMs it can be useful to connect to them over Remote Desktop Protocol (RDP). XenCenter supports this; it has an integrated RDP client.

First it is necessary to turn on the RDP service in the guest.

This can be controlled from XenCenter. Several layers are involved. This description starts in the guest and works up the stack to XenCenter.

This feature was completed in the first quarter of 2015, and released in Service Pack 1 for XenServer 6.5.

The guest agent

The XenServer guest agent installed in Windows VMs can turn the RDP service on and off, and can report whether it is running.

The guest agent is at https://github.com/xenserver/win-xenguestagent

Interaction with the agent is done through some Xenstore keys:

The guest agent running in domain N writes two xenstore nodes when it starts up:

/local/domain/N/control/feature-ts = 1
/local/domain/N/control/feature-ts2 = 1

This indicates support for the rest of the functionality described below.

(The “…ts2” flag is new for this feature; older versions of the guest agent wrote the “…ts” flag and had support for only a subset of the functionality (no firewall modification), and had a bug in updating .../data/ts.)

To indicate whether RDP is running, the guest agent writes the string “1” (running) or “0” (disabled) to xenstore node

/local/domain/N/data/ts.

It does this on start-up, and also in response to the deletion of that node.

The guest agent also watches xenstore node /local/domain/N/control/ts and it turns RDP on and off in response to “1” or “0” (respectively) being written to that node. The agent acknowledges the request by deleting the node, and afterwards it deletes local/domain/N/data/ts, thus triggering itself to update that node as described above.

When the guest agent turns the RDP service on/off, it also modifies the standard Windows firewall to allow/forbid incoming connections to the RDP port. This is the same as the firewall change that happens automatically when the RDP service is turned on/off through the standard Windows GUI.

XAPI etc.

xenopsd sets up watches on xenstore nodes including the control tree and data/ts, and prompts xapi to react by updating the relevant VM guest metrics record, which is available through a XenAPI call.

XenAPI includes a new message (function call) which can be used to ask the guest agent to turn RDP on and off.

This is VM.call_plugin (analogous to Host.call_plugin) in the hope that it can be used for other purposes in the future, even though for now it does not really call a plugin.

To use it, supply plugin="guest-agent-operation" and either fn="request_rdp_on" or fn="request_rdp_off".

See http://xapi-project.github.io/xen-api/classes/vm.html

The function strings are named with “request” (rather than, say, “enable_rdp” or “turn_rdp_on”) to make it clear that xapi only makes a request of the guest: when one of these calls returns successfully this means only that the appropriate string (1 or 0) was written to the control/ts node and it is up to the guest whether it responds.

XenCenter

Behaviour on older XenServer versions that do not support RDP control

Note that the current behaviour depends on some global options: “Enable Remote Desktop console scanning” and “Automatically switch to the Remote Desktop console when it becomes available”.

When tools are not installed:
- As of XenCenter 6.5, the RDP button is absent.
When tools are installed but RDP is not switched on in the guest:
1. If “Enable Remote Desktop console scanning” is on:
  - The RDP button is present but greyed out. (It seems to sometimes read “Switch to Remote Desktop” and sometimes read “Looking for guest console…”: I haven’t yet worked out the difference).
  - We scan the RDP port to detect when RDP is turned on
2. If “Enable Remote Desktop console scanning” is off:
  - The RDP button is enabled and reads “Switch to Remote Desktop”
When tools are installed and RDP is switched on in the guest:
1. If “Enable Remote Desktop console scanning” is on:
  - The RDP button is enabled and reads “Switch to Remote Desktop”
  - If “Automatically switch” is on, we switch to RDP immediately we detect it
2. If “Enable Remote Desktop console scanning” is off:
  - As above, the RDP button is enabled and reads “Switch to Remote Desktop”

New behaviour on XenServer versions that support RDP control

This new XenCenter behaviour is only for XenServer versions that support RDP control, with guests with the new guest agent: behaviour must be unchanged if the server or guest-agent is older.
There should be no change in the behaviour for Linux guests, either PV or HVM varieties: this must be tested.
We should never scan the RDP port; instead we should watch for a change in the relevant variable in guest_metrics.
The XenCenter option “Enable Remote Desktop console scanning” should change to read “Enable Remote Desktop console scanning (XenServer 6.5 and earlier)”
The XenCenter option “Automatically switch to the Remote Desktop console when it becomes available” should be enabled even when “Enable Remote Desktop console scanning” is off.
When tools are not installed:
- As above, the RDP button should be absent.
When tools are installed but RDP is not switched on in the guest:
- The RDP button should be enabled and read “Turn on Remote Desktop”
- If pressed, it should launch a dialog with the following wording: “Would you like to turn on Remote Desktop in this VM, and then connect to it over Remote Desktop? [Yes] [No]”
- That button should turn on RDP, wait for RDP to become enabled, and switch to an RDP connection. It should do this even if “Automatically switch” is off.
When tools are installed and RDP is switched on in the guest:
- The RDP button should be enabled and read “Switch to Remote Desktop”
- If “Automatically switch” is on, we should switch to RDP immediately
- There is no need for us to provide UI to switch RDP off again
We should also test the case where RDP has been switched on in the guest before the tools are installed.

Design document
Revision	v1
Status	released (7,0)

RRDD archival redesign

Introduction

Current problems with rrdd:

rrdd stores knowledge about whether it is running on a master or a slave

This determines the host to which rrdd will archive a VM’s rrd when the VM’s domain disappears - rrdd will always try to archive to the master. However, when a host joins a pool as a slave rrdd is not restarted so this knowledge is out of date. When a VM shuts down on the slave rrdd will archive the rrd locally. When starting this VM again the master xapi will attempt to push any locally-existing rrd to the host on which the VM is being started, but since no rrd archive exists on the master the slave rrdd will end up creating a new rrd and the previous rrd will be lost.

rrdd handles rebooting VMs unpredictably

When rebooting a VM, there is a chance rrdd will attempt to update that VM’s rrd during the brief period when there is no domain for that VM. If this happens, rrdd will archive the VM’s rrd to the master, and then create a new rrd for the VM when it sees the new domain. If rrdd doesn’t attempt to update that VM’s rrd during this period, rrdd will continue to add data for the new domain to the old rrd.

Proposal

To solve these problems, we will remove some of the intelligence from rrdd and make it into more of a slave process of xapi. This will entail removing all knowledge from rrdd of whether it is running on a master or a slave, and also modifying rrdd to only start monitoring a VM when it is told to, and only archiving an rrd (to a specified address) when it is told to. This matches the way xenopsd only manages domains which it has been told to manage.

Design

For most VM lifecycle operations, xapi and rrdd processes (sometimes across more than one host) cooperate to start or stop recording a VM’s metrics and/or to restore or backup the VM’s archived metrics. Below we will describe, for each relevant VM operation, how the VM’s rrd is currently handled, and how we propose it will be handled after the redesign.

VM.destroy

The master xapi makes a remove_rrd call to the local rrdd, which causes rrdd to to delete the VM’s archived rrd from disk. This behaviour will remain unchanged.

VM.start(_on) and VM.resume(_on)

The master xapi makes a push_rrd call to the local rrdd, which causes rrdd to send any locally-archived rrd for the VM in question to the rrdd of the host on which the VM is starting. This behaviour will remain unchanged.

VM.shutdown and VM.suspend

Every update cycle rrdd compares its list of registered VMs to the list of domains actually running on the host. Any registered VMs which do not have a corresponding domain have their rrds archived to the rrdd running on the host believed to be the master. We will change this behaviour by stopping rrdd from doing the archiving itself; instead we will expose a new function in rrdd’s interface:

val archive_rrd : vm_uuid:string -> remote_address:string -> unit

This will cause rrdd to remove the specified rrd from its table of registered VMs, and archive the rrd to the specified host. When a VM has finished shutting down or suspending, the xapi process on the host on which the VM was running will call archive_rrd to ask the local rrdd to archive back to the master rrdd.

VM.reboot

Removing rrdd’s ability to automatically archive the rrds for disappeared domains will have the bonus effect of fixing how the rrds of rebooting VMs are handled, as we don’t want the rrds of rebooting VMs to be archived at all.

VM.checkpoint

This will be handled automatically, as internally VM.checkpoint carries out a VM.suspend followed by a VM.resume.

VM.pool_migrate and VM.migrate_send

The source host’s xapi makes a migrate_rrd call to the local rrd, with a destination address and an optional session ID. The session ID is only required for cross-pool migration. The local rrdd sends the rrd for that VM to the destination host’s rrdd as an HTTP PUT. This behaviour will remain unchanged.

Design document
Revision	v1
Status	released (7.0)
Revision history
v1	Initial version

RRDD plugin protocol v2

Motivation

rrdd plugins currently report datasources via a shared-memory file, using the following format:

DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
  "timestamp": 1339685573.245,
  "data_sources": {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "type": "absolute",
      "units": "degC",
      "value": "64.33"
      "value_type": "float",
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "type": "absolute",
      "units": "degC",
      "value": "62.14"
      "value_type": "float",
    }
  }
}

This format contains four main components:

A constant header string

DATASOURCES

This should always be present.

The JSON data length, encoded as hexadecimal

000001e4

The md5sum of the JSON data

dba4bf7a84b6d11d565d19ef91f7906e

The JSON data itself, encoding the values and metadata associated with the reported datasources.

Example

{
  "timestamp": 1339685573.245,
  "data_sources": {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "type": "absolute",
      "units": "degC",
      "value": "64.33"
      "value_type": "float",
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "type": "absolute",
      "units": "degC",
      "value": "62.14"
      "value_type": "float",
    }
  }
}

The disadvantage of this protocol is that rrdd has to parse the entire JSON structure each tick, even though most of the time only the values will change.

For this reason a new protocol is proposed.

Protocol V2

value	bits	format	notes
header string	(string length)*8	string	“DATASOURCES” as in the V1 protocol
data checksum	32	int32	binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values
metadata checksum	32	int32	binary-encoded crc32 of the metadata string (see below)
number of datasources	32	int32	only needed if the metadata has changed - otherwise RRDD can use a cached value
timestamp	64	double	Unix epoch
datasource values	n * 64	int64 \| double	n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float]
metadata length	32	int32
metadata	(string length)*8	string

All integers/double are bigendian. The metadata will have the same JSON-based format as in the V1 protocol, minus the timestamp and value key-value pair for each datasource.

field	values	notes	required
description	string	Description of the datasource	no
owner	host \| vm \| sr	The object to which the data relates	no, default host
value_type	int64 \| float	The type of the datasource	yes
type	absolute \| derive \| gauge	The type of measurement being sent. Absolute for counters which are reset on reading, derive stores the derivative of the recorded values (useful for metrics which continually increase like amount of data written since start), gauge for things like temperature	no, default absolute
default	true \| false	Whether the source is default enabled or not	no, default false
units		The units the data should be displayed in	no
min		The minimum value for the datasource	no, default -infinity
max		The maximum value for the datasource	no, default +infinity

Example

{
  "datasources": {
    "memory_reclaimed": {
      "description":"Host memory reclaimed by squeezed",
      "owner":"host",
      "value_type":"int64",
      "type":"absolute",
      "default":"true",
      "units":"B",
      "min":"-inf",
      "max":"inf"
    },
    "memory_reclaimed_max": {
      "description":"Host memory that could be reclaimed by squeezed",
      "owner":"host",
      "value_type":"int64",
      "type":"absolute",
      "default":"true",
      "units":"B",
      "min":"-inf",
      "max":"inf"
    },
    {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "owner":"host",
      "value_type": "float",
      "type": "absolute",
      "default":"true",
      "units": "degC",
      "min":"-inf",
      "max":"inf"
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "owner":"host",
      "value_type": "float",
      "type": "absolute",
      "default":"true",
      "units": "degC",
      "min":"-inf",
      "max":"inf"
    }
  }
}

The above formatting is not required, but added here for readability.

Reading algorithm

if header != expected_header:
    raise InvalidHeader()
if data_checksum == last_data_checksum:
    raise NoUpdate()
if data_checksum != crc32(encoded_timestamp_and_values):
    raise InvalidChecksum()
if metadata_checksum == last_metadata_checksum:
    for datasource, value in cached_datasources, values:
        update(datasource, value)
else:
    if metadata_checksum != crc32(metadata):
        raise InvalidChecksum()
    cached_datasources = create_datasources(metadata)
    for datasource, value in cached_datasources, values:
        update(datasource, value)

This means that for a normal update, RRDD will only have to read the header plus the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of datasources exported by the plugin. If the metadata changes RRDD will have to read all the data (and parse the metadata).

Design document
Revision	v1
Status	proposed
Revision history
v1	Initial version

RRDD plugin protocol v3

Motivation

rrdd plugins protocol v2 report datasources via shared-memory file, however it has various limitations :

metrics are unique by their names, thus it is not possible cannot have several metrics that shares a same name (e.g vCPU usage per vm)
only number metrics are supported, for example we can’t expose string metrics (e.g CPU Model)

Therefore, it implies various limitations on plugins and limits OpenMetrics support for the metrics daemon.

Moreover, it may not be practical for plugin developpers and parser implementations :

json implementations may not keep insersion order on maps, which can cause issues to expose datasource values as it is sensitive to the order of the metadata map
header length is not constant and depends on datasource count, which complicates parsing
it still requires a quite advanced parser to convert between bytes and numbers according to metadata

A simpler protocol is proposed, based on OpenMetrics binary format to ease plugin and parser implementations.

Protocol V3

For this protocol, we still use a shared-memory file, but significantly change the structure of the file.

value	bits	format	notes
header string	12*8=96	string	“OPENMETRICS1” which is one byte longer than “DATASOURCES”, intentionally made at 12 bytes for alignment purposes
data checksum	32	uint32	Checksum of the concatenation of the rest of the header (from timestamp) and the payload data
timestamp	64	uint64	Unix epoch
payload length	32	uint32	Payload length
payload data	8*(payload length)	binary	OpenMetrics encoded metrics data (protocol-buffers format)

All values are big-endian.

The header size is constant (28 bytes) that implementation can rely on (read the entire header in one go, simplify usage of memory mapping).

As opposed to protocol v2 but alike protocol v1, metadata is included along metrics in OpenMetrics format.

owner attribute for metric should be exposed using a OpenMetrics label instead (named owner).

Multiple metrics that shares the same name should be exposed under the same Metric Family and be differenciated by labels (e.g owner).

Reading algorithm

if header != expected_header:
    raise InvalidHeader()
if data_checksum == last_data_checksum:
    raise NoUpdate()
if timestamp == last_timestamp:
    raise NoUpdate()
if data_checksum != crc32(concat_header_end_payload):
    raise InvalidChecksum()

metrics = parse_openmetrics(payload_data)

for family in metrics:
    if family_exists(family):
        update_family(family)
    else
        create_family(family)

track_removed_families(metrics)

Design document
Revision	v2
Status	proposed
Review	#186
Revision history
v1	Initial version
v2	Renaming VMSS fields and APIs. API message_create superseeds vmss_create_alerts.
v3	Remove VMSS alarm_config details and use existing pool wide alarm config
v4	Renaming field from retention-value to retained-snapshots and schedule-snapshot to scheduled-snapshot
v5	Add new API task_set_status

Schedule Snapshot Design

The scheduled snapshot feature will utilize the existing architecture of VMPR. In terms of functionality, scheduled snapshot is basically VMPR without its archiving capability.

Introduction

Schedule snapshot will be a new object in xapi as VMSS.
A pool can have multiple VMSS.
Multiple VMs can be a part of VMSS but a VM cannot be a part of multiple VMSS.
A VMSS takes VMs snapshot with type [snapshot, checkpoint, snapshot_with_quiesce].
VMSS takes snapshot of VMs on configured intervals:
- hourly -> On every day, Each hour, Mins [0;15;30;45]
- daily -> On every day, Hour [0 to 23], Mins [0;15;30;45]
- weekly -> Days [Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday], Hour[0 to 23], Mins [0;15;30;45]
VMSS will have a limit on retaining number of VM snapshots in range [1 to 10].

Datapath Design

There will be a cron job for VMSS.
VMSS plugin will go through all the scheduled snapshot policies in the pool and check if any of them are due.
If a snapshot is due then : Go through all the VM objects in XAPI associated with this scheduled snapshot policy and create a new snapshot.
If the snapshot operation fails, create a notification alert for the event and move to the next VM.
Check if an older snapshot now needs to be deleted to comply with the retained snapshots defined in the scheduled policy.
If we need to delete any existing snapshots, delete the oldest snapshot created via scheduled policy.
Set the last-run timestamp in the scheduled policy.

Xapi Changes

There is a new record for VM Scheduled Snapshot with new fields.

New fields:

name-label type String : Name label for VMSS.
name-description type String : Name description for VMSS.
enabled type Bool : Enable/Disable VMSS to take snapshot.
type type Enum [snapshot; checkpoint; snapshot_with_quiesce] : Type of snapshot VMSS takes.
retained-snapshots type Int64 : Number of snapshots limit for a VM, max limit is 10 and default is 7.
frequency type Enum [hourly; daily; weekly] : Frequency of taking snapshot of VMs.
schedule type Map(String,String) with (key, value) pair:
- hour : 0 to 23
- min : [0;15;30;45]
- days : [Monday,Tuesday,Wednesday,Thursday,Friday,Saturday,Sunday]
last-run-time type Date : DateTime of last execution of VMSS.
VMs type VM refs : List of VMs part of VMSS.

New fields to VM record:

scheduled-snapshot type VMSS ref : VM part of VMSS.
is-vmss-snapshot type Bool : If snapshot created from VMSS.

New APIs

vmss_snapshot_now (Ref vmss, Pool_Operater) -> String : This call executes the scheduled snapshot immediately.
vmss_set_retained_snapshots (Ref vmss, Int value, Pool_Operater) -> unit : Set the value of vmss retained snapshots, max is 10.
vmss_set_frequency (Ref vmss, String “value”, Pool_Operater) -> unit : Set the value of the vmss frequency field.
vmss_set_type (Ref vmss, String “value”, Pool_Operater) -> unit : Set the snapshot type of the vmss type field.
vmss_set_scheduled (Ref vmss, Map(String,String) “value”, Pool_Operater) -> unit : Set the vmss scheduled to take snapshot.
vmss_add_to_schedule (Ref vmss, String “key”, String “value”, Pool_Operater) -> unit : Add key value pair to VMSS schedule.
vmss_remove_from_schedule (Ref vmss, String “key”, Pool_Operater) -> unit : Remove key from VMSS schedule.
vmss_set_last_run_time (Ref vmss, DateTime “value”, Local_Root) -> unit : Set the last run time for VMSS.
task_set_status (Ref task, status_type “value”, READ_ONLY) -> unit : Set the status of task owned by same user, Pool_Operator can set status for any tasks.

New CLIs

vmss-create (required : “name-label”;“type”;“frequency”, optional : “name-description”;“enabled”;“schedule:”;“retained-snapshots”) -> unit : Creates VM scheduled snapshot.
vmss-destroy (required : uuid) -> unit : Destroys a VM scheduled snapshot.

Design document
Revision	v1
Status	released (7.6)

SMAPIv3

Xapi accesses storage through “plugins” which currently use a protocol called “SMAPIv1”. This protocol has a number of problems:

the protocol has many missing features, and this leads to people using the XenAPI from within a plugin, which is racy, difficult to get right, unscalable and makes component testing impossible.
the protocol expects plugin authors to have a deep knowledge of the Xen storage datapath (tapdisk, blkback etc) and the storage.
the protocol is undocumented.

We shall create a new revision of the protocol (“SMAPIv3”) to address these problems.

The following diagram shows the new control plane:

Storage control plane

Requests from xapi are filtered through the existing storage_access layer which is responsible for managing the mapping between VM VBDs and VDIs.

Each plugin is represented by a named queue, with APIs for

querying the state of each queue
explicitly cancelling or replying to messages

Legacy SMAPIv1 plugins will be processed via the existing storage_access.SMAPIv1 module. Newer SMAPIv3 plugins will be handled by a new xapi-storage-script service.

The SMAPIv3 APIs will be defined in an IDL format in a separate repo.

xapi-storage-script

The xapi-storage-script will run as a service and will

use inotify to monitor a well-known path in dom0
when a directory is created, check whether it contains storage plugins by executing a Plugin.query
assuming the directory contains plugins, it will register the queue name and start listening for messages
when messages from xapi or the CLI are received, it will generate the SMAPIv3 .json message and fork the relevant script.

SMAPIv3 IDL

The IDL will support

documentation for all functions, parameters and results
- this will be extended to be a XenAPI-style versioning scheme in future
generating hyperlinked HTML documentation, published on github
generating libraries for python and OCaml
- the libraries will include marshalling, unmarshalling, type-checking and command-line parsing and help generation

Diagnostic tools

It will be possible to view the contents of the queue associated with any plugin, and see whether

the queue is being served or not (perhaps the xapi-storage-script has crashed)
there are unanswered messages (perhaps one of the messages has caused a deadlock in the implementation?)

It will be possible to

delete/clear queues/messages
download a message-sequence chart of the last N messages for inclusion in bugtools.

Anatomy of a plugin

The following diagram shows what a plugin would look like:

Anatomy of a plugin

The SMAPIv3

Please read the current SMAPIv3 documentation.

Design document
Revision	v1
Status	proposed

Specifying Emulated PCI Devices

Background and goals

At present (early March 2015) the datamodel defines a VM as having a “platform” string-string map, in which two keys are interpreted as specifying a PCI device which should be emulated for the VM. Those keys are “device_id” and “revision” (with int values represented as decimal strings).

Limitations:

Hardcoded defaults are used for the vendor ID and all other parameters except device_id and revision.
Only one emulated PCI device can be specified.

When instructing qemu to emulate PCI devices, qemu accepts twelve parameters for each device.

Future guest-agent features rely on additional emulated PCI devices. We cannot know in advance the full details of all the devices that will be needed, but we can predict some.

We need a way to configure VMs such that they will be given additional emulated PCI devices.

Design

In the datamodel, there will be a new type of object for emulated PCI devices.

Tentative name: “emulated_pci_device”

Fields to be passed through to qemu are the following, all static read-only, and all ints except devicename:

devicename (string)
vendorid
deviceid
command
status
revision
classcode
headertype
subvendorid
subsystemid
interruptline
interruptpin

We also need a “built_in” flag: see below.

Allow creation of these objects through the API (and CLI).

(It would be nice, but by no means essential, to be able to create one by specifying an existing one as a basis, along with one or more altered fields, e.g. “Make a new one just like that existing one except with interruptpin=9.”)

Create some of these devices to be defined as standard in XenServer, along the same lines as the VM templates. Those ones should have built_in=true.

Allow destruction of these objects through the API (and CLI), but not if they are in use or if they have built_in=true.

A VM will have a list of zero or more of these emulated-pci-device objects. (OPEN QUESTION: Should we forbid having more than one of a given device?)

Provide API (and CLI) commands to add and remove one of these devices from a VM (identifying the VM and device by uuid or other identifier such as name).

The CLI should allow performing this on multiple VMs in one go, based on a selector or filter for the VMs. We have this concept already in the CLI in commands such as vm-start.

In the function that adds an emulated PCI device to a VM, we must check if this is the first device to be added, and must refuse if the VM’s Virtual Hardware Platform Version is too low. (Or should we just raise the version automatically if needed?)

When starting a VM, check its list of emulated pci devices and pass the details through to qemu (via xenopsd).

Design document
Revision	v11
Status	confirmed
Review	#139
Revision history
v1	Initial version
v2	Added details about the VDI's binary format and size, and the SR capability name.
v3	Tar was not needed after all!
v4	Add details about discovering the VDI using a new vdi_type.
v5	Add details about the http handlers and interaction with xapi's database
v6	Add details about the framing of the data within the VDI
v7	Redesign semantics of the rrd_updates handler
v8	Redesign semantics of the rrd_updates handler (again)
v9	Magic number change in framing format of vdi
v10	Add details of new APIs added to xapi and xcp-rrdd
v11	Remove unneeded API calls

SR-Level RRDs

Introduction

Xapi has RRDs to track VM- and host-level metrics. There is a desire to have SR-level RRDs as a new category, because SR stats are not specific to a certain VM or host. Examples are size and free space on the SR. While recording SR metrics is relatively straightforward within the current RRD system, the main question is where to archive them, which is what this design aims to address.

Stats Collection

All SR types, including the existing ones, should be able to have RRDs defined for them. Some RRDs, such as a “free space” one, may make sense for multiple (if not all) SR types. However, the way to measure something like free space will be SR specific. Furthermore, it should be possible for each type of SR to have its own specialised RRDs.

It follows that each SR will need its own xcp-rrdd plugin, which runs on the SR master and defines and collects the stats. For the new thin-lvhd SR this could be xenvmd itself. The plugin registers itself with xcp-rrdd, so that the latter records the live stats from the plugin into RRDs.

Archiving

SR-level RRDs will be archived in the SR itself, in a VDI, rather than in the local filesystem of the SR master. This way, we don’t need to worry about master failover.

The VDI will be 4MB in size. This is a little more space than we would need for the RRDs we have in mind at the moment, but will give us enough headroom for the foreseeable future. It will not have a filesystem on it for simplicity and performance. There will only be one RRD archive file for each SR (possibly containing data for multiple metrics), which is gzipped by xcp-rrdd, and can be copied onto the VDI.

There will be a simple framing format for the data on the VDI. This will be as follows:

Offset	Type	Name	Comment
0	32 bit network-order int	magic	Magic number = 0x7ada7ada
4	32 bit network-order int	version	1
8	32 bit network-order int	length	length of payload
12	gzipped data	data

Xapi will be in charge of the lifecycle of this VDI, not the plugin or xcp-rrdd, which will make it a little easier to manage them. Only xapi will attach/detach and read from/write to this VDI. We will keep xcp-rrdd as simple as possible, and have it archive to its standard path in the local file system. Xapi will then copy the RRDs in and out of the VDI.

A new value "rrd" in the vdi_type enum of the datamodel will be defined, and the VDI.type of the VDI will be set to that value. The storage backend will write the VDI type to the LVM metadata of the VDI, so that xapi can discover the VDI containing the SR-level RRDs when attaching an SR to a new pool. This means that SR-level RRDs are currently restricted to LVM SRs.

Because we will not write plugins for all SRs at once, and therefore do not need xapi to set up the VDI for all SRs, we will add an SR “capability” for the backends to be able to tell xapi whether it has the ability to record stats and will need storage for them. The capability name will be: SR_STATS.

Management of the SR-stats VDI

The SR-stats VDI will be attached/detached on PBD.plug/unplug on the SR master.

On PBD.plug on the SR master, if the SR has the stats capability, xapi:
- Creates a stats VDI if not already there (search for an existing one based on the VDI type).
- Attaches the stats VDI if it did already exist, and copies the RRDs to the local file system (standard location in the filesystem; asks xcp-rrdd where to put them).
- Informs xcp-rrdd about the RRDs so that it will load the RRDs and add newly recorded data to them (needs a function like push_rrd_local for VM-level RRDs).
- Detaches stats VDI.
On PBD.unplug on the SR master, if the SR has the stats capability xapi:
- Tells xcp-rrdd to archive the RRDs for the SR, which it will do to the local filesystem.
- Attaches the stats VDI, copies the RRDs into it, detaches VDI.

Periodic Archiving

Xapi’s periodic scheduler regularly triggers xcp-rrdd to archive the host and VM RRDs. It will need to do this for the SR ones as well. Furthermore, xapi will need to attach the stats VDI and copy the RRD archives into it (as on PBD.unplug).

Exporting

There will be a new handler for downloading an SR RRD:

http://<server>/sr_rrd?session_id=<SESSION HANDLE>&uuid=<SR UUID>

RRD updates are handled via a single handler for the host, VM and SR UUIDs RRD updates for the host, VMs and SRs are handled by a a single handler at /rrd_updates. Exactly what is returned will be determined by the parameters passed to this handler.

Whether the host RRD updates are returned is governed by the presence of host=true in the parameters. host=<anything else> or the absence of the host key will mean the host RRD is not returned.

Whether the VM RRD updates are returned is governed by the vm_uuid key in the URL parameters. vm_uuid=all will return RRD updates for all VM RRDs. vm_uuid=xxx will return the RRD updates for the VM with uuid xxx only. If vm_uuid is none (or any other string which is not a valid VM UUID) then the handler will return no VM RRD updates. If the vm_uuid key is absent, RRD updates for all VMs will be returned.

Whether the SR RRD updates are returned is governed by the sr_uuid key in the URL parameters. sr_uuid=all will return RRD updates for all SR RRDs. sr_uuid=xxx will return the RRD updates for the SR with uuid xxx only. If sr_uuid is none (or any other string which is not a valid SR UUID) then the handler will return no SR RRD updates. If the sr_uuid key is absent, no SR RRD updates will be returned.

It will be possible to mix and match these parameters; for example to return RRD updates for the host and all VMs, the URL to use would be:

http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=true&vm_uuid=all&sr_uuid=none

Or, to return RRD updates for all SRs but nothing else, the URL to use would be:

http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=false&vm_uuid=none&sr_uuid=all

While behaviour is defined if any of the keys host, vm_uuid and sr_uuid is missing, this is for backwards compatibility and it is recommended that clients specify each parameter explicitly.

Database updating.

If the SR is presenting a data source called ‘physical_utilisation’, xapi will record this periodically in its database. In order to do this, xapi will fork a thread that, every n minutes (2 suggested, but open to suggestions here), will query the attached SRs, then query RRDD for the latest data source for these, and update the database.

The utilisation of VDIs will not be updated in this way until scalability worries for RRDs are addressed.

Xapi will cache whether it is SR master for every attached SR and only attempt to update if it is the SR master.

New APIs.

xcp-rrdd:

Get the filesystem location where sr rrds are archived: val sr_rrds_path : uid:string -> string
Archive the sr rrds to the filesystem: val archive_sr_rrd : sr_uuid:string -> unit
Load the sr rrds from the filesystem: val push_sr_rrd : sr_uuid:string -> unit

Design document
Revision	v3
Status	proposed

thin LVHD storage

LVHD is a block-based storage system built on top of Xapi and LVM. LVHD disks are represented as LVM LVs with vhd-format data inside. When a disk is snapshotted, the LVM LV is “deflated” to the minimum-possible size, just big enough to store the current vhd data. All other disks are stored “inflated” i.e. consuming the maximum amount of storage space. This proposal describes how we could add dynamic thin-provisioning to LVHD such that

disks only consume the space they need (plus an adjustable small overhead)
when a disk needs more space, the allocation can be done locally in the common-case; in particular there is no network RPC needed
when the resource pool master host has failed, allocations can still continue, up to some limit, allowing time for the master host to be recovered; in particular there is no need for very low HA timeouts.
we can (in future) support in-kernel block allocation through the device mapper dm-thin target.

The following diagram shows the “Allocation plane”:

Allocation plane

All VM disk writes are channelled through tapdisk which keeps track of the remaining reserved space within the device mapper device. When the free space drops below a “low-water mark”, tapdisk sends a message to a local per-SR daemon called local-allocator and requests more space.

The local-allocator maintains a free pool of blocks available for allocation locally (hence the name). It will pick some blocks and transactionally send the update to the xenvmd process running on the SRmaster via the shared ring (labelled ToLVM queue in the diagram) and update the device mapper tables locally.

There is one xenvmd process per SR on the SRmaster. xenvmd receives local allocations from all the host shared rings (labelled ToLVM queue in the diagram) and combines them together, appending them to a redo-log also on shared storage. When xenvmd notices that a host’s free space (represented in the metadata as another LV) is low it allocates new free blocks and pushes these to the host via another shared ring (labelled FromLVM queue in the diagram).

The xenvmd process maintains a cache of the current VG metadata for fast query and update. All updates are appended to the redo-log to ensure they operate in O(1) time. The redo log updates are periodically flushed to the primary LVM metadata.

Since the operations are stored in the redo-log and will only be removed after the real metadata has been written, the implication is that it is possible for the operations to be performed more than once. This will occur if the xenvmd process exits between flushing to the real metadata and acknowledging the operations as completed. For this to work as expected, every individual operation stored in the redo-log must be idempotent.

Note on running out of blocks

Note that, while the host has plenty of free blocks, local allocations should be fast. If the master fails and the local free pool starts running out and tapdisk asks for more blocks, then the local allocator won’t be able to provide them. tapdisk should start to slow I/O in order to provide the local allocator more time. Eventually if tapdisk runs out of space before the local allocator can satisfy the request then guest I/O will block. Note Windows VMs will start to crash if guest I/O blocks for more than 70s. Linux VMs, no matter PV or HVM, may suffer from “block for more than 120 seconds” issue due to slow I/O. This known issue is that, slow I/O during dirty pages writeback/flush may cause memory starvation, then other userland process or kernel threads would be blocked.

The following diagram shows the control-plane:

control plane

When thin-provisioning is enabled we will be modifying the LVM metadata at an increased rate. We will cache the current metadata in the xenvmd process and funnel all queries through it, rather than “peeking” at the metadata on-disk. Note it will still be possible to peek at the on-disk metadata but it will be out-of-date. Peeking can still be used to query the PV state of the volume group.

The xenvm CLI uses a simple RPC interface to query the xenvmd process, tunnelled through xapi over the management network. The RPC interface can be used for

activating volumes locally: xenvm will query the LV segments and program device mapper
deactivating volumes locally
listing LVs, PVs etc

Note that current LVHD requires the management network for these control-plane functions.

When the SM backend wishes to query or update volume group metadata it should use the xenvm CLI while thin-provisioning is enabled.

The xenvmd process shall use a redo-log to ensure that metadata updates are persisted in constant time and flushed lazily to the regular metadata area.

Tunnelling through xapi will be done by POSTing to the localhost URI

/services/xenvmd/<SR uuid>

Xapi will the either proxy the request transparently to the SRmaster, or issue an http level redirect that the xenvm CLI would need to follow.

If the xenvmd process is not running on the host on which it should be, xapi will start it.

Components: roles and responsibilities

xenvmd:

one per plugged SRmaster PBD
owns the LVM metadata
provides a fast query/update API so we can (for example) create lots of LVs very fast
allocates free blocks to hosts when they are running low
receives block allocations from hosts and incorporates them in the LVM metadata
can safely flush all updates and downgrade to regular LVM

xenvm:

a CLI which talks the xenvmd protocol to query / update LVs
can be run on any host, calls (except “format” and “upgrade”) are forwarded by xapi
can “format” a LUN to prepare it for xenvmd
can “upgrade” a LUN to prepare it for xenvmd

local_allocator:

one per plugged PBD
exposes a simple interface to tapdisk for requesting more space
receives free block allocations via a queue on the shared disk from xenvmd
sends block allocations to xenvmd and updates the device mapper target locally

tapdisk:

monitors the free space inside LVs and requests more space when running out
slows down I/O when nearly out of space

xapi:

provides authenticated communication tunnels
ensures the xenvmd daemons are only running on the correct hosts.

SM:

writes the configuration file for xenvmd (though doesn’t start it)
has an on/off switch for thin-provisioning
can use either normal LVM or the xenvm CLI

membership_monitor

configures and manages the connections between xenvmd and the local_allocator

Queues on the shared disk

The local_allocator communicates with xenvmd via a pair of queues on the shared disk. Using the disk rather than the network means that VMs will continue to run even if the management network is not working. In particular

if the (management) network fails, VMs continue to run on SAN storage
if a host changes IP address, nothing needs to be reconfigured
if xapi fails, VMs continue to run.

Logical messages in the queues

The local_allocator needs to tell the xenvmd which blocks have been allocated to which guest LV. xenvmd needs to tell the local_allocator which blocks have become free. Since we are based on LVM, a “block” is an extent, and an “allocation” is a segment i.e. the placing of a physical extent at a logical extent in the logical volume.

The local_allocator needs to send a message with logical contents:

volume: a human-readable name of the LV
segments: a list of LVM segments which says “place physical extent x at logical extent y using a linear mapping”.

Note this message is idempotent.

The xenvmd needs to send a message with logical contents:

extents: a list of physical extents which are free for the host to use

Although for internal housekeeping xenvmd will want to assign these physical extents to logical extents within the host’s free LV, the local_allocator doesn’t need to know the logical extents. It only needs to know the set of blocks which it is free to allocate.

Starting up the local_allocator

What happens when a local_allocator (re)starts, after a

process crash, respawn
host crash, reboot?

When the local_allocator starts up, there are 2 cases:

the host has just rebooted, there are no attached disks and no running VMs
the process has just crashed, there are attached disks and running VMs

Case 1 is uninteresting. In case 2 there may have been an allocation in progress when the process crashed and this must be completed. Therefore the operation is journalled in a local filesystem in a directory which is deliberately deleted on host reboot (Case 1). The allocation operation consists of:

pushing the allocation to xenvmd on the SRmaster
updating the device mapper

Note that both parts of the allocation operation are idempotent and hence the whole operation is idempotent. The journalling will guarantee it executes at-least-once.

When the local_allocator starts up it needs to discover the list of free blocks. Rather than have 2 code paths, it’s best to treat everything as if it is a cold start (i.e. no local caches already populated) and to ask the master to resync the free block list. The resync is performed by executing a “suspend” and “resume” of the free block queue, and requiring the remote allocator to:

pop all block allocations and incorporate these updates
send the complete set of free blocks “now” (i.e. while the queue is suspended) to the local allocator.

Starting xenvmd

xenvmd needs to know

the device containing the volume group
the hosts to “connect” to via the shared queues

The device containing the volume group should be written to a config file when the SR is plugged.

xenvmd does not remember which hosts it is listening to across crashes, restarts or master failovers. The membership_monitor will keep the xenvmd list in sync with the PBD.currently_attached fields.

Shutting down the local_allocator

The local_allocator should be able to crash at any time and recover afterwards. If the user requests a PBD.unplug we can perform a clean shutdown by:

signalling xenvmd to suspend the block allocation queue
arranging for the local_allocator to acknowledge the suspension and exit
when the xenvmd sees the acknowlegement, we know that the local_allocator is offline and it doesn’t need to poll the queue any more

Downgrading metadata

xenvmd can be terminated at any time and restarted, since all compound operations are journalled.

Downgrade is a special case of shutdown. To downgrade, we need to stop all hosts allocating and ensure all updates are flushed to the global LVM metadata. xenvmd can shutdown by:

shutting down all local_allocators (see previous section)
flushing all outstanding block allocations to the LVM redo log
flushing the LVM redo log to the global LVM metadata

Queues as rings

We can use a simple ring protocol to represent the queues on the disk. Each queue will have a single consumer and single producer and reside within a single logical volume.

To make diagnostics simpler, we can require the ring to only support push and pop of whole messages i.e. there can be no partial reads or partial writes. This means that the producer and consumer pointers will always point to valid message boundaries.

One possible format used by the prototype is as follows:

sector 0: a magic string
sector 1: producer state
sector 2: consumer state
sector 3…: data

Within the producer state sector we can have:

octets 0-7: producer offset: a little-endian 64-bit integer
octet 8: 1 means “suspend acknowledged”; 0 otherwise

Within the consumer state sector we can have:

octets 0-7: consumer offset: a little-endian 64-bit integer
octet 8: 1 means “suspend requested”; 0 otherwise

The consumer and producer pointers point to message boundaries. Each message is prefixed with a 4 byte length and padded to the next 4-byte boundary.

To push a message onto the ring we need to

check whether the message is too big to ever fit: this is a permanent error
check whether the message is too big to fit given the current free space: this is a transient error
write the message into the ring
advance the producer pointer

To pop a message from the ring we need to

check whether there is unconsumed space: if not this is a transient error
read the message from the ring and process it
advance the consumer pointer

Journals as queues

When we journal an operation we want to guarantee to execute it never or at-least-once. We can re-use the queue implementation by pushing a description of the work item to the queue and waiting for the item to be popped, processed and finally consumed by advancing the consumer pointer. The journal code needs to check for unconsumed data during startup, and to process it before continuing.

Suspending and resuming queues

During startup (resync the free blocks) and shutdown (flush the allocations) we need to suspend and resume queues. The ring protocol can be extended to allow the consumer to suspend the ring by:

the consumer asserts the “suspend requested” bit
the producer push function checks the bit and writes “suspend acknowledged”
the producer also periodically polls the queue state and writes “suspend acknowledged” (to catch the case where no items are to be pushed)
after the producer has acknowledged it will guarantee to push no more items
when the consumer polls the producer’s state and spots the “suspend acknowledged”, it concludes that the queue is now suspended.

The key detail is that the handshake on the ring causes the two sides to synchronise and both agree that the ring is now suspended/ resumed.

Modelling the suspend/resume protocol

To check that the suspend/resume protocol works well enough to be used to resynchronise the free blocks list on a slave, a simple promela model was created. We model the queue state as 2 boolean flags:

bool suspend /* suspend requested */
bool suspend_ack /* suspend acknowledged *./

and an abstract representation of the data within the ring:

/* the queue may have no data (none); a delta or a full sync.
   the full sync is performed immediately on resume. */
mtype = { sync delta none }
mtype inflight_data = none

There is a “producer” and a “consumer” process which run forever, exchanging data and suspending and resuming whenever they want. The special data item sync is only sent immediately after a resume and we check that we never desynchronise with asserts:

  :: (inflight_data != none) ->
    /* In steady state we receive deltas */
    assert (suspend_ack == false);
    assert (inflight_data == delta);
    inflight_data = none

i.e. when we are receiving data normally (outside of the suspend/resume code) we aren’t suspended and we expect deltas, not full syncs.

The model-checker spin verifies this property holds.

Interaction with HA

Consider what will happen if a host fails when HA is disabled:

if the host is a slave: the VMs running on the host will crash but no other host is affected.
if the host is a master: allocation requests from running VMs will continue provided enough free blocks are cached on the hosts. If a host eventually runs out of free blocks, then guest I/O will start to block and VMs may eventually crash.

Therefore we recommend that users enable HA and only disable it for short periods of time. Note that, unlike other thin-provisioning implementations, we will allow HA to be disabled.

Host-local LVs

When a host calls SMAPI sr_attach, it will use xenvm to tell xenvmd on the SRmaster to connect to the local_allocator on the host. The xenvmd daemon will create the volumes for queues and a volume to represent the “free blocks” which a host is allowed to allocate.

Monitoring

The xenvmd process should export RRD datasources over shared memory named

sr_<SR uuid>_<host uuid>_free: the number of free blocks in the local cache. It’s useful to look at this and verify that it doesn’t usually hit zero, since that’s when allocations will start to block. For this reason we should use the MIN consolidation function.
sr_<SR uuid>_<host uuid>_requests: a counter of the number of satisfied allocation requests. If this number is too high then the quantum of allocation should be increased. For this reason we should use the MAX consolidation function.
sr_<SR uuid>_<host uuid>_allocations: a counter of the number of bytes being allocated. If the allocation rate is too high compared with the number of free blocks divided by the HA timeout period then the SRmaster-allocator should be reconfigured to supply more blocks with the host.

Modifications to tapdisk

TODO: to be updated by Germano

tapdisk will be modified to

on open: discover the current maximum size of the file/LV (for a file we assume there is no limit for now)
read a low-water mark value from a config file /etc/tapdisk3.conf
read a very-low-water mark value from a config file /etc/tapdisk3.conf
read a Unix domain socket path from a config file /etc/tapdisk3.conf
when there is less free space available than the low-water mark: connect to Unix domain socket and write an “extend” request
upon receiving the “extend” response, re-read the maximum size of the file/LV
when there is less free space available than the very-low-water mark: start to slow I/O responses and write a single ’error’ line to the log.

The extend request

TODO: to be updated by Germano

The request has the following format:

Octet offsets	Name	Description
0,1	tl	Total length (including this field) of message (in network byte order)
2	type	The value ‘0’ indicating an extend request
3	nl	The length of the LV name in octets, including NULL terminator
4,..,4+nl-1	name	The LV name
4+nl,..,12+nl-1	vdi_size	The virtual size of the logical VDI (in network byte order)
12+nl,..,20+nl-1	lv_size	The current size of the LV (in network byte order)
20+nl,..,28+nl-1	cur_size	The current size of the vhd metadata (in network byte order)

The extend response

The response is a single byte value “0” which is a signal to re-examime the LV size. The request will block indefinitely until it succeeds. The request will block for a long time if

the SR has genuinely run out of space. The admin should observe the existing free space graphs/alerts and perform an SR resize.
the master has failed and HA is disabled. The admin should re-enable HA or fix the problem manually.

The local_allocator

There is one local_allocator process per plugged PBD. The process will be spawned by the SM sr_attach call, and shutdown from the sr_detach call.

The local_allocator accepts the following configuration (via a config file):

socket: path to a local Unix domain socket. This is where the local_allocator listens for requests from tapdisk
allocation_quantum: number of megabytes to allocate to each tapdisk on request
local_journal: path to a block device or file used for local journalling. This should be deleted on reboot.
free_pool: name of the LV used to store the host’s free blocks
devices: list of local block devices containing the PVs
to_LVM: name of the LV containing the queue of block allocations sent to xenvmd
from_LVM: name of the LV containing the queue of messages sent from xenvmd. There are two types of messages:
1. Free blocks to put into the free pool
2. Cap requests to remove blocks from the free pool.

When the local_allocator process starts up it will read the host local journal and

re-execute any pending allocation requests from tapdisk
suspend and resume the from_LVM queue to trigger a full retransmit of free blocks from xenvmd

The procedure for handling an allocation request from tapdisk is:

if there aren’t enough free blocks in the free pool, wait polling the from_LVM queue
choose a range of blocks to assign to the tapdisk LV from the free LV
write the operation (i.e. exactly what we are about to do) to the journal. This ensures that it will be repeated if the allocator crashes and restarts. Note that, since the operation may be repeated multiple times, it must be idempotent.
push the block assignment to the toLVM queue
suspend the device mapper device
add/modify the device mapper target
resume the device mapper device
remove the operation from the local journal (i.e. there’s no need to repeat it now)
reply to tapdisk

Shutting down the local-allocator

The SM sr_detach called from PBD.unplug will use the xenvm CLI to request that xenvmd disconnects from a host. The procedure is:

SM calls xenvm disconnect host
xenvm sends an RPC to xenvmd tunnelled through xapi
xenvmd suspends the to_LVM queue
the local_allocator acknowledges the suspend and exits
xenvmd flushes all updates from the to_LVM queue and stops listening

xenvmd

xenvmd is a daemon running per SRmaster PBD, started in sr_attach and terminated in sr_detach. xenvmd has a config file containing:

socket: Unix domain socket where xenvmd listens for requests from xenvm tunnelled by xapi
host_allocation_quantum: number of megabytes to hand to a host at a time
host_low_water_mark: threshold below which we will hand blocks to a host
devices: local devices containing the PVs

xenvmd continually

peeks updates from all the to_LVM queues
calculates how much free space each host still has
if the size of a host’s free pool drops below some threshold:
- choose some free blocks
if the size of a host’s free pool goes above some threshold:
- request a cap of the host’s free pool
writes the change it is going to make to a journal stored in an LV
pops the updates from the to_LVM queues
pushes the updates to the from_LVM queues
pushes updates to the LVM redo-log
periodically flush the LVM redo-log to the LVM metadata area

The membership monitor

The role of the membership monitor is to keep the list of xenvmd connections in sync with the PBD.currently_attached fields.

We shall

install a host-pre-declare-dead script to use xenvm to send an RPC to xenvmd to forcibly flush (without acknowledgement) the to_LVM queue and destroy the LVs.
modify XenAPI Host.declare_dead to call host-pre-declare-dead before the VMs are unlocked
add a host-pre-forget hook type which will be called just before a Host is forgotten
install a host-pre-forget script to use xenvm to call xenvmd to destroy the host’s local LVs

Modifications to LVHD SR

sr_attach should:
- if an SRmaster, update the MGT major version number to prevent
- Write the xenvmd configuration file (on all hosts, not just SRmaster)
- spawn local_allocator
sr_detach should:
- call xenvm to request the shutdown of local_allocator
vdi_deactivate should:
- call xenvm to request the flushing of all the to_LVM queues to the redo log
vdi_activate should:
- if necessary, call xenvm to deflate the LV to the minimum size (with some slack)

Note that it is possible to attach and detach the individual hosts in any order but when the SRmaster is unplugged then there will be no “refilling” of the host local free LVs; it will behave as if the master host has failed.

Modifications to xapi

Xapi needs to learn how to forward xenvm connections to the SR master.
Xapi needs to start and stop xenvmd at the appropriate times
We must disable unplugging the PBDs for shared SRs on the pool master if any other slave has its PBD plugging. This is actually fixing an issue that exists today - LVHD SRs require the master PBD to be plugged to do many operations.
Xapi should provide a mechanism by which the xenvmd process can be killed once the last PBD for an SR has been unplugged.

Enabling thin provisioning

Thin provisioning will be automatically enabled on upgrade. When the SRmaster plugs in PBD the MGT major version number will be bumped to prevent old hosts from plugging in the SR and getting confused. When a VDI is activated, it will be deflated to the new low size.

Disabling thin provisioning

We shall make a tool which will

allow someone to downgrade their pool after enabling thin provisioning
allow developers to test the upgrade logic without fully downgrading their hosts

The tool will

check if there is enough space to fully inflate all non-snapshot leaves
unplug all the non-SRmaster PBDs
unplug the SRmaster PBD. As a side-effect all pending LVM updates will be written to the LVM metadata.
modify the MGT volume to have the lower metadata version
fully inflate all non-snapshot leaves

Walk-through: upgrade

Rolling upgrade should work in the usual way. As soon as the pool master has been upgraded, hosts will be able to use thin provisioning when new VDIs are attached. A VM suspend/resume/reboot or migrate will be needed to turn on thin provisioning for existing running VMs.

Walk-through: downgrade

A pool may be safely downgraded to a previous version without thin provisioning provided that the downgrade tool is run. If the tool hasn’t run then the old pool will refuse to attach the SR because the metadata has been upgraded.

Walk-through: after a host failure

If HA is enabled:

xhad elects a new master if necessary
Xapi on the master will start xenvmd processes for shared thin-lvhd SRs
the xhad tells Xapi which hosts are alive and which have failed.
Xapi runs the host-pre-declare-dead scripts for every failed host
the host-pre-declare-dead tells xenvmd to flush the to_LVM updates
Xapi unlocks the VMs and restarts them on new hosts.

If HA is not enabled:

The admin should verify the host is definitely dead
If the dead host was the master, a new master must be designated. This will start the xenvmd processes for the shared thin-lvhd SRs.
the admin must tell Xapi which hosts have failed with xe host-declare-dead
Xapi runs the host-pre-declare-dead scripts for every failed host
the host-pre-declare-dead tells xenvmd to flush the to_LVM updates
Xapi unlocks the VMs
the admin may now restart the VMs on new hosts.

Walk-through: co-operative master transition

The admin calls Pool.designate_new_master. This initiates a two-phase commit of the new master. As part of this, the slaves will restart, and on restart each host’s xapi will kill any xenvmd that should only run on the pool master. The new designated master will then restart itself and start up the xenvmd process on itself.

Future use of dm-thin?

Dm-thin also uses 2 local LVs: one for the “thin pool” and one for the metadata. After replaying our journal we could potentially delete our host local LVs and switch over to dm-thin.

Summary of the impact on the admin

If the VM workload performs a lot of disk allocation, then the admin should enable HA.
The admin must not downgrade the pool without first cleanly detaching the storage.
Extra metadata is needed to track thin provisioing, reducing the amount of space available for user volumes.
If an SR is completely full then it will not be possible to enable thin provisioning.
There will be more fragmentation, but the extent size is large (4MiB) so it shouldn’t be too bad.

Ring protocols

Each ring consists of 3 sectors of metadata followed by the data area. The contents of the first 3 sectors are:

Sector, Octet offsets	Name	Type	Description
0,0-30	signature	string	Signature (“mirage shared-block-device 1.0”)
1,0-7	producer	uint64	Pointer to the end of data written by the producer
1,8	suspend_ack	uint8	Suspend acknowledgement byte
2,0-7	consumer	uint64	Pointer to the end of data read by the consumer
2,8	suspend	uint8	Suspend request byte

Note. producer and consumer pointers are stored in little endian format.

The pointers are free running byte offsets rounded up to the next 4-byte boundary, and the position of the actual data is found by finding the remainder when dividing by the size of the data area. The producer pointer points to the first free byte, and the consumer pointer points to the byte after the last data consumed. The actual payload is preceded by a 4-byte length field, stored in little endian format. When writing a 1 byte payload, the next value of the producer pointer will therefore be 8 bytes on from the previous - 4 for the length (which will contain [0x01,0x00,0x00,0x00]), 1 byte for the payload, and 3 bytes padding.

A ring is suspended and resumed by the consumer. To suspend, the consumer first checks that the producer and consumer agree on the current suspend status. If they do not, the ring cannot be suspended. The consumer then writes the byte 0x02 into byte 8 of sector 2. The consumer must then wait for the producer to acknowledge the suspend, which it will do by writing 0x02 into byte 8 of sector 1.

The FromLVM ring

Two different types of message can be sent on the FromLVM ring.

The FreeAllocation message contains the blocks for the free pool. Example message:

(FreeAllocation((blocks((pv0(12326 12249))(pv0(11 1))))(generation 2)))

Pretty-printed:

(FreeAllocation
    (
        (blocks
            (
                (pv0(12326 12249))
                (pv0(11 1))
            )
        )
        (generation 2)
    )
)

This is a message to add two new sets of extents to the free pool. A span of length 12249 extents starting at extent 12326, and a span of length 1 starting from extent 11, both within the physical volume ‘pv0’. The generation count of this message is ‘2’. The semantics of the generation is that the local allocator must record the generation of the last message it received since the FromLVM ring was resumed, and ignore any message with a generated less than or equal to the last message received.

The CapRequest message contains a request to cap the free pool at a maximum size. Example message:

(CapRequest((cap 6127)(name host1-freeme)))

Pretty-printed:

(CapRequest
    (
        (cap 6127)
        (name host1-freeme)
    )
)

This is a request to cap the free pool at a maximum size of 6127 extents. The ’name’ parameter reflects the name of the LV into which the extents should be transferred.

The ToLVM Ring

The ToLVM ring only contains 1 type of message. Example:

((volume test5)(segments(((start_extent 1)(extent_count 32)(cls(Linear((name pv0)(start_extent 12328))))))))

Pretty-printed:

(
    (volume test5)
    (segments
        (
            (
                (start_extent 1)
                (extent_count 32)
                (cls
                    (Linear
                        (
                            (name pv0)
                            (start_extent 12328)
                        )
                    )
                )
            )
        )
    )
)

This message is extending an LV named ’test5’ by giving it 32 extents starting at extent 1, coming from PV ‘pv0’ starting at extent 12328. The ‘cls’ field should always be ‘Linear’ - this is the only acceptable value.

Cap requests

Xenvmd will try to keep the free pools of the hosts within a range set as a fraction of free space. There are 3 parameters adjustable via the config file:

low_water_mark_factor
medium_water_mark_factor
high_water_mark_factor

These three are all numbers between 0 and 1. Xenvmd will sum the free size and the sizes of all hosts’ free pools to find the total effective free size in the VG, F. It will then subtract the sizes of any pending desired space from in-flight create or resize calls s. This will then be divided by the number of hosts connected, n, and multiplied by the three factors above to find the 3 absolute values for the high, medium and low watermarks.

{high, medium, low} * (F - s) / n

When xenvmd notices that a host’s free pool size has dropped below the low watermark, it will be topped up such that the size is equal to the medium watermark. If xenvmd notices that a host’s free pool size is above the high watermark, it will issue a ‘cap request’ to the host’s local allocator, which will then respond by allocating from its free pool into the fake LV, which xenvmd will then delete as soon as it gets the update.

Xenvmd keeps track of the last update it has sent to the local allocator, and will not resend the same request twice, unless it is restarted.

Design document
Revision	v2
Status	released (22.6.0)

TLS vertification for intra-pool communications

Overview

Xenserver has used TLS-encrypted communications between xapi daemons in a pool since its first release. However it does not use TLS certificates to authenticate the servers it connects to. This allows possible attackers opportunities to impersonate servers when the pools’ management network is compromised.

In order to enable certificate verification, certificate exchange as well as proper set up to trust them must be provided by xapi. This is currently done by allowing users to generate, sign and install the certificates themselves; and then enable the Common Criteria mode. This requires a CA and has a high barrier of entry.

Using the same certificates for intra-host communication creates friction between what the user needs and what the host needs. Instead of trying to reconcile these two uses with one set of certificates, host will serve two certificates: one for API calls from external clients, which is the one that can be changed by the users; and one that is use for intra-pool communications. The TLS server in the host can select which certificate to serve depending on the service name the client requests when opening a TLS connection. This mechanism is called Server Name Identification or SNI in short.

Last but not least the update bearing these changes must not disrupt pool operations while or after being applied.

Glossary

Term	Meaning
SNI	Server Name Identification. This TLS protocol extension allows a server to select a certificate during the initial TLS handshake depending on a client-provided name. This usually allows a single reverse-proxy to serve several HTTPS websites.
Host certificate	Certificate that a host sends clients when the latter initiate a connection with the former. The clients may close the connection depending on the properties of this certificate and whether they have decided to trust it previously.
Trusted certificate	Certificate that a computer uses to verify whether a host certificate is valid. If the host certificate’s chain of trust does not include a trusted certificate it will be considered invalid.
Default Certificate	Xenserver hosts present this certificate to clients which do not request an SNI. Users are allowed to install their own custom certificate.
Pool Certificate	Xenserver hosts present this certificate to clients which request `xapi:pool`as the SNI. They are used for host-to-host communications.
Common Criteria	Common Criteria for Information Technology Security Evaluation is a certification on computer security.

Certificates and Identity management

Currently Xenserver hosts generate self-signed certificates with the IP or FQDN as their subjects, users may also choose to install certificates. When installing these certificates only the cryptographic algorithms used to generate the certificates (private key and hash) are validated and no properties about them are required.

This means that using user-installed certificates for intra-pool communication may prove difficult as restrictions regarding FQDN and chain validation need to be ensured before enabling TLS certificate checking or the pool communications will break down.

Instead a different certificate is used only for pool communication. This allows to decouple whatever requirements users might have for the certificates they install to the requirements needed for secure pool communication. This has several benefits:

Frees the pool from ensuring a sound hostname resolution on the internal communications.
Allows the pool to rotate the certificates when it deems necessary. (in particular expiration, or forced invalidation)
Hosts never share a host certificate, and their private keys never get transmitted.

In general, the project is able to more safely change the parameters of intra-pool communication without disrupting how users use custom certificates.

To be able to establish trust in a pool, hosts must distribute the certificates to the rest of the pool members. Once that is done servers can verify whether they are connecting to another host in the pool by comparing the server certificate with the certificates in the trust root. Certificate pinning is available and would allow more stringent checks, but it doesn’t seem a necessity: hosts in a pool already share secret that allows them to have full control of the pool.

To be able to select a host certificate depending whether the connections is intra-pool or comes from API clients SNI will be used. This allows clients to ask for a service when establishing a TLS connection. This allows the server to choose the certificate they want to offer when negotiating the connection with the client. The hosts will exploit this to request a particular service when they establish a connection with other hosts in the pool. When initiating a connection to another host in the pool, a server will create requests for TLS connections with the server_name xapi:pool with the name_type DNS, this goes against RFC-6066 as this server_name is not resolvable. This still works because we control the implementation in both peers of the connection and can follow the same convention.

In addition connections to the WLB appliance will continue to be validated using the current scheme of user-installed CA certificates. This means that hosts connecting to the appliance will need a special case to only trust user-installed certificated when establishing the connection. Conversely pool connections will ignore these certificates.

Name	Filesystem location	User-configurable	Used for
Host Default	/etc/xensource/xapi-ssl.pem	yes (using API)	Hosts serve it to normal API clients
Host Pool	/etc/xensource/xapi-pool-tls.pem	no	Hosts serve to clients requesting “xapi:pool” as the SNI
Trusted Default	/etc/stunnel/certs/	yes (using API)	Certificates that users can install for trusting appliances
Trusted Pool	/etc/stunnel/certs-pool/	no	Certificates that are managed by the pool for host-to-host communications
Default Bundle	/etc/stunnel/xapi-stunnel-ca-bundle.pem	no	Bundle of certificates that hosts use to verify appliances (in particular WLB), this is kept in sync with “Trusted Default”
Pool Bundle	/etc/stunnel/xapi-pool-ca-bundle.pem	no	Bundle of certificates that hosts use to verify other hosts on pool communications, this is kept in sync with “Trusted Pool”

Cryptography of certificates

The certificates until now have been signed using sha256WithRSAEncryption:

Pre-8.0 releases use 1024-bit RSA keys.
8.0, 8.1 and 8.2 use 2048-bit RSA keys.

The Default Certificates served to API clients will continue to use sha256WithRSAEncryption with 2048-bit RSA keys. The Pool certificates will use the same algorithms for consistency.

The self-signed certificates until now have used a mix of IP and hostname claims:

All released versions:
- Subject and issuer have CN FQDN if the hostname is different from localhost, or CN management IP
- Subject Alternate Names extension contains all the domain names as DNS names
Next release:
- Subject and issuer have CN management IP
- SAN extension contains all domain names as DNS names and the management IP as IP

The Pool certificates do not contain claims about IPs nor hostnames as this may change during runtime and depending on their validity may make pool communication more brittle. Instead the only claim they have is that their Issuer and their Subject are CN Host UUID, along with a serial number.

Self-signed certificates produced until now have had validity periods of 3650 days (~10 years). The Pool certificates will have the same validity period.

Server Components

HTTPS Connections between hosts usually involve the xapi daemons and stunnel processes:

When a xapi daemon needs to initiate a connection with another host it starts an HTTP connection with a local stunnel process.
The stunnel processes wrap http connections inside a TLS connection, allowing HTTPS to be used when hosts communicate

This means that stunnel needs to be set up correctly to verify certificates when connecting to other hosts. Some aspects like CA certificates are already managed, but certificate pinning is not.

Use Cases

There are several use cases that need to be modified in order correctly manage trust between hosts.

Opening a connection with a pool host

This is the main use case for the feature, the rest of use cases that need changes are modified to support this one. Currently a Xenserver host connecting to another host within the pool does not try to authenticate the receiving server when opening a TLS connection. (The receiving server authenticates the originating server by xapi authentication, see below)

Stunnel will be configured to verify the peer certificate against the CA certificates that are present in the host. The CA certificates must be correctly set up when a host joins the pool to correctly establish trust.

The previous behaviour for WLB must be kept as the WLB connection must be checked against the user-friendly CA certificates.

Receiving an incoming TLS connection

All incoming connections authenticate the client using credentials, this does not need the addition of certificates. (username and password, pool secret)

The hosts must present the certificate file to incoming connections so the client can authenticate them. This is already managed by xapi, it configures stunnel to present the configured host certificate. The configuration has to be changed so stunnel responds to SNI requests containing the string xapi:pool to serve the internal certificate instead of the client-installed one.

U1. Host Installation

On xapi startup an additional certificate is created now for pool operations. It’s added to the trusted pool certificates. The certificate’s only claim is the host’s UUID. No IP nor hostname information is kept as the clients only check for the certificate presence in the trust root.

U2. Pool Join

This use-case is delicate as it is the point where trust is established between hosts. This is done with a call from the joiner to the pool coordinator where the certificate of the coordinator is not verified. In this call the joiner transmits its certificate to the coordinator and the coordinator returns a list of the pool members’ UUIDs and certificates. This means that in the policy used is trust on first use.

To deal with parallel pool joins, hosts download all the Pool certificates in the pool from the coordinator after all restarts.

The connection is initiated by a client, just like before, there is no change in the API as all the information needed to start the join is already provided (pool username and password, IP of coordinator)

sequenceDiagram
participant clnt as Client
participant join as Joiner
participant coor as Coordinator
participant memb as Member
clnt->>join: pool.join coordinator_ip coordinator_username coordinator_password
join->>coor:login_with_password coordinator_ip coordinator_username coordinator_password

Note over join: pre_join_checks
join->>join: remote_pool_has_tls_enabled = self_pool_has_tls_enabled
alt are different
Note over join: interrupt join, raise error
end
Note right of join: certificate distribution
coor-->>join:
join->>coor: pool.internal_certificate_list_content
coor-->>join:

join->>coor: pool.upload_identity_host_certificate joiner_certificate uuid
coor->>memb: pool.internal_certificates_sync
memb-->>coor:

loop for every <user CA certificate> in Joiner
join->>coor: Pool.install_ca_certitificate <user CA certificate>
coor-->>join:
end

loop for every <user CRL> in Joiner
join->>coor: Pool.install_crl <user CRL>
coor-->>join:
end

join->>coor: host.add joiner
coor-->>join:

join->>join: restart_as_slave
join->>coor: pool.user_certificates_sync
join->>coor: host.copy_primary_host_certs

U3. Pool Eject

During pool eject the pool must remove the host certificate of the ejected member from the internal trust root, this must be done by the xapi daemon of the coordinator.

The ejected member will recreate both server certificates to replicate a new installation. This can be triggered by deleting the certificates and their private keys in the host before rebooting, the current boot scripts automatically generates a new self-signed certificate if the file is not present. Additionally, both the user and the internal trust roots will be cleared before rebooting as well.

U4. Pool Upgrade

When a pool has finished upgrading to the version with certificate checking the database reflects that the feature is turned off, this is done as part of the database upgrade procedure in xen-api. The internal certificate is created on restart. It is added to the internal trusted certificates directory. The distribution of certificate will happens when the tls verification is turned on, afterwards.

U5. Host certificate state inspection

In order to give information about the validity and useful information of installed user-facing certificates to API clients as well as the certificates used for internal purposes, 2 fields are added to certificate records in xapi’s datamodel and database:

type: indicates which of the 3 kind of certificates is the certificate. If it’s a user-installed trusted CA certificate, a server certificate served to clients that do not use SNI, and a server certificate served when the SNI xapi:pool is used. The exact values are ca, host and host-internal, respectively.
name: the human-readable name given by the user. This fields is only present on trusted CA certificates and allows the pool operators to better recognise the certificates.

Additionally, now the _host field contains a null reference if the certificate is a corporate CA (a ca certificate).

The fields will get exposed in the CLI whenever a certificate record is listed, this needs a xapi-cli-server to be modified to show the new field.

U6. Migrating a VM to another pool

To enable a frictionless migration when pools have tls verification enabled, the host certificate of the host receiving the vm is sent to the sender. This is done by adding the certificate of the receiving host as well as its pool coordinator to the return value of the function migrate_receive function. The sender can then add the certificate to the folder of CA certificates that stunnel uses to verify the server in a TLS connection. When the transaction finishes, whether it fails or succeeds the CA certificate is deleted.

The certificate is stored in a temporary location so xapi can clean up the file when it starts up, in case after the host fences or power cycles while the migration is in progress.

Xapi invokes sparse_dd with the filename correct trusted bundle as a parameter so it can verify the vhd-server running on the other host.

Xapi also invokes xcp-rrdd to migrate the VM metrics. xcp-rrdd is passed the 2 certificates to verify the remote hosts when sending the metrics.

Clients should not be aware of this change and require no change.

Xapi-cli-server, the server of xe embedded into xapi, connects to the remote coordinator using TLS to be able to initiate the migration. Currently no verification is done. A certificate is required to initiate the connection to verify the remote server.

In u6.3 and u6.4 no changes seem necessary.

U7. Change a host’s name

The Pool certificates do not depend on hostnames. Changing the hostnames does not affect TLS certificate verification in a pool.

U8. Installing a certificate (corporate CA)

Installation of corporate CA can be done with current API. Certificates are added to the database as CA certificates.

U9. Resetting a certificate (to self-signed certificate)

This needs a reimplementation of the current API to reset host certificate, this time allowing the operation to happen when the host is not on emergency node and to be able to do it remotely.

U10. Enabling certificate verification

A new API call is introduced to enable tls certificate verification: Pool.enable_tls_verification. This is used by the CLI command pool-enable-tls-verification. The call causes the coordinator of the pool to install the Pool certificates of all the members in its internal trust root. Then calls the api for each member to install all of these certificates. After this public key exchange is done, TLS certificate verification is enabled on the members, with the coordinator being the last to enable it.

When there are issues that block enabling the feature, the call returns an error specific to that problem:

HA must not be enabled, as it can interrupt the procedure when certificates are distributed
Pool operations that can disrupt the certificate exchange block this operation: These operations are listed in here
There was an issue with the certificate exchange in the pool.

The coordinator enabling verification last is done to ensure that if there is any issue enabling the coordinator host can still connect to members and rollback the setting.

A new field is added to the pool: tls_verification_enabled. This enables clients to query whether TLS verification is enabled.

U11. Disabling certificate verification

A new emergency command is added emergency-host-disable-tls-verification. This command disables tls-verification for the xapi daemon in a host. This allows the host to communicate with other hosts in the pool.

After that, the admin can regenerate the certificates using the new host-refresh-server-certificate in the hosts with invalid certificates, finally they can reenable tls certificate checking using the call emergency-host-reenable-tls-verification.

The documentation will include instructions for administrators on how to reset certificates and manually installing the host certificates as CA certificates to recover pools.

This means they will not have to disable TLS and compromise on security.

U12. Being aware of certificate expiry

Stockholm hosts provide alerts 30 days before hosts certificates expire, it must be changed to alert about users’ CA certificates expiring.

Pool certificates need to be cycled when the certificate expiry is approaching. Alerts are introduced to warn the administrator this task must be done, or risk the operation of the pool. A new API is introduced to create certificates for all members in a pool and replace the existing internal certificates with these. This call imposes the same requirements in a pool as the pool secret rotation: It cannot be run in a pool unless all the host are online, it can only be started by the coordinator, the coordinator is in a valid state, HA is disabled, no RPU is in progress, and no pool operations are in progress. The API call is Pool.rotate_internal_certificates. It is exposed by xe as pool-rotate-internal-certificates.

Changes

Xapi startup has to account for host changes that affect this feature and modify the filesystem and pool database accordingly.

Public certificate changed: On first boot, after a pool join and when doing emergency repairs the server certificate record of the host may not match to the contents in the filesystem. A check is to be introduced that detects if the database does not associate a certificate with the host or if the certificate’s public key in the database and the filesystem are different. If that’s the case the database is updated with the certificate in the filesystem.
Pool certificate not present: In the same way the public certificate served is generated on startup, the internal certificate must be generated if the certificate is not present in the filesystem.
Pool certificate changed: On first boot, after a pool join and after having done emergency repairs the internal server certificate record may not match the contents of the filesystem. A check is to be introduced that detects if the database does not associate a certificate with the host or if the certificate’s public key in the database and the filesystem are different. This check is made aware whether the host is joining a pool or is on first-boot, it does this by counting the amount of hosts in the pool from the database. In the case where it’s joining a pool it simply updated the database record with the correct information from the filesystem as the filesystem contents have been put in place before the restart. In the case of first boot the public part of the certificate is copied to the directory and the bundle for internally-trusted certificates: /etc/stunnel/certs-pool/ and /etc/stunnel/xapi-pool-ca-bundle.pem.

The xapi database records for certificates must be changed according with the additions explained before.

API

Additions

Pool.tls_verification_enabled: this is a field that indicates whether TLS verification is enabled.
Pool.enable_tls_verification: this call is allowed for role _R_POOL_ADMIN. It’s not allowed to run if HA is enabled nor pool operations are in progress. All the hosts in the pool transmit their certificate to the coordinator and the coordinator then distributes the certificates to all members of the pool. Once that is done the coordinator tries to initiate a session with all the pool members with TLS verification enabled. If it’s successful TLS verification is enabled for the whole pool, otherwise the error COULD_NOT_VERIFY_HOST [member UUID] is emmited.
TLS_VERIFICATION_ENABLE_IN_PROGRESS is a new error that is produced when trying to do other pool operations while enabling TLS verification is in progress
Host.emergency_disable_tls_verification: this called is allowed for role _R_LOCAL_ROOT_ONLY: it’s an emergency command and acts locally. It forces connections in xapi to stop verifying the peers on outgoing connections. It generates an alert to warn the administrators of this uncommon state.
Host.emergency_reenable_tls_verification: this call is allowed for role _R_LOCAL_ROOT_ONLY: it’s an emergency command and acts locally. It changes the configuration so xapi verifies connections by default after being switched off with the previous command.
Pool.install_ca_certificate: rename of Pool.certificate_install, add the ca certificate to the database.
Pool.uninstall_ca_certificate: rename of Pool.certificate_uninstall, removes the certificate from the database.
Host.reset_server_certificate: replaces Host.emergency_reset_server_certificate, now it’s allowed for role _R_POOL_ADMIN. It adds a record for the generated Default Certificate to the database while removing the previous record, if any.
Pool.rotate_internal_certificates: This call generates new Pool certificates, and substitutes the previous certificates with these. See the certificate expiry section for more details.

Modifications:

Pool.join: certificates must be correctly distributed. API Error POOL_JOINING_HOST_TLS_VERIFICATION_MISMATCH is returned if the tls_verification of the two pools doesn’t match.
Pool.eject: all certificates must be deleted from the ejected host’s filesystem and the ejected host’s certificate must be deleted from the pool’s trust root.
Host.install_server_certificate: the certificate type host for the record must be added to denote it’s a Standard Certificate.

Deprecations:

pool.certificate_install
pool.certificate_uninstall
pool.certificate_list
pool.wlb_verify_cert: This setting is superseeded by pool.enable_tls_verification. It cannot be removed, however. When updating from a previous version when this setting is on, TLS connections to WLB must still verify the external host. When the global setting is enabled this setting is ignored.
host.emergency_reset_server_certificate: host.reset_server_certificate should be used instead as this call does not modify the database.

CLI

Following API additions:

pool-enable-tls-verification
pool-install-ca-certificate
pool-uninstall-ca-certificate
pool-internal-certificates-rotation
host-reset-server-certificate
host-emergency-disable-tls-verification (emits a warning when verification is off and the pool-level is on)
host-emergency-reenable-tls-verification

And removals:

host-emergency-server-certificate

Feature Flags

This feature needs clients to behave differently when initiating pool joins, to allow them to choose behaviour the toolstack will expose a new feature flag ‘Certificate_verification’. This flag will be part of the express edition as it’s meant to aid detection of a feature and not block access to it.

Alerts

Several alerts are introduced:

POOL_CA_CERTIFICATE_EXPIRING_30, POOL_CA_CERTIFICATE_EXPIRING_14, POOL_CA_CERTIFICATE_EXPIRING_07, POOL_CA_CERTIFICATE_EXPIRED: Similar to host certificates, now the user-installable pool’s CA certificates are monitored for expiry dates and alerts are generated about them. The body for this type of message is:
The trusted TLS server certificate {is expiring soon|has expired}.20210302T02:00:01Z
HOST_INTERNAL_CERTIFICATE_EXPIRING_30, HOST_INTERNAL_CERTIFICATE_EXPIRING_14, HOST_INTERNAL_CERTIFICATE_EXPIRING_07, HOST_INTERNAL_CERTIFICATE_EXPIRED: Similar to host certificates, the newly-introduced hosts’ internal server certificates are monitored for expiry dates and alerts are generated about them. The body for this type of message is:
The TLS server certificate for internal communications {is expiring soon|has expired}.20210302T02:00:01Z
TLS_VERIFICATION_EMERGENCY_DISABLED: The host is in emergency mode and is not enforcing tls verification anymore, the situation that forced the disabling must be fixed and the verification enabled ASAP.
HOST-UUID
FAILED_LOGIN_ATTEMPTS: An hourly alert that contains the number of failed attempts and the 3 most common origins for these failed alerts. The body for this type of message is:
35 usr5origin55.4.3.21020200922T15:03:13Z usr4UA620200922T15:03:13Z UA4.3.2.1420200922T14:57:11Z 10

Design document
Revision	v1
Status	released (5.6 fp1)

Tunnelling API design

To isolate network traffic between VMs (e.g. for security reasons) one can use VLANs. The number of possible VLANs on a network, however, is limited, and setting up a VLAN requires configuring the physical switches in the network. GRE tunnels provide a similar, though more flexible solution. This document proposes a design that integrates the use of tunnelling in the XenAPI. The design relies on the recent introduction of the Open vSwitch, and requires an Open vSwitch (OpenFlow) controller (further referred to as the controller) to set up and maintain the actual GRE tunnels.

We suggest following the way VLANs are modelled in the datamodel. Introducing a VLAN involves creating a Network object for the VLAN, that VIFs can connect to. The VLAN.create API call takes references to a PIF and Network to use and a VLAN tag, and creates a VLAN object and a PIF object. We propose something similar for tunnels; the resulting objects and relations for two hosts would look like this:

PIF (transport) -- Tunnel -- PIF (access) \          / VIF
                                            Network -- VIF
PIF (transport) -- Tunnel -- PIF (access) /          \ VIF

XenAPI changes

New tunnel class

Fields

string uuid (read-only)
PIF ref access_PIF (read-only)
PIF ref transport_PIF (read-only)
(string -> string) map status (read/write); owned by the controller, containing at least the key active, and key and error when appropriate (see below)
(string -> string) map other_config (read/write)

New fields in PIF class (automatically linked to the corresponding tunnel fields):

PIF ref set tunnel_access_PIF_of (read-only)
PIF ref set tunnel_transport_PIF_of (read-only)

Messages

tunnel ref create (PIF ref, network ref)
void destroy (tunnel ref)

Backends

For clients to determine which network backend is in use (to decide whether tunnelling functionality is enabled) a key network_backend is added to the Host.software_version map on each host. The value of this key can be:

bridge: the Linux bridging backend is in use;
openvswitch: the [Open vSwitch] backend is in use.

Notes

The user is responsible for creating tunnel and network objects, associating VIFs with the right networks, and configuring the physical PIFs, all using the XenAPI/CLI/XC.
The tunnel.status field is owned by the controller. It may be possible to define an RBAC role for the controller, such that only the controller is able to write to it.
The tunnel.create message does not take a tunnel identifier (GRE key). The controller is responsible for assigning the right keys transparently. When a tunnel has been set up, the controller will write its key to tunnel.status:key, and it will set tunnel.status:active to "true" in the same field.
In case a tunnel could not be set up, an error code (to be defined) will be written to tunnel.status:error, and tunnel.status:active will be "false".

Xapi

tunnel.create

Fails with OPENVSWITCH_NOT_ACTIVE if the Open vSwitch networking sub-system is not active (the host uses linux bridging).
Fails with IS_TUNNEL_ACCESS_PIF if the specified transport PIF is a tunnel access PIF.
Takes care of creating and connecting the new tunnel and PIF objects.
- Sets a random MAC on the access PIF.
- IP configuration of the tunnel access PIF is left blank. (The IP configuration on a PIF is normally used for the interface in dom0. In this case, there is no tunnel interface for dom0 to use. Such functionality may be added in future.)
- The tunnel.status:active field is initialised to "false", indicating that no actual tunnelling infrastructure has been set up yet.
Calls PIF.plug on the new tunnel access PIF.

tunnel.destroy

Calls PIF.unplug on the tunnel access PIF. Destroys the tunnel and tunnel access PIF objects.

PIF.plug on a tunnel access PIF

Fails with TRANSPORT_PIF_NOT_CONFIGURED if the underlying transport PIF has PIF.ip_configuration_mode = None, as this interface needs to be configured for the tunnelling to work. Otherwise, the transport PIF will be plugged.
Xapi requests interface-reconfigure to “bring up” the tunnel access PIF, which causes it to create a local bridge.
No link will be made between the new bridge and the physical interface by interface-reconfigure. The controller is responsible for setting up these links. If the controller is not available, no links can be created, and the tunnel network degrades to an internal network (only intra-host connectivity).
PIF.currently_attached is set to true.

PIF.unplug on a tunnel access PIF

Xapi requests interface-reconfigure to “bring down” the tunnel PIF, which causes it to destroy the local bridge.
PIF.currently_attached is set to false.

PIF.unplug on a tunnel transport PIF

Calls PIF.unplug on the associated tunnel access PIF(s).

PIF.forget on a tunnel access of transport PIF

Fails with PIF_TUNNEL_STILL_EXISTS.

VLAN.create

Tunnels can only exist on top of physical/VLAN/Bond PIFs, and not the other way around. VLAN.create fails with IS_TUNNEL_ACCESS_PIF if given an underlying PIF that is a tunnel access PIF.

Pool join

As for VLANs, when a host joins a pool, it will inherit the tunnels that are present on the pool master.
Any tunnels (tunnel and access PIF objects) configured on the host are removed, which will leave their networks disconnected (the networks become internal networks). As a joining host is always a single host, there is no real use for having had tunnels on it, so this probably will never be an issue.

The controller

The controller tracks the tunnel class to determine which bridges/networks require GRE tunnelling.
- On start-up, it calls tunnel.get_all to obtain the information about all tunnels.
- Registers for events on the tunnel class to stay up-to-date.
A tunnel network is organised as a star topology. The controller is free to decide which host will be the central host (“switching host”).
If the current switching host goes down, a new one will be selected, and GRE tunnels will be reconstructed.
The controller creates GRE tunnels connecting each existing Open vSwitch bridge that is associated with the same tunnel network, after assigning the network a unique GRE key.
The controller destroys GRE tunnels if associated Open vSwitch bridges are destroyed. If the destroyed bridge was on the switching host, and other hosts are still using the same tunnel network, a new switching host will be selected, and GRE tunnels will be reconstructed.
The controller sets tunnel.status:active to "true" for all tunnel links that have been set up, and "false" if links are broken.
The controller writes an appropriate error code (to be defined) to tunnel.status:error in case something went wrong.
When an access PIF is plugged, and the controller succeeds to set up the tunnelling infrastructure, it writes the GRE key to tunnel.status:key on the associated tunnel object (at the same time tunnel.status:active will be set to "true").
When the tunnel infrastructure is not up and running, the controller may remove the key tunnel.status:key (optional; the key should anyway be disregarded if tunnel.status:active is "false").

CLI

New xe commands (analogous to xe vlan-):

tunnel-create
tunnel-destroy
tunnel-list
tunnel-param-get
tunnel-param-list

Design document
Revision	v2
Status	released (8.2)

User-installable host certificates

Introduction

It is often necessary to replace the TLS certificate used to secure communications to Xenservers hosts, for example to allow a XenAPI user such as Citrix Virtual Apps and Desktops (CVAD) to validate that the host is genuine and not impersonating the actual host.

Historically there has not been a supported mechanism to do this, and as a result users have had to rely on guides written by third parties that show how to manually replace the xapi-ssl.pem file on a host. This process is error-prone, and if a mistake is made, can result in an unusable system. This design provides a fully supported mechanism to allow replacing the certificates.

Design proposal

It is expected that an API caller will provide, in a single API call, a private key, and one or more certificates for use on the host. The key will be provided in PKCS #8 format, and the certificates in X509 format, both in base-64-encoded PEM containers.

Multiple certificates can be provided to cater for the case where an intermediate certificate or certificates are required for the caller to be able to verify the certificate back to a trusted root (best practice for Certificate Authorities is to have an ‘offline’ root, and issue certificates from an intermediate Certificate Authority). In this situation, it is expected (and common practice among other tools) that the first certificate provided in the chain is the host’s unique server certificate, and subsequent certificates form the chain.

To detect mistakes a user may make, certain checks will be carried out on the provided key and certificate(s) before they are used on the host. If all checks pass, the key and certificate(s) will be written to the host, at which stage a signal will be sent to stunnel that will cause it to start serving the new certificate.

Certificate Installation

API Additions

Xapi must provide an API call through Host RPC API to install host certificates:

let install_server_certificate = call
    ~lifecycle:[Published, rel_stockholm, ""]
    ~name:"install_server_certificate"
    ~doc:"Install the TLS server certificate."
    ~versioned_params:
      [{ param_type=Ref _host; param_name="host"; param_doc="The host"
       ; param_release=stockholm_release; param_default=None}
      ;{ param_type=String; param_name="certificate"
       ; param_doc="The server certificate, in PEM form"
       ; param_release=stockholm_release; param_default=None}
      ;{ param_type=String; param_name="private_key"
       ; param_doc="The unencrypted private key used to sign the certificate, \
                    in PKCS#8 form"
       ; param_release=stockholm_release; param_default=None}
      ;{ param_type=String; param_name="certificate_chain"
       ; param_doc="The certificate chain, in PEM form"
       ; param_release=stockholm_release; param_default=Some (VString "")}
      ]
    ~allowed_roles:_R_POOL_ADMIN
    ()

This call should be implemented within xapi, using the already-existing crypto libraries available to it.

Analogous to the API call, a new CLI call host-server-certificate-install must be introduced, which takes the parameters certificate, key and certificate-chain - these parameters are expected to be filenames, from which the key and certificate(s) must be read, and passed to the install_server_certificate RPC call.

The CLI will be defined as:

"host-server-certificate-install",
{
  reqd=["certificate"; "private-key"];
  optn=["certificate-chain"];
  help="Install a server TLS certificate on a host";
  implementation=With_fd Cli_operations.host_install_server_certificate;
  flags=[ Host_selectors ];
};

Validation

Xapi must perform the following validation steps on the provided key and certificate. If any validation step fails, the API call must return an error with the specified error code, providing any associated text:

Private Key

Validate that it is a pem-encoded PKCS#8 key, use error SERVER_CERTIFICATE_KEY_INVALID [] and exposed as “The provided key is not in a pem-encoded PKCS#8 format.”
Validate that the algorithm of the key is RSA, use error SERVER_CERTIFICATE_KEY_ALGORITHM_NOT_SUPPORTED, [<algorithms's ASN.1 OID>] and exposed as “The provided key uses an unsupported algorithm.”
Validate that the key length is ≥ 2048, and ≤ 4096 bits, use error SERVER_CERTIFICATE_KEY_RSA_LENGTH_NOT_SUPPORTED, [length] and exposed as “The provided RSA key does not have a length between 2048 and 4096.”
The library used does not support multi-prime RSA keys, when it’s encountered use error SERVER_CERTIFICATE_KEY_RSA_MULTI_NOT_SUPPORTED [] and exposed as “The provided RSA key is using more than 2 primes, expecting only 2”

Server Certificate

Validate that it is a pem-encoded X509 certificate, use error SERVER_CERTIFICATE_INVALID [] and exposed as “The provided certificate is not in a pem-encoded X509.”
Validate that the public key of the certificate matches the public key from the private key, using error SERVER_CERTIFICATE_KEY_MISMATCH [] and exposing it as “The provided key does not match the provided certificate’s public key.”
Validate that the certificate is currently valid. (ensure all time comparisons are done using UTC, and any times presented in errors are using ISO8601 format):
- Ensure the certificate’s not_before date is ≤ NOW SERVER_CERTIFICATE_NOT_VALID_YET, [<NOW>; <not_before>] and exposed as “The provided certificate certificate is not valid yet.”
- Ensure the certificate’s not_after date is > NOW SERVER_CERTIFICATE_EXPIRED, [<NOW>; <not_after>] and exposed as “The provided certificate has expired.”
Validate that the certificate signature algorithm is SHA-256 SERVER_CERTIFICATE_SIGNATURE_NOT_SUPPORTED [] and exposed as “The provided certificate is not using the SHA256 (SHA2) signature algorithm.”

Intermediate Certificates

Validate that it is an X509 certificate, use SERVER_CERTIFICATE_CHAIN_INVALID [] and exposed as “The provided intermediate certificates are not in a pem-encoded X509.”

Filesystem Interaction

If validation has been completed successfully, a temporary file must be created with permissions 0x400 containing the key and certificate(s), in that order, separated by an empty line.

This file must then be atomically moved to /etc/xensource/xapi-ssl.pem in order to ensure the integrity of the contents. This may be done using rename with the origin and destination in the same mount-point.

Alerting

A daily task must be added. This task must check the expiry date of the first certificate present in /etc/xensource/xapi-ssl.pem, and if it is within 30 days of expiry, generate a message to alert the administrator that the certificate is due to expire shortly.

The body of the message should contain:

<body>
  <message>
    The TLS server certificate is expiring soon
  </message>
  <date>
    <expiry date in ISO8601 'YYYY-MM-DDThh:mm:ssZ' format>`
  </date>
</body>

The priority of the message should be based on the number of days to expiry as follows:

Number of days	Priority
0-7	1
8-14	2
14+	3

The other fields of the message should be:

Field	Value
name	HOST_SERVER_CERTIFICATE_EXPIRING
class	Host
obj-uuid	< Host UUID >

Any existing HOST_SERVER_CERTIFICATE_EXPIRING messages with this host’s UUID should be removed to avoid a build-up of messages.

Additionally, the task may also produce messages for expired server certificates which must use the name HOST_SERVER_CERTIFICATE_EXPIRED. This kind of message must contain the message “The TLS server certificate has expired.” as well as the expiry date, like the expiring messages. They also may replace the existing expiring messages in a host.

Expose Certificate metadata

Currently xapi exposes a CLI command to print the certificate being used to verify external hosts. We would like to also expose through the API and the CLI useful metadata about the certificates in use by each host.

The new class is meant to cover server certificates and trusted certificates.

Schema

A new class, Certificate, will be added with the following schema:

Field	Type	Notes
uuid
type	CA	Certificate trusted by all hosts
	Host	Certificate that the host presents to normal clients
name	String	Name, only present for trusted certificates
host	Ref _host	Host where the certificate is installed
not_before	DateTime	Date after which the certificate is valid
not_after	DateTime	Date before which the certificate is valid
fingerprint_sha256	String	The certificate’s SHA256 fingerprint / hash
fingerprint_sha1	String	The certificate’s SHA1 fingerprint / hash

CLI / API

There are currently-existing CLI parameters for certificates: pool-certificate-{install,uninstall,list,sync}, pool-crl-{install,uninstall,list} and host-get-server-certificate.

The new command must show the metadata of installed server certificates in the pool. It must be able to show all of them in the same call, and be able to filter the certificates per-host.

To make it easy to separate it from the previous calls and to reflect that certificates are a class type in xapi the call will be named certificate-list and it will accept the parameter host-uuid=<uuid>.

Recovery mechanism

In the case a certificate is let to expire TLS clients connecting to the host will refuse to establish the connection. This means that the host is going to be unable to be managed using the xapi API (Xencenter, or a CVAD control plane)

There needs to be a mechanism to recover from this situation. A CLI command must be provided to install a self-signed certificate, in the same way it is generated during the setup process at the moment. The command will be host-emergency-reset-server-certificate. This command is never to be forwarded to another host and will call openssl to create a new RSA private key

The command must notify stunnel to make sure stunnel uses the newly-created certificate.

Miscellaneous

The auto-generated xapi-ssl.pem currently contains Diffie-Hellman (DH) Parameters, specifically 512 bits worth. We no longer support any ciphers which require DH parameters, so these are no longer needed, and it is acceptable for them to be lost as part of installing a new certificate/key pair.

The generation should also be modified to avoid creating these for new installations.

Design document
Revision	v1
Status	released (7.0)
Review	#156
Revision history
v1	Initial version

VGPU type identifiers

Introduction

When xapi starts, it may create a number of VGPU_type objects. These act as VGPU presets, and exactly which VGPU_type objects are created depends on the installed hardware and in certain cases the presence of certain files in dom0.

When deciding which VGPU_type objects need to be created, xapi needs to determine whether a suitable VGPU_type object already exists, as there should never be duplicates. At the moment the combination of vendor name and model name is used as a primary key, but this is not ideal as these values are subject to change. We therefore need a way of creating a primary key to uniquely identify VGPU_type objects.

Identifier

We will add a new read-only field to the database:

VGPU_type.identifier (string)

This field will contain a string representation of the parameters required to uniquely identify a VGPU_type. The parameters required can be summed up with the following OCaml type:

type nvidia_id = {
  pdev_id : int;
  psubdev_id : int option;
  vdev_id : int;
  vsubdev_id : int;
}

type gvt_g_id = {
  pdev_id : int;
  low_gm_sz : int64;
  high_gm_sz : int64;
  fence_sz : int64;
  monitor_config_file : string option;
}

type t =
  | Passthrough
  | Nvidia of nvidia_id
  | GVT_g of gvt_g_id

When converting this type to a string, the string will always be prefixed with 0001: enabling future versioning of the serialisation format.

For passthrough, the string will simply be:

0001:passthrough

For NVIDIA, the string will be nvidia followed by the four device IDs serialised as four-digit hex values, separated by commas. If psubdev_id is None, the empty string will be used e.g.

Nvidia {
  pdev_id = 0x11bf;
  psubdev_id = None;
  vdev_id = 0x11b0;
  vsubdev_id = 0x109d;
}

would map to

0001:nvidia,11bf,,11b0,109d

For GVT-g, the string will be gvt-g followed by the physical device ID encoded as four-digit hex, followed by low_gm_sz, high_gm_sz and fence_sz encoded as hex, followed by monitor_config_file (or the empty string if it is None) e.g.

GVT_g {
  pdev_id = 0x162a;
  low_gm_sz = 128L;
  high_gm_sz = 384L;
  fence_sz = 4L;
  monitor_config_file = None;
}

would map to

0001:gvt-g,162a,80,180,4,,

Having this string in the database will allow us to do a simple lookup to test whether a certain VGPU_type already exists. Although it is not currently required, this string can also be converted back to the type from which it was generated.

When deciding whether to create VGPU_type objects, xapi will generate the identifier string and use it to look for existing VGPU_type objects in the database. If none are found, xapi will look for existing VGPU_type objects with the tuple of model name and vendor name. If still none are found, xapi will create a new VGPU_type object.

Design document
Revision	v1
Status	released (7.0)

Virtual Hardware Platform Version

Background and goal

Some VMs can only be run on hosts of sufficiently recent versions.

We want a clean way to ensure that xapi only tries to run a guest VM on a host that supports the “virtual hardware platform” required by the VM.

Suggested design

In the datamodel, VM has a new integer field “hardware_platform_version” which defaults to zero.
In the datamodel, Host has a corresponding new integer-list field “virtual_hardware_platform_versions” which defaults to list containing a single zero element (i.e. [0] or [0L] in OCaml notation). The zero represents the implicit version supported by older hosts that lack the code to handle the Virtual Hardware Platform Version concept.
When a host boots it populates its own entry from a hardcoded value, currently [0; 1] i.e. a list containing the two integer elements 0 and 1. (Alternatively this could come from a config file.)
- If this new version-handling functionality is introduced in a hotfix, at some point the pool master will have the new functionality while at least one slave does not. An old slave-host that does not yet have software to handle this feature will not set its DB entry, which will therefore remain as [0] (maintained in the DB by the master).
The existing test for whether a VM can run on (or migrate to) a host must include a check that the VM’s virtual hardware platform version is in the host’s list of supported versions.
When a VM is made to start using a feature that is available only in a certain virtual hardware platform version, xapi must set the VM’s hardware_platform_version to the maximum of that version-number and its current value (i.e. raise if needed).

For the version we could consider some type other than integer, but a strict ordering is needed.

First use-case

Version 1 denotes support for a certain feature:

When a VM starts, if a certain flag is set in VM.platform then XenServer will provide an emulated PCI device which will trigger the guest Windows OS to seek drivers for the device, or updates for those drivers. Thus updated drivers can be obtained through the standard Windows Update mechanism.

If the PCI device is removed, the guest OS will fail to boot. A VM using this feature must not be migrated to or started on a XenServer that lacks support for the feature.

Therefore at VM start, we can look at whether this feature is being used; if it is, then if the VM’s Virtual Hardware Platform Version is less than 1 we should raise it to 1.

Limitation

Consider a VM that requires version 1 or higher. Suppose it is exported, then imported into an old host that does not support this feature. Then the host will not check the versions but will attempt to run the VM, which will then have difficulties.

The only way to prevent this would be to make a backwards-incompatible change to the VM metadata (e.g. a new item in an enum) so that the old hosts cannot read it, but that seems like a bad idea.

Design document
Revision	v2
Status	proposed

XenPrep

Background

Windows guests should have XenServer-specific drivers installed. As of mid-2015 these have been always been installed and upgraded by an essentially manual process involving an ISO carrying the drivers. We have a plan to enable automation through the standard Windows Update mechanism. This will involve a new additional virtual PCI device being provided to the VM, to trigger Windows Update to fetch drivers for the device.

There are many existing Windows guests that have drivers installed already. These drivers must be uninstalled before the new drivers are installed (and ideally before the new PCI device is added). To make this easier, we are planning a XenAPI call that will cause the removal of the old drivers and the addition of the new PCI device.

Since this is only to help with updating old guests, the call may well be removed at some point in the future.

Brief high-level design

The XenAPI call will be called VM.xenprep_start. It will update the VM record to note that the process has started, and will insert a special ISO into the VM’s virtual CD drive.

That ISO will contain a tool which will be set up to auto-run (if auto-run is enabled in the guest). The tool will:

Lock the CD drive so other Windows programs cannot eject the disc.
Uninstall the old drivers.
Eject the CD to signal success.
Shut down the VM.

XenServer will interpret the ejection of the CD as a success signal, and when the VM shuts down without the special ISO in the drive, XenServer will:

Update the VM record:

Remove the mark that shows that the xenprep process is in progress
Give it the new PCI device: set VM.auto_update_drivers to true.
If VM.virtual_hardware_platform_version is less than 2, then set it to 2.

Start the VM.

More details of the xapi-project parts

(The tool that runs in the guest is out of scope for this document.)

Start

The XenAPI call VM.xenprep_start will throw a power-state error if the VM is not running. For RBAC roles, it will be available to “VM Operator” and above.

It will:

Insert the xenprep ISO into the VM’s virtual CD drive.
Write VM.other_config key xenprep_progress=ISO_inserted to record the fact that the xenprep process has been initiated.

If xenprep_start is called on a VM already undergoing xenprep, the call will return successfully but will not do anything.

If the VM does not have an empty virtual CD drive, the call will fail with a suitable error.

Cancellation

While xenprep is in progress, any request to eject the xenprep ISO (except from inside the guest) will be rejected with a new error “VBD_XENPREP_CD_IN_USE”.

There will be a new XenAPI call VM.xenprep_abort which will:

Remove the xenprep_progress entry from VM.other_config.
Make a best-effort attempt to eject the CD. (The guest might prevent ejection.)

This is not intended for cancellation while the xenprep tool is running, but rather for use before it starts, for example if auto-run is disabled or if the VM has a non-Windows OS.

Completion

Aim: when the guest shuts down after ejecting the CD, XenServer will start the guest again with the new PCI device.

Xapi works through the queue of events it receives from xenopsd. It is possible that by the time xapi processes the cd-eject event, the guest might have shut down already.

When the shutdown (not reboot) event is handled, we shall check whether we need to do anything xenprep-related. If

The VM other_config map has xenprep_progress as either of ISO_inserted or shutdown, and
The xenprep ISO is no longer in the drive

then we must (in the specified order)

Update the VM record:
In VM.other_config set xenprep_progress=shutdown
If VM.virtual_hardware_platform_version is less than 2, then set it to 2.
Give it the new PCI device: set VM.auto_update_drivers to true.
Initiate VM start.
Remove xenprep_progress from VM.other_config

The most relevant code is probably the update_vm function in ocaml/xapi/xapi_xenops.ml in the xen-api repo (or in some function called from there).

Python

Introduction

Most Python3 scripts and plugins shall be located below the python3 directory. The structure of the directory is as follows:

python3/bin: This contains files installed in /opt/xensource/bin and are meant to be run by users
python3/libexec: This contains files installed in /opt/xensource/libexec and are meant to only be run by xapi and other daemons.
python3/packages: Contains files to be installed in python’s site-packages are meant to be modules and packages to be imported by other scripts or executed via python3 -m
python3/plugins: This contains files that are meant to be xapi plugins
python3/tests: Tests for testing and covering the Python scripts and plugins

Dependencies for development and testing

In GitHub CI and local testing, we can use pre-commit to execute the tests. It provides a dedicated, clearly defined and always consistent Python environment. The easiest way to run all tests and checks is to simply run pre-commit. The example commands below assume that you have Python3 in your PATH. Currently, Python 3.11 is required for it:

pip3 install pre-commit
pre-commit run -av
# Or, to just run the pytest hook:
pre-commit run -av pytest

Note: By default, CentOS 8 provides Python 3.6, whereas some tests need Python >= 3.7

Alternatively, you can of course tests given that you install the supported You can find the dependencies in the in the pytest hook from in any suitable environment, versions of all dependencies. list additional_dependencies of the pytest hook rel=external target=_blank title="pre-commit commit hook framework">pre-commit configuration file .pre-commit-config.yaml.

.pre-commit-config.yaml (expand)

    hooks: -   id: pytest files: python3/ name: check that the Python3 test suite in passes entry: sh -c 'coverage run && coverage xml && coverage html && coverage report && diff-cover --ignore-whitespace --compare-branch=origin/master --show-uncovered --format html:.git/coverage-diff.html --fail-under 50 .git/coverage3.11.xml' require_serial: true pass_filenames: false language: python types: [python] additional_dependencies: - coverage - diff-cover - future - opentelemetry-api - opentelemetry-exporter-zipkin-json - opentelemetry-sdk - pytest-mock - mock - wrapt - XenAPI

Coverage

Code moved to the python3 directory tree shall have good code coverage using tests that are executed, verified and covered using pytest and Coverage.py. The coverage tool and pytest are configured in pyproject.toml and coverage run is configured to run pytest by default.

coverage run collects coverage from the run and stores it in its database. The most simple command line to run and report coverage to stdout is: coverage run && coverage report

Other commands also used in the pytest hook example above (expand)

coverage xml: Generates an XML report from the coverage database to .git/coverage3.11.xml. It is needed for upload to https://codecov.io
coverage html: Generates an HTML report from the coverage database to .git/coverage_html/

We configure the file paths used for the generated database and other coverage configuration in the sections [tool.coverage.run] and [tool.coverage.report] of pyproject.toml.

Pytest

If your Python environment has the dependencies for the tests installed, you can run pytest in this environment without any arguments to use the defaults.

For development, pytest can also only run one test (expand)

To run a specific pytest command, run pytest and pass the test case to it (example):

pytest python3/tests/test_perfmon.py

coverage run -m pytest python3/tests/test_perfmon.py && coverage report

RRDD

The xcp-rrdd daemon (hereafter simply called “rrdd”) is a component in the xapi toolstack that is responsible for collecting metrics, storing them as “Round-Robin Databases” (RRDs) and exposing these to clients.

The code is in ocaml/xcp-rrdd.

Design document
Revision	v1
Status	released (7,0)

RRDD archival redesign

Introduction

Current problems with rrdd:

rrdd stores knowledge about whether it is running on a master or a slave

rrdd handles rebooting VMs unpredictably

Proposal

Design

VM.destroy

The master xapi makes a remove_rrd call to the local rrdd, which causes rrdd to to delete the VM’s archived rrd from disk. This behaviour will remain unchanged.

VM.start(_on) and VM.resume(_on)

VM.shutdown and VM.suspend

val archive_rrd : vm_uuid:string -> remote_address:string -> unit

VM.reboot

VM.checkpoint

This will be handled automatically, as internally VM.checkpoint carries out a VM.suspend followed by a VM.resume.

VM.pool_migrate and VM.migrate_send

Design document
Revision	v1
Status	released (7.0)
Revision history
v1	Initial version

RRDD plugin protocol v2

Motivation

rrdd plugins currently report datasources via a shared-memory file, using the following format:

DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
  "timestamp": 1339685573.245,
  "data_sources": {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "type": "absolute",
      "units": "degC",
      "value": "64.33"
      "value_type": "float",
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "type": "absolute",
      "units": "degC",
      "value": "62.14"
      "value_type": "float",
    }
  }
}

This format contains four main components:

A constant header string

DATASOURCES

This should always be present.

The JSON data length, encoded as hexadecimal

000001e4

The md5sum of the JSON data

dba4bf7a84b6d11d565d19ef91f7906e

The JSON data itself, encoding the values and metadata associated with the reported datasources.

Example

{
  "timestamp": 1339685573.245,
  "data_sources": {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "type": "absolute",
      "units": "degC",
      "value": "64.33"
      "value_type": "float",
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "type": "absolute",
      "units": "degC",
      "value": "62.14"
      "value_type": "float",
    }
  }
}

The disadvantage of this protocol is that rrdd has to parse the entire JSON structure each tick, even though most of the time only the values will change.

For this reason a new protocol is proposed.

Protocol V2

value	bits	format	notes
header string	(string length)*8	string	“DATASOURCES” as in the V1 protocol
data checksum	32	int32	binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values
metadata checksum	32	int32	binary-encoded crc32 of the metadata string (see below)
number of datasources	32	int32	only needed if the metadata has changed - otherwise RRDD can use a cached value
timestamp	64	double	Unix epoch
datasource values	n * 64	int64 \| double	n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64\|float]
metadata length	32	int32
metadata	(string length)*8	string

All integers/double are bigendian. The metadata will have the same JSON-based format as in the V1 protocol, minus the timestamp and value key-value pair for each datasource.

field	values	notes	required
description	string	Description of the datasource	no
owner	host \| vm \| sr	The object to which the data relates	no, default host
value_type	int64 \| float	The type of the datasource	yes
type	absolute \| derive \| gauge	The type of measurement being sent. Absolute for counters which are reset on reading, derive stores the derivative of the recorded values (useful for metrics which continually increase like amount of data written since start), gauge for things like temperature	no, default absolute
default	true \| false	Whether the source is default enabled or not	no, default false
units		The units the data should be displayed in	no
min		The minimum value for the datasource	no, default -infinity
max		The maximum value for the datasource	no, default +infinity

Example

{
  "datasources": {
    "memory_reclaimed": {
      "description":"Host memory reclaimed by squeezed",
      "owner":"host",
      "value_type":"int64",
      "type":"absolute",
      "default":"true",
      "units":"B",
      "min":"-inf",
      "max":"inf"
    },
    "memory_reclaimed_max": {
      "description":"Host memory that could be reclaimed by squeezed",
      "owner":"host",
      "value_type":"int64",
      "type":"absolute",
      "default":"true",
      "units":"B",
      "min":"-inf",
      "max":"inf"
    },
    {
    "cpu-temp-cpu0": {
      "description": "Temperature of CPU 0",
      "owner":"host",
      "value_type": "float",
      "type": "absolute",
      "default":"true",
      "units": "degC",
      "min":"-inf",
      "max":"inf"
    },
    "cpu-temp-cpu1": {
      "description": "Temperature of CPU 1",
      "owner":"host",
      "value_type": "float",
      "type": "absolute",
      "default":"true",
      "units": "degC",
      "min":"-inf",
      "max":"inf"
    }
  }
}

The above formatting is not required, but added here for readability.

Reading algorithm

if header != expected_header:
    raise InvalidHeader()
if data_checksum == last_data_checksum:
    raise NoUpdate()
if data_checksum != crc32(encoded_timestamp_and_values):
    raise InvalidChecksum()
if metadata_checksum == last_metadata_checksum:
    for datasource, value in cached_datasources, values:
        update(datasource, value)
else:
    if metadata_checksum != crc32(metadata):
        raise InvalidChecksum()
    cached_datasources = create_datasources(metadata)
    for datasource, value in cached_datasources, values:
        update(datasource, value)

Design document
Revision	v11
Status	confirmed
Review	#139
Revision history
v1	Initial version
v2	Added details about the VDI's binary format and size, and the SR capability name.
v3	Tar was not needed after all!
v4	Add details about discovering the VDI using a new vdi_type.
v5	Add details about the http handlers and interaction with xapi's database
v6	Add details about the framing of the data within the VDI
v7	Redesign semantics of the rrd_updates handler
v8	Redesign semantics of the rrd_updates handler (again)
v9	Magic number change in framing format of vdi
v10	Add details of new APIs added to xapi and xcp-rrdd
v11	Remove unneeded API calls

SR-Level RRDs

Introduction

Stats Collection

Archiving

SR-level RRDs will be archived in the SR itself, in a VDI, rather than in the local filesystem of the SR master. This way, we don’t need to worry about master failover.

There will be a simple framing format for the data on the VDI. This will be as follows:

Offset	Type	Name	Comment
0	32 bit network-order int	magic	Magic number = 0x7ada7ada
4	32 bit network-order int	version	1
8	32 bit network-order int	length	length of payload
12	gzipped data	data

Management of the SR-stats VDI

The SR-stats VDI will be attached/detached on PBD.plug/unplug on the SR master.

On PBD.plug on the SR master, if the SR has the stats capability, xapi:
- Creates a stats VDI if not already there (search for an existing one based on the VDI type).
- Attaches the stats VDI if it did already exist, and copies the RRDs to the local file system (standard location in the filesystem; asks xcp-rrdd where to put them).
- Informs xcp-rrdd about the RRDs so that it will load the RRDs and add newly recorded data to them (needs a function like push_rrd_local for VM-level RRDs).
- Detaches stats VDI.
On PBD.unplug on the SR master, if the SR has the stats capability xapi:
- Tells xcp-rrdd to archive the RRDs for the SR, which it will do to the local filesystem.
- Attaches the stats VDI, copies the RRDs into it, detaches VDI.

Periodic Archiving

Exporting

There will be a new handler for downloading an SR RRD:

http://<server>/sr_rrd?session_id=<SESSION HANDLE>&uuid=<SR UUID>

Whether the host RRD updates are returned is governed by the presence of host=true in the parameters. host=<anything else> or the absence of the host key will mean the host RRD is not returned.

It will be possible to mix and match these parameters; for example to return RRD updates for the host and all VMs, the URL to use would be:

http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=true&vm_uuid=all&sr_uuid=none

Or, to return RRD updates for all SRs but nothing else, the URL to use would be:

http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=false&vm_uuid=none&sr_uuid=all

While behaviour is defined if any of the keys host, vm_uuid and sr_uuid is missing, this is for backwards compatibility and it is recommended that clients specify each parameter explicitly.

Database updating.

The utilisation of VDIs will not be updated in this way until scalability worries for RRDs are addressed.

Xapi will cache whether it is SR master for every attached SR and only attempt to update if it is the SR master.

New APIs.

xcp-rrdd:

Get the filesystem location where sr rrds are archived: val sr_rrds_path : uid:string -> string
Archive the sr rrds to the filesystem: val archive_sr_rrd : sr_uuid:string -> unit
Load the sr rrds from the filesystem: val push_sr_rrd : sr_uuid:string -> unit

XenAPI

XenAPI Basics

This document contains a description of the Xen Management API - an interface for remotely configuring and controlling virtualised guests running on a Xen-enabled host.

The API is presented here as a set of Remote Procedure Calls (RPCs). There are two supported wire formats, one based upon XML-RPC and one based upon JSON-RPC (v1.0 and v2.0 are both recognized). No specific language bindings are prescribed, although examples are given in the Python programming language.

Although we adopt some terminology from object-oriented programming, future client language bindings may or may not be object-oriented. The API reference uses the terminology classes and objects. For our purposes a class is simply a hierarchical namespace; an object is an instance of a class with its fields set to specific values. Objects are persistent and exist on the server-side. Clients may obtain opaque references to these server-side objects and then access their fields via get/set RPCs.

For each class we specify a list of fields along with their types and qualifiers. A qualifier is one of:

RO/runtime: the field is Read Only. Furthermore, its value is automatically computed at runtime. For example, current CPU load and disk IO throughput.
RO/constructor: the field must be manually set when a new object is created, but is then Read Only for the duration of the object’s life. For example, the maximum memory addressable by a guest is set before the guest boots.
RW: the field is Read/Write. For example, the name of a VM.

Types

The following types are used to specify methods and fields in the API Reference:

string: Text strings.
int: 64-bit integers.
float: IEEE double-precision floating-point numbers.
bool: Boolean.
datetime: Date and timestamp.
c ref: Reference to an object of class c.
t set: Arbitrary-length set of values of type t.
(k -> v) map: Mapping from values of type k to values of type v.
e enum: Enumeration type with name e. Enums are defined in the API reference together with classes that use them.

Note that there are a number of cases where refs are doubly linked. For example, a VM has a field called VIFs of type VIF ref set; this field lists the network interfaces attached to a particular VM. Similarly, the VIF class has a field called VM of type VM ref which references the VM to which the interface is connected. These two fields are bound together, in the sense that creating a new VIF causes the VIFs field of the corresponding VM object to be updated automatically.

The API reference lists explicitly the fields that are bound together in this way. It also contains a diagram that shows relationships between classes. In this diagram an edge signifies the existence of a pair of fields that are bound together, using standard crows-foot notation to signify the type of relationship (e.g. one-many, many-many).

RPCs associated with fields

Each field, f, has an RPC accessor associated with it that returns f’s value:

get_f (r): takes a ref, r that refers to an object and returns the value of f.

Each field, f, with qualifier RW and whose outermost type is set has the following additional RPCs associated with it:

add_f(r, v): adds a new element v to the set. Note that sets cannot contain duplicate values, hence this operation has no action in the case that v is already in the set.
remove_f(r, v): removes element v from the set.

Each field, f, with qualifier RW and whose outermost type is map has the following additional RPCs associated with it:

add_to_f(r, k, v): adds new pair k -> v to the mapping stored in f in object r. Attempting to add a new pair for duplicate key, k, fails with a MAP_DUPLICATE_KEY error.
remove_from_f(r, k): removes the pair with key k from the mapping stored in f in object r.

Each field whose outermost type is neither set nor map, but whose qualifier is RW has an RPC accessor associated with it that sets its value:

set_f(r, v): sets the field f on object r to value v.

RPCs associated with classes

Most classes have a constructor RPC named create that takes as parameters all fields marked RW and RO/constructor. The result of this RPC is that a new persistent object is created on the server-side with the specified field values.
Each class has a get_by_uuid(uuid) RPC that returns the object of that class that has the specified uuid.
Each class that has a name_label field has a get_by_name_label(name_label) RPC that returns a set of objects of that class that have the specified name_label.
Most classes have a destroy(r) RPC that explicitly deletes the persistent object specified by r from the system. This is a non-cascading delete - if the object being removed is referenced by another object then the destroy call will fail.

Apart from the RPCs enumerated above, most classes have additional RPCs associated with them. For example, the VM class has RPCs for cloning, suspending, starting etc. Such additional RPCs are described explicitly in the API reference.

Wire Protocol

API calls are sent over a network to a Xen-enabled host using an RPC protocol. Here we describe how the higher-level types used in our API Reference are mapped to primitive RPC types, covering the two supported wire formats XML-RPC and JSON-RPC.

XML-RPC Protocol

We specify the signatures of API functions in the following style:

(VM ref set)  VM.get_all()

This specifies that the function with name VM.get_all takes no parameters and returns a set of VM ref. These types are mapped onto XML-RPC types in a straight-forward manner:

the types float, bool, datetime, and string map directly to the XML-RPC <double>, <boolean>, <dateTime.iso8601>, and <string> elements.
all ref types are opaque references, encoded as the XML-RPC’s <string> type. Users of the API should not make assumptions about the concrete form of these strings and should not expect them to remain valid after the client’s session with the server has terminated.
fields named uuid of type string are mapped to the XML-RPC <string> type. The string itself is the OSF DCE UUID presentation format (as output by uuidgen).
int is assumed to be 64-bit in our API and is encoded as a string of decimal digits (rather than using XML-RPC’s built-in 32-bit <i4> type).
values of enum types are encoded as strings. For example, the value destroy of enum on_normal_exit, would be conveyed as:

    <value><string>destroy</string></value>

for all our types, t, our type t set simply maps to XML-RPC’s <array> type, so, for example, a value of type string set would be transmitted like this:

    <array>
      <data>
        <value><string>CX8</string></value>
        <value><string>PSE36</string></value>
        <value><string>FPU</string></value>
      </data>
    </array>

for types k and v, our type (k -> v) map maps onto an XML-RPC <struct>, with the key as the name of the struct. Note that the (k -> v) map type is only valid when k is a string, ref, or int, and in each case the keys of the maps are stringified as above. For example, the (string -> float) map containing the mappings Mike -> 2.3 and John -> 1.2 would be represented as:

    <value>
      <struct>
        <member>
          <name>Mike</name>
          <value><double>2.3</double></value>
        </member>
        <member>
          <name>John</name>
          <value><double>1.2</double></value>
        </member>
      </struct>
    </value>

our void type is transmitted as an empty string.

XML-RPC Return Values and Status Codes

The return value of an RPC call is an XML-RPC <struct>.

The first element of the struct is named Status; it contains a string value indicating whether the result of the call was a Success or a Failure.

If the Status is Success then the struct contains a second element named Value:

The element of the struct named Value contains the function’s return value.

If the Status is Failure then the struct contains a second element named ErrorDescription:

The element of the struct named ErrorDescription contains an array of string values. The first element of the array is an error code; the rest of the elements are strings representing error parameters relating to that code.

For example, an XML-RPC return value from the host.get_resident_VMs function may look like this:

    <struct>
       <member>
         <name>Status</name>
         <value>Success</value>
       </member>
       <member>
          <name>Value</name>
          <value>
            <array>
               <data>
                 <value>81547a35-205c-a551-c577-00b982c5fe00</value>
                 <value>61c85a22-05da-b8a2-2e55-06b0847da503</value>
                 <value>1d401ec4-3c17-35a6-fc79-cee6bd9811fe</value>
               </data>
            </array>
         </value>
       </member>
    </struct>

JSON-RPC Protocol

We specify the signatures of API functions in the following style:

(VM ref set)  VM.get_all()

This specifies that the function with name VM.get_all takes no parameters and returns a set of VM ref. These types are mapped onto JSON-RPC types in the following manner:

the types float and bool map directly to the JSON types number and boolean, while datetime and string are represented as the JSON string type.
all ref types are opaque references, encoded as the JSON string type. Users of the API should not make assumptions about the concrete form of these strings and should not expect them to remain valid after the client’s session with the server has terminated.
fields named uuid of type string are mapped to the JSON string type. The string itself is the OSF DCE UUID presentation format (as output by uuidgen).
int is assumed to be 64-bit in our API and is encoded as a JSON number without decimal point or exponent, preserved as a string.
values of enum types are encoded as the JSON string type. For example, the value destroy of enum on_normal_exit, would be conveyed as:

  "destroy"

for all our types, t, our type t set simply maps to the JSON array type, so, for example, a value of type string set would be transmitted like this:

  [ "CX8", "PSE36", "FPU" ]

for types k and v, our type (k -> v) map maps onto a JSON object which contains members with name k and value v. Note that the (k -> v) map type is only valid when k is a string, ref, or int, and in each case the keys of the maps are stringified as above. For example, the (string -> float) map containing the mappings Mike -> 2.3 and John -> 1.2 would be represented as:

  {
    "Mike": 2.3,
    "John": 1.2
  }

our void type is transmitted as an empty string.

Both versions 1.0 and 2.0 of the JSON-RPC wire format are recognised and, depending on your client library, you can use either of them.

JSON-RPC v1.0

JSON-RPC v1.0 Requests

An API call is represented by sending a single JSON object to the server, which contains the members method, params, and id.

method: A JSON string containing the name of the function to be invoked.
params: A JSON array of values, which represents the parameters of the function to be invoked.
id: A JSON string or integer representing the call id. Note that, diverging from the JSON-RPC v1.0 specification the API does not accept notification requests (requests without responses), i.e. the id cannot be null.

For example, the body of a JSON-RPC v1.0 request to retrieve the resident VMs of a host may look like this:

  {
    "method": "host.get_resident_VMs",
    "params": [
      "OpaqueRef:74f1a19cd-b660-41e3-a163-10f03e0eae67",
      "OpaqueRef:08c34fc9-f418-4f09-8274-b9cb25cd8550"
    ],
    "id": "xyz"
  }

In the above example, the first element of the params array is the reference of the open session to the host, while the second is the host reference.

JSON-RPC v1.0 Return Values

The return value of a JSON-RPC v1.0 call is a single JSON object containing the members result, error, and id.

result: If the call is successful, it is a JSON value (string, array etc.) representing the return value of the invoked function. If an error has occurred, it is null.
error: If the call is successful, it is null. If the call has failed, it a JSON array of string values. The first element of the array is an error code; the remainder of the array are strings representing error parameters relating to that code.
id: The call id. It is a JSON string or integer and it is the same id as the request it is responding to.

For example, a JSON-RPC v1.0 return value from the host.get_resident_VMs function may look like this:

  {
    "result": [
        "OpaqueRef:604f51e7-630f-4412-83fa-b11c6cf008ab",
        "OpaqueRef:670d08f5-cbeb-4336-8420-ccd56390a65f"
    ],
    "error": null,
    "id": "xyz"
  }

while the return value of the same call made on a logged out session may look like this:

  {
    "result": null,
    "error": [
        "SESSION_INVALID",
        "OpaqueRef:93f1a23cd-a640-41e3-b163-10f86e0eae67"
    ],
    "id": "xyz"
  }

JSON-RPC v2.0

JSON-RPC v2.0 Requests

An API call is represented by sending a single JSON object to the server, which contains the members jsonrpc, method, params, and id.

jsonrpc: A JSON string specifying the version of the JSON-RPC protocol. It is exactly “2.0”.
method: A JSON string containing the name of the function to be invoked.
params: A JSON array of values, which represents the parameters of the function to be invoked. Although the JSON-RPC v2.0 specification allows this member to be omitted, in practice all API calls accept at least one parameter.
id: A JSON string or integer representing the call id. Note that, diverging from the JSON-RPC v2.0 specification it cannot be null. Neither can it be omitted because the API does not accept notification requests (requests without responses).

For example, the body of a JSON-RPC v2.0 request to retrieve the VMs resident on a host may look like this:

  {
    "jsonrpc": "2.0",
    "method": "host.get_resident_VMs",
    "params": [
      "OpaqueRef:c90cd28f-37ec-4dbf-88e6-f697ccb28b39",
      "OpaqueRef:08c34fc9-f418-4f09-8274-b9cb25cd8550"
    ],
    "id": 3
 }

As before, the first element of the parameter array is the reference of the open session to the host, while the second is the host reference.

JSON-RPC v2.0 Return Values

The return value of a JSON-RPC v2.0 call is a single JSON object containing the members jsonrpc, either result or error depending on the outcome of the call, and id.

jsonrpc: A JSON string specifying the version of the JSON-RPC protocol. It is exactly “2.0”.
result: If the call is successful, it is a JSON value (string, array etc.) representing the return value of the invoked function. If an error has occurred, it does not exist.
error: If the call is successful, it does not exist. If the call has failed, it is a single structured JSON object (see below).
id: The call id. It is a JSON string or integer and it is the same id as the request it is responding to.

The error object contains the members code, message, and data.

code: The API does not make use of this member and only retains it for compliance with the JSON-RPC v2.0 specification. It is a JSON integer which has a non-zero value.
message: A JSON string representing an API error code.
data: A JSON array of string values representing error parameters relating to the aforementioned API error code.

For example, a JSON-RPC v2.0 return value from the host.get_resident_VMs function may look like this:

  {
    "jsonrpc": "2.0",
    "result": [
        "OpaqueRef:604f51e7-630f-4412-83fa-b11c6cf008ab",
        "OpaqueRef:670d08f5-cbeb-4336-8420-ccd56390a65f"
    ],
    "id": 3
  }

while the return value of the same call made on a logged out session may look like this:

  {
    "jsonrpc": "2.0",
    "error": {
        "code": 1,
        "message": "SESSION_INVALID",
        "data": [
            "OpaqueRef:c90cd28f-37ec-4dbf-88e6-f697ccb28b39"
        ]
    },
    "id": 3
  }

Errors

When a low-level transport error occurs, or a request is malformed at the HTTP or RPC level, the server may send an HTTP 500 error response, or the client may simulate the same. The client must be prepared to handle these errors, though they may be treated as fatal.

For example, the following malformed request when using the XML-RPC protocol:

$curl -D - -X POST https://server -H 'Content-Type: application/xml' \
  -d '<?xml version="1.0"?>
  <methodCall>
    <methodName>session.logout</methodName>
  </methodCall>'

results to the following response:

HTTP/1.1 500 Internal Error
content-length: 297
content-type:text/html
connection:close
cache-control:no-cache, no-store

<html><body><h1>HTTP 500 internal server error</h1>An unexpected error occurred;
 please wait a while and try again. If the problem persists, please contact your
 support representative.<h1> Additional information </h1>Xmlrpc.Parse_error(&quo
t;close_tag&quot;, &quot;open_tag&quot;, _)</body></html>

When using the JSON-RPC protocol:

$curl -D - -X POST https://server/jsonrpc -H 'Content-Type: application/json' \
  -d '{
      "jsonrpc": "2.0",
      "method": "session.login_with_password",
      "id": 0
  }'

the response is:

HTTP/1.1 500 Internal Error
content-length: 308
content-type:text/html
connection:close
cache-control:no-cache, no-store

<html><body><h1>HTTP 500 internal server error</h1>An unexpected error occurred;
 please wait a while and try again. If the problem persists, please contact your
 support representative.<h1> Additional information </h1>Jsonrpc.Malformed_metho
d_request(&quot;{jsonrpc=...,method=...,id=...}&quot;)</body></html>

All other failures are reported with a more structured error response, to allow better automatic response to failures, proper internationalization of any error message, and easier debugging.

On the wire, these are transmitted like this when using the XML-RPC protocol:

<struct>
    <member>
        <name>Status</name>
        <value>Failure</value>
    </member>
    <member>
        <name>ErrorDescription</name>
        <value>
            <array>
                <data>
                    <value>MAP_DUPLICATE_KEY</value>
                    <value>Customer</value>
                    <value>eSpiel Inc.</value>
                    <value>eSpiel Incorporated</value>
                </data>
            </array>
        </value>
    </member>
</struct>

Note that ErrorDescription value is an array of string values. The first element of the array is an error code; the remainder of the array are strings representing error parameters relating to that code. In this case, the client has attempted to add the mapping Customer -> eSpiel Incorporated to a Map, but it already contains the mapping Customer -> eSpiel Inc., hence the request has failed.

When using the JSON-RPC protocol v2.0, the above error is transmitted as:

{
    "jsonrpc": "2.0",
    "error": {
        "code": 1,
        "message": "MAP_DUPLICATE_KEY",
        "data": [
            "Customer",
            "eSpiel Inc.",
            "eSpiel Incorporated"
        ]
    },
    "id": 3
}

Finally, when using the JSON-RPC protocol v1.0:

{
    "result": null,
    "error": [
        "MAP_DUPLICATE_KEY",
        "Customer",
        "eSpiel Inc.",
        "eSpiel Incorporated"
    ],
    "id": "xyz"
}

Each possible error code is documented in the last section of the API reference.

Note on References vs UUIDs

References are opaque types - encoded as XML-RPC and JSON-RPC strings on the wire - understood only by the particular server which generated them. Servers are free to choose any concrete representation they find convenient; clients should not make any assumptions or attempt to parse the string contents. References are not guaranteed to be permanent identifiers for objects; clients should not assume that references generated during one session are valid for any future session. References do not allow objects to be compared for equality. Two references to the same object are not guaranteed to be textually identical.

UUIDs are intended to be permanent identifiers for objects. They are guaranteed to be in the OSF DCE UUID presentation format (as output by uuidgen). Clients may store UUIDs on disk and use them to look up objects in subsequent sessions with the server. Clients may also test equality on objects by comparing UUID strings.

The API provides mechanisms for translating between UUIDs and opaque references. Each class that contains a UUID field provides:

A get_by_uuid method that takes a UUID and returns an opaque reference to the server-side object that has that UUID;
A get_uuid function (a regular “field getter” RPC) that takes an opaque reference and returns the UUID of the server-side object that is referenced by it.

Making RPC Calls

Transport Layer

The following transport layers are currently supported:

HTTP/HTTPS for remote administration
HTTP over Unix domain sockets for local administration

Session Layer

The RPC interface is session-based; before you can make arbitrary RPC calls you must login and initiate a session. For example:

   (session ref) session.login_with_password(string uname, string pwd,
                   string version, string originator)

where uname and password refer to your username and password, as defined by the Xen administrator, while version and originator are optional. The session ref returned by session.login_with_password is passed to subsequent RPC calls as an authentication token. Note that a session reference obtained by a login request to the XML-RPC backend can be used in subsequent requests to the JSON-RPC backend, and vice versa.

A session can be terminated with the session.logout function:

   void  session.logout(session ref session_id)

Synchronous and Asynchronous Invocation

Each method call (apart from methods on the Session and Task objects and “getters” and “setters” derived from fields) can be made either synchronously or asynchronously. A synchronous RPC call blocks until the return value is received; the return value of a synchronous RPC call is exactly as specified above.

Only synchronous API calls are listed explicitly in this document. All their asynchronous counterparts are in the special Async namespace. For example, the synchronous call VM.clone(...) has an asynchronous counterpart, Async.VM.clone(...), that is non-blocking.

Instead of returning its result directly, an asynchronous RPC call returns an identifier of type task ref which is subsequently used to track the status of a running asynchronous RPC.

Note that an asynchronous call may fail immediately, before a task has even been created. When using the XML-RPC wire protocol, this eventuality is represented by wrapping the returned task ref in an XML-RPC struct with a Status, ErrorDescription, and Value fields, exactly as specified above; the task ref is provided in the Value field if Status is set to Success. When using the JSON-RPC protocol, the task ref is wrapped in a response JSON object as specified above and it is provided by the value of the result member of a successful call.

The RPC call

    (task ref set)  Task.get_all(session ref session_id)

returns a set of all task identifiers known to the system. The status (including any returned result and error codes) of these can then be queried by accessing the fields of the Task object in the usual way. Note that, in order to get a consistent snapshot of a task’s state, it is advisable to call the get_record function.

Example interactive session

This section describes how an interactive session might look, using python XML-RPC and JSON-RPC client libraries.

First, initialise python:

$ python3
>>>

Using the XML-RPC Protocol

Import the library xmlrpc.client and create a python object referencing the remote server as shown below:

>>> import xmlrpc.client
>>> xen = xmlrpc.client.ServerProxy("https://localhost:443")

Note that you may need to disable SSL certificate validation to establish the connection, this can be done as follows:

>>> import ssl
>>> ctx = ssl._create_unverified_context()
>>> xen = xmlrpc.client.ServerProxy("https://localhost:443", context=ctx)

Acquire a session reference by logging in with a username and password; the session reference is returned under the key Value in the resulting dictionary (error-handling omitted for brevity):

>>> session = xen.session.login_with_password("user", "passwd",
...                                           "version", "originator")['Value']

This is what the call looks like when serialized

<?xml version='1.0'?>
<methodCall>
    <methodName>session.login_with_password</methodName>
    <params>
        <param><value><string>user</string></value></param>
        <param><value><string>passwd</string></value></param>
        <param><value><string>version</string></value></param>
        <param><value><string>originator</string></value></param>
    </params>
</methodCall>

Next, the user may acquire a list of all the VMs known to the system (note the call takes the session reference as the only parameter):

>>> all_vms = xen.VM.get_all(session)['Value']
>>> all_vms
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]

The VM references here have the form OpaqueRef:X (though they may not be that simple in reality) and you should treat them as opaque strings. Templates are VMs with the is_a_template field set to true. We can find the subset of template VMs using a command like the following:

>>> all_templates = filter(lambda x: xen.VM.get_is_a_template(session, x)['Value'],
                              all_vms)

Once a reference to a VM has been acquired, a lifecycle operation may be invoked:

>>> xen.VM.start(session, all_templates[0], False, False)
{'Status': 'Failure', 'ErrorDescription': ['VM_IS_TEMPLATE', 'OpaqueRef:X']}

In this case the start message has been rejected, because the VM is a template, and so an error response has been returned. These high-level errors are returned as structured data (rather than as XML-RPC faults), allowing them to be internationalized.

Rather than querying fields individually, whole records may be returned at once. To retrieve the record of a single object as a python dictionary:

>>> record = xen.VM.get_record(session, all_templates[0])['Value']
>>> record['power_state']
'Halted'
>>> record['name_label']
'Windows 10 (64-bit)'

To retrieve all the VM records in a single call:

>>> records = xen.VM.get_all_records(session)['Value']
>>> list(records.keys())
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
>>> records['OpaqueRef:1']['name_label']
'Red Hat Enterprise Linux 7'

Using the JSON-RPC Protocol

For this example we are making use of the package jsonrpcclient and the requests library due to their simplicity, although other packages can also be used.

First, import the requests and jsonrpcclient libraries:

>>> import requests
>>> import jsonrpcclient

Now we construct a utility method to make using these libraries easier:

>>> def jsonrpccall(method, params):
...     r = requests.post("https://localhost:443/jsonrpc",
...                       json=jsonrpcclient.request(method, params=params),
...                       verify=False)
...     p = jsonrpcclient.parse(r.json())
...     if isinstance(p, jsonrpcclient.Ok):
...         return p.result
...     raise Exception(p.message, p.data)

Acquire a session reference by logging in with a username and password:

>>> session = jsonrpccall("session.login_with_password",
...                       ("user", "password", "version", "originator"))

jsonrpcclient uses the JSON-RPC protocol v2.0, so this is what the serialized request looks like:

  {
    "jsonrpc": "2.0",
    "method": "session.login_with_password",
    "params": ["user", "passwd", "version", "originator"],
    "id": 0
  }

Next, the user may acquire a list of all the VMs known to the system (note the call takes the session reference as the only parameter):

>>> all_vms = jsonrpccall("VM.get_all", (session,))
>>> all_vms
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]

>>> all_templates = filter(
...     lambda x: jsonrpccall("VM.get_is_a_template", (session, x)),
...     all_vms)

Once a reference to a VM has been acquired, a lifecycle operation may be invoked:

>>> try:
...     jsonrpccall("VM.start", (session, next(all_templates), False, False))
... except Exception as e:
...     e
...
Exception('VM_IS_TEMPLATE', ['OpaqueRef:1', 'start'])

In this case the start message has been rejected because the VM is a template, hence an error response has been returned. These high-level errors are returned as structured data, allowing them to be internationalized.

Rather than querying fields individually, whole records may be returned at once. To retrieve the record of a single object as a python dictionary:

>>> record = jsonrpccall("VM.get_record", (session, next(all_templates)))
>>> record['power_state']
'Halted'
>>> record['name_label']
'Windows 10 (64-bit)'

To retrieve all the VM records in a single call:

>>> records = jsonrpccall("VM.get_all_records", (session,))
>>> records.keys()
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
>>> records['OpaqueRef:1']['name_label']
'Red Hat Enterprise Linux 7'

Overview of the XenAPI

This chapter introduces the XenAPI and its associated object model. The API has the following key features:

Management of all aspects of the XenServer Host. The API allows you to manage VMs, storage, networking, host configuration and pools. Performance and status metrics can also be queried from the API.
Persistent Object Model. The results of all side-effecting operations (e.g. object creation, deletion and parameter modifications) are persisted in a server-side database that is managed by the XenServer installation.
An event mechanism. Through the API, clients can register to be notified when persistent (server-side) objects are modified. This enables applications to keep track of datamodel modifications performed by concurrently executing clients.
Synchronous and asynchronous invocation. All API calls can be invoked synchronously (that is, block until completion); any API call that may be long-running can also be invoked asynchronously. Asynchronous calls return immediately with a reference to a task object. This task object can be queried (through the API) for progress and status information. When an asynchronously invoked operation completes, the result (or error code) is available from the task object.
Remotable and Cross-Platform. The client issuing the API calls does not have to be resident on the host being managed; nor does it have to be connected to the host over ssh in order to execute the API. API calls make use of the XML-RPC protocol to transmit requests and responses over the network.
Secure and Authenticated Access. The XML-RPC API server executing on the host accepts secure socket connections. This allows a client to execute the APIs over the https protocol. Further, all the API calls execute in the context of a login session generated through username and password validation at the server. This provides secure and authenticated access to the XenServer installation.

Getting Started with the API

We will start our tour of the API by describing the calls required to create a new VM on a XenServer installation, and take it through a start/suspend/resume/stop cycle. This is done without reference to code in any specific language; at this stage we just describe the informal sequence of RPC invocations that accomplish our “install and start” task.

Authentication: acquiring a session reference

The first step is to call Session.login_with_password(, , , ). The API is session based, so before you can make other calls you will need to authenticate with the server. Assuming the username and password are authenticated correctly, the result of this call is a session reference. Subsequent API calls take the session reference as a parameter. In this way we ensure that only API users who are suitably authorized can perform operations on a XenServer installation. You can continue to use the same session for any number of API calls. When you have finished the session, Citrix recommends that you call Session.logout(session) to clean up: see later.

Acquiring a list of templates to base a new VM installation on

The next step is to query the list of “templates” on the host. Templates are specially-marked VM objects that specify suitable default parameters for a variety of supported guest types. (If you want to see a quick enumeration of the templates on a XenServer installation for yourself then you can execute the xe template-list CLI command.) To get a list of templates from the API, we need to find the VM objects on the server that have their is_a_template field set to true. One way to do this by calling VM.get_all_records(session) where the session parameter is the reference we acquired from our Session.login_with_password call earlier. This call queries the server, returning a snapshot (taken at the time of the call) containing all the VM object references and their field values.

(Remember that at this stage we are not concerned about the particular mechanisms by which the returned object references and field values can be manipulated in any particular client language: that detail is dealt with by our language-specific API bindings and described concretely in the following chapter. For now it suffices just to assume the existence of an abstract mechanism for reading and manipulating objects and field values returned by API calls.)

Now that we have a snapshot of all the VM objects’ field values in the memory of our client application we can simply iterate through them and find the ones that have their “is_a_template” set to true. At this stage let’s assume that our example application further iterates through the template objects and remembers the reference corresponding to the one that has its “name_label” set to “Debian Etch 4.0” (one of the default Linux templates supplied with XenServer).

Installing the VM based on a template

Continuing through our example, we must now install a new VM based on the template we selected. The installation process requires 4 API calls:

First we must now invoke the API call VM.clone(session, t_ref, "my first VM"). This tells the server to clone the VM object referenced by t_ref in order to make a new VM object. The return value of this call is the VM reference corresponding to the newly-created VM. Let’s call this new_vm_ref.
Next, we need to specify the UUID of the Storage Repository where the VM’s disks will be instantiated. We have to put this in the sr attribute in the disk provisioning XML stored under the “disks” key in the other_config map of the newly-created VM. This field can be updated by calling its getter (other_config <- VM.get_other_config(session, new_vm_ref)) and then its setter (VM.set_other_config(session, new_vm_ref, other_config)) with the modified other_config map.
At this stage the object referred to by new_vm_ref is still a template (just like the VM object referred to by t_ref, from which it was cloned). To make new_vm_ref into a VM object we need to call VM.provision(session, new_vm_ref). When this call returns the new_vm_ref object will have had its is_a_template field set to false, indicating that new_vm_ref now refers to a regular VM ready for starting.

Note
The provision operation may take a few minutes, as it is as during this call that the template’s disk images are created. In the case of the Debian template, the newly created disks are also at this stage populated with a Debian root filesystem.

Taking the VM through a start/suspend/resume/stop cycle

Now we have an object reference representing our newly-installed VM, it is trivial to take it through a few lifecycle operations:

To start our VM we can just call VM.start(session, new_vm_ref)
After it’s running, we can suspend it by calling VM.suspend(session, new_vm_ref),
and then resume it by calling VM.resume(session, new_vm_ref).
We can call VM.shutdown(session, new_vm_ref) to shutdown the VM cleanly.

Logging out

Once an application is finished interacting with a XenServer Host it is good practice to call Session.logout(session). This invalidates the session reference (so it cannot be used in subsequent API calls) and simultaneously deallocates server-side memory used to store the session object.

Although inactive sessions will eventually timeout, the server has a hardcoded limit of 500 concurrent sessions for each username or originator. Once this limit has been reached fresh logins will evict the session objects that have been used least recently, causing their associated session references to become invalid. For successful interoperability with other applications, concurrently accessing the server, the best policy is:

Choose a string that identifies your application and its version.
Create a single session at start-of-day, using that identifying string for the originator parameter to Session.login_with_password.
Use this session throughout the application (note that sessions can be used across multiple separate client-server network connections) and then explicitly logout when possible.

If a poorly written client leaks sessions or otherwise exceeds the limit, then as long as the client uses an appropriate originator argument, it will be easily identifiable from the XenServer logs and XenServer will destroy the longest-idle sessions of the rogue client only; this may cause problems for that client but not for other clients. If the misbehaving client did not specify an originator, it would be harder to identify and would cause the premature destruction of sessions of any clients that also did not specify an originator

Install and start example: summary

We have seen how the API can be used to install a VM from a XenServer template and perform a number of lifecycle operations on it. You will note that the number of calls we had to make in order to affect these operations was small:

One call to acquire a session: Session.login_with_password()
One call to query the VM (and template) objects present on the XenServer installation: VM.get_all_records(). Recall that we used the information returned from this call to select a suitable template to install from.
Four calls to install a VM from our chosen template: VM.clone(), followed by the getter and setter of the other_config field to specify where to create the disk images of the template, and then VM.provision().
One call to start the resultant VM: VM.start() (and similarly other single calls to suspend, resume and shutdown accordingly)
And then one call to logout Session.logout()

The take-home message here is that, although the API as a whole is complex and fully featured, common tasks (such as creating and performing lifecycle operations on VMs) are very straightforward to perform, requiring only a small number of simple API calls. Keep this in mind while you study the next section which may, on first reading, appear a little daunting!

Object Model Overview

This section gives a high-level overview of the object model of the API. A more detailed description of the parameters and methods of each class outlined here can be found in the XenServer API Reference document.

We start by giving a brief outline of some of the core classes that make up the API. (Don’t worry if these definitions seem somewhat abstract in their initial presentation; the textual description in subsequent sections, and the code-sample walk through in the next Chapter will help make these concepts concrete.)

Class	Description
VM	A VM object represents a particular virtual machine instance on a XenServer Host or Resource Pool. Example methods include `start`, `suspend`, `pool_migrate`; example parameters include `power_state`, `memory_static_max`, and `name_label`. (In the previous section we saw how the VM class is used to represent both templates and regular VMs)
Host	A host object represents a physical host in a XenServer pool. Example methods include `reboot` and `shutdown`. Example parameters include `software_version`, `hostname`, and [IP] `address`.
VDI	A VDI object represents a Virtual Disk Image. Virtual Disk Images can be attached to VMs, in which case a block device appears inside the VM through which the bits encapsulated by the Virtual Disk Image can be read and written. Example methods of the VDI class include “resize” and “clone”. Example fields include “virtual_size” and “sharable”. (When we called `VM.provision` on the VM template in our previous example, some VDI objects were automatically created to represent the newly created disks, and attached to the VM object.)
SR	An SR (Storage Repository) aggregates a collection of VDIs and encapsulates the properties of physical storage on which the VDIs’ bits reside. Example parameters include `type` (which determines the storage-specific driver a XenServer installation uses to read/write the SR’s VDIs) and `physical_utilisation`; example methods include `scan` (which invokes the storage-specific driver to acquire a list of the VDIs contained with the SR and the properties of these VDIs) and `create` (which initializes a block of physical storage so it is ready to store VDIs).
Network	A network object represents a layer-2 network that exists in the environment in which the XenServer Host instance lives. Since XenServer does not manage networks directly this is a lightweight class that serves merely to model physical and virtual network topology. VM and Host objects that are attached to a particular Network object (by virtue of VIF and PIF instances – see below) can send network packets to each other.

At this point, readers who are finding this enumeration of classes rather terse may wish to skip to the code walk-throughs of the next chapter: there are plenty of useful applications that can be written using only a subset of the classes already described! For those who wish to continue this description of classes in the abstract, read on.

On top of the classes listed above, there are 4 more that act as connectors, specifying relationships between VMs and Hosts, and Storage and Networks. The first 2 of these classes that we will consider, VBD and VIF, determine how VMs are attached to virtual disks and network objects respectively:

Class

Description

VBD

A VBD (Virtual Block Device) object represents an attachment between a VM and a VDI. When a VM is booted its VBD objects are queried to determine which disk images (VDIs) should be attached. Example methods of the VBD class include “plug” (which hot plugs a disk device into a running VM, making the specified VDI accessible therein) and “unplug” (which hot unplugs a disk device from a running guest); example fields include “device” (which determines the device name inside the guest under which the specified VDI will be made accessible).

VIF

A VIF (Virtual network InterFace) object represents an attachment between a VM and a Network object. When a VM is booted its VIF objects are queried to determine which network devices should be created. Example methods of the VIF class include “plug” (which hot plugs a network device into a running VM) and “unplug” (which hot unplugs a network device from a running guest).

The second set of “connector classes” that we will consider determine how Hosts are attached to Networks and Storage.

Class

Description

PIF

A PIF (Physical InterFace) object represents an attachment between a Host and a Network object. If a host is connected to a Network (over a PIF) then packets from the specified host can be transmitted/received by the corresponding host. Example fields of the PIF class include “device” (which specifies the device name to which the PIF corresponds – e.g. eth0) and “MAC” (which specifies the MAC address of the underlying NIC that a PIF represents). Note that PIFs abstract both physical interfaces and VLANs (the latter distinguished by the existence of a positive integer in the “VLAN” field).

PBD

A PBD (Physical Block Device) object represents an attachment between a Host and a SR (Storage Repository) object. Fields include “currently-attached” (which specifies whether the chunk of storage represented by the specified SR object) is currently available to the host; and “device_config” (which specifies storage-driver specific parameters that determines how the low-level storage devices are configured on the specified host – e.g. in the case of an SR rendered on an NFS filer, device_config may specify the host-name of the filer and the path on the filer in which the SR files live.).

Graphical overview of API classes for managing VMs, Hosts, Storage and Networking

The figure above presents a graphical overview of the API classes involved in managing VMs, Hosts, Storage and Networking. From this diagram, the symmetry between storage and network configuration, and also the symmetry between virtual machine and host configuration is plain to see.

Working with VIFs and VBDs

In this section we walk through a few more complex scenarios, describing informally how various tasks involving virtual storage and network devices can be accomplished using the API.

Creating disks and attaching them to VMs

Let’s start by considering how to make a new blank disk image and attach it to a running VM. We will assume that we already have ourselves a running VM, and we know its corresponding API object reference (e.g. we may have created this VM using the procedure described in the previous section, and had the server return its reference to us.) We will also assume that we have authenticated with the XenServer installation and have a corresponding session reference. Indeed in the rest of this chapter, for the sake of brevity, we will stop mentioning sessions altogether.

Creating a new blank disk image

The first step is to instantiate the disk image on physical storage. We do this by calling VDI.create(). The VDI.create call takes a number of parameters, including:

name_label and name_description: a human-readable name/description for the disk (e.g. for convenient display in the UI etc.). These fields can be left blank if desired.
SR: the object reference of the Storage Repository representing the physical storage in which the VDI’s bits will be placed.
read_only: setting this field to true indicates that the VDI can only be attached to VMs in a read-only fashion. (Attempting to attach a VDI with its read_only field set to true in a read/write fashion results in error.)

Invoking the VDI.create call causes the XenServer installation to create a blank disk image on physical storage, create an associated VDI object (the datamodel instance that refers to the disk image on physical storage) and return a reference to this newly created VDI object.

The way in which the disk image is represented on physical storage depends on the type of the SR in which the created VDI resides. For example, if the SR is of type “lvm” then the new disk image will be rendered as an LVM volume; if the SR is of type “nfs” then the new disk image will be a sparse VHD file created on an NFS filer. (You can query the SR type through the API using the SR.get_type() call.)

Note
Some SR types might round up the virtual-size value to make it divisible by a configured block size.

Attaching the disk image to a VM

So far we have a running VM (that we assumed the existence of at the start of this example) and a fresh VDI that we just created. Right now, these are both independent objects that exist on the XenServer Host, but there is nothing linking them together. So our next step is to create such a link, associating the VDI with our VM.

The attachment is formed by creating a new “connector” object called a VBD (Virtual Block Device). To create our VBD we invoke the VBD.create() call. The VBD.create() call takes a number of parameters including:

VM - the object reference of the VM to which the VDI is to be attached
VDI - the object reference of the VDI that is to be attached
mode - specifies whether the VDI is to be attached in a read-only or a read-write fashion
userdevice - specifies the block device inside the guest through which applications running inside the VM will be able to read/write the VDI’s bits.
type - specifies whether the VDI should be presented inside the VM as a regular disk or as a CD. (Note that this particular field has more meaning for Windows VMs than it does for Linux VMs, but we will not explore this level of detail in this chapter.)

Invoking VBD.create makes a VBD object on the XenServer installation and returns its object reference. However, this call in itself does not have any side-effects on the running VM (that is, if you go and look inside the running VM you will see that the block device has not been created). The fact that the VBD object exists but that the block device in the guest is not active, is reflected by the fact that the VBD object’s currently_attached field is set to false.

A VM object with 2 associated VDIs

For expository purposes, the figure above presents a graphical example that shows the relationship between VMs, VBDs, VDIs and SRs. In this instance a VM object has 2 attached VDIs: there are 2 VBD objects that form the connections between the VM object and its VDIs; and the VDIs reside within the same SR.

Hotplugging the VBD

If we rebooted the VM at this stage then, after rebooting, the block device corresponding to the VBD would appear: on boot, XenServer queries all VBDs of a VM and actively attaches each of the corresponding VDIs.

Rebooting the VM is all very well, but recall that we wanted to attach a newly created blank disk to a running VM. This can be achieved by invoking the plug method on the newly created VBD object. When the plug call returns successfully, the block device to which the VBD relates will have appeared inside the running VM – i.e. from the perspective of the running VM, the guest operating system is led to believe that a new disk device has just been hot plugged. Mirroring this fact in the managed world of the API, the currently_attached field of the VBD is set to true.

Unsurprisingly, the VBD plug method has a dual called “unplug”. Invoking the unplug method on a VBD object causes the associated block device to be hot unplugged from a running VM, setting the currently_attached field of the VBD object to false accordingly.

Creating and attaching Network Devices to VMs

The API calls involved in configuring virtual network interfaces in VMs are similar in many respects to the calls involved in configuring virtual disk devices. For this reason we will not run through a full example of how one can create network interfaces using the API object-model; instead we will use this section just to outline briefly the symmetry between virtual networking device and virtual storage device configuration.

The networking analogue of the VBD class is the VIF class. Just as a VBD is the API representation of a block device inside a VM, a VIF (Virtual network InterFace) is the API representation of a network device inside a VM. Whereas VBDs associate VM objects with VDI objects, VIFs associate VM objects with Network objects. Just like VBDs, VIFs have a currently_attached field that determines whether or not the network device (inside the guest) associated with the VIF is currently active or not. And as we saw with VBDs, at VM boot-time the VIFs of the VM are queried and a corresponding network device for each created inside the booting VM. Similarly, VIFs also have plug and unplug methods for hot plugging/unplugging network devices in/out of running VMs.

Host configuration for networking and storage

We have seen that the VBD and VIF classes are used to manage configuration of block devices and network devices (respectively) inside VMs. To manage host configuration of storage and networking there are two analogous classes: PBD (Physical Block Device) and PIF (Physical [network] InterFace).

Host storage configuration: PBDs

Let us start by considering the PBD class. A PBD_create() call takes a number of parameters including:

Parameter	Description
host	physical machine on which the PBD is available
SR	the Storage Repository that the PBD connects to
device_config	a string-to-string map that is provided to the host’s SR-backend-driver, containing the low-level parameters required to configure the physical storage device(s) on which the SR is to be realized. The specific contents of the `device_config` field depend on the type of the SR to which the PBD is connected. (Executing `xe sm-list` will show a list of possible SR types; the configuration field in this enumeration specifies the `device_config` parameters that each SR type expects.)

For example, imagine we have an SR object s of type “nfs” (representing a directory on an NFS filer within which VDIs are stored as VHD files); and let’s say that we want a host, h, to be able to access s. In this case we invoke PBD.create() specifying host h, SR s, and a value for the device_config parameter that is the following map:

("server", "my_nfs_server.example.com"), ("serverpath", "/scratch/mysrs/sr1")

This tells the XenServer Host that SR s is accessible on host h, and further that to access SR s, the host needs to mount the directory /scratch/mysrs/sr1 on the NFS server named my_nfs_server.example.com.

Like VBD objects, PBD objects also have a field called currently_attached. Storage repositories can be attached and detached from a given host by invoking PBD.plug and PBD.unplug methods respectively.

Host networking configuration: PIFs

Host network configuration is specified by virtue of PIF objects. If a PIF object connects a network object, n, to a host object h, then the network corresponding to n is bridged onto a physical interface (or a physical interface plus a VLAN tag) specified by the fields of the PIF object.

For example, imagine a PIF object exists connecting host h to a network n, and that device field of the PIF object is set to eth0. This means that all packets on network n are bridged to the NIC in the host corresponding to host network device eth0.

XML-RPC notes

Datetimes

The API deviates from the XML-RPC specification in handling of datetimes. The API appends a “Z” to the end of datetime strings, which is meant to indicate that the time is expressed in UTC.

API evolution

All APIs evolve as bugs are fixed, new features added and features are removed

the XenAPI is no exception. This document lists policies describing how the XenAPI evolves over time.

The goals of XenAPI evolution are:

to allow bugs to be fixed efficiently;
to allow new, innovative features to be added easily;
to keep old, unmodified clients working as much as possible; and
where backwards-incompatible changes are to be made, publish this information early to enable affected parties to give timely feedback.

Background

In this document, the term XenAPI refers to the XMLRPC-derived wire protocol used by xapi. The XenAPI has objects which each have fields and messages. The XenAPI is described in detail elsewhere.

XenAPI Lifecycle

graph LR
    Prototype -->|1| Published -->|4| Deprecated -->|5| Removed
    Published -->|2,3| Published

Each element of the XenAPI (objects, messages and fields) follows the lifecycle diagram above. When an element is newly created and being still in development, it is in the Prototype state. Elements in this state may be stubs: the interface is there and can be used by clients for prototyping their new features, but the actual implementation is not yet ready.

When the element subsequently becomes ready for use (the stub is replaced by a real implementation), it transitions to the Published state. This is the only state in which the object, message or field should be used. From this point onwards, the element needs to have clearly defined semantics that are available for reference in the XenAPI documentation.

If the XenAPI element becomes Deprecated, it will still function as it did before, but its use is discouraged. The final stage of the lifecycle is the Removed state, in which the element is not available anymore.

The numbered state changes in the diagram have the following meaning:

Publish: declare that the XenAPI element is ready for people to use.
Extend: a backwards-compatible extension of the XenAPI, for example an additional parameter in a message with an appropriate default value. If the API is used as before, it still has the same effect.
Change: a backwards-incompatible change. That is, the message now behaves differently, or the field has different semantics. Such changes are discouraged and should only be considered in special cases (always consider whether deprecation is a better solution). The use of a message can for example be restricted for security or efficiency reasons, or the behaviour can be changed simply to fix a bug.
Deprecate: declare that the use of this XenAPI element should be avoided from now on. Reasons for doing this include: the element is redundant (it duplicates functionality elsewhere), it is inconsistent with other parts of the XenAPI, it is insecure or inefficient. For examples of deprecation policies of other projects, see the policies of the eclipse and oval projects.
Remove: the element is taken out of the public API and can no longer be used.

Each lifecycle transition must be accompanied by an explanation describing the change and the reason for the change. This message should be enough to understand the semantics of the XenAPI element after the change, and in the case of backwards-incompatible changes or deprecation, it should give directions about how to modify a client to deal with the change (for example, how to avoid using the deprecated field or message).

Releases

Every release must be accompanied by release notes listing all objects, fields and messages that are newly prototyped, published, extended, changed, deprecated or removed in the release. Each item should have an explanation as implied above, documenting the new or changed XenAPI element. The release notes for every release shall be prominently displayed in the XenAPI HTML documentation.

Documentation

The XenAPI documentation will contain its complete lifecycle history for each XenAPI element. Only the elements described in the documentation are “official” and supported.

Each object, message and field in datamodel.ml will have lifecycle metadata attached to it, which is a list of transitions (transition type * release * explanation string) as described above. Release notes are automatically generated from this data.

Using the API

This chapter describes how to use the XenServer Management API from real programs to manage XenServer Hosts and VMs. The chapter begins with a walk-through of a typical client application and demonstrates how the API can be used to perform common tasks. Example code fragments are given in python syntax but equivalent code in the other programming languages would look very similar. The language bindings themselves are discussed afterwards and the chapter finishes with walk-throughs of two complete examples.

Anatomy of a typical application

This section describes the structure of a typical application using the XenServer Management API. Most client applications begin by connecting to a XenServer Host and authenticating (e.g. with a username and password). Assuming the authentication succeeds, the server will create a “session” object and return a reference to the client. This reference will be passed as an argument to all future API calls. Once authenticated, the client may search for references to other useful objects (e.g. XenServer Hosts, VMs, etc.) and invoke operations on them. Operations may be invoked either synchronously or asynchronously; special task objects represent the state and progress of asynchronous operations. These application elements are all described in detail in the following sections.

Choosing a low-level transport

API calls can be issued over two transports:

SSL-encrypted TCP on port 443 (https) over an IP network
plaintext over a local Unix domain socket: /var/xapi/xapi

The SSL-encrypted TCP transport is used for all off-host traffic while the Unix domain socket can be used from services running directly on the XenServer Host itself. In the SSL-encrypted TCP transport, all API calls should be directed at the Resource Pool master; failure to do so will result in the error HOST_IS_SLAVE, which includes the IP address of the master as an error parameter.

Because the master host of a pool can change, especially if HA is enabled on a pool, clients must implement the following steps to detect a master host change and connect to the new master as required:

Subscribe to updates in the list of hosts servers, and maintain a current list of hosts in the pool

If the connection to the pool master fails to respond, attempt to connect to all hosts in the list until one responds

The first host to respond will return the HOST_IS_SLAVE error message, which contains the identity of the new pool master (unless of course the host is the new master)

Connect to the new master

Note
As a special-case, all messages sent through the Unix domain socket are transparently forwarded to the correct node.

Authentication and session handling

The vast majority of API calls take a session reference as their first parameter; failure to supply a valid reference will result in a SESSION_INVALID error being returned. Acquire a session reference by supplying a username and password to the login_with_password function.

Note
As a special-case, if this call is executed over the local Unix domain socket then the username and password are ignored and the call always succeeds.

Every session has an associated “last active” timestamp which is updated on every API call. The server software currently has a built-in limit of 500 active sessions and will remove those with the oldest “last active” field if this limit is exceeded for a given username or originator. In addition all sessions whose “last active” field is older than 24 hours are also removed. Therefore it is important to:

Specify an appropriate originator when logging in; and
Remember to log out of active sessions to avoid leaking them; and
Be prepared to log in again to the server if a SESSION_INVALID error is caught.

In the following Python fragment a connection is established over the Unix domain socket and a session is created:

import XenAPI

    session = XenAPI.xapi_local()
    try:
        session.xenapi.login_with_password("root", "", "2.3", "My Widget v0.1")
        ...
    finally:
        session.xenapi.session.logout()

Finding references to useful objects

Once an application has authenticated the next step is to acquire references to objects in order to query their state or invoke operations on them. All objects have a set of “implicit” messages which include the following:

get_by_name_label : return a list of all objects of a particular class with a particular label;
get_by_uuid : return a single object named by its UUID;
get_all : return a set of references to all objects of a particular class; and
get_all_records : return a map of reference to records for each object of a particular class.

For example, to list all hosts:

hosts = session.xenapi.host.get_all()

To find all VMs with the name “my first VM”:

vms = session.xenapi.VM.get_by_name_label('my first VM')

Note
Object name_label fields are not guaranteed to be unique and so the get_by_name_label API call returns a set of references rather than a single reference.

In addition to the methods of finding objects described above, most objects also contain references to other objects within fields. For example it is possible to find the set of VMs running on a particular host by calling:

vms = session.xenapi.host.get_resident_VMs(host)

Invoking synchronous operations on objects

Once object references have been acquired, operations may be invoked on them. For example to start a VM:

session.xenapi.VM.start(vm, False, False)

All API calls are by default synchronous and will not return until the operation has completed or failed. For example in the case of VM.start the call does not return until the VM has started booting.

Note
When the VM.start call returns the VM will be booting. To determine when the booting has finished, wait for the in-guest agent to report internal statistics through the VM_guest_metrics object.

Using Tasks to manage asynchronous operations

To simplify managing operations which take quite a long time (e.g. VM.clone and VM.copy) functions are available in two forms: synchronous (the default) and asynchronous. Each asynchronous function returns a reference to a task object which contains information about the in-progress operation including:

whether it is pending
whether it is has succeeded or failed
progress (in the range 0-1)
the result or error code returned by the operation

An application which wanted to track the progress of a VM.clone operation and display a progress bar would have code like the following:

vm = session.xenapi.VM.get_by_name_label('my vm')
task = session.xenapi.Async.VM.clone(vm)
while session.xenapi.task.get_status(task) == "pending":
        progress = session.xenapi.task.get_progress(task)
        update_progress_bar(progress)
        time.sleep(1)
session.xenapi.task.destroy(task)

Note
Note that a well-behaved client should remember to delete tasks created by asynchronous operations when it has finished reading the result or error. If the number of tasks exceeds a built-in threshold then the server will delete the oldest of the completed tasks.

Subscribing to and listening for events

With the exception of the task and metrics classes, whenever an object is modified the server generates an event. Clients can subscribe to this event stream on a per-class basis and receive updates rather than resorting to frequent polling. Events come in three types:

add - generated when an object has been created;
del - generated immediately before an object is destroyed; and
mod - generated when an object’s field has changed.

Events also contain a monotonically increasing ID, the name of the class of object and a snapshot of the object state equivalent to the result of a get_record().

Clients register for events by calling event.register() with a list of class names or the special string “*”. Clients receive events by executing event.next() which blocks until events are available and returns the new events.

Note
Since the queue of generated events on the server is of finite length a very slow client might fail to read the events fast enough; if this happens an EVENTS_LOST error is returned. Clients should be prepared to handle this by re-registering for events and checking that the condition they are waiting for hasn’t become true while they were unregistered.

The following python code fragment demonstrates how to print a summary of every event generated by a system: (similar code exists in Xenserver-SDK/XenServerPython/samples/watch-all-events.py)

fmt = "%8s  %20s  %5s  %s"
session.xenapi.event.register(["*"])
while True:
    try:
        for event in session.xenapi.event.next():
            name = "(unknown)"
            if "snapshot" in event.keys():
                snapshot = event["snapshot"]
                if "name_label" in snapshot.keys():
                    name = snapshot["name_label"]
            print fmt % (event['id'], event['class'], event['operation'], name)           
    except XenAPI.Failure, e:
        if e.details == [ "EVENTS_LOST" ]:
            print "Caught EVENTS_LOST; should reregister"

Language bindings

C

The SDK includes the source to the C language binding in the directory XenServer-SDK/libxenserver/src together with a Makefile which compiles the binding into a library. Every API object is associated with a header file which contains declarations for all that object’s API functions; for example the type definitions and functions required to invoke VM operations are all contained in xen_vm.h.

C binding dependencies

Platform supported:	Linux
Library:	The language binding is generated as a `libxenserver.so` that is linked by C programs.
Dependencies:	XML library (libxml2.so on GNU Linux) Curl library (libcurl2.so)

The following simple examples are included with the C bindings:

test_vm_async_migrate: demonstrates how to use asynchronous API calls to migrate running VMs from a slave host to the pool master.
test_vm_ops: demonstrates how to query the capabilities of a host, create a VM, attach a fresh blank disk image to the VM and then perform various powercycle operations;
test_failures: demonstrates how to translate error strings into enum_xen_api_failure, and vice versa;
test_event_handling: demonstrates how to listen for events on a connection.
test_enumerate: demonstrates how to enumerate the various API objects.

C#

The C# bindings are contained within the directory XenServer-SDK/XenServer.NET and include project files suitable for building under Microsoft Visual Studio. Every API object is associated with one C# file; for example the functions implementing the VM operations are contained within the file VM.cs.

C# binding dependencies

Platform supported:	Windows with .NET version 4.5
Library:	The language binding is generated as a Dynamic Link Library `XenServer.dll` that is linked by C# programs.
Dependencies:	`CookComputing.XMLRpcV2.dll` is needed for the XenServer.dll to be able to communicate with the xml-rpc server. We test with version 2.1.0.6 and recommend that you use this version, though others may work.

Three examples are included with the C# bindings in the directory XenServer-SDK/XenServer.NET/samples as separate projects of the XenSdkSample.sln solution:

GetVariousRecords: logs into a XenServer Host and displays information about hosts, storage and virtual machines;
GetVmRecords: logs into a XenServer Host and lists all the VM records;
VmPowerStates: logs into a XenServer Host, finds a VM and takes it through the various power states. Requires a shut-down VM to be already installed.

Java

The Java bindings are contained within the directory XenServer-SDK/XenServerJava and include project files suitable for building under Microsoft Visual Studio. Every API object is associated with one Java file; for example the functions implementing the VM operations are contained within the file VM.java.

Java binding dependencies

Platform supported:	Linux and Windows
Library:	The language binding is generated as a Java Archive file `xenserver-PRODUCT_VERSION.jar` that is linked by Java programs.
Dependencies:	xmlrpc-client-3.1.jar is needed for the xenserver.jar to be able to communicate with the xml-rpc server. ws-commons-util-1.0.2.jar is needed to run the examples.

Running the main file XenServer-SDK/XenServerJava/samples/RunTests.java will run a series of examples included in the same directory:

AddNetwork: Adds a new internal network not attached to any NICs;
SessionReuse: Demonstrates how a Session object can be shared between multiple Connections;
AsyncVMCreate: Makes asynchronously a new VM from a built-in template, starts and stops it;
VdiAndSrOps: Performs various SR and VDI tests, including creating a dummy SR;
CreateVM: Creates a VM on the default SR with a network and DVD drive;
DeprecatedMethod: Tests a warning is displayed wehn a deprecated API method is called;
GetAllRecordsOfAllTypes: Retrieves all the records for all types of objects;
SharedStorage: Creates a shared NFS SR;
StartAllVMs: Connects to a host and tries to start each VM on it.

PowerShell

The PowerShell bindings are contained within the directory XenServer-SDK/XenServerPowerShell. We provide the PowerShell module XenServerPSModule and source code exposing the XenServer API as Windows PowerShell cmdlets.

PowerShell binding dependencies

Platform supported:	Windows with .NET Framework 4.5 and PowerShell v4.0
Library:	`XenServerPSModule`
Dependencies:	`CookComputing.XMLRpcV2.dll` is needed to be able to communicate with the xml-rpc server. We test with version 2.1.0.6 and recommend that you use this version, though others may work.

These example scripts are included with the PowerShell bindings in the directory XenServer-SDK/XenServerPowerShell/samples:

AutomatedTestCore.ps1: demonstrates how to log into a XenServer host, create a storage repository and a VM, and then perform various powercycle operations;
HttpTest.ps1: demonstrates how to log into a XenServer host, create a VM, and then perform operations such as VM importing and exporting, patch upload, and retrieval of performance statistics.

Python

The python bindings are contained within a single file: XenServer-SDK/XenServerPython/XenAPI.py.

Python binding dependencies

The SDK includes 7 python examples:

fixpbds.py - reconfigures the settings used to access shared storage;
install.py - installs a Debian VM, connects it to a network, starts it up and waits for it to report its IP address;
license.py - uploads a fresh license to a XenServer Host;
permute.py - selects a set of VMs and uses XenMotion to move them simultaneously between hosts;
powercycle.py - selects a set of VMs and powercycles them;
shell.py - a simple interactive shell for testing;
vm_start_async.py - demonstrates how to invoke operations asynchronously;
watch-all-events.py - registers for all events and prints details when they occur.

Command Line Interface (CLI)

Besides using raw XML-RPC or one of the supplied language bindings, third-party software developers may integrate with XenServer Hosts by using the XE command line interface xe. The xe CLI is installed by default on XenServer hosts; a stand-alone remote CLI is also available for Linux. On Windows, the xe.exe CLI executable is installed along with XenCenter.

CLI dependencies

The CLI allows almost every API call to be directly invoked from a script or other program, silently taking care of the required session management. The XE CLI syntax and capabilities are described in detail in the XenServer Administrator’s Guide. For additional resources and examples, visit the Citrix Knowledge Center.

Note
When running the CLI from a XenServer Host console, tab-completion of both command names and arguments is available.

Complete application examples

This section describes two complete examples of real programs using the API.

Simultaneously migrating VMs using XenMotion

This python example (contained in XenServer-SDK/XenServerPython/samples/permute.py) demonstrates how to use XenMotion to move VMs simultaneously between hosts in a Resource Pool. The example makes use of asynchronous API calls and shows how to wait for a set of tasks to complete.

The program begins with some standard boilerplate and imports the API bindings module

import sys, time
import XenAPI

Next the commandline arguments containing a server URL, username, password and a number of iterations are parsed. The username and password are used to establish a session which is passed to the function main, which is called multiple times in a loop. Note the use of try: finally: to make sure the program logs out of its session at the end.

if __name__ == "__main__":
    if len(sys.argv) <> 5:
        print "Usage:"
        print sys.argv[0], " <url> <username> <password> <iterations>"
        sys.exit(1)
    url = sys.argv[1]
    username = sys.argv[2]
    password = sys.argv[3]
    iterations = int(sys.argv[4])
    # First acquire a valid session by logging in:
    session = XenAPI.Session(url)
    session.xenapi.login_with_password(username, password, "2.3",
                                       "Example migration-demo v0.1")
    try:
        for i in range(iterations):
            main(session, i)
    finally:
        session.xenapi.session.logout()

The main function examines each running VM in the system, taking care to filter out control domains (which are part of the system and not controllable by the user). A list of running VMs and their current hosts is constructed.

def main(session, iteration):
    # Find a non-template VM object
    all = session.xenapi.VM.get_all()
    vms = []
    hosts = []
    for vm in all:
        record = session.xenapi.VM.get_record(vm)
        if not(record["is_a_template"]) and \
           not(record["is_control_domain"]) and \
           record["power_state"] == "Running":
            vms.append(vm)
            hosts.append(record["resident_on"])
    print "%d: Found %d suitable running VMs" % (iteration, len(vms))

Next the list of hosts is rotated:

# use a rotation as a permutation
    hosts = [hosts[-1]] + hosts[:(len(hosts)-1)]

Each VM is then moved using XenMotion to the new host under this rotation (i.e. a VM running on host at position 2 in the list will be moved to the host at position 1 in the list etc.) In order to execute each of the movements in parallel, the asynchronous version of the VM.pool_migrate is used and a list of task references constructed. Note the live flag passed to the VM.pool_migrate; this causes the VMs to be moved while they are still running.

tasks = []
    for i in range(0, len(vms)):
        vm = vms[i]
        host = hosts[i]
        task = session.xenapi.Async.VM.pool_migrate(vm, host, { "live": "true" })
        tasks.append(task)

The list of tasks is then polled for completion:

finished = False
    records = {}
    while not(finished):
        finished = True
        for task in tasks:
            record = session.xenapi.task.get_record(task)
            records[task] = record
            if record["status"] == "pending":
                finished = False
        time.sleep(1)

Once all tasks have left the pending state (i.e. they have successfully completed, failed or been cancelled) the tasks are polled once more to see if they all succeeded:

allok = True
    for task in tasks:
        record = records[task]
        if record["status"] <> "success":
            allok = False

If any one of the tasks failed then details are printed, an exception is raised and the task objects left around for further inspection. If all tasks succeeded then the task objects are destroyed and the function returns.

if not(allok):
        print "One of the tasks didn't succeed at", \
            time.strftime("%F:%HT%M:%SZ", time.gmtime())
        idx = 0
        for task in tasks:
            record = records[task]
            vm_name = session.xenapi.VM.get_name_label(vms[idx])
            host_name = session.xenapi.host.get_name_label(hosts[idx])
            print "%s : %12s %s -> %s [ status: %s; result = %s; error = %s ]" % \
                  (record["uuid"], record["name_label"], vm_name, host_name,      \
                   record["status"], record["result"], repr(record["error_info"]))
            idx = idx + 1
        raise "Task failed"
    else:
        for task in tasks:
            session.xenapi.task.destroy(task)

Cloning a VM using the XE CLI

This example is a bash script which uses the XE CLI to clone a VM taking care to shut it down first if it is powered on.

The example begins with some boilerplate which first checks if the environment variable XE has been set: if it has it assumes that it points to the full path of the CLI, else it is assumed that the XE CLI is on the current path. Next the script prompts the user for a server name, username and password:

# Allow the path to the 'xe' binary to be overridden by the XE environment variable
if [ -z "${XE}" ]; then
  XE=xe
fi

if [ ! -e "${HOME}/.xe" ]; then
  read -p "Server name: " SERVER
  read -p "Username: " USERNAME
  read -p "Password: " PASSWORD
  XE="${XE} -s ${SERVER} -u ${USERNAME} -pw ${PASSWORD}"
fi

Next the script checks its commandline arguments. It requires exactly one: the UUID of the VM which is to be cloned:

# Check if there's a VM by the uuid specified
${XE} vm-list params=uuid | grep -q " ${vmuuid}$"
if [ $? -ne 0 ]; then
        echo "error: no vm uuid \"${vmuuid}\" found"
        exit 2
fi

The script then checks the power state of the VM and if it is running, it attempts a clean shutdown. The event system is used to wait for the VM to enter state “Halted”.

Note
The XE CLI supports a command-line argument --minimal which causes it to print its output without excess whitespace or formatting, ideal for use from scripts. If multiple values are returned they are comma-separated.

# Check the power state of the vm
name=$(${XE} vm-list uuid=${vmuuid} params=name-label --minimal)
state=$(${XE} vm-list uuid=${vmuuid} params=power-state --minimal)
wasrunning=0

# If the VM state is running, we shutdown the vm first
if [ "${state}" = "running" ]; then
        ${XE} vm-shutdown uuid=${vmuuid}
        ${XE} event-wait class=vm power-state=halted uuid=${vmuuid}
        wasrunning=1
fi

The VM is then cloned and the new VM has its name_label set to cloned_vm.

# Clone the VM
newuuid=$(${XE} vm-clone uuid=${vmuuid} new-name-label=cloned_vm)

Finally, if the original VM had been running and was shutdown, both it and the new VM are started.

# If the VM state was running before cloning, we start it again
# along with the new VM.
if [ "$wasrunning" -eq 1 ]; then
        ${XE} vm-start uuid=${vmuuid}
        ${XE} vm-start uuid=${newuuid}
fi

XenAPI Reference

XenAPI Classes

Click on a class to view the associated fields and messages.

Classes, Fields and Messages

Classes have both fields and messages. Messages are either implicit or explicit where an implicit message is one of:

a constructor (usually called "create");
a destructor (usually called "destroy");
"get_by_name_label";
"get_by_uuid";
"get_record";
"get_all"; and
"get_all_records".

Explicit messages include all the rest, more class-specific messages (e.g. "VM.start", "VM.clone")

Every field has at least one accessor depending both on its type and whether it is read-only or read-write. Accessors for a field named "X" would be a proper subset of:

set_X: change the value of field X (only if it is read-write);
get_X: retrieve the value of field X;
add_X: add a key/value pair (for fields of type set);
remove_X: remove a key (for fields of type set);
add_to_X: add a key/value pair (for fields of type map); and
remove_from_X: remove a key (for fields of type map).

auth

Class: auth

Management of remote authentication services

Fields

Messages

string set get_group_membership (session ref, string)

string get_subject_identifier (session ref, string)

(string → string) map get_subject_information_from_identifier (session ref, string)

blob

Class: blob

A placeholder for a binary blob

Fields

datetime last_updated [RO/constructor]

string mime_type [RO/constructor]

string name_description [RW]

string name_label [RW]

bool public [RW]

int size [RO/runtime]

string uuid [RO/runtime]

Messages

blob ref create (session ref, string, bool)

void destroy (session ref, blob ref)

blob ref set get_all (session ref)

(blob ref → blob record) map get_all_records (session ref)

blob ref set get_by_name_label (session ref, string)

blob ref get_by_uuid (session ref, string)

datetime get_last_updated (session ref, blob ref)

string get_mime_type (session ref, blob ref)

string get_name_description (session ref, blob ref)

string get_name_label (session ref, blob ref)

bool get_public (session ref, blob ref)

blob record get_record (session ref, blob ref)

int get_size (session ref, blob ref)

string get_uuid (session ref, blob ref)

void set_name_description (session ref, blob ref, string)

void set_name_label (session ref, blob ref, string)

void set_public (session ref, blob ref, bool)

Bond

Class: Bond

A Network bond that combines physical network interfaces, also known as link aggregation

Enums

bond_mode

Fields

bool auto_update_mac [RO/runtime]

int links_up [RO/runtime]

PIF ref master [RO/constructor]

enum bond_mode mode [RO/runtime]

(string → string) map other_config [RW]

PIF ref primary_slave [RO/runtime]

(string → string) map properties [RO/runtime]

PIF ref set slaves [RO/runtime]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, Bond ref, string, string)

Bond ref create (session ref, network ref, PIF ref set, string, enum bond_mode, (string → string) map)

void destroy (session ref, Bond ref)

Bond ref set get_all (session ref)

(Bond ref → Bond record) map get_all_records (session ref)

bool get_auto_update_mac (session ref, Bond ref)

Bond ref get_by_uuid (session ref, string)

int get_links_up (session ref, Bond ref)

PIF ref get_master (session ref, Bond ref)

enum bond_mode get_mode (session ref, Bond ref)

(string → string) map get_other_config (session ref, Bond ref)

PIF ref get_primary_slave (session ref, Bond ref)

(string → string) map get_properties (session ref, Bond ref)

Bond record get_record (session ref, Bond ref)

PIF ref set get_slaves (session ref, Bond ref)

string get_uuid (session ref, Bond ref)

void remove_from_other_config (session ref, Bond ref, string)

void set_mode (session ref, Bond ref, enum bond_mode)

void set_other_config (session ref, Bond ref, (string → string) map)

void set_property (session ref, Bond ref, string, string)

Certificate

Class: Certificate

An X509 certificate used for TLS connections

Enums

certificate_type

Fields

string fingerprint [RO/constructor]

host ref host [RO/constructor]

string name [RO/runtime]

datetime not_after [RO/constructor]

datetime not_before [RO/constructor]

enum certificate_type type [RO/runtime]

string uuid [RO/runtime]

Messages

Certificate ref set get_all (session ref)

(Certificate ref → Certificate record) map get_all_records (session ref)

Certificate ref get_by_uuid (session ref, string)

string get_fingerprint (session ref, Certificate ref)

host ref get_host (session ref, Certificate ref)

string get_name (session ref, Certificate ref)

datetime get_not_after (session ref, Certificate ref)

datetime get_not_before (session ref, Certificate ref)

Certificate record get_record (session ref, Certificate ref)

enum certificate_type get_type (session ref, Certificate ref)

string get_uuid (session ref, Certificate ref)

Cluster

Class: Cluster

Cluster-wide Cluster metadata

Enums

cluster_operation

Fields

enum cluster_operation set allowed_operations [RO/runtime]

(string → string) map cluster_config [RO/constructor]

Cluster_host ref set cluster_hosts [RO/runtime]

string cluster_stack [RO/constructor]

string cluster_token [RO/constructor]

(string → enum cluster_operation) map current_operations [RO/runtime]

Prototype

bool is_quorate [RO/runtime]

Prototype

int live_hosts [RO/runtime]

(string → string) map other_config [RW]

string set pending_forget [RO/runtime]

bool pool_auto_join [RO/constructor]

Prototype

int quorum [RO/runtime]

float token_timeout [RO/constructor]

float token_timeout_coefficient [RO/constructor]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, Cluster ref, string, string)

Cluster ref create (session ref, PIF ref, string, bool, float, float)

void destroy (session ref, Cluster ref)

Cluster ref set get_all (session ref)

(Cluster ref → Cluster record) map get_all_records (session ref)

enum cluster_operation set get_allowed_operations (session ref, Cluster ref)

Cluster ref get_by_uuid (session ref, string)

(string → string) map get_cluster_config (session ref, Cluster ref)

Cluster_host ref set get_cluster_hosts (session ref, Cluster ref)

string get_cluster_stack (session ref, Cluster ref)

string get_cluster_token (session ref, Cluster ref)

(string → enum cluster_operation) map get_current_operations (session ref, Cluster ref)

Prototype

bool get_is_quorate (session ref, Cluster ref)

Prototype

int get_live_hosts (session ref, Cluster ref)

network ref get_network (session ref, Cluster ref)

(string → string) map get_other_config (session ref, Cluster ref)

string set get_pending_forget (session ref, Cluster ref)

bool get_pool_auto_join (session ref, Cluster ref)

Prototype

int get_quorum (session ref, Cluster ref)

Cluster record get_record (session ref, Cluster ref)

float get_token_timeout (session ref, Cluster ref)

float get_token_timeout_coefficient (session ref, Cluster ref)

string get_uuid (session ref, Cluster ref)

Cluster ref pool_create (session ref, network ref, string, float, float)

void pool_destroy (session ref, Cluster ref)

void pool_force_destroy (session ref, Cluster ref)

void pool_resync (session ref, Cluster ref)

void remove_from_other_config (session ref, Cluster ref, string)

void set_other_config (session ref, Cluster ref, (string → string) map)

Cluster_host

Class: Cluster_host

Cluster member metadata

Enums

cluster_host_operation

Fields

enum cluster_host_operation set allowed_operations [RO/runtime]

Cluster ref cluster [RO/constructor]

(string → enum cluster_host_operation) map current_operations [RO/runtime]

bool enabled [RO/runtime]

host ref host [RO/constructor]

bool joined [RO/runtime]

Prototype

datetime last_update_live [RO/runtime]

Prototype

bool live [RO/runtime]

(string → string) map other_config [RO/constructor]

PIF ref PIF [RO/constructor]

string uuid [RO/runtime]

Messages

Cluster_host ref create (session ref, Cluster ref, host ref, PIF ref)

void destroy (session ref, Cluster_host ref)

void disable (session ref, Cluster_host ref)

void enable (session ref, Cluster_host ref)

void force_destroy (session ref, Cluster_host ref)

Cluster_host ref set get_all (session ref)

(Cluster_host ref → Cluster_host record) map get_all_records (session ref)

enum cluster_host_operation set get_allowed_operations (session ref, Cluster_host ref)

Cluster_host ref get_by_uuid (session ref, string)

Cluster ref get_cluster (session ref, Cluster_host ref)

(string → enum cluster_host_operation) map get_current_operations (session ref, Cluster_host ref)

bool get_enabled (session ref, Cluster_host ref)

host ref get_host (session ref, Cluster_host ref)

bool get_joined (session ref, Cluster_host ref)

Prototype

datetime get_last_update_live (session ref, Cluster_host ref)

Prototype

bool get_live (session ref, Cluster_host ref)

(string → string) map get_other_config (session ref, Cluster_host ref)

PIF ref get_PIF (session ref, Cluster_host ref)

Cluster_host record get_record (session ref, Cluster_host ref)

string get_uuid (session ref, Cluster_host ref)

console

Class: console

A console

Enums

console_protocol

Fields

string location [RO/runtime]

(string → string) map other_config [RW]

enum console_protocol protocol [RO/runtime]

string uuid [RO/runtime]

VM ref VM [RO/runtime]

Messages

void add_to_other_config (session ref, console ref, string, string)

console ref create (session ref, console record)

void destroy (session ref, console ref)

console ref set get_all (session ref)

(console ref → console record) map get_all_records (session ref)

console ref get_by_uuid (session ref, string)

string get_location (session ref, console ref)

(string → string) map get_other_config (session ref, console ref)

enum console_protocol get_protocol (session ref, console ref)

console record get_record (session ref, console ref)

string get_uuid (session ref, console ref)

VM ref get_VM (session ref, console ref)

void remove_from_other_config (session ref, console ref, string)

void set_other_config (session ref, console ref, (string → string) map)

crashdump

Deprecated

Class: crashdump

A VM crashdump

Fields

(string → string) map other_config [RW]

string uuid [RO/runtime]

VDI ref VDI [RO/constructor]

VM ref VM [RO/constructor]

Messages

void add_to_other_config (session ref, crashdump ref, string, string)

void destroy (session ref, crashdump ref)

Deprecated

crashdump ref set get_all (session ref)

Deprecated

(crashdump ref → crashdump record) map get_all_records (session ref)

Deprecated

crashdump ref get_by_uuid (session ref, string)

(string → string) map get_other_config (session ref, crashdump ref)

Deprecated

crashdump record get_record (session ref, crashdump ref)

string get_uuid (session ref, crashdump ref)

VDI ref get_VDI (session ref, crashdump ref)

VM ref get_VM (session ref, crashdump ref)

void remove_from_other_config (session ref, crashdump ref, string)

void set_other_config (session ref, crashdump ref, (string → string) map)

data_source

Class: data_source

Data sources for logging in RRDs

Fields

bool enabled [RO/runtime]

float max [RO/runtime]

float min [RO/runtime]

string name_description [RO/runtime]

string name_label [RO/runtime]

bool standard [RO/runtime]

string units [RO/runtime]

float value [RO/runtime]

Messages

DR_task

Class: DR_task

DR task

Fields

SR ref set introduced_SRs [RO/runtime]

string uuid [RO/runtime]

Messages

DR_task ref create (session ref, string, (string → string) map, string set)

void destroy (session ref, DR_task ref)

DR_task ref set get_all (session ref)

(DR_task ref → DR_task record) map get_all_records (session ref)

DR_task ref get_by_uuid (session ref, string)

SR ref set get_introduced_SRs (session ref, DR_task ref)

DR_task record get_record (session ref, DR_task ref)

string get_uuid (session ref, DR_task ref)

event

Class: event

Asynchronous event registration and handling

Enums

event_operation

Fields

string class [RO/constructor]

int id [RO/constructor]

Deprecated

string obj_uuid [RO/constructor]

enum event_operation operation [RO/constructor]

string ref [RO/constructor]

<class> record snapshot [RO/runtime]

Deprecated

datetime timestamp [RO/constructor]

Messages

an event batch from (session ref, string set, string, float)

int get_current_id (session ref)

string inject (session ref, string, string)

Deprecated

event record set next (session ref)

Deprecated

void register (session ref, string set)

Deprecated

void unregister (session ref, string set)

Feature

Class: Feature

A new piece of functionality

Fields

bool enabled [RO/runtime]

bool experimental [RO/constructor]

host ref host [RO/runtime]

string name_description [RO/constructor]

string name_label [RO/constructor]

string uuid [RO/runtime]

string version [RO/constructor]

Messages

Feature ref set get_all (session ref)

(Feature ref → Feature record) map get_all_records (session ref)

Feature ref set get_by_name_label (session ref, string)

Feature ref get_by_uuid (session ref, string)

bool get_enabled (session ref, Feature ref)

bool get_experimental (session ref, Feature ref)

host ref get_host (session ref, Feature ref)

string get_name_description (session ref, Feature ref)

string get_name_label (session ref, Feature ref)

Feature record get_record (session ref, Feature ref)

string get_uuid (session ref, Feature ref)

string get_version (session ref, Feature ref)

GPU_group

Class: GPU_group

A group of compatible GPUs across the resource pool

Enums

allocation_algorithm

Fields

enum allocation_algorithm allocation_algorithm [RW]

VGPU_type ref set enabled_VGPU_types [RO/runtime]

string set GPU_types [RO/runtime]

string name_description [RW]

string name_label [RW]

(string → string) map other_config [RW]

PGPU ref set PGPUs [RO/runtime]

VGPU_type ref set supported_VGPU_types [RO/runtime]

string uuid [RO/runtime]

VGPU ref set VGPUs [RO/runtime]

Messages

void add_to_other_config (session ref, GPU_group ref, string, string)

GPU_group ref create (session ref, string, string, (string → string) map)

void destroy (session ref, GPU_group ref)

GPU_group ref set get_all (session ref)

(GPU_group ref → GPU_group record) map get_all_records (session ref)

enum allocation_algorithm get_allocation_algorithm (session ref, GPU_group ref)

GPU_group ref set get_by_name_label (session ref, string)

GPU_group ref get_by_uuid (session ref, string)

VGPU_type ref set get_enabled_VGPU_types (session ref, GPU_group ref)

string set get_GPU_types (session ref, GPU_group ref)

string get_name_description (session ref, GPU_group ref)

string get_name_label (session ref, GPU_group ref)

(string → string) map get_other_config (session ref, GPU_group ref)

PGPU ref set get_PGPUs (session ref, GPU_group ref)

GPU_group record get_record (session ref, GPU_group ref)

int get_remaining_capacity (session ref, GPU_group ref, VGPU_type ref)

VGPU_type ref set get_supported_VGPU_types (session ref, GPU_group ref)

string get_uuid (session ref, GPU_group ref)

VGPU ref set get_VGPUs (session ref, GPU_group ref)

void remove_from_other_config (session ref, GPU_group ref, string)

void set_allocation_algorithm (session ref, GPU_group ref, enum allocation_algorithm)

void set_name_description (session ref, GPU_group ref, string)

void set_name_label (session ref, GPU_group ref, string)

void set_other_config (session ref, GPU_group ref, (string → string) map)

host

Class: host

A physical host

Enums

host_allowed_operations

latest_synced_updates_applied_state

update_guidances

host_display

host_sched_gran

host_numa_affinity_policy

Fields

string address [RW]

enum host_allowed_operations set allowed_operations [RO/runtime]

int API_version_major [RO/runtime]

int API_version_minor [RO/runtime]

string API_version_vendor [RO/runtime]

(string → string) map API_version_vendor_implementation [RO/runtime]

(string → string) map bios_strings [RO/runtime]

(string → blob ref) map blobs [RO/runtime]

string set capabilities [RO/constructor]

Certificate ref set certificates [RO/runtime]

(string → string) map chipset_info [RO/runtime]

VM ref control_domain [RO/runtime]

(string → string) map cpu_configuration [RO/runtime]

(string → string) map cpu_info [RO/runtime]

SR ref crash_dump_sr [RW]

host_crashdump ref set crashdumps [RO/runtime]

(string → enum host_allowed_operations) map current_operations [RO/runtime]

enum host_display display [RW]

string edition [RO/runtime]

string set editions [RO/runtime]

bool enabled [RO/runtime]

(string → string) map external_auth_configuration [RO/runtime]

string external_auth_service_name [RO/runtime]

string external_auth_type [RO/runtime]

Feature ref set features [RO/runtime]

(string → string) map guest_VCPUs_params [RW]

string set ha_network_peers [RO/runtime]

string set ha_statefiles [RO/runtime]

host_cpu ref set host_CPUs [RO/runtime]

string hostname [RW]

Prototype

bool https_only [RO/runtime]

string iscsi_iqn [RO/constructor]

Prototype

datetime last_software_update [RO/runtime]

Prototype

string last_update_hash [RO/runtime]

Prototype

enum latest_synced_updates_applied_state latest_synced_updates_applied [RO/runtime]

(string → string) map license_params [RO/runtime]

(string → string) map license_server [RW]

SR ref local_cache_sr [RO/constructor]

(string → string) map logging [RW]

int memory_overhead [RO/runtime]

host_metrics ref metrics [RO/runtime]

bool multipathing [RO/constructor]

string name_description [RW]

string name_label [RW]

Prototype

enum host_numa_affinity_policy numa_affinity_policy [RO/runtime]

(string → string) map other_config [RW]

Deprecated

host_patch ref set patches [RO/runtime]

PBD ref set PBDs [RO/runtime]

PCI ref set PCIs [RO/runtime]

enum update_guidances set pending_guidances [RO/runtime]

Prototype

enum update_guidances set pending_guidances_full [RO/runtime]

Prototype

enum update_guidances set pending_guidances_recommended [RO/runtime]

PGPU ref set PGPUs [RO/runtime]

PIF ref set PIFs [RO/runtime]

(string → string) map power_on_config [RO/runtime]

string power_on_mode [RO/runtime]

PUSB ref set PUSBs [RO/runtime]

VM ref set resident_VMs [RO/runtime]

string sched_policy [RO/runtime]

(string → string) map software_version [RO/constructor]

Deprecated

bool ssl_legacy [RO/constructor]

string set supported_bootloaders [RO/runtime]

SR ref suspend_image_sr [RW]

string set tags [RW]

bool tls_verification_enabled [RO/runtime]

Deprecated

string uefi_certificates [RO/constructor]

pool_update ref set updates [RO/runtime]

pool_update ref set updates_requiring_reboot [RO/runtime]

string uuid [RO/runtime]

int set virtual_hardware_platform_versions [RO/runtime]

Messages

void add_tags (session ref, host ref, string)

void add_to_guest_VCPUs_params (session ref, host ref, string, string)

void add_to_license_server (session ref, host ref, string, string)

void add_to_logging (session ref, host ref, string, string)

void add_to_other_config (session ref, host ref, string, string)

void apply_edition (session ref, host ref, string, bool)

Removed

void apply_recommended_guidances (session ref, host ref)

string set set apply_updates (session ref, host ref, string)

void assert_can_evacuate (session ref, host ref)

void backup_rrds (session ref, host ref, float)

void bugreport_upload (session ref, host ref, string, (string → string) map)

string call_extension (session ref, host ref, string)

string call_plugin (session ref, host ref, string, string, (string → string) map)

int compute_free_memory (session ref, host ref)

int compute_memory_overhead (session ref, host ref)

blob ref create_new_blob (session ref, host ref, string, string, bool)

void declare_dead (session ref, host ref)

void destroy (session ref, host ref)

void disable (session ref, host ref)

enum host_display disable_display (session ref, host ref)

void disable_external_auth (session ref, host ref, (string → string) map)

void disable_local_storage_caching (session ref, host ref)

string dmesg (session ref, host ref)

string dmesg_clear (session ref, host ref)

Prototype

void emergency_clear_mandatory_guidance (session ref)

void emergency_disable_tls_verification (session ref)

void emergency_ha_disable (session ref, bool)

void emergency_reenable_tls_verification (session ref)

void emergency_reset_server_certificate (session ref)

void enable (session ref, host ref)

enum host_display enable_display (session ref, host ref)

void enable_external_auth (session ref, host ref, (string → string) map, string, string)

void enable_local_storage_caching (session ref, host ref, SR ref)

void evacuate (session ref, host ref, network ref, int)

void forget_data_source_archives (session ref, host ref, string)

string get_address (session ref, host ref)

host ref set get_all (session ref)

(host ref → host record) map get_all_records (session ref)

enum host_allowed_operations set get_allowed_operations (session ref, host ref)

int get_API_version_major (session ref, host ref)

int get_API_version_minor (session ref, host ref)

string get_API_version_vendor (session ref, host ref)

(string → string) map get_API_version_vendor_implementation (session ref, host ref)

(string → string) map get_bios_strings (session ref, host ref)

(string → blob ref) map get_blobs (session ref, host ref)

host ref set get_by_name_label (session ref, string)

host ref get_by_uuid (session ref, string)

string set get_capabilities (session ref, host ref)

Certificate ref set get_certificates (session ref, host ref)

(string → string) map get_chipset_info (session ref, host ref)

VM ref get_control_domain (session ref, host ref)

(string → string) map get_cpu_configuration (session ref, host ref)

(string → string) map get_cpu_info (session ref, host ref)

SR ref get_crash_dump_sr (session ref, host ref)

host_crashdump ref set get_crashdumps (session ref, host ref)

(string → enum host_allowed_operations) map get_current_operations (session ref, host ref)

data_source record set get_data_sources (session ref, host ref)

enum host_display get_display (session ref, host ref)

string get_edition (session ref, host ref)

string set get_editions (session ref, host ref)

bool get_enabled (session ref, host ref)

(string → string) map get_external_auth_configuration (session ref, host ref)

string get_external_auth_service_name (session ref, host ref)

string get_external_auth_type (session ref, host ref)

Feature ref set get_features (session ref, host ref)

(string → string) map get_guest_VCPUs_params (session ref, host ref)

string set get_ha_network_peers (session ref, host ref)

string set get_ha_statefiles (session ref, host ref)

host_cpu ref set get_host_CPUs (session ref, host ref)

string get_hostname (session ref, host ref)

Prototype

bool get_https_only (session ref, host ref)

string get_iscsi_iqn (session ref, host ref)

Prototype

datetime get_last_software_update (session ref, host ref)

Prototype

string get_last_update_hash (session ref, host ref)

Prototype

enum latest_synced_updates_applied_state get_latest_synced_updates_applied (session ref, host ref)

(string → string) map get_license_params (session ref, host ref)

(string → string) map get_license_server (session ref, host ref)

SR ref get_local_cache_sr (session ref, host ref)

string get_log (session ref, host ref)

(string → string) map get_logging (session ref, host ref)

PIF ref get_management_interface (session ref, host ref)

int get_memory_overhead (session ref, host ref)

host_metrics ref get_metrics (session ref, host ref)

bool get_multipathing (session ref, host ref)

string get_name_description (session ref, host ref)

string get_name_label (session ref, host ref)

Prototype

enum host_numa_affinity_policy get_numa_affinity_policy (session ref, host ref)

(string → string) map get_other_config (session ref, host ref)

Deprecated

host_patch ref set get_patches (session ref, host ref)

PBD ref set get_PBDs (session ref, host ref)

PCI ref set get_PCIs (session ref, host ref)

enum update_guidances set get_pending_guidances (session ref, host ref)

Prototype

enum update_guidances set get_pending_guidances_full (session ref, host ref)

Prototype

enum update_guidances set get_pending_guidances_recommended (session ref, host ref)

PGPU ref set get_PGPUs (session ref, host ref)

PIF ref set get_PIFs (session ref, host ref)

(string → string) map get_power_on_config (session ref, host ref)

string get_power_on_mode (session ref, host ref)

PUSB ref set get_PUSBs (session ref, host ref)

host record get_record (session ref, host ref)

VM ref set get_resident_VMs (session ref, host ref)

enum host_sched_gran get_sched_gran (session ref, host ref)

string get_sched_policy (session ref, host ref)

string get_server_certificate (session ref, host ref)

datetime get_server_localtime (session ref, host ref)

datetime get_servertime (session ref, host ref)

(string → string) map get_software_version (session ref, host ref)

Deprecated

bool get_ssl_legacy (session ref, host ref)

string set get_supported_bootloaders (session ref, host ref)

SR ref get_suspend_image_sr (session ref, host ref)

string get_system_status_capabilities (session ref, host ref)

string set get_tags (session ref, host ref)

bool get_tls_verification_enabled (session ref, host ref)

Deprecated

string get_uefi_certificates (session ref, host ref)

Deprecated

VM ref set get_uncooperative_resident_VMs (session ref, host ref)

pool_update ref set get_updates (session ref, host ref)

pool_update ref set get_updates_requiring_reboot (session ref, host ref)

string get_uuid (session ref, host ref)

int set get_virtual_hardware_platform_versions (session ref, host ref)

(VM ref → string set) map get_vms_which_prevent_evacuation (session ref, host ref)

bool has_extension (session ref, host ref, string)

void install_server_certificate (session ref, host ref, string, string, string)

void license_add (session ref, host ref, string)

Removed

void license_apply (session ref, host ref, string)

void license_remove (session ref, host ref)

string set list_methods (session ref)

void local_management_reconfigure (session ref, string)

void management_disable (session ref)

void management_reconfigure (session ref, PIF ref)

(string → string) map migrate_receive (session ref, host ref, network ref, (string → string) map)

void power_on (session ref, host ref)

float query_data_source (session ref, host ref, string)

void reboot (session ref, host ref)

void record_data_source (session ref, host ref, string)

Deprecated

void refresh_pack_info (session ref, host ref)

void refresh_server_certificate (session ref, host ref)

void remove_from_guest_VCPUs_params (session ref, host ref, string)

void remove_from_license_server (session ref, host ref, string)

void remove_from_logging (session ref, host ref, string)

void remove_from_other_config (session ref, host ref, string)

void remove_tags (session ref, host ref, string)

Removed

void reset_cpu_features (session ref, host ref)

void reset_server_certificate (session ref, host ref)

void restart_agent (session ref, host ref)

(VM ref → string set) map retrieve_wlb_evacuate_recommendations (session ref, host ref)

void send_debug_keys (session ref, host ref, string)

void set_address (session ref, host ref, string)

Removed

void set_cpu_features (session ref, host ref, string)

void set_crash_dump_sr (session ref, host ref, SR ref)

void set_display (session ref, host ref, enum host_display)

void set_guest_VCPUs_params (session ref, host ref, (string → string) map)

void set_hostname (session ref, host ref, string)

void set_hostname_live (session ref, host ref, string)

Prototype

void set_https_only (session ref, host ref, bool)

void set_iscsi_iqn (session ref, host ref, string)

void set_license_server (session ref, host ref, (string → string) map)

void set_logging (session ref, host ref, (string → string) map)

void set_multipathing (session ref, host ref, bool)

void set_name_description (session ref, host ref, string)

void set_name_label (session ref, host ref, string)

Prototype

void set_numa_affinity_policy (session ref, host ref, enum host_numa_affinity_policy)

void set_other_config (session ref, host ref, (string → string) map)

void set_power_on_mode (session ref, host ref, string, (string → string) map)

void set_sched_gran (session ref, host ref, enum host_sched_gran)

void set_ssl_legacy (session ref, host ref, bool)

void set_suspend_image_sr (session ref, host ref, SR ref)

void set_tags (session ref, host ref, string set)

Deprecated

void set_uefi_certificates (session ref, host ref, string)

void shutdown (session ref, host ref)

void shutdown_agent (session ref)

void sync_data (session ref, host ref)

void syslog_reconfigure (session ref, host ref)

host_cpu

Deprecated

Class: host_cpu

A physical CPU

Fields

int family [RO/runtime]

string features [RO/runtime]

string flags [RO/runtime]

host ref host [RO/runtime]

int model [RO/runtime]

string modelname [RO/runtime]

int number [RO/runtime]

(string → string) map other_config [RW]

int speed [RO/runtime]

string stepping [RO/runtime]

float utilisation [RO/runtime]

string uuid [RO/runtime]

string vendor [RO/runtime]

Messages

void add_to_other_config (session ref, host_cpu ref, string, string)

Deprecated

host_cpu ref set get_all (session ref)

Deprecated

(host_cpu ref → host_cpu record) map get_all_records (session ref)

Deprecated

host_cpu ref get_by_uuid (session ref, string)

int get_family (session ref, host_cpu ref)

string get_features (session ref, host_cpu ref)

string get_flags (session ref, host_cpu ref)

host ref get_host (session ref, host_cpu ref)

int get_model (session ref, host_cpu ref)

string get_modelname (session ref, host_cpu ref)

int get_number (session ref, host_cpu ref)

(string → string) map get_other_config (session ref, host_cpu ref)

Deprecated

host_cpu record get_record (session ref, host_cpu ref)

int get_speed (session ref, host_cpu ref)

string get_stepping (session ref, host_cpu ref)

float get_utilisation (session ref, host_cpu ref)

string get_uuid (session ref, host_cpu ref)

string get_vendor (session ref, host_cpu ref)

void remove_from_other_config (session ref, host_cpu ref, string)

void set_other_config (session ref, host_cpu ref, (string → string) map)

host_crashdump

Class: host_crashdump

Represents a host crash dump

Fields

host ref host [RO/constructor]

(string → string) map other_config [RW]

int size [RO/runtime]

datetime timestamp [RO/runtime]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, host_crashdump ref, string, string)

void destroy (session ref, host_crashdump ref)

host_crashdump ref set get_all (session ref)

(host_crashdump ref → host_crashdump record) map get_all_records (session ref)

host_crashdump ref get_by_uuid (session ref, string)

host ref get_host (session ref, host_crashdump ref)

(string → string) map get_other_config (session ref, host_crashdump ref)

host_crashdump record get_record (session ref, host_crashdump ref)

int get_size (session ref, host_crashdump ref)

datetime get_timestamp (session ref, host_crashdump ref)

string get_uuid (session ref, host_crashdump ref)

void remove_from_other_config (session ref, host_crashdump ref, string)

void set_other_config (session ref, host_crashdump ref, (string → string) map)

void upload (session ref, host_crashdump ref, string, (string → string) map)

host_metrics

Class: host_metrics

The metrics associated with a host

Fields

datetime last_updated [RO/runtime]

bool live [RO/runtime]

Removed

int memory_free [RO/runtime]

int memory_total [RO/runtime]

(string → string) map other_config [RW]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, host_metrics ref, string, string)

host_metrics ref set get_all (session ref)

(host_metrics ref → host_metrics record) map get_all_records (session ref)

host_metrics ref get_by_uuid (session ref, string)

datetime get_last_updated (session ref, host_metrics ref)

bool get_live (session ref, host_metrics ref)

Removed

int get_memory_free (session ref, host_metrics ref)

int get_memory_total (session ref, host_metrics ref)

(string → string) map get_other_config (session ref, host_metrics ref)

host_metrics record get_record (session ref, host_metrics ref)

string get_uuid (session ref, host_metrics ref)

void remove_from_other_config (session ref, host_metrics ref, string)

void set_other_config (session ref, host_metrics ref, (string → string) map)

host_patch

Deprecated

Class: host_patch

Represents a patch stored on a server

Fields

bool applied [RO/runtime]

host ref host [RO/constructor]

string name_description [RO/constructor]

string name_label [RO/constructor]

(string → string) map other_config [RW]

pool_patch ref pool_patch [RO/constructor]

int size [RO/runtime]

datetime timestamp_applied [RO/runtime]

string uuid [RO/runtime]

string version [RO/constructor]

Messages

void add_to_other_config (session ref, host_patch ref, string, string)

Deprecated

string apply (session ref, host_patch ref)

Deprecated

void destroy (session ref, host_patch ref)

Deprecated

host_patch ref set get_all (session ref)

Deprecated

(host_patch ref → host_patch record) map get_all_records (session ref)

bool get_applied (session ref, host_patch ref)

Deprecated

host_patch ref set get_by_name_label (session ref, string)

Deprecated

host_patch ref get_by_uuid (session ref, string)

host ref get_host (session ref, host_patch ref)

string get_name_description (session ref, host_patch ref)

string get_name_label (session ref, host_patch ref)

(string → string) map get_other_config (session ref, host_patch ref)

pool_patch ref get_pool_patch (session ref, host_patch ref)

Deprecated

host_patch record get_record (session ref, host_patch ref)

int get_size (session ref, host_patch ref)

datetime get_timestamp_applied (session ref, host_patch ref)

string get_uuid (session ref, host_patch ref)

string get_version (session ref, host_patch ref)

void remove_from_other_config (session ref, host_patch ref, string)

void set_other_config (session ref, host_patch ref, (string → string) map)

LVHD

Class: LVHD

LVHD SR specific operations

Fields

string uuid [RO/runtime]

Messages

string enable_thin_provisioning (session ref, host ref, SR ref, int, int)

LVHD ref get_by_uuid (session ref, string)

LVHD record get_record (session ref, LVHD ref)

string get_uuid (session ref, LVHD ref)

message

Class: message

An message for the attention of the administrator

Enums

cls

Fields

string body [RO/runtime]

enum cls cls [RO/runtime]

string name [RO/runtime]

string obj_uuid [RO/runtime]

int priority [RO/runtime]

datetime timestamp [RO/runtime]

string uuid [RO/runtime]

Messages

message ref create (session ref, string, int, enum cls, string, string)

void destroy (session ref, message ref)

Prototype

void destroy_many (session ref, message ref set)

(message ref → message record) map get (session ref, enum cls, string, datetime)

message ref set get_all (session ref)

(message ref → message record) map get_all_records (session ref)

(message ref → message record) map get_all_records_where (session ref, string)

message ref get_by_uuid (session ref, string)

message record get_record (session ref, message ref)

(message ref → message record) map get_since (session ref, datetime)

network

Class: network

A virtual network

Enums

network_operations

network_default_locking_mode

network_purpose

Fields

enum network_operations set allowed_operations [RO/runtime]

(VIF ref → string) map assigned_ips [RO/runtime]

(string → blob ref) map blobs [RO/runtime]

string bridge [RO/constructor]

(string → enum network_operations) map current_operations [RO/runtime]

enum network_default_locking_mode default_locking_mode [RO/runtime]

bool managed [RO/constructor]

int MTU [RW]

string name_description [RW]

string name_label [RW]

(string → string) map other_config [RW]

PIF ref set PIFs [RO/runtime]

enum network_purpose set purpose [RO/runtime]

string set tags [RW]

string uuid [RO/runtime]

VIF ref set VIFs [RO/runtime]

Messages

void add_purpose (session ref, network ref, enum network_purpose)

void add_tags (session ref, network ref, string)

void add_to_other_config (session ref, network ref, string, string)

network ref create (session ref, network record)

blob ref create_new_blob (session ref, network ref, string, string, bool)

void destroy (session ref, network ref)

network ref set get_all (session ref)

(network ref → network record) map get_all_records (session ref)

enum network_operations set get_allowed_operations (session ref, network ref)

(VIF ref → string) map get_assigned_ips (session ref, network ref)

(string → blob ref) map get_blobs (session ref, network ref)

string get_bridge (session ref, network ref)

network ref set get_by_name_label (session ref, string)

network ref get_by_uuid (session ref, string)

(string → enum network_operations) map get_current_operations (session ref, network ref)

enum network_default_locking_mode get_default_locking_mode (session ref, network ref)

bool get_managed (session ref, network ref)

int get_MTU (session ref, network ref)

string get_name_description (session ref, network ref)

string get_name_label (session ref, network ref)

(string → string) map get_other_config (session ref, network ref)

PIF ref set get_PIFs (session ref, network ref)

enum network_purpose set get_purpose (session ref, network ref)

network record get_record (session ref, network ref)

string set get_tags (session ref, network ref)

string get_uuid (session ref, network ref)

VIF ref set get_VIFs (session ref, network ref)

void remove_from_other_config (session ref, network ref, string)

void remove_purpose (session ref, network ref, enum network_purpose)

void remove_tags (session ref, network ref, string)

void set_default_locking_mode (session ref, network ref, enum network_default_locking_mode)

void set_MTU (session ref, network ref, int)

void set_name_description (session ref, network ref, string)

void set_name_label (session ref, network ref, string)

void set_other_config (session ref, network ref, (string → string) map)

void set_tags (session ref, network ref, string set)

network_sriov

Class: network_sriov

network-sriov which connects logical pif and physical pif

Enums

sriov_configuration_mode

Fields

enum sriov_configuration_mode configuration_mode [RO/runtime]

PIF ref logical_PIF [RO/constructor]

PIF ref physical_PIF [RO/constructor]

bool requires_reboot [RO/runtime]

string uuid [RO/runtime]

Messages

network_sriov ref create (session ref, PIF ref, network ref)

void destroy (session ref, network_sriov ref)

network_sriov ref set get_all (session ref)

(network_sriov ref → network_sriov record) map get_all_records (session ref)

network_sriov ref get_by_uuid (session ref, string)

enum sriov_configuration_mode get_configuration_mode (session ref, network_sriov ref)

PIF ref get_logical_PIF (session ref, network_sriov ref)

PIF ref get_physical_PIF (session ref, network_sriov ref)

network_sriov record get_record (session ref, network_sriov ref)

int get_remaining_capacity (session ref, network_sriov ref)

bool get_requires_reboot (session ref, network_sriov ref)

string get_uuid (session ref, network_sriov ref)

Observer

Prototype

Class: Observer

Describes a observer which will control observability activity in the Toolstack

Fields

Prototype

(string → string) map attributes [RO/constructor]

Prototype

string set components [RO/constructor]

Prototype

bool enabled [RO/constructor]

Prototype

string set endpoints [RO/constructor]

Prototype

host ref set hosts [RO/constructor]

string name_description [RW]

string name_label [RW]

Prototype

string uuid [RO/runtime]

Messages

Prototype

Observer ref create (session ref, Observer record)

Prototype

void destroy (session ref, Observer ref)

Prototype

Observer ref set get_all (session ref)

Prototype

(Observer ref → Observer record) map get_all_records (session ref)

Prototype

(string → string) map get_attributes (session ref, Observer ref)

Prototype

Observer ref set get_by_name_label (session ref, string)

Prototype

Observer ref get_by_uuid (session ref, string)

Prototype

string set get_components (session ref, Observer ref)

Prototype

bool get_enabled (session ref, Observer ref)

Prototype

string set get_endpoints (session ref, Observer ref)

Prototype

host ref set get_hosts (session ref, Observer ref)

string get_name_description (session ref, Observer ref)

string get_name_label (session ref, Observer ref)

Prototype

Observer record get_record (session ref, Observer ref)

Prototype

string get_uuid (session ref, Observer ref)

Prototype

void set_attributes (session ref, Observer ref, (string → string) map)

Prototype

void set_components (session ref, Observer ref, string set)

Prototype

void set_enabled (session ref, Observer ref, bool)

Prototype

void set_endpoints (session ref, Observer ref, string set)

Prototype

void set_hosts (session ref, Observer ref, host ref set)

void set_name_description (session ref, Observer ref, string)

void set_name_label (session ref, Observer ref, string)

PBD

Class: PBD

The physical block devices through which hosts access SRs

Fields

bool currently_attached [RO/runtime]

(string → string) map device_config [RO/constructor]

host ref host [RO/constructor]

(string → string) map other_config [RW]

SR ref SR [RO/constructor]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, PBD ref, string, string)

PBD ref create (session ref, PBD record)

void destroy (session ref, PBD ref)

PBD ref set get_all (session ref)

(PBD ref → PBD record) map get_all_records (session ref)

PBD ref get_by_uuid (session ref, string)

bool get_currently_attached (session ref, PBD ref)

(string → string) map get_device_config (session ref, PBD ref)

host ref get_host (session ref, PBD ref)

(string → string) map get_other_config (session ref, PBD ref)

PBD record get_record (session ref, PBD ref)

SR ref get_SR (session ref, PBD ref)

string get_uuid (session ref, PBD ref)

void plug (session ref, PBD ref)

void remove_from_other_config (session ref, PBD ref, string)

void set_device_config (session ref, PBD ref, (string → string) map)

void set_other_config (session ref, PBD ref, (string → string) map)

void unplug (session ref, PBD ref)

PCI

Class: PCI

A PCI device

Fields

string class_name [RO/constructor]

PCI ref set dependencies [RO/runtime]

string device_name [RO/constructor]

string driver_name [RO/constructor]

host ref host [RO/constructor]

(string → string) map other_config [RW]

string pci_id [RO/constructor]

string subsystem_device_name [RO/constructor]

string subsystem_vendor_name [RO/constructor]

string uuid [RO/runtime]

string vendor_name [RO/constructor]

Messages

void add_to_other_config (session ref, PCI ref, string, string)

PCI ref set get_all (session ref)

(PCI ref → PCI record) map get_all_records (session ref)

PCI ref get_by_uuid (session ref, string)

string get_class_name (session ref, PCI ref)

PCI ref set get_dependencies (session ref, PCI ref)

string get_device_name (session ref, PCI ref)

string get_driver_name (session ref, PCI ref)

host ref get_host (session ref, PCI ref)

(string → string) map get_other_config (session ref, PCI ref)

string get_pci_id (session ref, PCI ref)

PCI record get_record (session ref, PCI ref)

string get_subsystem_device_name (session ref, PCI ref)

string get_subsystem_vendor_name (session ref, PCI ref)

string get_uuid (session ref, PCI ref)

string get_vendor_name (session ref, PCI ref)

void remove_from_other_config (session ref, PCI ref, string)

void set_other_config (session ref, PCI ref, (string → string) map)

PGPU

Class: PGPU

A physical GPU (pGPU)

Enums

pgpu_dom0_access

Fields

(string → string) map compatibility_metadata [RO/runtime]

enum pgpu_dom0_access dom0_access [RO/runtime]

VGPU_type ref set enabled_VGPU_types [RO/runtime]

GPU_group ref GPU_group [RO/constructor]

host ref host [RO/runtime]

bool is_system_display_device [RO/runtime]

(string → string) map other_config [RW]

PCI ref PCI [RO/constructor]

VGPU ref set resident_VGPUs [RO/runtime]

(VGPU_type ref → int) map supported_VGPU_max_capacities [RO/runtime]

VGPU_type ref set supported_VGPU_types [RO/runtime]

string uuid [RO/runtime]

Messages

void add_enabled_VGPU_types (session ref, PGPU ref, VGPU_type ref)

void add_to_other_config (session ref, PGPU ref, string, string)

enum pgpu_dom0_access disable_dom0_access (session ref, PGPU ref)

enum pgpu_dom0_access enable_dom0_access (session ref, PGPU ref)

PGPU ref set get_all (session ref)

(PGPU ref → PGPU record) map get_all_records (session ref)

PGPU ref get_by_uuid (session ref, string)

(string → string) map get_compatibility_metadata (session ref, PGPU ref)

enum pgpu_dom0_access get_dom0_access (session ref, PGPU ref)

VGPU_type ref set get_enabled_VGPU_types (session ref, PGPU ref)

GPU_group ref get_GPU_group (session ref, PGPU ref)

host ref get_host (session ref, PGPU ref)

bool get_is_system_display_device (session ref, PGPU ref)

(string → string) map get_other_config (session ref, PGPU ref)

PCI ref get_PCI (session ref, PGPU ref)

PGPU record get_record (session ref, PGPU ref)

int get_remaining_capacity (session ref, PGPU ref, VGPU_type ref)

VGPU ref set get_resident_VGPUs (session ref, PGPU ref)

(VGPU_type ref → int) map get_supported_VGPU_max_capacities (session ref, PGPU ref)

VGPU_type ref set get_supported_VGPU_types (session ref, PGPU ref)

string get_uuid (session ref, PGPU ref)

void remove_enabled_VGPU_types (session ref, PGPU ref, VGPU_type ref)

void remove_from_other_config (session ref, PGPU ref, string)

void set_enabled_VGPU_types (session ref, PGPU ref, VGPU_type ref set)

void set_GPU_group (session ref, PGPU ref, GPU_group ref)

void set_other_config (session ref, PGPU ref, (string → string) map)

PIF

Class: PIF

A physical network interface (note separate VLANs are represented as several PIFs)

Enums

pif_igmp_status

ip_configuration_mode

ipv6_configuration_mode

primary_address_type

Fields

Bond ref set bond_master_of [RO/runtime]

Bond ref bond_slave_of [RO/runtime]

string set capabilities [RO/runtime]

bool currently_attached [RO/runtime]

string device [RO/constructor]

bool disallow_unplug [RO/runtime]

string DNS [RO/runtime]

string gateway [RO/runtime]

host ref host [RO/constructor]

enum pif_igmp_status igmp_snooping_status [RO/runtime]

string IP [RO/runtime]

enum ip_configuration_mode ip_configuration_mode [RO/runtime]

string set IPv6 [RO/runtime]

enum ipv6_configuration_mode ipv6_configuration_mode [RO/runtime]

string ipv6_gateway [RO/runtime]

string MAC [RO/constructor]

bool managed [RO/constructor]

bool management [RO/runtime]

PIF_metrics ref metrics [RO/runtime]

int MTU [RO/constructor]

string netmask [RO/runtime]

network ref network [RO/constructor]

(string → string) map other_config [RW]

PCI ref PCI [RO/runtime]

bool physical [RO/runtime]

enum primary_address_type primary_address_type [RO/runtime]

(string → string) map properties [RO/runtime]

network_sriov ref set sriov_logical_PIF_of [RO/runtime]

network_sriov ref set sriov_physical_PIF_of [RO/runtime]

tunnel ref set tunnel_access_PIF_of [RO/runtime]

tunnel ref set tunnel_transport_PIF_of [RO/runtime]

string uuid [RO/runtime]

int VLAN [RO/constructor]

VLAN ref VLAN_master_of [RO/runtime]

VLAN ref set VLAN_slave_of [RO/runtime]

Messages

void add_to_other_config (session ref, PIF ref, string, string)

Deprecated

PIF ref create_VLAN (session ref, string, network ref, host ref, int)

void db_forget (session ref, PIF ref)

PIF ref db_introduce (session ref, string, network ref, host ref, string, int, int, bool, enum ip_configuration_mode, string, string, string, string, Bond ref, VLAN ref, bool, (string → string) map, bool, enum ipv6_configuration_mode, string set, string, enum primary_address_type, bool, (string → string) map)

Deprecated

void destroy (session ref, PIF ref)

void forget (session ref, PIF ref)

PIF ref set get_all (session ref)

(PIF ref → PIF record) map get_all_records (session ref)

Bond ref set get_bond_master_of (session ref, PIF ref)

Bond ref get_bond_slave_of (session ref, PIF ref)

PIF ref get_by_uuid (session ref, string)

string set get_capabilities (session ref, PIF ref)

bool get_currently_attached (session ref, PIF ref)

string get_device (session ref, PIF ref)

bool get_disallow_unplug (session ref, PIF ref)

string get_DNS (session ref, PIF ref)

string get_gateway (session ref, PIF ref)

host ref get_host (session ref, PIF ref)

enum pif_igmp_status get_igmp_snooping_status (session ref, PIF ref)

string get_IP (session ref, PIF ref)

enum ip_configuration_mode get_ip_configuration_mode (session ref, PIF ref)

string set get_IPv6 (session ref, PIF ref)

enum ipv6_configuration_mode get_ipv6_configuration_mode (session ref, PIF ref)

string get_ipv6_gateway (session ref, PIF ref)

string get_MAC (session ref, PIF ref)

bool get_managed (session ref, PIF ref)

bool get_management (session ref, PIF ref)

PIF_metrics ref get_metrics (session ref, PIF ref)

int get_MTU (session ref, PIF ref)

string get_netmask (session ref, PIF ref)

network ref get_network (session ref, PIF ref)

(string → string) map get_other_config (session ref, PIF ref)

PCI ref get_PCI (session ref, PIF ref)

bool get_physical (session ref, PIF ref)

enum primary_address_type get_primary_address_type (session ref, PIF ref)

(string → string) map get_properties (session ref, PIF ref)

PIF record get_record (session ref, PIF ref)

network_sriov ref set get_sriov_logical_PIF_of (session ref, PIF ref)

network_sriov ref set get_sriov_physical_PIF_of (session ref, PIF ref)

tunnel ref set get_tunnel_access_PIF_of (session ref, PIF ref)

tunnel ref set get_tunnel_transport_PIF_of (session ref, PIF ref)

string get_uuid (session ref, PIF ref)

int get_VLAN (session ref, PIF ref)

VLAN ref get_VLAN_master_of (session ref, PIF ref)

VLAN ref set get_VLAN_slave_of (session ref, PIF ref)

PIF ref introduce (session ref, host ref, string, string, bool)

void plug (session ref, PIF ref)

void reconfigure_ip (session ref, PIF ref, enum ip_configuration_mode, string, string, string, string)

void reconfigure_ipv6 (session ref, PIF ref, enum ipv6_configuration_mode, string, string, string)

void remove_from_other_config (session ref, PIF ref, string)

void scan (session ref, host ref)

void set_disallow_unplug (session ref, PIF ref, bool)

void set_other_config (session ref, PIF ref, (string → string) map)

void set_primary_address_type (session ref, PIF ref, enum primary_address_type)

void set_property (session ref, PIF ref, string, string)

void unplug (session ref, PIF ref)

PIF_metrics

Class: PIF_metrics

The metrics associated with a physical network interface

Fields

bool carrier [RO/runtime]

string device_id [RO/runtime]

string device_name [RO/runtime]

bool duplex [RO/runtime]

Removed

float io_read_kbs [RO/runtime]

Removed

float io_write_kbs [RO/runtime]

datetime last_updated [RO/runtime]

(string → string) map other_config [RW]

string pci_bus_path [RO/runtime]

int speed [RO/runtime]

string uuid [RO/runtime]

string vendor_id [RO/runtime]

string vendor_name [RO/runtime]

Messages

void add_to_other_config (session ref, PIF_metrics ref, string, string)

PIF_metrics ref set get_all (session ref)

(PIF_metrics ref → PIF_metrics record) map get_all_records (session ref)

PIF_metrics ref get_by_uuid (session ref, string)

bool get_carrier (session ref, PIF_metrics ref)

string get_device_id (session ref, PIF_metrics ref)

string get_device_name (session ref, PIF_metrics ref)

bool get_duplex (session ref, PIF_metrics ref)

Removed

float get_io_read_kbs (session ref, PIF_metrics ref)

Removed

float get_io_write_kbs (session ref, PIF_metrics ref)

datetime get_last_updated (session ref, PIF_metrics ref)

(string → string) map get_other_config (session ref, PIF_metrics ref)

string get_pci_bus_path (session ref, PIF_metrics ref)

PIF_metrics record get_record (session ref, PIF_metrics ref)

int get_speed (session ref, PIF_metrics ref)

string get_uuid (session ref, PIF_metrics ref)

string get_vendor_id (session ref, PIF_metrics ref)

string get_vendor_name (session ref, PIF_metrics ref)

void remove_from_other_config (session ref, PIF_metrics ref, string)

void set_other_config (session ref, PIF_metrics ref, (string → string) map)

pool

Class: pool

Pool-wide information

Enums

pool_allowed_operations

telemetry_frequency

update_sync_frequency

Fields

enum pool_allowed_operations set allowed_operations [RO/runtime]

(string → blob ref) map blobs [RO/runtime]

bool client_certificate_auth_enabled [RO/runtime]

string client_certificate_auth_name [RO/runtime]

bool coordinator_bias [RW]

(string → string) map cpu_info [RO/runtime]

SR ref crash_dump_SR [RW]

(string → enum pool_allowed_operations) map current_operations [RO/runtime]

Prototype

string custom_uefi_certificates [RO/constructor]

SR ref default_SR [RW]

Prototype

int ext_auth_max_threads [RO/constructor]

(string → string) map guest_agent_config [RO/runtime]

(string → string) map gui_config [RW]

bool ha_allow_overcommit [RW]

string ha_cluster_stack [RO/runtime]

(string → string) map ha_configuration [RO/runtime]

bool ha_enabled [RO/runtime]

int ha_host_failures_to_tolerate [RO/runtime]

bool ha_overcommitted [RO/runtime]

int ha_plan_exists_for [RO/runtime]

string set ha_statefiles [RO/runtime]

(string → string) map health_check_config [RW]

bool igmp_snooping_enabled [RO/runtime]

bool is_psr_pending [RW]

Prototype

datetime last_update_sync [RO/runtime]

bool live_patching_disabled [RW]

Prototype

int local_auth_max_threads [RO/constructor]

host ref master [RO/runtime]

VDI ref set metadata_VDIs [RO/runtime]

Prototype

bool migration_compression [RW]

string name_description [RW]

string name_label [RW]

(string → string) map other_config [RW]

bool policy_no_vendor_device [RW]

bool redo_log_enabled [RO/runtime]

VDI ref redo_log_vdi [RO/runtime]

Repository ref set repositories [RO/runtime]

secret ref repository_proxy_password [RO/runtime]

string repository_proxy_url [RO/runtime]

string repository_proxy_username [RO/runtime]

(string → string) map restrictions [RO/runtime]

SR ref suspend_image_SR [RW]

string set tags [RW]

Prototype

enum telemetry_frequency telemetry_frequency [RO/runtime]

Prototype

datetime telemetry_next_collection [RO/runtime]

Prototype

secret ref telemetry_uuid [RO/runtime]

bool tls_verification_enabled [RO/runtime]

string uefi_certificates [RO/constructor]

Prototype

int update_sync_day [RO/runtime]

Prototype

bool update_sync_enabled [RO/runtime]

Prototype

enum update_sync_frequency update_sync_frequency [RO/runtime]

string uuid [RO/runtime]

Deprecated

string vswitch_controller [RO/runtime]

bool wlb_enabled [RW]

string wlb_url [RO/runtime]

string wlb_username [RO/runtime]

Deprecated

bool wlb_verify_cert [RW]

Messages

void add_repository (session ref, pool ref, Repository ref)

void add_tags (session ref, pool ref, string)

void add_to_guest_agent_config (session ref, pool ref, string, string)

void add_to_gui_config (session ref, pool ref, string, string)

void add_to_health_check_config (session ref, pool ref, string, string)

void add_to_other_config (session ref, pool ref, string, string)

void apply_edition (session ref, pool ref, string)

Deprecated

void certificate_install (session ref, string, string)

Deprecated

string set certificate_list (session ref)

void certificate_sync (session ref)

Deprecated

void certificate_uninstall (session ref, string)

string set set check_update_readiness (session ref, pool ref, bool)

void configure_repository_proxy (session ref, pool ref, string, string, string)

Prototype

void configure_update_sync (session ref, pool ref, enum update_sync_frequency, int)

blob ref create_new_blob (session ref, pool ref, string, string, bool)

PIF ref set create_VLAN (session ref, string, network ref, int)

PIF ref set create_VLAN_from_PIF (session ref, PIF ref, network ref, int)

void crl_install (session ref, string, string)

string set crl_list (session ref)

void crl_uninstall (session ref, string)

void deconfigure_wlb (session ref)

void designate_new_master (session ref, host ref)

void detect_nonhomogeneous_external_auth (session ref, pool ref)

void disable_client_certificate_auth (session ref, pool ref)

void disable_external_auth (session ref, pool ref, (string → string) map)

void disable_ha (session ref)

void disable_local_storage_caching (session ref, pool ref)

void disable_redo_log (session ref)

void disable_repository_proxy (session ref, pool ref)

Deprecated

void disable_ssl_legacy (session ref, pool ref)

void eject (session ref, host ref)

void emergency_reset_master (session ref, string)

void emergency_transition_to_master (session ref)

void enable_client_certificate_auth (session ref, pool ref, string)

void enable_external_auth (session ref, pool ref, (string → string) map, string, string)

void enable_ha (session ref, SR ref set, (string → string) map)

void enable_local_storage_caching (session ref, pool ref)

void enable_redo_log (session ref, SR ref)

Removed

void enable_ssl_legacy (session ref, pool ref)

void enable_tls_verification (session ref)

pool ref set get_all (session ref)

(pool ref → pool record) map get_all_records (session ref)

enum pool_allowed_operations set get_allowed_operations (session ref, pool ref)

(string → blob ref) map get_blobs (session ref, pool ref)

pool ref get_by_uuid (session ref, string)

bool get_client_certificate_auth_enabled (session ref, pool ref)

string get_client_certificate_auth_name (session ref, pool ref)

bool get_coordinator_bias (session ref, pool ref)

(string → string) map get_cpu_info (session ref, pool ref)

SR ref get_crash_dump_SR (session ref, pool ref)

(string → enum pool_allowed_operations) map get_current_operations (session ref, pool ref)

Prototype

string get_custom_uefi_certificates (session ref, pool ref)

SR ref get_default_SR (session ref, pool ref)

Prototype

int get_ext_auth_max_threads (session ref, pool ref)

(string → string) map get_guest_agent_config (session ref, pool ref)

(string → string) map get_gui_config (session ref, pool ref)

bool get_ha_allow_overcommit (session ref, pool ref)

string get_ha_cluster_stack (session ref, pool ref)

(string → string) map get_ha_configuration (session ref, pool ref)

bool get_ha_enabled (session ref, pool ref)

int get_ha_host_failures_to_tolerate (session ref, pool ref)

bool get_ha_overcommitted (session ref, pool ref)

int get_ha_plan_exists_for (session ref, pool ref)

string set get_ha_statefiles (session ref, pool ref)

(string → string) map get_health_check_config (session ref, pool ref)

bool get_igmp_snooping_enabled (session ref, pool ref)

bool get_is_psr_pending (session ref, pool ref)

Prototype

datetime get_last_update_sync (session ref, pool ref)

(string → string) map get_license_state (session ref, pool ref)

bool get_live_patching_disabled (session ref, pool ref)

Prototype

int get_local_auth_max_threads (session ref, pool ref)

host ref get_master (session ref, pool ref)

VDI ref set get_metadata_VDIs (session ref, pool ref)

Prototype

bool get_migration_compression (session ref, pool ref)

string get_name_description (session ref, pool ref)

string get_name_label (session ref, pool ref)

(string → string) map get_other_config (session ref, pool ref)

bool get_policy_no_vendor_device (session ref, pool ref)

pool record get_record (session ref, pool ref)

bool get_redo_log_enabled (session ref, pool ref)

VDI ref get_redo_log_vdi (session ref, pool ref)

Repository ref set get_repositories (session ref, pool ref)

secret ref get_repository_proxy_password (session ref, pool ref)

string get_repository_proxy_url (session ref, pool ref)

string get_repository_proxy_username (session ref, pool ref)

(string → string) map get_restrictions (session ref, pool ref)

SR ref get_suspend_image_SR (session ref, pool ref)

string set get_tags (session ref, pool ref)

Prototype

enum telemetry_frequency get_telemetry_frequency (session ref, pool ref)

Prototype

datetime get_telemetry_next_collection (session ref, pool ref)

Prototype

secret ref get_telemetry_uuid (session ref, pool ref)

bool get_tls_verification_enabled (session ref, pool ref)

string get_uefi_certificates (session ref, pool ref)

Prototype

int get_update_sync_day (session ref, pool ref)

Prototype

bool get_update_sync_enabled (session ref, pool ref)

Prototype

enum update_sync_frequency get_update_sync_frequency (session ref, pool ref)

string get_uuid (session ref, pool ref)

Deprecated

string get_vswitch_controller (session ref, pool ref)

bool get_wlb_enabled (session ref, pool ref)

string get_wlb_url (session ref, pool ref)

string get_wlb_username (session ref, pool ref)

Deprecated

bool get_wlb_verify_cert (session ref, pool ref)

int ha_compute_hypothetical_max_host_failures_to_tolerate (session ref, (VM ref → string) map)

int ha_compute_max_host_failures_to_tolerate (session ref)

(VM ref → (string → string) map) map ha_compute_vm_failover_plan (session ref, host ref set, VM ref set)

bool ha_failover_plan_exists (session ref, int)

void ha_prevent_restarts_for (session ref, int)

bool has_extension (session ref, pool ref, string)

void initialize_wlb (session ref, string, string, string, string, string)

void install_ca_certificate (session ref, string, string)

void join (session ref, string, string, string)

void join_force (session ref, string, string, string)

void management_reconfigure (session ref, network ref)

host ref set recover_slaves (session ref)

void remove_from_guest_agent_config (session ref, pool ref, string)

void remove_from_gui_config (session ref, pool ref, string)

void remove_from_health_check_config (session ref, pool ref, string)

void remove_from_other_config (session ref, pool ref, string)

void remove_repository (session ref, pool ref, Repository ref)

void remove_tags (session ref, pool ref, string)

Prototype

void reset_telemetry_uuid (session ref, pool ref)

(string → string) map retrieve_wlb_configuration (session ref)

(VM ref → string set) map retrieve_wlb_recommendations (session ref)

void rotate_secret (session ref)

string send_test_post (session ref, string, int, string)

void send_wlb_configuration (session ref, (string → string) map)

void set_coordinator_bias (session ref, pool ref, bool)

void set_crash_dump_SR (session ref, pool ref, SR ref)

Prototype

void set_custom_uefi_certificates (session ref, pool ref, string)

void set_default_SR (session ref, pool ref, SR ref)

Prototype

void set_ext_auth_max_threads (session ref, pool ref, int)

void set_gui_config (session ref, pool ref, (string → string) map)

void set_ha_allow_overcommit (session ref, pool ref, bool)

void set_ha_host_failures_to_tolerate (session ref, pool ref, int)

void set_health_check_config (session ref, pool ref, (string → string) map)

Prototype

void set_https_only (session ref, pool ref, bool)

void set_igmp_snooping_enabled (session ref, pool ref, bool)

void set_is_psr_pending (session ref, pool ref, bool)

void set_live_patching_disabled (session ref, pool ref, bool)

Prototype

void set_local_auth_max_threads (session ref, pool ref, int)

Prototype

void set_migration_compression (session ref, pool ref, bool)

void set_name_description (session ref, pool ref, string)

void set_name_label (session ref, pool ref, string)

void set_other_config (session ref, pool ref, (string → string) map)

void set_policy_no_vendor_device (session ref, pool ref, bool)

void set_repositories (session ref, pool ref, Repository ref set)

void set_suspend_image_SR (session ref, pool ref, SR ref)

void set_tags (session ref, pool ref, string set)

Prototype

void set_telemetry_next_collection (session ref, pool ref, datetime)

Deprecated

void set_uefi_certificates (session ref, pool ref, string)

Prototype

void set_update_sync_enabled (session ref, pool ref, bool)

Deprecated

void set_vswitch_controller (session ref, string)

void set_wlb_enabled (session ref, pool ref, bool)

Deprecated

void set_wlb_verify_cert (session ref, pool ref, bool)

void sync_database (session ref)

string sync_updates (session ref, pool ref, bool, string, string)

string test_archive_target (session ref, pool ref, (string → string) map)

void uninstall_ca_certificate (session ref, string)

pool_patch

Deprecated

Class: pool_patch

Pool-wide patches

Enums

after_apply_guidance

Fields

enum after_apply_guidance set after_apply_guidance [RO/runtime]

host_patch ref set host_patches [RO/runtime]

string name_description [RO/constructor]

string name_label [RO/constructor]

(string → string) map other_config [RW]

bool pool_applied [RO/runtime]

pool_update ref pool_update [RO/constructor]

int size [RO/runtime]

string uuid [RO/runtime]

string version [RO/constructor]

Messages

void add_to_other_config (session ref, pool_patch ref, string, string)

Deprecated

string apply (session ref, pool_patch ref, host ref)

Deprecated

void clean (session ref, pool_patch ref)

Deprecated

void clean_on_host (session ref, pool_patch ref, host ref)

Deprecated

void destroy (session ref, pool_patch ref)

enum after_apply_guidance set get_after_apply_guidance (session ref, pool_patch ref)

Deprecated

pool_patch ref set get_all (session ref)

Deprecated

(pool_patch ref → pool_patch record) map get_all_records (session ref)

Deprecated

pool_patch ref set get_by_name_label (session ref, string)

Deprecated

pool_patch ref get_by_uuid (session ref, string)

host_patch ref set get_host_patches (session ref, pool_patch ref)

string get_name_description (session ref, pool_patch ref)

string get_name_label (session ref, pool_patch ref)

(string → string) map get_other_config (session ref, pool_patch ref)

bool get_pool_applied (session ref, pool_patch ref)

pool_update ref get_pool_update (session ref, pool_patch ref)

Deprecated

pool_patch record get_record (session ref, pool_patch ref)

int get_size (session ref, pool_patch ref)

string get_uuid (session ref, pool_patch ref)

string get_version (session ref, pool_patch ref)

Deprecated

void pool_apply (session ref, pool_patch ref)

Deprecated

void pool_clean (session ref, pool_patch ref)

Deprecated

string precheck (session ref, pool_patch ref, host ref)

void remove_from_other_config (session ref, pool_patch ref, string)

void set_other_config (session ref, pool_patch ref, (string → string) map)

pool_update

Class: pool_update

Pool-wide updates to the host software

Enums

update_after_apply_guidance

livepatch_status

Fields

enum update_after_apply_guidance set after_apply_guidance [RO/constructor]

bool enforce_homogeneity [RO/constructor]

host ref set hosts [RO/runtime]

int installation_size [RO/constructor]

string key [RO/constructor]

string name_description [RO/constructor]

string name_label [RO/constructor]

(string → string) map other_config [RW]

string uuid [RO/runtime]

VDI ref vdi [RO/constructor]

string version [RO/constructor]

Messages

void add_to_other_config (session ref, pool_update ref, string, string)

void apply (session ref, pool_update ref, host ref)

void destroy (session ref, pool_update ref)

enum update_after_apply_guidance set get_after_apply_guidance (session ref, pool_update ref)

pool_update ref set get_all (session ref)

(pool_update ref → pool_update record) map get_all_records (session ref)

pool_update ref set get_by_name_label (session ref, string)

pool_update ref get_by_uuid (session ref, string)

bool get_enforce_homogeneity (session ref, pool_update ref)

host ref set get_hosts (session ref, pool_update ref)

int get_installation_size (session ref, pool_update ref)

string get_key (session ref, pool_update ref)

string get_name_description (session ref, pool_update ref)

string get_name_label (session ref, pool_update ref)

(string → string) map get_other_config (session ref, pool_update ref)

pool_update record get_record (session ref, pool_update ref)

string get_uuid (session ref, pool_update ref)

VDI ref get_vdi (session ref, pool_update ref)

string get_version (session ref, pool_update ref)

pool_update ref introduce (session ref, VDI ref)

void pool_apply (session ref, pool_update ref)

void pool_clean (session ref, pool_update ref)

enum livepatch_status precheck (session ref, pool_update ref, host ref)

void remove_from_other_config (session ref, pool_update ref, string)

void set_other_config (session ref, pool_update ref, (string → string) map)

probe_result

Class: probe_result

A set of properties that describe one result element of SR.probe. Result elements and properties can change dynamically based on changes to the the SR.probe input-parameters or the target.

Fields

bool complete [RO/runtime]

(string → string) map configuration [RO/runtime]

(string → string) map extra_info [RO/runtime]

sr_stat record option sr [RO/runtime]

Messages

PUSB

Class: PUSB

A physical USB device

Fields

string description [RO/constructor]

host ref host [RO/constructor]

(string → string) map other_config [RW]

bool passthrough_enabled [RO/runtime]

string path [RO/constructor]

string product_desc [RO/constructor]

string product_id [RO/constructor]

string serial [RO/constructor]

float speed [RO/constructor]

USB_group ref USB_group [RO/constructor]

string uuid [RO/runtime]

string vendor_desc [RO/constructor]

string vendor_id [RO/constructor]

string version [RO/constructor]

Messages

void add_to_other_config (session ref, PUSB ref, string, string)

PUSB ref set get_all (session ref)

(PUSB ref → PUSB record) map get_all_records (session ref)

PUSB ref get_by_uuid (session ref, string)

string get_description (session ref, PUSB ref)

host ref get_host (session ref, PUSB ref)

(string → string) map get_other_config (session ref, PUSB ref)

bool get_passthrough_enabled (session ref, PUSB ref)

string get_path (session ref, PUSB ref)

string get_product_desc (session ref, PUSB ref)

string get_product_id (session ref, PUSB ref)

PUSB record get_record (session ref, PUSB ref)

string get_serial (session ref, PUSB ref)

float get_speed (session ref, PUSB ref)

USB_group ref get_USB_group (session ref, PUSB ref)

string get_uuid (session ref, PUSB ref)

string get_vendor_desc (session ref, PUSB ref)

string get_vendor_id (session ref, PUSB ref)

string get_version (session ref, PUSB ref)

void remove_from_other_config (session ref, PUSB ref, string)

void scan (session ref, host ref)

void set_other_config (session ref, PUSB ref, (string → string) map)

void set_passthrough_enabled (session ref, PUSB ref, bool)

PVS_cache_storage

Class: PVS_cache_storage

Describes the storage that is available to a PVS site for caching purposes

Fields

host ref host [RO/constructor]

PVS_site ref site [RO/constructor]

int size [RO/constructor]

SR ref SR [RO/constructor]

string uuid [RO/runtime]

VDI ref VDI [RO/runtime]

Messages

PVS_cache_storage ref create (session ref, PVS_cache_storage record)

void destroy (session ref, PVS_cache_storage ref)

PVS_cache_storage ref set get_all (session ref)

(PVS_cache_storage ref → PVS_cache_storage record) map get_all_records (session ref)

PVS_cache_storage ref get_by_uuid (session ref, string)

host ref get_host (session ref, PVS_cache_storage ref)

PVS_cache_storage record get_record (session ref, PVS_cache_storage ref)

PVS_site ref get_site (session ref, PVS_cache_storage ref)

int get_size (session ref, PVS_cache_storage ref)

SR ref get_SR (session ref, PVS_cache_storage ref)

string get_uuid (session ref, PVS_cache_storage ref)

VDI ref get_VDI (session ref, PVS_cache_storage ref)

PVS_proxy

Class: PVS_proxy

a proxy connects a VM/VIF with a PVS site

Enums

pvs_proxy_status

Fields

bool currently_attached [RO/runtime]

PVS_site ref site [RO/constructor]

enum pvs_proxy_status status [RO/runtime]

string uuid [RO/runtime]

VIF ref VIF [RO/constructor]

Messages

PVS_proxy ref create (session ref, PVS_site ref, VIF ref)

void destroy (session ref, PVS_proxy ref)

PVS_proxy ref set get_all (session ref)

(PVS_proxy ref → PVS_proxy record) map get_all_records (session ref)

PVS_proxy ref get_by_uuid (session ref, string)

bool get_currently_attached (session ref, PVS_proxy ref)

PVS_proxy record get_record (session ref, PVS_proxy ref)

PVS_site ref get_site (session ref, PVS_proxy ref)

enum pvs_proxy_status get_status (session ref, PVS_proxy ref)

string get_uuid (session ref, PVS_proxy ref)

VIF ref get_VIF (session ref, PVS_proxy ref)

PVS_server

Class: PVS_server

individual machine serving provisioning (block) data

Fields

string set addresses [RO/constructor]

int first_port [RO/constructor]

int last_port [RO/constructor]

PVS_site ref site [RO/constructor]

string uuid [RO/runtime]

Messages

void forget (session ref, PVS_server ref)

string set get_addresses (session ref, PVS_server ref)

PVS_server ref set get_all (session ref)

(PVS_server ref → PVS_server record) map get_all_records (session ref)

PVS_server ref get_by_uuid (session ref, string)

int get_first_port (session ref, PVS_server ref)

int get_last_port (session ref, PVS_server ref)

PVS_server record get_record (session ref, PVS_server ref)

PVS_site ref get_site (session ref, PVS_server ref)

string get_uuid (session ref, PVS_server ref)

PVS_server ref introduce (session ref, string set, int, int, PVS_site ref)

PVS_site

Class: PVS_site

machines serving blocks of data for provisioning VMs

Fields

PVS_cache_storage ref set cache_storage [RO/runtime]

string name_description [RW]

string name_label [RW]

PVS_proxy ref set proxies [RO/runtime]

string PVS_uuid [RO/constructor]

PVS_server ref set servers [RO/runtime]

string uuid [RO/runtime]

Messages

void forget (session ref, PVS_site ref)

PVS_site ref set get_all (session ref)

(PVS_site ref → PVS_site record) map get_all_records (session ref)

PVS_site ref set get_by_name_label (session ref, string)

PVS_site ref get_by_uuid (session ref, string)

PVS_cache_storage ref set get_cache_storage (session ref, PVS_site ref)

string get_name_description (session ref, PVS_site ref)

string get_name_label (session ref, PVS_site ref)

PVS_proxy ref set get_proxies (session ref, PVS_site ref)

string get_PVS_uuid (session ref, PVS_site ref)

PVS_site record get_record (session ref, PVS_site ref)

PVS_server ref set get_servers (session ref, PVS_site ref)

string get_uuid (session ref, PVS_site ref)

PVS_site ref introduce (session ref, string, string, string)

void set_name_description (session ref, PVS_site ref, string)

void set_name_label (session ref, PVS_site ref, string)

void set_PVS_uuid (session ref, PVS_site ref, string)

Repository

Class: Repository

Repository for updates

Fields

string binary_url [RO/constructor]

Prototype

string gpgkey_path [RO/constructor]

string hash [RO/runtime]

string name_description [RW]

string name_label [RW]

string source_url [RO/constructor]

Removed

bool up_to_date [RO/runtime]

bool update [RO/constructor]

string uuid [RO/runtime]

Messages

void forget (session ref, Repository ref)

Repository ref set get_all (session ref)

(Repository ref → Repository record) map get_all_records (session ref)

string get_binary_url (session ref, Repository ref)

Repository ref set get_by_name_label (session ref, string)

Repository ref get_by_uuid (session ref, string)

Prototype

string get_gpgkey_path (session ref, Repository ref)

string get_hash (session ref, Repository ref)

string get_name_description (session ref, Repository ref)

string get_name_label (session ref, Repository ref)

Repository record get_record (session ref, Repository ref)

string get_source_url (session ref, Repository ref)

Removed

bool get_up_to_date (session ref, Repository ref)

bool get_update (session ref, Repository ref)

string get_uuid (session ref, Repository ref)

Repository ref introduce (session ref, string, string, string, string, bool, string)

Prototype

void set_gpgkey_path (session ref, Repository ref, string)

void set_name_description (session ref, Repository ref, string)

void set_name_label (session ref, Repository ref, string)

role

Class: role

A set of permissions associated with a subject

Fields

bool is_internal [RO/runtime]

string name_description [RO/constructor]

string name_label [RO/constructor]

role ref set subroles [RO/constructor]

string uuid [RO/runtime]

Messages

role ref set get_all (session ref)

(role ref → role record) map get_all_records (session ref)

role ref set get_by_name_label (session ref, string)

role ref set get_by_permission (session ref, role ref)

role ref set get_by_permission_name_label (session ref, string)

role ref get_by_uuid (session ref, string)

bool get_is_internal (session ref, role ref)

string get_name_description (session ref, role ref)

string get_name_label (session ref, role ref)

role ref set get_permissions (session ref, role ref)

string set get_permissions_name_label (session ref, role ref)

role record get_record (session ref, role ref)

role ref set get_subroles (session ref, role ref)

string get_uuid (session ref, role ref)

SDN_controller

Class: SDN_controller

Describes the SDN controller that is to connect with the pool

Enums

sdn_controller_protocol

Fields

string address [RO/constructor]

int port [RO/constructor]

enum sdn_controller_protocol protocol [RO/constructor]

string uuid [RO/runtime]

Messages

void forget (session ref, SDN_controller ref)

string get_address (session ref, SDN_controller ref)

SDN_controller ref set get_all (session ref)

(SDN_controller ref → SDN_controller record) map get_all_records (session ref)

SDN_controller ref get_by_uuid (session ref, string)

int get_port (session ref, SDN_controller ref)

enum sdn_controller_protocol get_protocol (session ref, SDN_controller ref)

SDN_controller record get_record (session ref, SDN_controller ref)

string get_uuid (session ref, SDN_controller ref)

SDN_controller ref introduce (session ref, enum sdn_controller_protocol, string, int)

secret

Class: secret

A secret

Fields

(string → string) map other_config [RW]

string uuid [RO/runtime]

string value [RW]

Messages

void add_to_other_config (session ref, secret ref, string, string)

secret ref create (session ref, secret record)

void destroy (session ref, secret ref)

secret ref set get_all (session ref)

(secret ref → secret record) map get_all_records (session ref)

secret ref get_by_uuid (session ref, string)

(string → string) map get_other_config (session ref, secret ref)

secret record get_record (session ref, secret ref)

string get_uuid (session ref, secret ref)

string get_value (session ref, secret ref)

void remove_from_other_config (session ref, secret ref, string)

void set_other_config (session ref, secret ref, (string → string) map)

void set_value (session ref, secret ref, string)

session

Class: session

A session

Fields

string auth_user_name [RO/runtime]

string auth_user_sid [RO/runtime]

bool client_certificate [RO/runtime]

bool is_local_superuser [RO/runtime]

datetime last_active [RO/runtime]

string originator [RO/runtime]

(string → string) map other_config [RW]

session ref parent [RO/constructor]

bool pool [RO/runtime]

string set rbac_permissions [RO/constructor]

subject ref subject [RO/runtime]

task ref set tasks [RO/runtime]

host ref this_host [RO/runtime]

user ref this_user [RO/runtime]

string uuid [RO/runtime]

datetime validation_time [RO/runtime]

Messages

void add_to_other_config (session ref, session ref, string, string)

void change_password (session ref, string, string)

session ref create_from_db_file (session ref, string)

string set get_all_subject_identifiers (session ref)

string get_auth_user_name (session ref, session ref)

string get_auth_user_sid (session ref, session ref)

session ref get_by_uuid (session ref, string)

bool get_client_certificate (session ref, session ref)

bool get_is_local_superuser (session ref, session ref)

datetime get_last_active (session ref, session ref)

string get_originator (session ref, session ref)

(string → string) map get_other_config (session ref, session ref)

session ref get_parent (session ref, session ref)

bool get_pool (session ref, session ref)

string set get_rbac_permissions (session ref, session ref)

session record get_record (session ref, session ref)

subject ref get_subject (session ref, session ref)

task ref set get_tasks (session ref, session ref)

host ref get_this_host (session ref, session ref)

user ref get_this_user (session ref, session ref)

string get_uuid (session ref, session ref)

datetime get_validation_time (session ref, session ref)

void local_logout (session ref)

session ref login_with_password (string, string, string, string)

void logout (session ref)

void logout_subject_identifier (session ref, string)

void remove_from_other_config (session ref, session ref, string)

void set_other_config (session ref, session ref, (string → string) map)

session ref slave_local_login_with_password (string, string)

SM

Class: SM

A storage manager plugin

Fields

Deprecated

string set capabilities [RO/runtime]

(string → string) map configuration [RO/runtime]

string copyright [RO/runtime]

string driver_filename [RO/runtime]

(string → int) map features [RO/runtime]

string name_description [RO/runtime]

string name_label [RO/runtime]

(string → string) map other_config [RW]

string required_api_version [RO/runtime]

string set required_cluster_stack [RO/runtime]

string type [RO/runtime]

string uuid [RO/runtime]

string vendor [RO/runtime]

string version [RO/runtime]

Messages

void add_to_other_config (session ref, SM ref, string, string)

SM ref set get_all (session ref)

(SM ref → SM record) map get_all_records (session ref)

SM ref set get_by_name_label (session ref, string)

SM ref get_by_uuid (session ref, string)

Deprecated

string set get_capabilities (session ref, SM ref)

(string → string) map get_configuration (session ref, SM ref)

string get_copyright (session ref, SM ref)

string get_driver_filename (session ref, SM ref)

(string → int) map get_features (session ref, SM ref)

string get_name_description (session ref, SM ref)

string get_name_label (session ref, SM ref)

(string → string) map get_other_config (session ref, SM ref)

SM record get_record (session ref, SM ref)

string get_required_api_version (session ref, SM ref)

string set get_required_cluster_stack (session ref, SM ref)

string get_type (session ref, SM ref)

string get_uuid (session ref, SM ref)

string get_vendor (session ref, SM ref)

string get_version (session ref, SM ref)

void remove_from_other_config (session ref, SM ref, string)

void set_other_config (session ref, SM ref, (string → string) map)

SR

Class: SR

A storage repository

Enums

storage_operations

Fields

enum storage_operations set allowed_operations [RO/runtime]

(string → blob ref) map blobs [RO/runtime]

bool clustered [RO/runtime]

string content_type [RO/constructor]

(string → enum storage_operations) map current_operations [RO/runtime]

DR_task ref introduced_by [RO/runtime]

bool is_tools_sr [RO/runtime]

bool local_cache_enabled [RO/runtime]

string name_description [RO/constructor]

string name_label [RO/constructor]

(string → string) map other_config [RW]

PBD ref set PBDs [RO/runtime]

int physical_size [RO/constructor]

int physical_utilisation [RO/runtime]

bool shared [RO/runtime]

(string → string) map sm_config [RW]

string set tags [RW]

string type [RO/constructor]

string uuid [RO/runtime]

VDI ref set VDIs [RO/runtime]

int virtual_allocation [RO/runtime]

Messages

void add_tags (session ref, SR ref, string)

void add_to_other_config (session ref, SR ref, string, string)

void add_to_sm_config (session ref, SR ref, string, string)

void assert_can_host_ha_statefile (session ref, SR ref)

void assert_supports_database_replication (session ref, SR ref)

SR ref create (session ref, host ref, (string → string) map, int, string, string, string, string, bool, (string → string) map)

blob ref create_new_blob (session ref, SR ref, string, string, bool)

void destroy (session ref, SR ref)

void disable_database_replication (session ref, SR ref)

void enable_database_replication (session ref, SR ref)

void forget (session ref, SR ref)

void forget_data_source_archives (session ref, SR ref, string)

SR ref set get_all (session ref)

(SR ref → SR record) map get_all_records (session ref)

enum storage_operations set get_allowed_operations (session ref, SR ref)

(string → blob ref) map get_blobs (session ref, SR ref)

SR ref set get_by_name_label (session ref, string)

SR ref get_by_uuid (session ref, string)

bool get_clustered (session ref, SR ref)

string get_content_type (session ref, SR ref)

(string → enum storage_operations) map get_current_operations (session ref, SR ref)

data_source record set get_data_sources (session ref, SR ref)

DR_task ref get_introduced_by (session ref, SR ref)

bool get_is_tools_sr (session ref, SR ref)

bool get_local_cache_enabled (session ref, SR ref)

string get_name_description (session ref, SR ref)

string get_name_label (session ref, SR ref)

(string → string) map get_other_config (session ref, SR ref)

PBD ref set get_PBDs (session ref, SR ref)

int get_physical_size (session ref, SR ref)

int get_physical_utilisation (session ref, SR ref)

SR record get_record (session ref, SR ref)

bool get_shared (session ref, SR ref)

(string → string) map get_sm_config (session ref, SR ref)

string set get_supported_types (session ref)

string set get_tags (session ref, SR ref)

string get_type (session ref, SR ref)

string get_uuid (session ref, SR ref)

VDI ref set get_VDIs (session ref, SR ref)

int get_virtual_allocation (session ref, SR ref)

SR ref introduce (session ref, string, string, string, string, string, bool, (string → string) map)

Deprecated

string make (session ref, host ref, (string → string) map, int, string, string, string, string, (string → string) map)

string probe (session ref, host ref, (string → string) map, string, (string → string) map)

probe_result record set probe_ext (session ref, host ref, (string → string) map, string, (string → string) map)

float query_data_source (session ref, SR ref, string)

void record_data_source (session ref, SR ref, string)

void remove_from_other_config (session ref, SR ref, string)

void remove_from_sm_config (session ref, SR ref, string)

void remove_tags (session ref, SR ref, string)

void scan (session ref, SR ref)

void set_name_description (session ref, SR ref, string)

void set_name_label (session ref, SR ref, string)

void set_other_config (session ref, SR ref, (string → string) map)

void set_physical_size (session ref, SR ref, int)

void set_shared (session ref, SR ref, bool)

void set_sm_config (session ref, SR ref, (string → string) map)

void set_tags (session ref, SR ref, string set)

void update (session ref, SR ref)

sr_stat

Class: sr_stat

A set of high-level properties associated with an SR.

Enums

sr_health

Fields

bool clustered [RO/runtime]

int free_space [RO/runtime]

enum sr_health health [RO/runtime]

string name_description [RO/runtime]

string name_label [RO/runtime]

int total_space [RO/runtime]

string option uuid [RO/runtime]

Messages

subject

Class: subject

A user or group that can log in xapi

Fields

(string → string) map other_config [RO/constructor]

role ref set roles [RO/runtime]

string subject_identifier [RO/constructor]

string uuid [RO/runtime]

Messages

void add_to_roles (session ref, subject ref, role ref)

subject ref create (session ref, subject record)

void destroy (session ref, subject ref)

subject ref set get_all (session ref)

(subject ref → subject record) map get_all_records (session ref)

subject ref get_by_uuid (session ref, string)

(string → string) map get_other_config (session ref, subject ref)

string set get_permissions_name_label (session ref, subject ref)

subject record get_record (session ref, subject ref)

role ref set get_roles (session ref, subject ref)

string get_subject_identifier (session ref, subject ref)

string get_uuid (session ref, subject ref)

void remove_from_roles (session ref, subject ref, role ref)

task

Class: task

A long-running asynchronous task

Enums

task_allowed_operations

task_status_type

Fields

enum task_allowed_operations set allowed_operations [RO/runtime]

string backtrace [RO/runtime]

datetime created [RO/runtime]

(string → enum task_allowed_operations) map current_operations [RO/runtime]

string set error_info [RO/runtime]

datetime finished [RO/runtime]

string name_description [RO/runtime]

string name_label [RO/runtime]

(string → string) map other_config [RW]

float progress [RO/runtime]

host ref resident_on [RO/runtime]

string result [RO/runtime]

enum task_status_type status [RO/runtime]

task ref subtask_of [RO/runtime]

task ref set subtasks [RO/runtime]

string type [RO/runtime]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, task ref, string, string)

void cancel (session ref, task ref)

task ref create (session ref, string, string)

void destroy (session ref, task ref)

task ref set get_all (session ref)

(task ref → task record) map get_all_records (session ref)

enum task_allowed_operations set get_allowed_operations (session ref, task ref)

string get_backtrace (session ref, task ref)

task ref set get_by_name_label (session ref, string)

task ref get_by_uuid (session ref, string)

datetime get_created (session ref, task ref)

(string → enum task_allowed_operations) map get_current_operations (session ref, task ref)

string set get_error_info (session ref, task ref)

datetime get_finished (session ref, task ref)

string get_name_description (session ref, task ref)

string get_name_label (session ref, task ref)

(string → string) map get_other_config (session ref, task ref)

float get_progress (session ref, task ref)

task record get_record (session ref, task ref)

host ref get_resident_on (session ref, task ref)

string get_result (session ref, task ref)

enum task_status_type get_status (session ref, task ref)

task ref get_subtask_of (session ref, task ref)

task ref set get_subtasks (session ref, task ref)

string get_type (session ref, task ref)

string get_uuid (session ref, task ref)

void remove_from_other_config (session ref, task ref, string)

void set_error_info (session ref, task ref, string set)

void set_other_config (session ref, task ref, (string → string) map)

void set_progress (session ref, task ref, float)

void set_result (session ref, task ref, string)

void set_status (session ref, task ref, enum task_status_type)

tunnel

Class: tunnel

A tunnel for network traffic

Enums

tunnel_protocol

Fields

PIF ref access_PIF [RO/constructor]

(string → string) map other_config [RW]

enum tunnel_protocol protocol [RW]

(string → string) map status [RW]

PIF ref transport_PIF [RO/constructor]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, tunnel ref, string, string)

void add_to_status (session ref, tunnel ref, string, string)

tunnel ref create (session ref, PIF ref, network ref, enum tunnel_protocol)

void destroy (session ref, tunnel ref)

PIF ref get_access_PIF (session ref, tunnel ref)

tunnel ref set get_all (session ref)

(tunnel ref → tunnel record) map get_all_records (session ref)

tunnel ref get_by_uuid (session ref, string)

(string → string) map get_other_config (session ref, tunnel ref)

enum tunnel_protocol get_protocol (session ref, tunnel ref)

tunnel record get_record (session ref, tunnel ref)

(string → string) map get_status (session ref, tunnel ref)

PIF ref get_transport_PIF (session ref, tunnel ref)

string get_uuid (session ref, tunnel ref)

void remove_from_other_config (session ref, tunnel ref, string)

void remove_from_status (session ref, tunnel ref, string)

void set_other_config (session ref, tunnel ref, (string → string) map)

void set_protocol (session ref, tunnel ref, enum tunnel_protocol)

void set_status (session ref, tunnel ref, (string → string) map)

USB_group

Class: USB_group

A group of compatible USBs across the resource pool

Fields

string name_description [RW]

string name_label [RW]

(string → string) map other_config [RW]

PUSB ref set PUSBs [RO/runtime]

string uuid [RO/runtime]

VUSB ref set VUSBs [RO/runtime]

Messages

void add_to_other_config (session ref, USB_group ref, string, string)

USB_group ref create (session ref, string, string, (string → string) map)

void destroy (session ref, USB_group ref)

USB_group ref set get_all (session ref)

(USB_group ref → USB_group record) map get_all_records (session ref)

USB_group ref set get_by_name_label (session ref, string)

USB_group ref get_by_uuid (session ref, string)

string get_name_description (session ref, USB_group ref)

string get_name_label (session ref, USB_group ref)

(string → string) map get_other_config (session ref, USB_group ref)

PUSB ref set get_PUSBs (session ref, USB_group ref)

USB_group record get_record (session ref, USB_group ref)

string get_uuid (session ref, USB_group ref)

VUSB ref set get_VUSBs (session ref, USB_group ref)

void remove_from_other_config (session ref, USB_group ref, string)

void set_name_description (session ref, USB_group ref, string)

void set_name_label (session ref, USB_group ref, string)

void set_other_config (session ref, USB_group ref, (string → string) map)

user

Deprecated

Class: user

A user of the system

Fields

string fullname [RW]

(string → string) map other_config [RW]

string short_name [RO/constructor]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, user ref, string, string)

Deprecated

user ref create (session ref, user record)

Deprecated

void destroy (session ref, user ref)

Deprecated

user ref get_by_uuid (session ref, string)

string get_fullname (session ref, user ref)

(string → string) map get_other_config (session ref, user ref)

Deprecated

user record get_record (session ref, user ref)

string get_short_name (session ref, user ref)

string get_uuid (session ref, user ref)

void remove_from_other_config (session ref, user ref, string)

void set_fullname (session ref, user ref, string)

void set_other_config (session ref, user ref, (string → string) map)

VBD

Class: VBD

A virtual block device

Enums

vbd_operations

vbd_type

vbd_mode

Fields

enum vbd_operations set allowed_operations [RO/runtime]

bool bootable [RW]

(string → enum vbd_operations) map current_operations [RO/runtime]

bool currently_attached [RO/constructor]

string device [RO/constructor]

bool empty [RO/constructor]

Removed

VBD_metrics ref metrics [RO/runtime]

enum vbd_mode mode [RO/constructor]

(string → string) map other_config [RW]

(string → string) map qos_algorithm_params [RW]

string qos_algorithm_type [RW]

string set qos_supported_algorithms [RO/runtime]

(string → string) map runtime_properties [RO/runtime]

int status_code [RO/runtime]

string status_detail [RO/runtime]

bool storage_lock [RO/runtime]

enum vbd_type type [RW]

bool unpluggable [RW]

string userdevice [RW]

string uuid [RO/runtime]

VDI ref VDI [RO/constructor]

VM ref VM [RO/constructor]

Messages

void add_to_other_config (session ref, VBD ref, string, string)

void add_to_qos_algorithm_params (session ref, VBD ref, string, string)

void assert_attachable (session ref, VBD ref)

VBD ref create (session ref, VBD record)

void destroy (session ref, VBD ref)

void eject (session ref, VBD ref)

VBD ref set get_all (session ref)

(VBD ref → VBD record) map get_all_records (session ref)

enum vbd_operations set get_allowed_operations (session ref, VBD ref)

bool get_bootable (session ref, VBD ref)

VBD ref get_by_uuid (session ref, string)

(string → enum vbd_operations) map get_current_operations (session ref, VBD ref)

bool get_currently_attached (session ref, VBD ref)

string get_device (session ref, VBD ref)

bool get_empty (session ref, VBD ref)

Removed

VBD_metrics ref get_metrics (session ref, VBD ref)

enum vbd_mode get_mode (session ref, VBD ref)

(string → string) map get_other_config (session ref, VBD ref)

(string → string) map get_qos_algorithm_params (session ref, VBD ref)

string get_qos_algorithm_type (session ref, VBD ref)

string set get_qos_supported_algorithms (session ref, VBD ref)

VBD record get_record (session ref, VBD ref)

(string → string) map get_runtime_properties (session ref, VBD ref)

int get_status_code (session ref, VBD ref)

string get_status_detail (session ref, VBD ref)

bool get_storage_lock (session ref, VBD ref)

enum vbd_type get_type (session ref, VBD ref)

bool get_unpluggable (session ref, VBD ref)

string get_userdevice (session ref, VBD ref)

string get_uuid (session ref, VBD ref)

VDI ref get_VDI (session ref, VBD ref)

VM ref get_VM (session ref, VBD ref)

void insert (session ref, VBD ref, VDI ref)

void plug (session ref, VBD ref)

void remove_from_other_config (session ref, VBD ref, string)

void remove_from_qos_algorithm_params (session ref, VBD ref, string)

void set_bootable (session ref, VBD ref, bool)

void set_mode (session ref, VBD ref, enum vbd_mode)

void set_other_config (session ref, VBD ref, (string → string) map)

void set_qos_algorithm_params (session ref, VBD ref, (string → string) map)

void set_qos_algorithm_type (session ref, VBD ref, string)

void set_type (session ref, VBD ref, enum vbd_type)

void set_unpluggable (session ref, VBD ref, bool)

void set_userdevice (session ref, VBD ref, string)

void unplug (session ref, VBD ref)

void unplug_force (session ref, VBD ref)

VBD_metrics

Removed

Class: VBD_metrics

The metrics associated with a virtual block device

Fields

Removed

float io_read_kbs [RO/runtime]

Removed

float io_write_kbs [RO/runtime]

Removed

datetime last_updated [RO/runtime]

Removed

(string → string) map other_config [RW]

string uuid [RO/runtime]

Messages

Removed

void add_to_other_config (session ref, VBD_metrics ref, string, string)

Removed

VBD_metrics ref set get_all (session ref)

Removed

(VBD_metrics ref → VBD_metrics record) map get_all_records (session ref)

Removed

VBD_metrics ref get_by_uuid (session ref, string)

Removed

float get_io_read_kbs (session ref, VBD_metrics ref)

Removed

float get_io_write_kbs (session ref, VBD_metrics ref)

Removed

datetime get_last_updated (session ref, VBD_metrics ref)

Removed

(string → string) map get_other_config (session ref, VBD_metrics ref)

Removed

VBD_metrics record get_record (session ref, VBD_metrics ref)

string get_uuid (session ref, VBD_metrics ref)

Removed

void remove_from_other_config (session ref, VBD_metrics ref, string)

Removed

void set_other_config (session ref, VBD_metrics ref, (string → string) map)

VDI

Class: VDI

A virtual disk image

Enums

vdi_operations

vdi_type

on_boot

Fields

bool allow_caching [RO/runtime]

enum vdi_operations set allowed_operations [RO/runtime]

bool cbt_enabled [RO/runtime]

crashdump ref set crash_dumps [RO/runtime]

(string → enum vdi_operations) map current_operations [RO/runtime]

bool is_a_snapshot [RO/runtime]

bool is_tools_iso [RO/runtime]

string location [RO/runtime]

bool managed [RO/runtime]

bool metadata_latest [RO/runtime]

pool ref metadata_of_pool [RO/runtime]

bool missing [RO/runtime]

string name_description [RO/constructor]

string name_label [RO/constructor]

enum on_boot on_boot [RO/runtime]

(string → string) map other_config [RW]

Deprecated

VDI ref parent [RO/runtime]

int physical_utilisation [RO/runtime]

bool read_only [RO/constructor]

bool sharable [RO/constructor]

(string → string) map sm_config [RW]

VDI ref snapshot_of [RO/runtime]

datetime snapshot_time [RO/runtime]

VDI ref set snapshots [RO/runtime]

SR ref SR [RO/constructor]

bool storage_lock [RO/runtime]

string set tags [RW]

enum vdi_type type [RO/constructor]

string uuid [RO/runtime]

VBD ref set VBDs [RO/runtime]

int virtual_size [RO/constructor]

(string → string) map xenstore_data [RW]

Messages

void add_tags (session ref, VDI ref, string)

void add_to_other_config (session ref, VDI ref, string, string)

void add_to_sm_config (session ref, VDI ref, string, string)

void add_to_xenstore_data (session ref, VDI ref, string, string)

VDI ref clone (session ref, VDI ref, (string → string) map)

VDI ref copy (session ref, VDI ref, SR ref, VDI ref, VDI ref)

VDI ref create (session ref, VDI record)

void data_destroy (session ref, VDI ref)

void destroy (session ref, VDI ref)

void disable_cbt (session ref, VDI ref)

void enable_cbt (session ref, VDI ref)

void forget (session ref, VDI ref)

VDI ref set get_all (session ref)

(VDI ref → VDI record) map get_all_records (session ref)

bool get_allow_caching (session ref, VDI ref)

enum vdi_operations set get_allowed_operations (session ref, VDI ref)

VDI ref set get_by_name_label (session ref, string)

VDI ref get_by_uuid (session ref, string)

bool get_cbt_enabled (session ref, VDI ref)

crashdump ref set get_crash_dumps (session ref, VDI ref)

(string → enum vdi_operations) map get_current_operations (session ref, VDI ref)

bool get_is_a_snapshot (session ref, VDI ref)

bool get_is_tools_iso (session ref, VDI ref)

string get_location (session ref, VDI ref)

bool get_managed (session ref, VDI ref)

bool get_metadata_latest (session ref, VDI ref)

pool ref get_metadata_of_pool (session ref, VDI ref)

bool get_missing (session ref, VDI ref)

string get_name_description (session ref, VDI ref)

string get_name_label (session ref, VDI ref)

vdi_nbd_server_info record set get_nbd_info (session ref, VDI ref)

enum on_boot get_on_boot (session ref, VDI ref)

(string → string) map get_other_config (session ref, VDI ref)

Deprecated

VDI ref get_parent (session ref, VDI ref)

int get_physical_utilisation (session ref, VDI ref)

bool get_read_only (session ref, VDI ref)

VDI record get_record (session ref, VDI ref)

bool get_sharable (session ref, VDI ref)

(string → string) map get_sm_config (session ref, VDI ref)

VDI ref get_snapshot_of (session ref, VDI ref)

datetime get_snapshot_time (session ref, VDI ref)

VDI ref set get_snapshots (session ref, VDI ref)

SR ref get_SR (session ref, VDI ref)

bool get_storage_lock (session ref, VDI ref)

string set get_tags (session ref, VDI ref)

enum vdi_type get_type (session ref, VDI ref)

string get_uuid (session ref, VDI ref)

VBD ref set get_VBDs (session ref, VDI ref)

int get_virtual_size (session ref, VDI ref)

(string → string) map get_xenstore_data (session ref, VDI ref)

VDI ref introduce (session ref, string, string, string, SR ref, enum vdi_type, bool, bool, (string → string) map, string, (string → string) map, (string → string) map, bool, int, int, pool ref, bool, datetime, VDI ref)

string list_changed_blocks (session ref, VDI ref, VDI ref)

session ref open_database (session ref, VDI ref)

VDI ref pool_migrate (session ref, VDI ref, SR ref, (string → string) map)

string read_database_pool_uuid (session ref, VDI ref)

void remove_from_other_config (session ref, VDI ref, string)

void remove_from_sm_config (session ref, VDI ref, string)

void remove_from_xenstore_data (session ref, VDI ref, string)

void remove_tags (session ref, VDI ref, string)

void resize (session ref, VDI ref, int)

Removed

void resize_online (session ref, VDI ref, int)

void set_allow_caching (session ref, VDI ref, bool)

void set_name_description (session ref, VDI ref, string)

void set_name_label (session ref, VDI ref, string)

void set_on_boot (session ref, VDI ref, enum on_boot)

void set_other_config (session ref, VDI ref, (string → string) map)

void set_read_only (session ref, VDI ref, bool)

void set_sharable (session ref, VDI ref, bool)

void set_sm_config (session ref, VDI ref, (string → string) map)

void set_tags (session ref, VDI ref, string set)

void set_xenstore_data (session ref, VDI ref, (string → string) map)

VDI ref snapshot (session ref, VDI ref, (string → string) map)

void update (session ref, VDI ref)

vdi_nbd_server_info

Class: vdi_nbd_server_info

Details for connecting to a VDI using the Network Block Device protocol

Fields

string address [RO/runtime]

string cert [RO/runtime]

string exportname [RO/runtime]

int port [RO/runtime]

string subject [RO/runtime]

Messages

VGPU

Class: VGPU

A virtual GPU (vGPU)

Fields

(string → string) map compatibility_metadata [RO/runtime]

bool currently_attached [RO/runtime]

string device [RO/runtime]

string extra_args [RW]

GPU_group ref GPU_group [RO/runtime]

(string → string) map other_config [RW]

PCI ref PCI [RO/runtime]

PGPU ref resident_on [RO/runtime]

PGPU ref scheduled_to_be_resident_on [RO/runtime]

VGPU_type ref type [RO/runtime]

string uuid [RO/runtime]

VM ref VM [RO/runtime]

Messages

void add_to_other_config (session ref, VGPU ref, string, string)

VGPU ref create (session ref, VM ref, GPU_group ref, string, (string → string) map, VGPU_type ref)

void destroy (session ref, VGPU ref)

VGPU ref set get_all (session ref)

(VGPU ref → VGPU record) map get_all_records (session ref)

VGPU ref get_by_uuid (session ref, string)

(string → string) map get_compatibility_metadata (session ref, VGPU ref)

bool get_currently_attached (session ref, VGPU ref)

string get_device (session ref, VGPU ref)

string get_extra_args (session ref, VGPU ref)

GPU_group ref get_GPU_group (session ref, VGPU ref)

(string → string) map get_other_config (session ref, VGPU ref)

PCI ref get_PCI (session ref, VGPU ref)

VGPU record get_record (session ref, VGPU ref)

PGPU ref get_resident_on (session ref, VGPU ref)

PGPU ref get_scheduled_to_be_resident_on (session ref, VGPU ref)

VGPU_type ref get_type (session ref, VGPU ref)

string get_uuid (session ref, VGPU ref)

VM ref get_VM (session ref, VGPU ref)

void remove_from_other_config (session ref, VGPU ref, string)

void set_extra_args (session ref, VGPU ref, string)

void set_other_config (session ref, VGPU ref, (string → string) map)

VGPU_type

Class: VGPU_type

A type of virtual GPU

Enums

vgpu_type_implementation

Fields

VGPU_type ref set compatible_types_in_vm [RO/runtime]

GPU_group ref set enabled_on_GPU_groups [RO/runtime]

PGPU ref set enabled_on_PGPUs [RO/runtime]

bool experimental [RO/constructor]

int framebuffer_size [RO/constructor]

string identifier [RO/constructor]

enum vgpu_type_implementation implementation [RO/constructor]

int max_heads [RO/constructor]

int max_resolution_x [RO/constructor]

int max_resolution_y [RO/constructor]

string model_name [RO/constructor]

GPU_group ref set supported_on_GPU_groups [RO/runtime]

PGPU ref set supported_on_PGPUs [RO/runtime]

string uuid [RO/runtime]

string vendor_name [RO/constructor]

VGPU ref set VGPUs [RO/runtime]

Messages

VGPU_type ref set get_all (session ref)

(VGPU_type ref → VGPU_type record) map get_all_records (session ref)

VGPU_type ref get_by_uuid (session ref, string)

VGPU_type ref set get_compatible_types_in_vm (session ref, VGPU_type ref)

GPU_group ref set get_enabled_on_GPU_groups (session ref, VGPU_type ref)

PGPU ref set get_enabled_on_PGPUs (session ref, VGPU_type ref)

bool get_experimental (session ref, VGPU_type ref)

int get_framebuffer_size (session ref, VGPU_type ref)

string get_identifier (session ref, VGPU_type ref)

enum vgpu_type_implementation get_implementation (session ref, VGPU_type ref)

int get_max_heads (session ref, VGPU_type ref)

int get_max_resolution_x (session ref, VGPU_type ref)

int get_max_resolution_y (session ref, VGPU_type ref)

string get_model_name (session ref, VGPU_type ref)

VGPU_type record get_record (session ref, VGPU_type ref)

GPU_group ref set get_supported_on_GPU_groups (session ref, VGPU_type ref)

PGPU ref set get_supported_on_PGPUs (session ref, VGPU_type ref)

string get_uuid (session ref, VGPU_type ref)

string get_vendor_name (session ref, VGPU_type ref)

VGPU ref set get_VGPUs (session ref, VGPU_type ref)

VIF

Class: VIF

A virtual network interface

Enums

vif_operations

vif_locking_mode

vif_ipv4_configuration_mode

vif_ipv6_configuration_mode

Fields

enum vif_operations set allowed_operations [RO/runtime]

(string → enum vif_operations) map current_operations [RO/runtime]

bool currently_attached [RO/constructor]

string device [RO/constructor]

string set ipv4_addresses [RO/runtime]

string set ipv4_allowed [RO/constructor]

enum vif_ipv4_configuration_mode ipv4_configuration_mode [RO/runtime]

string ipv4_gateway [RO/runtime]

string set ipv6_addresses [RO/runtime]

string set ipv6_allowed [RO/constructor]

enum vif_ipv6_configuration_mode ipv6_configuration_mode [RO/runtime]

string ipv6_gateway [RO/runtime]

enum vif_locking_mode locking_mode [RO/constructor]

string MAC [RO/constructor]

bool MAC_autogenerated [RO/runtime]

Removed

VIF_metrics ref metrics [RO/runtime]

int MTU [RO/constructor]

network ref network [RO/constructor]

(string → string) map other_config [RW]

(string → string) map qos_algorithm_params [RW]

string qos_algorithm_type [RW]

string set qos_supported_algorithms [RO/runtime]

(string → string) map runtime_properties [RO/runtime]

int status_code [RO/runtime]

string status_detail [RO/runtime]

string uuid [RO/runtime]

VM ref VM [RO/constructor]

Messages

void add_ipv4_allowed (session ref, VIF ref, string)

void add_ipv6_allowed (session ref, VIF ref, string)

void add_to_other_config (session ref, VIF ref, string, string)

void add_to_qos_algorithm_params (session ref, VIF ref, string, string)

void configure_ipv4 (session ref, VIF ref, enum vif_ipv4_configuration_mode, string, string)

void configure_ipv6 (session ref, VIF ref, enum vif_ipv6_configuration_mode, string, string)

VIF ref create (session ref, VIF record)

void destroy (session ref, VIF ref)

VIF ref set get_all (session ref)

(VIF ref → VIF record) map get_all_records (session ref)

enum vif_operations set get_allowed_operations (session ref, VIF ref)

VIF ref get_by_uuid (session ref, string)

(string → enum vif_operations) map get_current_operations (session ref, VIF ref)

bool get_currently_attached (session ref, VIF ref)

string get_device (session ref, VIF ref)

string set get_ipv4_addresses (session ref, VIF ref)

string set get_ipv4_allowed (session ref, VIF ref)

enum vif_ipv4_configuration_mode get_ipv4_configuration_mode (session ref, VIF ref)

string get_ipv4_gateway (session ref, VIF ref)

string set get_ipv6_addresses (session ref, VIF ref)

string set get_ipv6_allowed (session ref, VIF ref)

enum vif_ipv6_configuration_mode get_ipv6_configuration_mode (session ref, VIF ref)

string get_ipv6_gateway (session ref, VIF ref)

enum vif_locking_mode get_locking_mode (session ref, VIF ref)

string get_MAC (session ref, VIF ref)

bool get_MAC_autogenerated (session ref, VIF ref)

Removed

VIF_metrics ref get_metrics (session ref, VIF ref)

int get_MTU (session ref, VIF ref)

network ref get_network (session ref, VIF ref)

(string → string) map get_other_config (session ref, VIF ref)

(string → string) map get_qos_algorithm_params (session ref, VIF ref)

string get_qos_algorithm_type (session ref, VIF ref)

string set get_qos_supported_algorithms (session ref, VIF ref)

VIF record get_record (session ref, VIF ref)

(string → string) map get_runtime_properties (session ref, VIF ref)

int get_status_code (session ref, VIF ref)

string get_status_detail (session ref, VIF ref)

string get_uuid (session ref, VIF ref)

VM ref get_VM (session ref, VIF ref)

void move (session ref, VIF ref, network ref)

void plug (session ref, VIF ref)

void remove_from_other_config (session ref, VIF ref, string)

void remove_from_qos_algorithm_params (session ref, VIF ref, string)

void remove_ipv4_allowed (session ref, VIF ref, string)

void remove_ipv6_allowed (session ref, VIF ref, string)

void set_ipv4_allowed (session ref, VIF ref, string set)

void set_ipv6_allowed (session ref, VIF ref, string set)

void set_locking_mode (session ref, VIF ref, enum vif_locking_mode)

void set_other_config (session ref, VIF ref, (string → string) map)

void set_qos_algorithm_params (session ref, VIF ref, (string → string) map)

void set_qos_algorithm_type (session ref, VIF ref, string)

void unplug (session ref, VIF ref)

void unplug_force (session ref, VIF ref)

VIF_metrics

Removed

Class: VIF_metrics

The metrics associated with a virtual network device

Fields

Removed

float io_read_kbs [RO/runtime]

Removed

float io_write_kbs [RO/runtime]

datetime last_updated [RO/runtime]

(string → string) map other_config [RW]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, VIF_metrics ref, string, string)

Removed

VIF_metrics ref set get_all (session ref)

Removed

(VIF_metrics ref → VIF_metrics record) map get_all_records (session ref)

Removed

VIF_metrics ref get_by_uuid (session ref, string)

Removed

float get_io_read_kbs (session ref, VIF_metrics ref)

Removed

float get_io_write_kbs (session ref, VIF_metrics ref)

datetime get_last_updated (session ref, VIF_metrics ref)

(string → string) map get_other_config (session ref, VIF_metrics ref)

Removed

VIF_metrics record get_record (session ref, VIF_metrics ref)

string get_uuid (session ref, VIF_metrics ref)

void remove_from_other_config (session ref, VIF_metrics ref, string)

void set_other_config (session ref, VIF_metrics ref, (string → string) map)

VLAN

Class: VLAN

A VLAN mux/demux

Fields

(string → string) map other_config [RW]

int tag [RO/constructor]

PIF ref tagged_PIF [RO/constructor]

PIF ref untagged_PIF [RO/runtime]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, VLAN ref, string, string)

VLAN ref create (session ref, PIF ref, int, network ref)

void destroy (session ref, VLAN ref)

VLAN ref set get_all (session ref)

(VLAN ref → VLAN record) map get_all_records (session ref)

VLAN ref get_by_uuid (session ref, string)

(string → string) map get_other_config (session ref, VLAN ref)

VLAN record get_record (session ref, VLAN ref)

int get_tag (session ref, VLAN ref)

PIF ref get_tagged_PIF (session ref, VLAN ref)

PIF ref get_untagged_PIF (session ref, VLAN ref)

string get_uuid (session ref, VLAN ref)

void remove_from_other_config (session ref, VLAN ref, string)

void set_other_config (session ref, VLAN ref, (string → string) map)

VM

Class: VM

A virtual machine (or 'guest').

Enums

vm_power_state

update_guidances

on_softreboot_behavior

on_normal_exit

vm_operations

Values:	snapshot	refers to the operation "snapshot"
	clone	refers to the operation "clone"
	copy	refers to the operation "copy"
	create_template	refers to the operation "create_template"
	revert	refers to the operation "revert"
	checkpoint	refers to the operation "checkpoint"
	snapshot_with_quiesce	refers to the operation "snapshot_with_quiesce"
	provision	refers to the operation "provision"
	start	refers to the operation "start"
	start_on	refers to the operation "start_on"
	pause	refers to the operation "pause"
	unpause	refers to the operation "unpause"
	clean_shutdown	refers to the operation "clean_shutdown"
	clean_reboot	refers to the operation "clean_reboot"
	hard_shutdown	refers to the operation "hard_shutdown"
	power_state_reset	refers to the operation "power_state_reset"
	hard_reboot	refers to the operation "hard_reboot"
	suspend	refers to the operation "suspend"
	csvm	refers to the operation "csvm"
	resume	refers to the operation "resume"
	resume_on	refers to the operation "resume_on"
	pool_migrate	refers to the operation "pool_migrate"
	migrate_send	refers to the operation "migrate_send"
	get_boot_record	refers to the operation "get_boot_record"
	send_sysrq	refers to the operation "send_sysrq"
	send_trigger	refers to the operation "send_trigger"
	query_services	refers to the operation "query_services"
	shutdown	refers to the operation "shutdown"
	call_plugin	refers to the operation "call_plugin"
	changing_memory_live	Changing the memory settings
	awaiting_memory_live	Waiting for the memory settings to change
	changing_dynamic_range	Changing the memory dynamic range
	changing_static_range	Changing the memory static range
	changing_memory_limits	Changing the memory limits
	changing_shadow_memory	Changing the shadow memory for a halted VM.
	changing_shadow_memory_live	Changing the shadow memory for a running VM.
	changing_VCPUs	Changing VCPU settings for a halted VM.
	changing_VCPUs_live	Changing VCPU settings for a running VM.
	changing_NVRAM	Changing NVRAM for a halted VM.
	assert_operation_valid
	data_source_op	Add, remove, query or list data sources
	update_allowed_operations
	make_into_template	Turning this VM into a template
	import	importing a VM from a network stream
	export	exporting a VM to a network stream
	metadata_export	exporting VM metadata to a network stream
	reverting	Reverting the VM to a previous snapshotted state
	destroy	refers to the act of uninstalling the VM
	create_vtpm	Creating and adding a VTPM to this VM

on_crash_behaviour

domain_type

Fields

enum on_crash_behaviour actions_after_crash [RO/constructor]

enum on_normal_exit actions_after_reboot [RW]

enum on_normal_exit actions_after_shutdown [RW]

Prototype

enum on_softreboot_behavior actions_after_softreboot [RW]

host ref affinity [RW]

enum vm_operations set allowed_operations [RO/runtime]

VM_appliance ref appliance [RO/constructor]

PCI ref set attached_PCIs [RO/runtime]

(string → string) map bios_strings [RO/runtime]

(string → blob ref) map blobs [RO/runtime]

(enum vm_operations → string) map blocked_operations [RW]

VM ref set children [RO/runtime]

console ref set consoles [RO/runtime]

crashdump ref set crash_dumps [RO/runtime]

(string → enum vm_operations) map current_operations [RO/runtime]

enum domain_type domain_type [RO/constructor]

string domarch [RO/runtime]

int domid [RO/runtime]

string generation_id [RO/constructor]

VM_guest_metrics ref guest_metrics [RO/runtime]

Deprecated

bool ha_always_run [RO/constructor]

string ha_restart_priority [RO/constructor]

int hardware_platform_version [RW]

bool has_vendor_device [RO/constructor]

(string → string) map HVM_boot_params [RW]

Deprecated

string HVM_boot_policy [RO/constructor]

float HVM_shadow_multiplier [RO/constructor]

bool is_a_snapshot [RO/runtime]

bool is_a_template [RW]

bool is_control_domain [RO/runtime]

bool is_default_template [RO/runtime]

Removed

bool is_snapshot_from_vmpp [RO/constructor]

bool is_vmss_snapshot [RO/constructor]

(string → string) map last_boot_CPU_flags [RO/constructor]

string last_booted_record [RO/constructor]

int memory_dynamic_max [RO/constructor]

int memory_dynamic_min [RO/constructor]

int memory_overhead [RO/runtime]

int memory_static_max [RO/constructor]

int memory_static_min [RO/constructor]

Deprecated

int memory_target [RO/constructor]

VM_metrics ref metrics [RO/runtime]

string name_description [RW]

string name_label [RW]

(string → string) map NVRAM [RO/constructor]

int order [RO/constructor]

(string → string) map other_config [RW]

VM ref parent [RO/runtime]

Deprecated

string PCI_bus [RW]

enum update_guidances set pending_guidances [RO/runtime]

Prototype

enum update_guidances set pending_guidances_full [RO/runtime]

Prototype

enum update_guidances set pending_guidances_recommended [RO/runtime]

(string → string) map platform [RW]

enum vm_power_state power_state [RO/constructor]

Deprecated

VMPP ref protection_policy [RO/constructor]

string PV_args [RW]

string PV_bootloader [RW]

string PV_bootloader_args [RW]

string PV_kernel [RW]

string PV_legacy_args [RW]

string PV_ramdisk [RW]

string recommendations [RW]

string reference_label [RO/constructor]

bool requires_reboot [RO/runtime]

host ref resident_on [RO/runtime]

host ref scheduled_to_be_resident_on [RO/runtime]

int shutdown_delay [RO/constructor]

(string → string) map snapshot_info [RO/runtime]

string snapshot_metadata [RO/runtime]

VM ref snapshot_of [RO/runtime]

VMSS ref snapshot_schedule [RO/constructor]

datetime snapshot_time [RO/runtime]

VM ref set snapshots [RO/runtime]

int start_delay [RO/constructor]

SR ref suspend_SR [RW]

VDI ref suspend_VDI [RO/constructor]

string set tags [RW]

string transportable_snapshot_id [RO/runtime]

int user_version [RW]

string uuid [RO/runtime]

VBD ref set VBDs [RO/runtime]

int VCPUs_at_startup [RO/constructor]

int VCPUs_max [RO/constructor]

(string → string) map VCPUs_params [RW]

int version [RO/constructor]

VGPU ref set VGPUs [RO/runtime]

VIF ref set VIFs [RO/runtime]

VTPM ref set VTPMs [RO/runtime]

VUSB ref set VUSBs [RO/runtime]

(string → string) map xenstore_data [RW]

Messages

void add_tags (session ref, VM ref, string)

void add_to_blocked_operations (session ref, VM ref, enum vm_operations, string)

void add_to_HVM_boot_params (session ref, VM ref, string, string)

void add_to_NVRAM (session ref, VM ref, string, string)

void add_to_other_config (session ref, VM ref, string, string)

void add_to_platform (session ref, VM ref, string, string)

void add_to_VCPUs_params (session ref, VM ref, string, string)

void add_to_VCPUs_params_live (session ref, VM ref, string, string)

void add_to_xenstore_data (session ref, VM ref, string, string)

void assert_agile (session ref, VM ref)

void assert_can_be_recovered (session ref, VM ref, session ref)

void assert_can_boot_here (session ref, VM ref, host ref)

Returns an error if the VM could not boot on this host for some reason

Parameters:	session ref session_id	Reference to a valid session
	VM ref self	The VM
	host ref host	The host
Minimum role:	read-only
Errors:	HOST_NOT_ENOUGH_FREE_MEMORY	Not enough server memory is available to perform this operation.
	HOST_NOT_ENOUGH_PCPUS	The host does not have enough pCPUs to run the VM. It needs at least as many as the VM has vCPUs.
	NETWORK_SRIOV_INSUFFICIENT_CAPACITY	There is insufficient capacity for VF reservation
	HOST_NOT_LIVE	This operation cannot be completed as the server is not live.
	HOST_DISABLED	The specified server is disabled.
	HOST_CANNOT_ATTACH_NETWORK	Server cannot attach network (in the case of NIC bonding, this may be because attaching the network on this server would require other networks - that are currently active - to be taken down).
	VM_HVM_REQUIRED	HVM is required for this operation
	VM_REQUIRES_GPU	You attempted to run a VM on a host which doesn't have a pGPU available in the GPU group needed by the VM. The VM has a vGPU attached to this GPU group.
	VM_REQUIRES_IOMMU	You attempted to run a VM on a host which doesn't have I/O virtualization (IOMMU/VT-d) enabled, which is needed by the VM.
	VM_REQUIRES_NETWORK	You attempted to run a VM on a host which doesn't have a PIF on a Network needed by the VM. The VM has at least one VIF attached to the Network.
	VM_REQUIRES_SR	You attempted to run a VM on a host which doesn't have access to an SR needed by the VM. The VM has at least one VBD attached to a VDI in the SR.
	VM_REQUIRES_VGPU	You attempted to run a VM on a host on which the vGPU required by the VM cannot be allocated on any pGPUs in the GPU_group needed by the VM.
	VM_HOST_INCOMPATIBLE_VERSION	This VM operation cannot be performed on an older-versioned host during an upgrade.
	VM_HOST_INCOMPATIBLE_VIRTUAL_HARDWARE_PLATFORM_VERSION	You attempted to run a VM on a host that cannot provide the VM's required Virtual Hardware Platform version.
	INVALID_VALUE	The value given is invalid
	MEMORY_CONSTRAINT_VIOLATION	The dynamic memory range does not satisfy the following constraint.
	OPERATION_NOT_ALLOWED	You attempted an operation that was not allowed.
	VALUE_NOT_SUPPORTED	You attempted to set a value that is not supported by this implementation. The fully-qualified field name and the value that you tried to set are returned. Also returned is a developer-only diagnostic reason.
	VM_INCOMPATIBLE_WITH_THIS_HOST	The VM is incompatible with the CPU features of this host.
Published in:	XenServer 4.0 (rio)
Changed in:	Citrix Hypervisor 8.1 (quebec)	Does additional compatibility checks when VM powerstate is not halted (e.g. CPUID). Use this before calling VM.resume or VM.pool_migrate.

void assert_can_migrate (session ref, VM ref, (string → string) map, bool, (VDI ref → SR ref) map, (VIF ref → network ref) map, (string → string) map, (VGPU ref → GPU_group ref) map)

void assert_operation_valid (session ref, VM ref, enum vm_operations)

string call_plugin (session ref, VM ref, string, string, (string → string) map)

VM ref checkpoint (session ref, VM ref, string)

void clean_reboot (session ref, VM ref)

void clean_shutdown (session ref, VM ref)

VM ref clone (session ref, VM ref, string)

int compute_memory_overhead (session ref, VM ref)

VM ref copy (session ref, VM ref, string, SR ref)

void copy_bios_strings (session ref, VM ref, host ref)

VM ref create (session ref, VM record)

blob ref create_new_blob (session ref, VM ref, string, string, bool)

void destroy (session ref, VM ref)

void forget_data_source_archives (session ref, VM ref, string)

enum on_crash_behaviour get_actions_after_crash (session ref, VM ref)

enum on_normal_exit get_actions_after_reboot (session ref, VM ref)

enum on_normal_exit get_actions_after_shutdown (session ref, VM ref)

Prototype

enum on_softreboot_behavior get_actions_after_softreboot (session ref, VM ref)

host ref get_affinity (session ref, VM ref)

VM ref set get_all (session ref)

(VM ref → VM record) map get_all_records (session ref)

enum vm_operations set get_allowed_operations (session ref, VM ref)

string set get_allowed_VBD_devices (session ref, VM ref)

string set get_allowed_VIF_devices (session ref, VM ref)

VM_appliance ref get_appliance (session ref, VM ref)

PCI ref set get_attached_PCIs (session ref, VM ref)

(string → string) map get_bios_strings (session ref, VM ref)

(string → blob ref) map get_blobs (session ref, VM ref)

(enum vm_operations → string) map get_blocked_operations (session ref, VM ref)

Deprecated

VM record get_boot_record (session ref, VM ref)

VM ref set get_by_name_label (session ref, string)

VM ref get_by_uuid (session ref, string)

VM ref set get_children (session ref, VM ref)

console ref set get_consoles (session ref, VM ref)

Deprecated

bool get_cooperative (session ref, VM ref)

crashdump ref set get_crash_dumps (session ref, VM ref)

(string → enum vm_operations) map get_current_operations (session ref, VM ref)

data_source record set get_data_sources (session ref, VM ref)

enum domain_type get_domain_type (session ref, VM ref)

string get_domarch (session ref, VM ref)

int get_domid (session ref, VM ref)

string get_generation_id (session ref, VM ref)

VM_guest_metrics ref get_guest_metrics (session ref, VM ref)

Deprecated

bool get_ha_always_run (session ref, VM ref)

string get_ha_restart_priority (session ref, VM ref)

int get_hardware_platform_version (session ref, VM ref)

bool get_has_vendor_device (session ref, VM ref)

(string → string) map get_HVM_boot_params (session ref, VM ref)

Deprecated

string get_HVM_boot_policy (session ref, VM ref)

float get_HVM_shadow_multiplier (session ref, VM ref)

bool get_is_a_snapshot (session ref, VM ref)

bool get_is_a_template (session ref, VM ref)

bool get_is_control_domain (session ref, VM ref)

bool get_is_default_template (session ref, VM ref)

Removed

bool get_is_snapshot_from_vmpp (session ref, VM ref)

bool get_is_vmss_snapshot (session ref, VM ref)

(string → string) map get_last_boot_CPU_flags (session ref, VM ref)

string get_last_booted_record (session ref, VM ref)

int get_memory_dynamic_max (session ref, VM ref)

int get_memory_dynamic_min (session ref, VM ref)

int get_memory_overhead (session ref, VM ref)

int get_memory_static_max (session ref, VM ref)

int get_memory_static_min (session ref, VM ref)

Deprecated

int get_memory_target (session ref, VM ref)

VM_metrics ref get_metrics (session ref, VM ref)

string get_name_description (session ref, VM ref)

string get_name_label (session ref, VM ref)

(string → string) map get_NVRAM (session ref, VM ref)

int get_order (session ref, VM ref)

(string → string) map get_other_config (session ref, VM ref)

VM ref get_parent (session ref, VM ref)

Deprecated

string get_PCI_bus (session ref, VM ref)

enum update_guidances set get_pending_guidances (session ref, VM ref)

Prototype

enum update_guidances set get_pending_guidances_full (session ref, VM ref)

Prototype

enum update_guidances set get_pending_guidances_recommended (session ref, VM ref)

(string → string) map get_platform (session ref, VM ref)

host ref set get_possible_hosts (session ref, VM ref)

enum vm_power_state get_power_state (session ref, VM ref)

Deprecated

VMPP ref get_protection_policy (session ref, VM ref)

string get_PV_args (session ref, VM ref)

string get_PV_bootloader (session ref, VM ref)

string get_PV_bootloader_args (session ref, VM ref)

string get_PV_kernel (session ref, VM ref)

string get_PV_legacy_args (session ref, VM ref)

string get_PV_ramdisk (session ref, VM ref)

string get_recommendations (session ref, VM ref)

VM record get_record (session ref, VM ref)

string get_reference_label (session ref, VM ref)

bool get_requires_reboot (session ref, VM ref)

host ref get_resident_on (session ref, VM ref)

host ref get_scheduled_to_be_resident_on (session ref, VM ref)

int get_shutdown_delay (session ref, VM ref)

(string → string) map get_snapshot_info (session ref, VM ref)

string get_snapshot_metadata (session ref, VM ref)

VM ref get_snapshot_of (session ref, VM ref)

VMSS ref get_snapshot_schedule (session ref, VM ref)

datetime get_snapshot_time (session ref, VM ref)

VM ref set get_snapshots (session ref, VM ref)

SR ref set get_SRs_required_for_recovery (session ref, VM ref, session ref)

int get_start_delay (session ref, VM ref)

SR ref get_suspend_SR (session ref, VM ref)

VDI ref get_suspend_VDI (session ref, VM ref)

string set get_tags (session ref, VM ref)

string get_transportable_snapshot_id (session ref, VM ref)

int get_user_version (session ref, VM ref)

string get_uuid (session ref, VM ref)

VBD ref set get_VBDs (session ref, VM ref)

int get_VCPUs_at_startup (session ref, VM ref)

int get_VCPUs_max (session ref, VM ref)

(string → string) map get_VCPUs_params (session ref, VM ref)

int get_version (session ref, VM ref)

VGPU ref set get_VGPUs (session ref, VM ref)

VIF ref set get_VIFs (session ref, VM ref)

VTPM ref set get_VTPMs (session ref, VM ref)

VUSB ref set get_VUSBs (session ref, VM ref)

(string → string) map get_xenstore_data (session ref, VM ref)

void hard_reboot (session ref, VM ref)

void hard_shutdown (session ref, VM ref)

VM ref set import (session ref, string, SR ref, bool, bool)

void import_convert (session ref, string, string, string, SR ref, (string → string) map)

int maximise_memory (session ref, VM ref, int, bool)

VM ref migrate_send (session ref, VM ref, (string → string) map, bool, (VDI ref → SR ref) map, (VIF ref → network ref) map, (string → string) map, (VGPU ref → GPU_group ref) map)

void pause (session ref, VM ref)

void pool_migrate (session ref, VM ref, host ref, (string → string) map)

void power_state_reset (session ref, VM ref)

void provision (session ref, VM ref)

float query_data_source (session ref, VM ref, string)

(string → string) map query_services (session ref, VM ref)

void record_data_source (session ref, VM ref, string)

void recover (session ref, VM ref, session ref, bool)

void remove_from_blocked_operations (session ref, VM ref, enum vm_operations)

void remove_from_HVM_boot_params (session ref, VM ref, string)

void remove_from_NVRAM (session ref, VM ref, string)

void remove_from_other_config (session ref, VM ref, string)

void remove_from_platform (session ref, VM ref, string)

void remove_from_VCPUs_params (session ref, VM ref, string)

void remove_from_xenstore_data (session ref, VM ref, string)

void remove_tags (session ref, VM ref, string)

Prototype

void restart_device_models (session ref, VM ref)

void resume (session ref, VM ref, bool, bool)

void resume_on (session ref, VM ref, host ref, bool, bool)

(host ref → string set) map retrieve_wlb_recommendations (session ref, VM ref)

void revert (session ref, VM ref)

void send_sysrq (session ref, VM ref, string)

void send_trigger (session ref, VM ref, string)

void set_actions_after_crash (session ref, VM ref, enum on_crash_behaviour)

void set_actions_after_reboot (session ref, VM ref, enum on_normal_exit)

void set_actions_after_shutdown (session ref, VM ref, enum on_normal_exit)

Prototype

void set_actions_after_softreboot (session ref, VM ref, enum on_softreboot_behavior)

void set_affinity (session ref, VM ref, host ref)

void set_appliance (session ref, VM ref, VM_appliance ref)

void set_bios_strings (session ref, VM ref, (string → string) map)

void set_blocked_operations (session ref, VM ref, (enum vm_operations → string) map)

void set_domain_type (session ref, VM ref, enum domain_type)

Deprecated

void set_ha_always_run (session ref, VM ref, bool)

void set_ha_restart_priority (session ref, VM ref, string)

void set_hardware_platform_version (session ref, VM ref, int)

void set_has_vendor_device (session ref, VM ref, bool)

void set_HVM_boot_params (session ref, VM ref, (string → string) map)

Deprecated

void set_HVM_boot_policy (session ref, VM ref, string)

void set_HVM_shadow_multiplier (session ref, VM ref, float)

void set_is_a_template (session ref, VM ref, bool)

void set_memory (session ref, VM ref, int)

void set_memory_dynamic_max (session ref, VM ref, int)

void set_memory_dynamic_min (session ref, VM ref, int)

void set_memory_dynamic_range (session ref, VM ref, int, int)

void set_memory_limits (session ref, VM ref, int, int, int, int)

void set_memory_static_max (session ref, VM ref, int)

void set_memory_static_min (session ref, VM ref, int)

void set_memory_static_range (session ref, VM ref, int, int)

Deprecated

void set_memory_target_live (session ref, VM ref, int)

void set_name_description (session ref, VM ref, string)

void set_name_label (session ref, VM ref, string)

void set_NVRAM (session ref, VM ref, (string → string) map)

void set_order (session ref, VM ref, int)

void set_other_config (session ref, VM ref, (string → string) map)

Deprecated

void set_PCI_bus (session ref, VM ref, string)

void set_platform (session ref, VM ref, (string → string) map)

Removed

void set_protection_policy (session ref, VM ref, VMPP ref)

void set_PV_args (session ref, VM ref, string)

void set_PV_bootloader (session ref, VM ref, string)

void set_PV_bootloader_args (session ref, VM ref, string)

void set_PV_kernel (session ref, VM ref, string)

void set_PV_legacy_args (session ref, VM ref, string)

void set_PV_ramdisk (session ref, VM ref, string)

void set_recommendations (session ref, VM ref, string)

void set_shadow_multiplier_live (session ref, VM ref, float)

void set_shutdown_delay (session ref, VM ref, int)

void set_snapshot_schedule (session ref, VM ref, VMSS ref)

void set_start_delay (session ref, VM ref, int)

void set_suspend_SR (session ref, VM ref, SR ref)

void set_suspend_VDI (session ref, VM ref, VDI ref)

void set_tags (session ref, VM ref, string set)

void set_user_version (session ref, VM ref, int)

void set_VCPUs_at_startup (session ref, VM ref, int)

void set_VCPUs_max (session ref, VM ref, int)

void set_VCPUs_number_live (session ref, VM ref, int)

void set_VCPUs_params (session ref, VM ref, (string → string) map)

void set_xenstore_data (session ref, VM ref, (string → string) map)

void shutdown (session ref, VM ref)

VM ref snapshot (session ref, VM ref, string, VDI ref set)

Removed

VM ref snapshot_with_quiesce (session ref, VM ref, string)

void start (session ref, VM ref, bool, bool)

void start_on (session ref, VM ref, host ref, bool, bool)

void suspend (session ref, VM ref)

void unpause (session ref, VM ref)

void update_allowed_operations (session ref, VM ref)

Deprecated

void wait_memory_target_live (session ref, VM ref)

VM_appliance

Class: VM_appliance

VM appliance

Enums

vm_appliance_operation

Fields

enum vm_appliance_operation set allowed_operations [RO/runtime]

(string → enum vm_appliance_operation) map current_operations [RO/runtime]

string name_description [RW]

string name_label [RW]

string uuid [RO/runtime]

VM ref set VMs [RO/runtime]

Messages

void assert_can_be_recovered (session ref, VM_appliance ref, session ref)

void clean_shutdown (session ref, VM_appliance ref)

VM_appliance ref create (session ref, VM_appliance record)

void destroy (session ref, VM_appliance ref)

VM_appliance ref set get_all (session ref)

(VM_appliance ref → VM_appliance record) map get_all_records (session ref)

enum vm_appliance_operation set get_allowed_operations (session ref, VM_appliance ref)

VM_appliance ref set get_by_name_label (session ref, string)

VM_appliance ref get_by_uuid (session ref, string)

(string → enum vm_appliance_operation) map get_current_operations (session ref, VM_appliance ref)

string get_name_description (session ref, VM_appliance ref)

string get_name_label (session ref, VM_appliance ref)

VM_appliance record get_record (session ref, VM_appliance ref)

SR ref set get_SRs_required_for_recovery (session ref, VM_appliance ref, session ref)

string get_uuid (session ref, VM_appliance ref)

VM ref set get_VMs (session ref, VM_appliance ref)

void hard_shutdown (session ref, VM_appliance ref)

void recover (session ref, VM_appliance ref, session ref, bool)

void set_name_description (session ref, VM_appliance ref, string)

void set_name_label (session ref, VM_appliance ref, string)

void shutdown (session ref, VM_appliance ref)

void start (session ref, VM_appliance ref, bool)

VM_guest_metrics

Class: VM_guest_metrics

The metrics reported by the guest (as opposed to inferred from outside)

Enums

tristate_type

Fields

enum tristate_type can_use_hotplug_vbd [RO/runtime]

enum tristate_type can_use_hotplug_vif [RO/runtime]

Removed

(string → string) map disks [RO/runtime]

datetime last_updated [RO/runtime]

bool live [RO/runtime]

Removed

(string → string) map memory [RO/runtime]

(string → string) map networks [RO/runtime]

(string → string) map os_version [RO/runtime]

(string → string) map other [RO/runtime]

(string → string) map other_config [RW]

bool PV_drivers_detected [RO/runtime]

Deprecated

bool PV_drivers_up_to_date [RO/runtime]

(string → string) map PV_drivers_version [RO/runtime]

string uuid [RO/runtime]

Messages

void add_to_other_config (session ref, VM_guest_metrics ref, string, string)

VM_guest_metrics ref set get_all (session ref)

(VM_guest_metrics ref → VM_guest_metrics record) map get_all_records (session ref)

VM_guest_metrics ref get_by_uuid (session ref, string)

enum tristate_type get_can_use_hotplug_vbd (session ref, VM_guest_metrics ref)

enum tristate_type get_can_use_hotplug_vif (session ref, VM_guest_metrics ref)

Removed

(string → string) map get_disks (session ref, VM_guest_metrics ref)

datetime get_last_updated (session ref, VM_guest_metrics ref)

bool get_live (session ref, VM_guest_metrics ref)

Removed

(string → string) map get_memory (session ref, VM_guest_metrics ref)

(string → string) map get_networks (session ref, VM_guest_metrics ref)

(string → string) map get_os_version (session ref, VM_guest_metrics ref)

(string → string) map get_other (session ref, VM_guest_metrics ref)

(string → string) map get_other_config (session ref, VM_guest_metrics ref)

bool get_PV_drivers_detected (session ref, VM_guest_metrics ref)

Deprecated

bool get_PV_drivers_up_to_date (session ref, VM_guest_metrics ref)

(string → string) map get_PV_drivers_version (session ref, VM_guest_metrics ref)

VM_guest_metrics record get_record (session ref, VM_guest_metrics ref)

string get_uuid (session ref, VM_guest_metrics ref)

void remove_from_other_config (session ref, VM_guest_metrics ref, string)

void set_other_config (session ref, VM_guest_metrics ref, (string → string) map)

VM_metrics

Class: VM_metrics

The metrics associated with a VM

Enums

domain_type

Fields

enum domain_type current_domain_type [RO/runtime]

bool hvm [RO/runtime]

datetime install_time [RO/runtime]

datetime last_updated [RO/runtime]

int memory_actual [RO/runtime]

bool nested_virt [RO/runtime]

bool nomigrate [RO/runtime]

(string → string) map other_config [RW]

datetime start_time [RO/runtime]

string set state [RO/runtime]

string uuid [RO/runtime]

(int → int) map VCPUs_CPU [RO/runtime]

(int → string set) map VCPUs_flags [RO/runtime]

int VCPUs_number [RO/runtime]

(string → string) map VCPUs_params [RO/runtime]

Removed

(int → float) map VCPUs_utilisation [RO/runtime]

Messages

void add_to_other_config (session ref, VM_metrics ref, string, string)

VM_metrics ref set get_all (session ref)

(VM_metrics ref → VM_metrics record) map get_all_records (session ref)

VM_metrics ref get_by_uuid (session ref, string)

enum domain_type get_current_domain_type (session ref, VM_metrics ref)

bool get_hvm (session ref, VM_metrics ref)

datetime get_install_time (session ref, VM_metrics ref)

datetime get_last_updated (session ref, VM_metrics ref)

int get_memory_actual (session ref, VM_metrics ref)

bool get_nested_virt (session ref, VM_metrics ref)

bool get_nomigrate (session ref, VM_metrics ref)

(string → string) map get_other_config (session ref, VM_metrics ref)

VM_metrics record get_record (session ref, VM_metrics ref)

datetime get_start_time (session ref, VM_metrics ref)

string set get_state (session ref, VM_metrics ref)

string get_uuid (session ref, VM_metrics ref)

(int → int) map get_VCPUs_CPU (session ref, VM_metrics ref)

(int → string set) map get_VCPUs_flags (session ref, VM_metrics ref)

int get_VCPUs_number (session ref, VM_metrics ref)

(string → string) map get_VCPUs_params (session ref, VM_metrics ref)

Removed

(int → float) map get_VCPUs_utilisation (session ref, VM_metrics ref)

void remove_from_other_config (session ref, VM_metrics ref, string)

void set_other_config (session ref, VM_metrics ref, (string → string) map)

VMPP

Removed

Class: VMPP

VM Protection Policy

Enums

vmpp_backup_type

vmpp_backup_frequency

vmpp_archive_frequency

vmpp_archive_target_type

Fields

Removed

(string → string) map alarm_config [RO/constructor]

Removed

enum vmpp_archive_frequency archive_frequency [RO/constructor]

Removed

datetime archive_last_run_time [RO/runtime]

Removed

(string → string) map archive_schedule [RO/constructor]

Removed

(string → string) map archive_target_config [RO/constructor]

Removed

enum vmpp_archive_target_type archive_target_type [RO/constructor]

Removed

enum vmpp_backup_frequency backup_frequency [RO/constructor]

Removed

datetime backup_last_run_time [RO/runtime]

Removed

int backup_retention_value [RO/constructor]

Removed

(string → string) map backup_schedule [RO/constructor]

Removed

enum vmpp_backup_type backup_type [RW]

Removed

bool is_alarm_enabled [RO/constructor]

Removed

bool is_archive_running [RO/runtime]

Removed

bool is_backup_running [RO/runtime]

Removed

bool is_policy_enabled [RW]

string name_description [RW]

string name_label [RW]

Removed

string set recent_alerts [RO/runtime]

Removed

string uuid [RO/runtime]

Removed

VM ref set VMs [RO/runtime]

Messages

Removed

void add_to_alarm_config (session ref, VMPP ref, string, string)

Removed

void add_to_archive_schedule (session ref, VMPP ref, string, string)

Removed

void add_to_archive_target_config (session ref, VMPP ref, string, string)

Removed

void add_to_backup_schedule (session ref, VMPP ref, string, string)

Removed

string archive_now (session ref, VM ref)

Removed

VMPP ref create (session ref, VMPP record)

Removed

void destroy (session ref, VMPP ref)

Removed

(string → string) map get_alarm_config (session ref, VMPP ref)

Removed

string set get_alerts (session ref, VMPP ref, int)

Removed

VMPP ref set get_all (session ref)

Removed

(VMPP ref → VMPP record) map get_all_records (session ref)

Removed

enum vmpp_archive_frequency get_archive_frequency (session ref, VMPP ref)

Removed

datetime get_archive_last_run_time (session ref, VMPP ref)

Removed

(string → string) map get_archive_schedule (session ref, VMPP ref)

Removed

(string → string) map get_archive_target_config (session ref, VMPP ref)

Removed

enum vmpp_archive_target_type get_archive_target_type (session ref, VMPP ref)

Removed

enum vmpp_backup_frequency get_backup_frequency (session ref, VMPP ref)

Removed

datetime get_backup_last_run_time (session ref, VMPP ref)

Removed

int get_backup_retention_value (session ref, VMPP ref)

Removed

(string → string) map get_backup_schedule (session ref, VMPP ref)

Removed

enum vmpp_backup_type get_backup_type (session ref, VMPP ref)

Removed

VMPP ref set get_by_name_label (session ref, string)

Removed

VMPP ref get_by_uuid (session ref, string)

Removed

bool get_is_alarm_enabled (session ref, VMPP ref)

Removed

bool get_is_archive_running (session ref, VMPP ref)

Removed

bool get_is_backup_running (session ref, VMPP ref)

Removed

bool get_is_policy_enabled (session ref, VMPP ref)

string get_name_description (session ref, VMPP ref)

string get_name_label (session ref, VMPP ref)

Removed

string set get_recent_alerts (session ref, VMPP ref)

Removed

VMPP record get_record (session ref, VMPP ref)

Removed

string get_uuid (session ref, VMPP ref)

Removed

VM ref set get_VMs (session ref, VMPP ref)

Removed

string protect_now (session ref, VMPP ref)

Removed

void remove_from_alarm_config (session ref, VMPP ref, string)

Removed

void remove_from_archive_schedule (session ref, VMPP ref, string)

Removed

void remove_from_archive_target_config (session ref, VMPP ref, string)

Removed

void remove_from_backup_schedule (session ref, VMPP ref, string)

Removed

void set_alarm_config (session ref, VMPP ref, (string → string) map)

Removed

void set_archive_frequency (session ref, VMPP ref, enum vmpp_archive_frequency)

Removed

void set_archive_last_run_time (session ref, VMPP ref, datetime)

Removed

void set_archive_schedule (session ref, VMPP ref, (string → string) map)

Removed

void set_archive_target_config (session ref, VMPP ref, (string → string) map)

Removed

void set_archive_target_type (session ref, VMPP ref, enum vmpp_archive_target_type)

Removed

void set_backup_frequency (session ref, VMPP ref, enum vmpp_backup_frequency)

Removed

void set_backup_last_run_time (session ref, VMPP ref, datetime)

Removed

void set_backup_retention_value (session ref, VMPP ref, int)

Removed

void set_backup_schedule (session ref, VMPP ref, (string → string) map)

Removed

void set_backup_type (session ref, VMPP ref, enum vmpp_backup_type)

Removed

void set_is_alarm_enabled (session ref, VMPP ref, bool)

Removed

void set_is_policy_enabled (session ref, VMPP ref, bool)

void set_name_description (session ref, VMPP ref, string)

void set_name_label (session ref, VMPP ref, string)

VMSS

Class: VMSS

VM Snapshot Schedule

Enums

vmss_frequency

vmss_type

Fields

bool enabled [RW]

enum vmss_frequency frequency [RO/constructor]

datetime last_run_time [RO/runtime]

string name_description [RW]

string name_label [RW]

int retained_snapshots [RO/constructor]

(string → string) map schedule [RO/constructor]

enum vmss_type type [RO/constructor]

string uuid [RO/runtime]

VM ref set VMs [RO/runtime]

Messages

void add_to_schedule (session ref, VMSS ref, string, string)

VMSS ref create (session ref, VMSS record)

void destroy (session ref, VMSS ref)

VMSS ref set get_all (session ref)

(VMSS ref → VMSS record) map get_all_records (session ref)

VMSS ref set get_by_name_label (session ref, string)

VMSS ref get_by_uuid (session ref, string)

bool get_enabled (session ref, VMSS ref)

enum vmss_frequency get_frequency (session ref, VMSS ref)

datetime get_last_run_time (session ref, VMSS ref)

string get_name_description (session ref, VMSS ref)

string get_name_label (session ref, VMSS ref)

VMSS record get_record (session ref, VMSS ref)

int get_retained_snapshots (session ref, VMSS ref)

(string → string) map get_schedule (session ref, VMSS ref)

enum vmss_type get_type (session ref, VMSS ref)

string get_uuid (session ref, VMSS ref)

VM ref set get_VMs (session ref, VMSS ref)

void remove_from_schedule (session ref, VMSS ref, string)

void set_enabled (session ref, VMSS ref, bool)

void set_frequency (session ref, VMSS ref, enum vmss_frequency)

void set_last_run_time (session ref, VMSS ref, datetime)

void set_name_description (session ref, VMSS ref, string)

void set_name_label (session ref, VMSS ref, string)

void set_retained_snapshots (session ref, VMSS ref, int)

void set_schedule (session ref, VMSS ref, (string → string) map)

void set_type (session ref, VMSS ref, enum vmss_type)

string snapshot_now (session ref, VMSS ref)

VTPM

Prototype

Class: VTPM

A virtual TPM device

Enums

vtpm_operations

persistence_backend

Fields

enum vtpm_operations set allowed_operations [RO/runtime]

VM ref backend [RO/runtime]

(string → enum vtpm_operations) map current_operations [RO/runtime]

Prototype

bool is_protected [RO/runtime]

Prototype

bool is_unique [RO/constructor]

Prototype

enum persistence_backend persistence_backend [RO/runtime]

string uuid [RO/runtime]

VM ref VM [RO/constructor]

Messages

Prototype

VTPM ref create (session ref, VM ref, bool)

Prototype

void destroy (session ref, VTPM ref)

Prototype

VTPM ref set get_all (session ref)

Prototype

(VTPM ref → VTPM record) map get_all_records (session ref)

enum vtpm_operations set get_allowed_operations (session ref, VTPM ref)

VM ref get_backend (session ref, VTPM ref)

Prototype

VTPM ref get_by_uuid (session ref, string)

(string → enum vtpm_operations) map get_current_operations (session ref, VTPM ref)

Prototype

bool get_is_protected (session ref, VTPM ref)

Prototype

bool get_is_unique (session ref, VTPM ref)

Prototype

enum persistence_backend get_persistence_backend (session ref, VTPM ref)

Prototype

VTPM record get_record (session ref, VTPM ref)

string get_uuid (session ref, VTPM ref)

VM ref get_VM (session ref, VTPM ref)

VUSB

Class: VUSB

Describes the vusb device

Enums

vusb_operations

Fields

enum vusb_operations set allowed_operations [RO/runtime]

(string → enum vusb_operations) map current_operations [RO/runtime]

bool currently_attached [RO/runtime]

(string → string) map other_config [RW]

USB_group ref USB_group [RO/runtime]

string uuid [RO/runtime]

VM ref VM [RO/runtime]

Messages

void add_to_other_config (session ref, VUSB ref, string, string)

VUSB ref create (session ref, VM ref, USB_group ref, (string → string) map)

void destroy (session ref, VUSB ref)

VUSB ref set get_all (session ref)

(VUSB ref → VUSB record) map get_all_records (session ref)

enum vusb_operations set get_allowed_operations (session ref, VUSB ref)

VUSB ref get_by_uuid (session ref, string)

(string → enum vusb_operations) map get_current_operations (session ref, VUSB ref)

bool get_currently_attached (session ref, VUSB ref)

(string → string) map get_other_config (session ref, VUSB ref)

VUSB record get_record (session ref, VUSB ref)

USB_group ref get_USB_group (session ref, VUSB ref)

string get_uuid (session ref, VUSB ref)

VM ref get_VM (session ref, VUSB ref)

void remove_from_other_config (session ref, VUSB ref, string)

void set_other_config (session ref, VUSB ref, (string → string) map)

void unplug (session ref, VUSB ref)

XenAPI Releases

XAPI 24.16.0

Code name: "24.16.0".

Changes

Change	Element	Description
Extended class	sr_stat	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.clustered	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.free_space	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.health	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.name_description	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.name_label	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.total_space	Enum extended with 'unreachable' and 'unavailable' values
Extended field	sr_stat.uuid	Enum extended with 'unreachable' and 'unavailable' values

XAPI 24.14.0

Code name: "24.14.0".

Changes

Change	Element	Description
Prototyped message	PCI.disable_dom0_access
Prototyped message	PCI.enable_dom0_access
Prototyped message	PCI.get_dom0_access_status
Changed field	VM.has_vendor_device	New default and not consulting Pool.policy_no_vendor_device
Deprecated field	PGPU.dom0_access	Use PCI.get_dom0_access_status instead.
Deprecated field	pool.policy_no_vendor_device	No longer considered by VM.create
Deprecated message	PGPU.disable_dom0_access	Use PCI.disable_dom0_access instead.
Deprecated message	PGPU.enable_dom0_access	Use PCI.enable_dom0_access instead.

XAPI 24.10.0

Code name: "24.10.0".

Changes

Change	Element	Description
Prototyped field	VM.pending_guidances_full
Prototyped field	VM.pending_guidances_recommended
Prototyped field	host.last_update_hash
Prototyped field	host.pending_guidances_full
Prototyped field	host.pending_guidances_recommended
Prototyped message	host.emergency_clear_mandatory_guidance

XAPI 24.3.0

Code name: "24.3.0".

Changes

Change	Element	Description
Prototyped field	Cluster.is_quorate
Prototyped field	Cluster.live_hosts
Prototyped field	Cluster.quorum
Prototyped field	Cluster_host.last_update_live
Prototyped field	Cluster_host.live

XAPI 24.0.0

Code name: "24.0.0".

Changes

Change	Element	Description
Prototyped field	host.numa_affinity_policy
Prototyped field	pool.custom_uefi_certificates
Prototyped message	host.set_numa_affinity_policy
Prototyped message	pool.set_custom_uefi_certificates
Deprecated message	pool.set_uefi_certificates	use set_custom_uefi_certificates instead

XAPI 23.30.0

Code name: "23.30.0".

Changes

Change	Element	Description
Prototyped message	VM.restart_device_models

XAPI 23.27.0

Code name: "23.27.0".

Changes

Change	Element	Description
Prototyped field	pool.ext_auth_max_threads
Prototyped field	pool.local_auth_max_threads
Prototyped message	pool.set_ext_auth_max_threads
Prototyped message	pool.set_local_auth_max_threads
Extended message	host.evacuate	Choose batch size of VM evacuation.

XAPI 23.25.0

Code name: "23.25.0".

Changes

Change	Element	Description
Removed message	host.apply_recommended_guidances

XAPI 23.18.0

Code name: "23.18.0".

Changes

Change	Element	Description
Prototyped field	host.latest_synced_updates_applied
Prototyped field	pool.last_update_sync
Prototyped field	pool.update_sync_day
Prototyped field	pool.update_sync_enabled
Prototyped field	pool.update_sync_frequency
Prototyped message	host.apply_recommended_guidances
Prototyped message	pool.configure_update_sync
Prototyped message	pool.set_update_sync_enabled
Removed field	Repository.up_to_date	The up_to_date field of repository was removed

XAPI 23.14.0

Code name: "23.14.0".

Changes

Change	Element	Description
Prototyped class	Observer
Prototyped field	Observer.attributes
Prototyped field	Observer.components
Prototyped field	Observer.enabled
Prototyped field	Observer.endpoints
Prototyped field	Observer.hosts
Prototyped field	Observer.uuid
Prototyped message	Observer.set_attributes
Prototyped message	Observer.set_components
Prototyped message	Observer.set_enabled
Prototyped message	Observer.set_endpoints
Prototyped message	Observer.set_hosts

XAPI 23.9.0

Code name: "23.9.0".

Changes

Change	Element	Description
Prototyped field	pool.telemetry_frequency
Prototyped field	pool.telemetry_next_collection
Prototyped field	pool.telemetry_uuid
Prototyped message	pool.reset_telemetry_uuid
Prototyped message	pool.set_telemetry_next_collection
Changed field	pool.repository_proxy_password	Changed internal_only to false

XAPI 23.1.0

Code name: "23.1.0".

Changes

Change	Element	Description
Prototyped field	VM.actions_after_softreboot

XAPI 22.37.0

Code name: "22.37.0".

Changes

Change	Element	Description
Prototyped field	pool.coordinator_bias

XAPI 22.33.0

Code name: "22.33.0".

Changes

Change	Element	Description
Prototyped field	pool.migration_compression

XAPI 22.27.0

Code name: "22.27.0".

Changes

Change	Element	Description
Prototyped field	host.https_only
Prototyped message	host.set_https_only
Prototyped message	pool.set_https_only

XAPI 22.26.0

Code name: "22.26.0".

Changes

Change	Element	Description
Prototyped class	VTPM
Prototyped field	VTPM.is_protected
Prototyped field	VTPM.is_unique
Prototyped field	VTPM.persistence_backend
Prototyped message	VTPM.create
Prototyped message	VTPM.destroy

XAPI 22.20.0

Code name: "22.20.0".

Changes

Change	Element	Description
Prototyped field	host.last_software_update

XAPI 22.19.0

Code name: "22.19.0".

Changes

Change	Element	Description
Prototyped message	message.destroy_many

XAPI 22.16.0

Code name: "22.16.0".

Changes

Change	Element	Description
Published message	pool.set_uefi_certificates	Set the UEFI certificates for a pool and all its hosts. Deprecated: use set_custom_uefi_certificates instead
Changed field	pool.uefi_certificates	Became StaticRO to be editable through new method
Deprecated field	host.uefi_certificates	Use Pool.uefi_certificates instead
Deprecated message	host.set_uefi_certificates	Use Pool.set_uefi_certificates instead

XAPI 22.12.0

Code name: "22.12.0".

Changes

Change	Element	Description
Prototyped field	Repository.gpgkey_path
Prototyped message	Repository.set_gpgkey_path

XAPI 22.5.0

Code name: "22.5.0".

Changes

Change	Element	Description
Published field	role.is_internal	Indicates whether the role is only to be assigned internally by xapi, or can be used by clients

XAPI 21.4.0

Code name: "21.4.0".

Changes

Change	Element	Description
Published message	pool.disable_repository_proxy	Disable the proxy for RPM package repositories.

XAPI 21.3.0

Code name: "21.3.0".

Changes

Change	Element	Description
Published field	pool.repository_proxy_password	Password for the authentication of the proxy used in syncing with the enabled repositories
Published field	pool.repository_proxy_url	Url of the proxy used in syncing with the enabled repositories
Published field	pool.repository_proxy_username	Username for the authentication of the proxy used in syncing with the enabled repositories
Published message	pool.configure_repository_proxy	Configure proxy for RPM package repositories.
Published message	task.set_error_info	Set the task error info
Published message	task.set_result	Set the task result

XAPI 21.2.0

Code name: "21.2.0".

Changes

Change	Element	Description
Published field	session.client_certificate	indicates whether this session was authenticated using a client certificate

XAPI 1.329.0

Code name: "1.329.0".

Changes

Change	Element	Description
Published message	pool.sync_updates	Sync with the enabled repository

XAPI 1.318.0

Code name: "1.318.0".

Changes

Change	Element	Description
Published field	pool.client_certificate_auth_enabled	True if authentication by TLS client certificates is enabled
Published field	pool.client_certificate_auth_name	The name (CN/SAN) that an incoming client certificate must have to allow authentication
Published message	pool.disable_client_certificate_auth	Disable client certificate authentication on the pool
Published message	pool.enable_client_certificate_auth	Enable client certificate authentication on the pool

XAPI 1.313.0

Code name: "1.313.0".

Changes

Change	Element	Description
Published field	host.tls_verification_enabled	True if this host has TLS verifcation enabled
Extended field	message.cls	Added Certificate class

XAPI 1.307.0

Code name: "1.307.0".

Changes

Change	Element	Description
Published message	host.refresh_server_certificate	Replace the internal self-signed host certficate with a new one.

XAPI 1.304.0

Code name: "1.304.0".

Changes

Change	Element	Description
Published message	pool.check_update_readiness	Check if the pool is ready to be updated. If not, report the reasons.

XAPI 1.303.0

Code name: "1.303.0".

Changes

Change	Element	Description
Published field	VM.pending_guidances	The set of pending mandatory guidances after applying updates, which must be applied, as otherwise there may be e.g. VM failures
Published field	host.pending_guidances	The set of pending mandatory guidances after applying updates, which must be applied, as otherwise there may be e.g. VM failures

XAPI 1.301.0

Code name: "1.301.0".

Changes

Change	Element	Description
Published class	Repository	Repository for updates
Published field	Repository.binary_url	Base URL of binary packages in this repository
Published field	Repository.hash	SHA256 checksum of latest updateinfo.xml.gz in this repository if its 'update' is true
Published field	Repository.source_url	Base URL of source packages in this repository
Published field	Repository.up_to_date	True if all hosts in pool is up to date with this repository
Published field	Repository.update	True if updateinfo.xml in this repository needs to be parsed
Published field	Repository.uuid	Unique identifier/object reference
Published field	pool.repositories	The set of currently enabled repositories
Published message	Repository.forget	Remove the repository record from the database
Published message	Repository.introduce	Add the configuration for a new repository
Published message	host.apply_updates	apply updates from current enabled repository on a host
Published message	pool.add_repository	Add a repository to the enabled set
Published message	pool.remove_repository	Remove a repository from the enabled set
Published message	pool.set_repositories	Set enabled set of repositories

XAPI 1.298.0

Code name: "1.298.0".

Changes

Change	Element	Description
Published message	host.emergency_reenable_tls_verification	Reenable TLS verification for this host only

XAPI 1.297.0

Code name: "1.297.0".

Changes

Change	Element	Description
Extended message	host.evacuate	Enable migration network selection.

XAPI 1.294.0

Code name: "1.294.0".

Changes

Change	Element	Description
Published field	Certificate.name	The name of the certificate, only present on certificates of type 'ca'
Published field	Certificate.type	The type of the certificate, either 'ca', 'host' or 'host_internal'

XAPI 1.290.0

Code name: "1.290.0".

Changes

Change	Element	Description
Published field	pool.tls_verification_enabled	True iff TLS certificate verification is enabled
Published message	host.emergency_disable_tls_verification	Disable TLS verification for this host only
Published message	host.reset_server_certificate	Delete the current TLS server certificate and replace by a new, self-signed one. This should only be used with extreme care.
Published message	pool.enable_tls_verification	Enable TLS server certificate verification
Published message	pool.install_ca_certificate	Install TLS CA certificate
Published message	pool.uninstall_ca_certificate	Uninstall TLS CA certificate
Deprecated field	pool.wlb_verify_cert	Deprecated: to enable TLS verification use Pool.enable_tls_verification instead
Deprecated message	pool.certificate_install	Use Pool.install_ca_certificate instead
Deprecated message	pool.certificate_list	Use openssl to inspect certificate
Deprecated message	pool.certificate_uninstall	Use Pool.uninstall_ca_certificate instead

XAPI 1.271.0

Code name: "1.271.0".

Changes

Change	Element	Description
Published message	host.get_sched_gran	Gets xen's sched-gran on a host
Published message	host.set_sched_gran	Sets xen's sched-gran on a host. See: https://xenbits.xen.org/docs/unstable/misc/xen-command-line.html#sched-gran-x86

XAPI 1.257.0

Code name: "1.257.0".

Changes

Change	Element	Description
Changed class	VM	possibility to create a VM in suspended mode with a suspend_VDI set
Changed field	VBD.currently_attached	Made StaticRO to allow plugged VIF and VBD creation for Suspended VM
Changed field	VBD.device	Become static to allow plugged VBD creation for Suspended VM
Changed field	VIF.currently_attached	Made StaticRO to allow plugged VIF and VBD creation for Suspended VM
Changed field	VM.last_booted_record	Become static to allow Suspended VM creation
Changed field	VM.power_state	Made StaticRO to allow Suspended VM creation
Changed field	VM.suspend_VDI	Become static to allow Suspended VM creation

XAPI 1.250.0

Code name: "1.250.0".

Changes

Change	Element	Description
Published field	tunnel.protocol	Add protocol field to tunnel

XenServer 8 Preview

Code name: "nile-preview".

No changes...

Citrix Hypervisor 8.2 Hotfix 2

Code name: "stockholm_psr".

Changes

Change	Element	Description
Published field	pool.is_psr_pending	True if either a PSR is running or we are waiting for a PSR to be re-run
Published message	pool.rotate_secret

Citrix Hypervisor 8.2

Code name: "stockholm".

Changes

Change	Element	Description
Published class	Certificate	An X509 certificate used for TLS connections
Published field	Certificate.fingerprint	The certificate's SHA256 fingerprint / hash
Published field	Certificate.host	The host where the certificate is installed
Published field	Certificate.not_after	Date before which the certificate is valid
Published field	Certificate.not_before	Date after which the certificate is valid
Published field	Certificate.uuid	Unique identifier/object reference
Published field	PUSB.speed	USB device speed
Published field	host.certificates	List of certificates installed in the host
Published field	host.editions	List of all available product editions
Published message	host.emergency_reset_server_certificate	Delete the current TLS server certificate and replace by a new, self-signed one. This should only be used with extreme care.
Published message	host.install_server_certificate	Install the TLS server certificate.
Published message	task.set_progress	Set the task progress
Changed message	host.set_power_on_mode	Removed iLO script
Changed message	host.set_ssl_legacy	Legacy SSL no longer supported
Deprecated field	host.ssl_legacy	Legacy SSL no longer supported
Deprecated message	pool.disable_ssl_legacy	Legacy SSL no longer supported
Removed message	pool.enable_ssl_legacy	Legacy SSL no longer supported

Citrix Hypervisor 8.1

Code name: "quebec".

Changes

Change	Element	Description
Published field	Bond.auto_update_mac	true if the MAC was taken from the primary slave when the bond was created, and false if the client specified the MAC
Published field	VGPU.PCI	Device passed trough to VM, either as full device or SR-IOV virtual function
Published field	VGPU.extra_args	Extra arguments for vGPU and passed to demu
Published field	VGPU_type.compatible_types_in_vm	List of VGPU types which are compatible in one VM
Published field	host.uefi_certificates	The UEFI certificates allowing Secure Boot
Published field	pool.uefi_certificates	The UEFI certificates allowing Secure Boot
Published message	host.set_uefi_certificates	Sets the UEFI certificates on a host
Changed message	VM.assert_can_boot_here	Does additional compatibility checks when VM powerstate is not halted (e.g. CPUID). Use this before calling VM.resume or VM.pool_migrate.
Removed message	VM.snapshot_with_quiesce	VSS support has been removed

Citrix Hypervisor 8.0

Code name: "naples".

Changes

Change	Element	Description
Published field	VM.NVRAM	initial value for guest NVRAM (containing UEFI variables, etc). Cannot be changed while the VM is running
Published message	VM.add_to_NVRAM
Published message	VM.remove_from_NVRAM
Published message	VM.set_NVRAM

XenServer 7.6

Code name: "lima".

Changes

Change	Element	Description
Published class	Cluster	Cluster-wide Cluster metadata
Published class	Cluster_host	Cluster member metadata
Published class	probe_result	A set of properties that describe one result element of SR.probe. Result elements and properties can change dynamically based on changes to the the SR.probe input-parameters or the target.
Published class	sr_stat	A set of high-level properties associated with an SR.
Published field	Cluster.cluster_config	Contains read-only settings for the cluster, such as timeouts and other options. It can only be set at cluster create time
Published field	Cluster.cluster_hosts	A list of the cluster_host objects associated with the Cluster
Published field	Cluster.cluster_stack	Simply the string 'corosync'. No other cluster stacks are currently supported
Published field	Cluster.cluster_stack_version	Version of cluster stack, not writable via the API. Defaulting to 2 for backwards compatibility when upgrading from a cluster without this field, which means it is necessarily running version 2 of corosync, the only cluster stack supported so far.
Published field	Cluster.cluster_token	The secret key used by xapi-clusterd when it talks to itself on other hosts
Published field	Cluster.other_config	Additional configuration
Published field	Cluster.pending_forget	Internal field used by Host.destroy to store the IP of cluster members marked as permanently dead but not yet removed
Published field	Cluster.pool_auto_join	True if automatically joining new pool members to the cluster. This will be `true` in the first release
Published field	Cluster.uuid	Unique identifier/object reference
Published field	Cluster_host.PIF	Reference to the PIF object
Published field	Cluster_host.cluster	Reference to the Cluster object
Published field	Cluster_host.enabled	Whether the cluster host believes that clustering should be enabled on this host. This field can be altered by calling the enable/disable message on a cluster host. Only enabled members run the underlying cluster stack. Disabled members are still considered a member of the cluster (see joined), and can be re-enabled by the user.
Published field	Cluster_host.host	Reference to the Host object
Published field	Cluster_host.joined	Whether the cluster host has joined the cluster. Contrary to enabled, a host that is not joined is not considered a member of the cluster, and hence enable and disable operations cannot be performed on this host.
Published field	Cluster_host.other_config	Additional configuration
Published field	Cluster_host.uuid	Unique identifier/object reference
Published field	probe_result.complete	True if this configuration is complete and can be used to call SR.create. False if it requires further iterative calls to SR.probe, to potentially narrow down on a configuration that can be used.
Published field	probe_result.configuration	Plugin-specific configuration which describes where and how to locate the storage repository. This may include the physical block device name, a remote NFS server and path or an RBD storage pool.
Published field	probe_result.extra_info	Additional plugin-specific information about this configuration, that might be of use for an API user. This can for example include the LUN or the WWPN.
Published field	probe_result.sr	Existing SR found for this configuration
Published field	sr_stat.clustered	Indicates whether the SR uses clustered local storage.
Published field	sr_stat.free_space	Number of bytes free on the backing storage (in bytes)
Published field	sr_stat.health	The health status of the SR.
Published field	sr_stat.name_description	Longer, human-readable description of the SR. Descriptions are generally only displayed by clients when the user is examining SRs in detail.
Published field	sr_stat.name_label	Short, human-readable label for the SR.
Published field	sr_stat.total_space	Total physical size of the backing storage (in bytes)
Published field	sr_stat.uuid	Uuid that uniquely identifies this SR, if one is available.
Published message	Cluster.create	Creates a Cluster object and one Cluster_host object as its first member
Published message	Cluster.destroy	Destroys a Cluster object and the one remaining Cluster_host member
Published message	Cluster.get_network	Returns the network used by the cluster for inter-host communication, i.e. the network shared by all cluster host PIFs
Published message	Cluster.pool_create	Attempt to create a Cluster from the entire pool
Published message	Cluster.pool_destroy	Attempt to destroy the Cluster_host objects for all hosts in the pool and then destroy the Cluster.
Published message	Cluster.pool_force_destroy	Attempt to force destroy the Cluster_host objects, and then destroy the Cluster.
Published message	Cluster.pool_resync	Resynchronise the cluster_host objects across the pool. Creates them where they need creating and then plugs them
Published message	Cluster_host.create	Add a new host to an existing cluster.
Published message	Cluster_host.destroy	Remove the host from an existing cluster. This operation is allowed even if a cluster host is not enabled.
Published message	Cluster_host.disable	Disable cluster membership for an enabled cluster host.
Published message	Cluster_host.enable	Enable cluster membership for a disabled cluster host.
Published message	Cluster_host.force_destroy	Remove a host from an existing cluster forcefully.
Published message	SR.probe_ext	Perform a backend-specific scan, using the given device_config. If the device_config is complete, then this will return a list of the SRs present of this type on the device, if any. If the device_config is partial, then a backend-specific scan will be performed, returning results that will guide the user in improving the device_config.
Changed field	Cluster.token_timeout	the unit is now seconds
Changed field	Cluster.token_timeout_coefficient	the unit is now seconds

XenServer 7.5

Code name: "kolkata".

Changes

Change	Element	Description
Prototyped class	Cluster
Prototyped class	Cluster_host
Prototyped class	probe_result
Prototyped class	sr_stat
Prototyped field	Cluster.cluster_config
Prototyped field	Cluster.cluster_hosts
Prototyped field	Cluster.cluster_stack
Prototyped field	Cluster.cluster_stack_version
Prototyped field	Cluster.cluster_token
Prototyped field	Cluster.other_config
Prototyped field	Cluster.pool_auto_join
Prototyped field	Cluster.token_timeout	the unit is milliseconds
Prototyped field	Cluster.token_timeout_coefficient	the unit is milliseconds
Prototyped field	Cluster.uuid
Prototyped field	Cluster_host.PIF
Prototyped field	Cluster_host.cluster
Prototyped field	Cluster_host.enabled
Prototyped field	Cluster_host.host
Prototyped field	Cluster_host.joined
Prototyped field	Cluster_host.other_config
Prototyped field	Cluster_host.uuid
Prototyped field	probe_result.complete
Prototyped field	probe_result.configuration
Prototyped field	probe_result.extra_info
Prototyped field	probe_result.sr
Prototyped field	sr_stat.clustered
Prototyped field	sr_stat.free_space
Prototyped field	sr_stat.health
Prototyped field	sr_stat.name_description
Prototyped field	sr_stat.name_label
Prototyped field	sr_stat.total_space
Prototyped field	sr_stat.uuid
Prototyped message	Cluster.create
Prototyped message	Cluster.destroy
Prototyped message	Cluster.get_network
Prototyped message	Cluster.pool_create
Prototyped message	Cluster.pool_destroy
Prototyped message	Cluster.pool_force_destroy
Prototyped message	Cluster.pool_resync
Prototyped message	Cluster_host.create
Prototyped message	Cluster_host.destroy
Prototyped message	Cluster_host.disable
Prototyped message	Cluster_host.enable
Prototyped message	Cluster_host.force_destroy
Prototyped message	SR.probe_ext
Published class	network_sriov	network-sriov which connects logical pif and physical pif
Published field	PCI.driver_name	Driver name
Published field	PIF.PCI	Link to underlying PCI device
Published field	PIF.sriov_logical_PIF_of	Indicates which network_sriov this interface is logical of
Published field	PIF.sriov_physical_PIF_of	Indicates which network_sriov this interface is physical of
Published field	VM.domain_type	The field is now valid
Published field	VM_metrics.current_domain_type	This field now contains valid data
Published field	host.iscsi_iqn	The initiator IQN for the host
Published field	host.multipathing	Specifies whether multipathing is enabled
Published field	network_sriov.configuration_mode	The mode for configure network sriov
Published field	network_sriov.logical_PIF	The logical PIF to connect to the SR-IOV network after enable SR-IOV on the physical PIF
Published field	network_sriov.physical_PIF	The PIF that has SR-IOV enabled
Published field	network_sriov.requires_reboot	Indicates whether the host need to be rebooted before SR-IOV is enabled on the physical PIF
Published message	VM.set_domain_type	Set the VM.domain_type field of the given VM, which will take effect when it is next started
Published message	host.set_iscsi_iqn	Sets the initiator IQN for the host
Published message	host.set_multipathing	Specifies whether multipathing is enabled
Published message	network_sriov.create	Enable SR-IOV on the specific PIF. It will create a network-sriov based on the specific PIF and automatically create a logical PIF to connect the specific network.
Published message	network_sriov.destroy	Disable SR-IOV on the specific PIF. It will destroy the network-sriov and the logical PIF accordingly.
Published message	network_sriov.get_remaining_capacity	Get the number of free SR-IOV VFs on the associated PIF
Deprecated field	VM.HVM_boot_policy	Replaced by VM.domain_type
Deprecated message	VM.set_HVM_boot_policy	Replaced by VM.set_domain_type

XenServer 7.4

Code name: "jura".

Changes

Change	Element	Description
Prototyped field	VM.domain_type	Internal-only field; not yet in the public API
Prototyped field	VM_metrics.current_domain_type	Not yet implemented (for future use)

XenServer 7.3

Code name: "inverness".

Changes

Change	Element	Description
Published class	PUSB	A physical USB device
Published class	USB_group	A group of compatible USBs across the resource pool
Published class	VUSB	Describes the vusb device
Published class	vdi_nbd_server_info	Details for connecting to a VDI using the Network Block Device protocol
Published field	PGPU.compatibility_metadata	PGPU metadata to determine whether a VGPU can migrate between two PGPUs
Published field	PIF.igmp_snooping_status	The IGMP snooping status of the corresponding network bridge
Published field	PUSB.USB_group	USB group the PUSB is contained in
Published field	PUSB.description	USB device description
Published field	PUSB.host	Physical machine that owns the USB device
Published field	PUSB.other_config	additional configuration
Published field	PUSB.passthrough_enabled	enabled for passthrough
Published field	PUSB.path	port path of USB device
Published field	PUSB.product_desc	product description of the USB device
Published field	PUSB.product_id	product id of the USB device
Published field	PUSB.serial	serial of the USB device
Published field	PUSB.uuid	Unique identifier/object reference
Published field	PUSB.vendor_desc	vendor description of the USB device
Published field	PUSB.vendor_id	vendor id of the USB device
Published field	PUSB.version	USB device version
Published field	USB_group.PUSBs	List of PUSBs in the group
Published field	USB_group.VUSBs	List of VUSBs using the group
Published field	USB_group.name_description	a notes field containing human-readable description
Published field	USB_group.name_label	a human-readable name
Published field	USB_group.other_config	Additional configuration
Published field	USB_group.uuid	Unique identifier/object reference
Published field	VDI.cbt_enabled	True if changed blocks are tracked for this VDI
Published field	VGPU.compatibility_metadata	VGPU metadata to determine whether a VGPU can migrate between two PGPUs
Published field	VUSB.USB_group	USB group used by the VUSB
Published field	VUSB.VM	VM that owns the VUSB
Published field	VUSB.other_config	Additional configuration
Published field	VUSB.uuid	Unique identifier/object reference
Published field	host.PUSBs	List of physical USBs in the host
Published field	network.purpose	Set of purposes for which the server will use this network
Published field	pool.igmp_snooping_enabled	true if IGMP snooping is enabled in the pool, false otherwise.
Published field	pool_update.enforce_homogeneity	Flag - if true, all hosts in a pool must apply this update
Published field	pool_update.other_config	additional configuration
Published field	vdi_nbd_server_info.address	An address on which the server can be reached; this can be IPv4, IPv6, or a DNS name.
Published field	vdi_nbd_server_info.cert	The TLS certificate of the server
Published field	vdi_nbd_server_info.exportname	The exportname to request over NBD. This holds details including an authentication token, so it must be protected appropriately. Clients should regard the exportname as an opaque string or token.
Published field	vdi_nbd_server_info.port	The TCP port
Published field	vdi_nbd_server_info.subject	For convenience, this redundant field holds a DNS (hostname) subject of the certificate. This can be a wildcard, but only for a certificate that has a wildcard subject and no concrete hostname subjects.
Published message	PUSB.scan
Published message	PUSB.set_passthrough_enabled
Published message	USB_group.create
Published message	USB_group.destroy
Published message	VDI.data_destroy	Delete the data of the snapshot VDI, but keep its changed block tracking metadata. When successful, this call changes the type of the VDI to cbt_metadata. This operation is idempotent: calling it on a VDI of type cbt_metadata results in a no-op, and no error will be thrown.
Published message	VDI.disable_cbt	Disable changed block tracking for the VDI. This call is only allowed on VDIs that support enabling CBT. It is an idempotent operation - disabling CBT for a VDI for which CBT is not enabled results in a no-op, and no error will be thrown.
Published message	VDI.enable_cbt	Enable changed block tracking for the VDI. This call is idempotent - enabling CBT for a VDI for which CBT is already enabled results in a no-op, and no error will be thrown.
Published message	VDI.get_nbd_info	Get details specifying how to access this VDI via a Network Block Device server. For each of a set of NBD server addresses on which the VDI is available, the return value set contains a vdi_nbd_server_info object that contains an exportname to request once the NBD connection is established, and connection details for the address. An empty list is returned if there is no network that has a PIF on a host with access to the relevant SR, or if no such network has been assigned an NBD-related purpose in its purpose field. To access the given VDI, any of the vdi_nbd_server_info objects can be used to make a connection to a server, and then the VDI will be available by requesting the exportname.
Published message	VDI.list_changed_blocks	Compare two VDIs in 64k block increments and report which blocks differ. This operation is not allowed when vdi_to is attached to a VM.
Published message	VM.set_bios_strings	Set custom BIOS strings to this VM. VM will be given a default set of BIOS strings, only some of which can be overridden by the supplied values. Allowed keys are: 'bios-vendor', 'bios-version', 'system-manufacturer', 'system-product-name', 'system-version', 'system-serial-number', 'enclosure-asset-tag', 'baseboard-manufacturer', 'baseboard-product-name', 'baseboard-version', 'baseboard-serial-number', 'baseboard-asset-tag', 'baseboard-location-in-chassis', 'enclosure-asset-tag'
Published message	VUSB.create	Create a new VUSB record in the database only
Published message	VUSB.destroy	Removes a VUSB record from the database
Published message	VUSB.unplug	Unplug the vusb device from the vm.
Published message	network.add_purpose	Give a network a new purpose (if not present already)
Published message	network.remove_purpose	Remove a purpose from a network (if present)
Published message	pool.management_reconfigure	Reconfigure the management network interface for all Hosts in the Pool
Published message	pool.set_igmp_snooping_enabled	Enable or disable IGMP Snooping on the pool.
Changed message	host.get_server_certificate	Now available to all RBAC roles.
Deprecated class	crashdump
Deprecated message	VM.get_boot_record	Use the current VM record/fields instead
Removed message	VDI.resize_online	Online VDI resize is not supported by any of the storage backends.

XenServer 7.2

Code name: "falcon".

Changes

Change	Element	Description
Published class	Feature	A new piece of functionality
Published class	SDN_controller	Describes the SDN controller that is to connect with the pool
Published class	VMSS	VM Snapshot Schedule
Published field	Feature.enabled	Indicates whether the feature is enabled
Published field	Feature.experimental	Indicates whether the feature is experimental (as opposed to stable and fully supported)
Published field	Feature.host	The host where this feature is available
Published field	Feature.uuid	Unique identifier/object reference
Published field	Feature.version	The version of this feature
Published field	SDN_controller.address	IP address of the controller
Published field	SDN_controller.port	TCP port of the controller
Published field	SDN_controller.protocol	Protocol to connect with SDN controller
Published field	SDN_controller.uuid	Unique identifier/object reference
Published field	VM.is_default_template	Identifies default templates
Published field	VM.is_vmss_snapshot	true if this snapshot was created by the snapshot schedule
Published field	VM.snapshot_schedule	Ref pointing to a snapshot schedule for this VM
Published field	host.features	List of features available on this host
Published field	network.managed	true if the bridge is managed by xapi
Published message	SDN_controller.forget	Remove the OVS manager of the pool and destroy the db record.
Published message	SDN_controller.introduce	Introduce an SDN controller to the pool.
Published message	VM.set_snapshot_schedule	Set the value of the snapshot schedule field
Published message	VMSS.add_to_schedule
Published message	VMSS.remove_from_schedule
Published message	VMSS.set_frequency	Set the value of the frequency field
Published message	VMSS.set_last_run_time
Published message	VMSS.set_retained_snapshots
Published message	VMSS.set_schedule
Published message	VMSS.set_type
Published message	VMSS.snapshot_now	This call executes the snapshot schedule immediately
Published message	task.set_status	Set the task status
Changed field	network.bridge	Added to the constructor (network.create)
Deprecated field	pool.vswitch_controller	Deprecated: set the IP address of the vswitch controller in SDN_controller instead.
Deprecated message	pool.set_vswitch_controller	Deprecated: use 'SDN_controller.introduce' and 'SDN_controller.forget' instead.

XenServer 7.1

Code name: "ely".

Changes

Change	Element	Description
Published class	PVS_cache_storage	Describes the storage that is available to a PVS site for caching purposes
Published class	PVS_proxy	a proxy connects a VM/VIF with a PVS site
Published class	PVS_server	individual machine serving provisioning (block) data
Published class	PVS_site	machines serving blocks of data for provisioning VMs
Published class	pool_update	Pool-wide updates to the host software
Published field	PVS_cache_storage.SR	SR providing storage for the PVS cache
Published field	PVS_cache_storage.VDI	The VDI used for caching
Published field	PVS_cache_storage.host	The host on which this object defines PVS cache storage
Published field	PVS_cache_storage.site	The PVS_site for which this object defines the storage
Published field	PVS_cache_storage.size	The size of the cache VDI (in bytes)
Published field	PVS_cache_storage.uuid	Unique identifier/object reference
Published field	PVS_proxy.VIF	VIF of the VM using the proxy
Published field	PVS_proxy.currently_attached	true = VM is currently proxied
Published field	PVS_proxy.site	PVS site this proxy is part of
Published field	PVS_proxy.status	The run-time status of the proxy
Published field	PVS_proxy.uuid	Unique identifier/object reference
Published field	PVS_server.addresses	IPv4 addresses of this server
Published field	PVS_server.first_port	First UDP port accepted by this server
Published field	PVS_server.last_port	Last UDP port accepted by this server
Published field	PVS_server.site	PVS site this server is part of
Published field	PVS_server.uuid	Unique identifier/object reference
Published field	PVS_site.PVS_uuid	Unique identifier of the PVS site, as configured in PVS
Published field	PVS_site.cache_storage	The SR used by PVS proxy for the cache
Published field	PVS_site.name_description	a notes field containing human-readable description
Published field	PVS_site.name_label	a human-readable name
Published field	PVS_site.proxies	The set of proxies associated with the site
Published field	PVS_site.servers	The set of PVS servers in the site
Published field	PVS_site.uuid	Unique identifier/object reference
Published field	VM.reference_label	Textual reference to the template used to create a VM. This can be used by clients in need of an immutable reference to the template since the latter's uuid and name_label may change, for example, after a package installation or upgrade.
Published field	VM.requires_reboot	Indicates whether a VM requires a reboot in order to update its configuration, e.g. its memory allocation.
Published field	VM_metrics.hvm	hardware virtual machine
Published field	VM_metrics.nested_virt	VM supports nested virtualisation
Published field	VM_metrics.nomigrate	VM is immobile and can't migrate between hosts
Published field	host.control_domain	The control domain (domain 0)
Published field	host.updates	Set of updates
Published field	host.updates_requiring_reboot	List of updates which require reboot
Published field	pool.live_patching_disabled	The pool-wide flag to show if the live patching feauture is disabled or not.
Published field	pool_patch.pool_update	A reference to the associated pool_update object
Published field	pool_update.after_apply_guidance	What the client should do after this update has been applied.
Published field	pool_update.hosts	The hosts that have applied this update.
Published field	pool_update.installation_size	Size of the update in bytes
Published field	pool_update.key	GPG key of the update
Published field	pool_update.version	Update version number
Published message	PVS_proxy.create	Configure a VM/VIF to use a PVS proxy
Published message	PVS_proxy.destroy	remove (or switch off) a PVS proxy for this VM
Published message	PVS_server.forget	forget a PVS server
Published message	PVS_server.introduce	introduce new PVS server
Published message	PVS_site.forget	Remove a site's meta data
Published message	PVS_site.introduce	Introduce new PVS site
Published message	PVS_site.set_PVS_uuid	Update the PVS UUID of the PVS site
Published message	VIF.move	Move the specified VIF to the specified network, even while the VM is running
Published message	VM.set_memory	Set the memory allocation of this VM. Sets all of memory_static_max, memory_dynamic_min, and memory_dynamic_max to the given value, and leaves memory_static_min untouched.
Published message	host.call_extension	Call an API extension on this host
Published message	host.has_extension	Return true if the extension is available on the host
Published message	pool_update.apply	Apply the selected update to a host
Published message	pool_update.destroy	Removes the database entry. Only works on unapplied update.
Published message	pool_update.introduce	Introduce update VDI
Published message	pool_update.pool_apply	Apply the selected update to all hosts in the pool
Published message	pool_update.pool_clean	Removes the update's files from all hosts in the pool, but does not revert the update
Published message	pool_update.precheck	Execute the precheck stage of the selected update on a host
Changed message	VM.set_VCPUs_number_live	Unless the feature is explicitly enabled for every host in the pool, this fails with Api_errors.license_restriction.
Deprecated class	host_patch
Deprecated class	pool_patch
Deprecated field	VDI.parent	The field was never used.
Deprecated field	host.patches
Deprecated message	host.refresh_pack_info	Use Pool_update.resync_host instead
Deprecated message	pool_patch.apply
Deprecated message	pool_patch.clean
Deprecated message	pool_patch.clean_on_host
Deprecated message	pool_patch.destroy
Deprecated message	pool_patch.pool_apply
Deprecated message	pool_patch.pool_clean
Deprecated message	pool_patch.precheck

XenServer 7.0

Code name: "dundee".

Changes

Change	Element	Description
Published class	LVHD	LVHD SR specific operations
Published field	PIF.capabilities	Additional capabilities on the interface.
Published field	SM.required_cluster_stack	The storage plugin requires that one of these cluster stacks is configured and running.
Published field	SR.clustered	True if the SR is using aggregated local storage
Published field	SR.is_tools_sr	True if this is the SR that contains the Tools ISO VDIs
Published field	VDI.is_tools_iso	Whether this VDI is a Tools ISO
Published field	VGPU.scheduled_to_be_resident_on	The PGPU on which this VGPU is scheduled to run
Published field	VGPU_type.experimental	Indicates whether VGPUs of this type should be considered experimental
Published field	VGPU_type.identifier	Key used to identify VGPU types and avoid creating duplicates - this field is used internally and not intended for interpretation by API clients
Published field	VGPU_type.implementation	The internal implementation of this VGPU type
Published field	VIF.ipv4_addresses	IPv4 addresses in CIDR format
Published field	VIF.ipv4_configuration_mode	Determines whether IPv4 addresses are configured on the VIF
Published field	VIF.ipv4_gateway	IPv4 gateway (the empty string means that no gateway is set)
Published field	VIF.ipv6_addresses	IPv6 addresses in CIDR format
Published field	VIF.ipv6_configuration_mode	Determines whether IPv6 addresses are configured on the VIF
Published field	VIF.ipv6_gateway	IPv6 gateway (the empty string means that no gateway is set)
Published field	VM.has_vendor_device	When an HVM guest starts, this controls the presence of the emulated C000 PCI device which triggers Windows Update to fetch or update PV drivers.
Published field	VM_guest_metrics.PV_drivers_detected	At least one of the guest's devices has successfully connected to the backend.
Published field	VM_guest_metrics.can_use_hotplug_vbd	To be used where relevant and available instead of checking PV driver version.
Published field	VM_guest_metrics.can_use_hotplug_vif	To be used where relevant and available instead of checking PV driver version.
Published field	host.ssl_legacy	Allow SSLv3 protocol and ciphersuites as used by older server versions. This controls both incoming and outgoing connections. When this is set to a different value, the host immediately restarts its SSL/TLS listening service; typically this takes less than a second but existing connections to it will be broken. API login sessions will remain valid.
Published field	pool.cpu_info	Details about the physical CPUs on the pool
Published field	pool.guest_agent_config	Pool-wide guest agent configuration information
Published field	pool.ha_cluster_stack	The HA cluster stack that is currently in use. Only valid when HA is enabled.
Published field	pool.health_check_config	Configuration for the automatic health check feature
Published field	pool.policy_no_vendor_device	This field was consulted when VM.create did not specify a value for 'has_vendor_device'; VM.create now uses a simple default and no longer consults this value.
Published field	task.backtrace	Function call trace for debugging.
Published message	LVHD.enable_thin_provisioning	Upgrades an LVHD SR to enable thin-provisioning. Future VDIs created in this SR will be thinly-provisioned, although existing VDIs will be left alone. Note that the SR must be attached to the SRmaster for upgrade to work.
Published message	SR.forget_data_source_archives	Forget the recorded statistics related to the specified data source
Published message	SR.get_data_sources
Published message	SR.query_data_source	Query the latest value of the specified data source
Published message	SR.record_data_source	Start recording the specified data source
Published message	VIF.configure_ipv4	Configure IPv4 settings for this virtual interface
Published message	VIF.configure_ipv6	Configure IPv6 settings for this virtual interface
Published message	VM.import	Import an XVA from a URI
Published message	VM.set_has_vendor_device	Controls whether, when the VM starts in HVM mode, its virtual hardware will include the emulated PCI device for which drivers may be available through Windows Update. Usually this should never be changed on a VM on which Windows has been installed: changing it on such a VM is likely to lead to a crash on next start.
Published message	host.set_ssl_legacy	Enable/disable SSLv3 for interoperability with older server versions. When this is set to a different value, the host immediately restarts its SSL/TLS listening service; typically this takes less than a second but existing connections to it will be broken. API login sessions will remain valid.
Published message	pool.add_to_guest_agent_config	Add a key-value pair to the pool-wide guest agent configuration
Published message	pool.disable_ssl_legacy	Sets ssl_legacy false on each host, pool-master last. See Host.ssl_legacy and Host.set_ssl_legacy.
Published message	pool.has_extension	Return true if the extension is available on the pool
Published message	pool.remove_from_guest_agent_config	Remove a key-value pair from the pool-wide guest agent configuration
Published message	session.create_from_db_file
Deprecated field	VM_guest_metrics.PV_drivers_up_to_date	Deprecated in favour of PV_drivers_detected, and redefined in terms of it
Deprecated message	pool.enable_ssl_legacy	Legacy SSL will soon cease to be supported
Removed message	host.reset_cpu_features	Manual CPU feature setting was removed
Removed message	host.set_cpu_features	Manual CPU feature setting was removed

XenServer 6.5 SP1 Hotfix 31

Code name: "indigo".

Changes

Change	Element	Description
Published message	host.license_add	Functionality for parsing license files re-added
Published message	host.license_remove	Remove any license file from the specified host, and switch that host to the unlicensed edition

XenServer 6.5 SP1

Code name: "cream".

Changes

Change	Element	Description
Published field	PGPU.dom0_access	The accessibility of this device from dom0
Published field	PGPU.is_system_display_device	Is this device the system display device
Published field	VM.hardware_platform_version	The host virtual hardware platform version the VM can run on
Published field	host.display	indicates whether the host is configured to output its console to a physical display device
Published field	host.virtual_hardware_platform_versions	The set of versions of the virtual hardware platform that the host can offer to its guests
Published message	PGPU.disable_dom0_access
Published message	PGPU.enable_dom0_access
Published message	VM.call_plugin	Call an API plugin on this vm
Published message	host.disable_display	Disable console output to the physical display device next time this host boots
Published message	host.enable_display	Enable console output to the physical display device next time this host boots

XenServer 6.5

Code name: "creedence".

Changes

Change	Element	Description
Published field	PIF.properties	Additional configuration properties for the interface.
Published field	network.assigned_ips	The IP addresses assigned to VIFs on networks that have active xapi-managed DHCP
Published message	PIF.set_property	Set the value of a property of the PIF
Published message	VM.get_SRs_required_for_recovery	List all the SR's that are required for the VM to be recovered
Published message	VM_appliance.get_SRs_required_for_recovery	Get the list of SRs required by the VM appliance to recover.

XenServer 6.2 SP1 Hotfix 11

Code name: "clearwater-whetstone".

Changes

Change	Element	Description
Published field	PCI.subsystem_device_name	Subsystem device name
Published field	PCI.subsystem_vendor_name	Subsystem vendor name

XenServer 6.2 SP1 Hotfix 4

Code name: "clearwater-felton".

Changes

Change	Element	Description
Extended message	VDI.copy	The copy can now be performed into a pre-created VDI. It is now possible to request copying only changed blocks from a base VDI

XenServer 6.2 SP1

Code name: "vgpu-productisation".

Changes

Change	Element	Description
Published field	GPU_group.enabled_VGPU_types	vGPU types supported on at least one of the pGPUs in this group
Published field	GPU_group.supported_VGPU_types	vGPU types supported on at least one of the pGPUs in this group
Published field	PGPU.supported_VGPU_max_capacities	A map relating each VGPU type supported on this GPU to the maximum number of VGPUs of that type which can run simultaneously on this GPU
Published field	PIF.managed	Indicates whether the interface is managed by xapi. If it is not, then xapi will not configure the interface, the commands PIF.plug/unplug/reconfigure_ip(v6) cannot be used, nor can the interface be bonded or have VLANs based on top through xapi.
Published field	VGPU_type.enabled_on_GPU_groups	List of GPU groups in which at least one have this VGPU type enabled
Published field	VGPU_type.max_resolution_x	Maximum resolution (width) supported by the VGPU type
Published field	VGPU_type.max_resolution_y	Maximum resolution (height) supported by the VGPU type
Published field	VGPU_type.supported_on_GPU_groups	List of GPU groups in which at least one PGPU supports this VGPU type

XenServer 6.2 SP1 Tech-Preview

Code name: "vgpu-tech-preview".

Changes

Change	Element	Description
Published class	VGPU_type	A type of virtual GPU
Published field	GPU_group.allocation_algorithm	Current allocation of vGPUs to pGPUs for this group
Published field	PGPU.enabled_VGPU_types	List of VGPU types which have been enabled for this PGPU
Published field	PGPU.resident_VGPUs	List of VGPUs running on this PGPU
Published field	PGPU.supported_VGPU_types	List of VGPU types supported by the underlying hardware
Published field	VGPU.resident_on	The PGPU on which this VGPU is running
Published field	VGPU.type	Preset type for this VGPU
Published field	VGPU_type.VGPUs	List of VGPUs of this type
Published field	VGPU_type.enabled_on_PGPUs	List of PGPUs that have this VGPU type enabled
Published field	VGPU_type.framebuffer_size	Framebuffer size of the VGPU type, in bytes
Published field	VGPU_type.max_heads	Maximum number of displays supported by the VGPU type
Published field	VGPU_type.model_name	Model name associated with the VGPU type
Published field	VGPU_type.supported_on_PGPUs	List of PGPUs that support this VGPU type
Published field	VGPU_type.uuid	Unique identifier/object reference
Published field	VGPU_type.vendor_name	Name of VGPU vendor
Published message	GPU_group.get_remaining_capacity
Published message	PGPU.add_enabled_VGPU_types
Published message	PGPU.get_remaining_capacity
Published message	PGPU.remove_enabled_VGPU_types
Published message	PGPU.set_GPU_group
Published message	PGPU.set_enabled_VGPU_types

XenServer 6.2

Code name: "clearwater".

Changes

Change	Element	Description
Published field	SM.features	capabilities of the SM plugin, with capability version numbers
Published field	VM.generation_id	Generation ID of the VM
Published field	session.originator	a key string provided by a API user to distinguish itself from other users sharing the same login name
Published message	VM.shutdown	Attempts to first clean shutdown a VM and if it should fail then perform a hard shutdown on it.
Published message	host.declare_dead	Declare that a host is dead. This is a dangerous operation, and should only be called if the administrator is absolutely sure the host is definitely dead
Published message	pool.apply_edition	Apply an edition to all hosts in the pool
Published message	pool.get_license_state	This call returns the license state for the pool
Deprecated field	SM.capabilities	Use SM.features instead
Deprecated field	VM.protection_policy	The VMPR feature was removed
Removed class	VMPP	The VMPR feature was removed
Removed field	VM.is_snapshot_from_vmpp	The VMPR feature was removed
Removed field	VMPP.VMs	The VMPR feature was removed
Removed field	VMPP.alarm_config	The VMPR feature was removed
Removed field	VMPP.archive_frequency	The VMPR feature was removed
Removed field	VMPP.archive_last_run_time	The VMPR feature was removed
Removed field	VMPP.archive_schedule	The VMPR feature was removed
Removed field	VMPP.archive_target_config	The VMPR feature was removed
Removed field	VMPP.archive_target_type	The VMPR feature was removed
Removed field	VMPP.backup_frequency	The VMPR feature was removed
Removed field	VMPP.backup_last_run_time	The VMPR feature was removed
Removed field	VMPP.backup_retention_value	The VMPR feature was removed
Removed field	VMPP.backup_schedule	The VMPR feature was removed
Removed field	VMPP.backup_type	The VMPR feature was removed
Removed field	VMPP.is_alarm_enabled	The VMPR feature was removed
Removed field	VMPP.is_archive_running	The VMPR feature was removed
Removed field	VMPP.is_backup_running	The VMPR feature was removed
Removed field	VMPP.is_policy_enabled	The VMPR feature was removed
Removed field	VMPP.recent_alerts	The VMPR feature was removed
Removed field	VMPP.uuid	The VMPR feature was removed
Removed message	VM.set_protection_policy	The VMPR feature was removed
Removed message	VMPP.add_to_alarm_config	The VMPR feature was removed
Removed message	VMPP.add_to_archive_schedule	The VMPR feature was removed
Removed message	VMPP.add_to_archive_target_config	The VMPR feature was removed
Removed message	VMPP.add_to_backup_schedule	The VMPR feature was removed
Removed message	VMPP.archive_now	The VMPR feature was removed
Removed message	VMPP.get_alerts	The VMPR feature was removed
Removed message	VMPP.protect_now	The VMPR feature was removed
Removed message	VMPP.remove_from_alarm_config	The VMPR feature was removed
Removed message	VMPP.remove_from_archive_schedule	The VMPR feature was removed
Removed message	VMPP.remove_from_archive_target_config	The VMPR feature was removed
Removed message	VMPP.remove_from_backup_schedule	The VMPR feature was removed
Removed message	VMPP.set_alarm_config	The VMPR feature was removed
Removed message	VMPP.set_archive_frequency	The VMPR feature was removed
Removed message	VMPP.set_archive_last_run_time	The VMPR feature was removed
Removed message	VMPP.set_archive_schedule	The VMPR feature was removed
Removed message	VMPP.set_archive_target_config	The VMPR feature was removed
Removed message	VMPP.set_archive_target_type	The VMPR feature was removed
Removed message	VMPP.set_backup_frequency	The VMPR feature was removed
Removed message	VMPP.set_backup_last_run_time	The VMPR feature was removed
Removed message	VMPP.set_backup_retention_value	The VMPR feature was removed
Removed message	VMPP.set_backup_schedule	The VMPR feature was removed
Removed message	VMPP.set_is_alarm_enabled	The VMPR feature was removed
Removed message	host.license_apply	Free licenses no longer handled by xapi

XenServer 6.1

Code name: "tampa".

Changes

Change	Element	Description
Published field	Bond.links_up	Number of links up in this bond
Published field	Bond.properties	Additional configuration properties specific to the bond mode.
Published field	PIF.IPv6	IPv6 address
Published field	PIF.ipv6_configuration_mode	Sets if and how this interface gets an IPv6 address
Published field	PIF.ipv6_gateway	IPv6 gateway
Published field	PIF.primary_address_type	Which protocol should define the primary address of this interface
Published field	VIF.ipv4_allowed	A list of IPv4 addresses which can be used to filter traffic passing through this VIF
Published field	VIF.ipv6_allowed	A list of IPv6 addresses which can be used to filter traffic passing through this VIF
Published field	VIF.locking_mode	current locking mode of the VIF
Published field	blob.public	True if the blob is publicly accessible
Published field	host.guest_VCPUs_params	VCPUs params to apply to all resident guests
Published field	network.default_locking_mode	The network will use this value to determine the behaviour of all VIFs where locking_mode = default
Published message	Bond.set_property	Set the value of a property of the bond
Published message	PIF.reconfigure_ipv6	Reconfigure the IPv6 address settings for this interface
Published message	PIF.set_primary_address_type	Change the primary address type used by this PIF
Published message	VDI.pool_migrate	Migrate a VDI, which may be attached to a running guest, to a different SR. The destination SR must be visible to the guest.
Published message	VIF.add_ipv4_allowed	Associates an IPv4 address with this VIF
Published message	VIF.add_ipv6_allowed	Associates an IPv6 address with this VIF
Published message	VIF.remove_ipv4_allowed	Removes an IPv4 address from this VIF
Published message	VIF.remove_ipv6_allowed	Removes an IPv6 address from this VIF
Published message	VIF.set_ipv4_allowed	Set the IPv4 addresses to which traffic on this VIF can be restricted
Published message	VIF.set_ipv6_allowed	Set the IPv6 addresses to which traffic on this VIF can be restricted
Published message	VIF.set_locking_mode	Set the locking mode for this VIF
Published message	VM.assert_can_migrate	Assert whether a VM can be migrated to the specified destination.
Published message	VM.import_convert	Import using a conversion service.
Published message	VM.migrate_send	Migrate the VM to another host. This can only be called when the specified VM is in the Running state.
Published message	VM.query_services	Query the system services advertised by this VM and register them. This can only be applied to a system domain.
Published message	event.inject	Injects an artificial event on the given object and returns the corresponding ID in the form of a token, which can be used as a point of reference for database events. For example, to check whether an object has reached the right state before attempting an operation, one can inject an artificial event on the object and wait until the token returned by consecutive event.from calls is lexicographically greater than the one returned by event.inject.
Published message	host.get_management_interface	Returns the management interface for the specified host
Published message	host.migrate_receive	Prepare to receive a VM, returning a token which can be passed to VM.migrate.
Published message	network.set_default_locking_mode	Set the default locking mode for VIFs attached to this network
Published message	pool_patch.clean_on_host	Removes the patch's files from the specified host
Published message	pool_patch.pool_clean	Removes the patch's files from all hosts in the pool, but does not remove the database entries
Deprecated message	VM.get_cooperative
Deprecated message	host.get_uncooperative_resident_VMs
Removed class	VBD_metrics	Disabled in favour of RRD
Removed class	VIF_metrics	Disabled in favour of RRDs
Removed field	PIF_metrics.io_read_kbs	Disabled and replaced by RRDs
Removed field	PIF_metrics.io_write_kbs	Disabled and replaced by RRDs
Removed field	VBD.metrics	Disabled in favour of RRDs
Removed field	VBD_metrics.io_read_kbs	Disabled and replaced by RRDs
Removed field	VBD_metrics.io_write_kbs	Disabled and replaced by RRDs
Removed field	VBD_metrics.last_updated	Disabled in favour of RRD
Removed field	VBD_metrics.other_config	Disabled in favour of RRD
Removed field	VIF.metrics	Disabled in favour of RRDs
Removed field	VIF_metrics.io_read_kbs	Disabled and replaced by RRDs
Removed field	VIF_metrics.io_write_kbs	Disabled and replaced by RRDs
Removed field	VM_metrics.VCPUs_utilisation	Disabled in favour of RRDs
Removed field	host_metrics.memory_free	Disabled in favour of RRD

XenServer 6.0

Code name: "boston".

Changes

Change	Element	Description
Published class	DR_task	DR task
Published class	GPU_group	A group of compatible GPUs across the resource pool
Published class	PCI	A PCI device
Published class	PGPU	A physical GPU (pGPU)
Published class	VGPU	A virtual GPU (vGPU)
Published class	VM_appliance	VM appliance
Published field	Bond.mode	The algorithm used to distribute traffic among the bonded NICs
Published field	Bond.primary_slave	The PIF of which the IP configuration and MAC were copied to the bond, and which will receive all configuration/VLANs/VIFs on the bond if the bond is destroyed
Published field	GPU_group.GPU_types	List of GPU types (vendor+device ID) that can be in this group
Published field	GPU_group.PGPUs	List of pGPUs in the group
Published field	GPU_group.VGPUs	List of vGPUs using the group
Published field	GPU_group.name_description	a notes field containing human-readable description
Published field	GPU_group.name_label	a human-readable name
Published field	GPU_group.other_config	Additional configuration
Published field	GPU_group.uuid	Unique identifier/object reference
Published field	PCI.class_name	PCI class name
Published field	PCI.dependencies	List of dependent PCI devices
Published field	PCI.device_name	Device name
Published field	PCI.host	Physical machine that owns the PCI device
Published field	PCI.other_config	Additional configuration
Published field	PCI.pci_id	PCI ID of the physical device
Published field	PCI.uuid	Unique identifier/object reference
Published field	PCI.vendor_name	Vendor name
Published field	PGPU.GPU_group	GPU group the pGPU is contained in
Published field	PGPU.PCI	Link to underlying PCI device
Published field	PGPU.host	Host that owns the GPU
Published field	PGPU.other_config	Additional configuration
Published field	PGPU.uuid	Unique identifier/object reference
Published field	SR.introduced_by	The disaster recovery task which introduced this SR
Published field	VDI.metadata_latest	Whether this VDI contains the latest known accessible metadata for the pool
Published field	VDI.metadata_of_pool	The pool whose metadata is contained in this VDI
Published field	VGPU.GPU_group	GPU group used by the vGPU
Published field	VGPU.VM	VM that owns the vGPU
Published field	VGPU.currently_attached	Reflects whether the virtual device is currently connected to a physical device
Published field	VGPU.device	Order in which the devices are plugged into the VM
Published field	VGPU.other_config	Additional configuration
Published field	VGPU.uuid	Unique identifier/object reference
Published field	VM.VGPUs	Virtual GPUs
Published field	VM.attached_PCIs	Currently passed-through PCI devices
Published field	VM.order	The point in the startup or shutdown sequence at which this VM will be started
Published field	VM.shutdown_delay	The delay to wait before proceeding to the next order in the shutdown sequence (seconds)
Published field	VM.start_delay	The delay to wait before proceeding to the next order in the startup sequence (seconds)
Published field	VM.suspend_SR	The SR on which a suspend image is stored
Published field	VM.version	The number of times this VM has been recovered
Published field	event.snapshot	The record of the database object that was added, changed or deleted
Published field	host.PCIs	List of PCI devices in the host
Published field	host.PGPUs	List of physical GPUs in the host
Published field	host.chipset_info	Information about chipset features
Published field	pool.metadata_VDIs	The set of currently known metadata VDIs for this pool
Published message	Bond.set_mode	Change the bond mode
Published message	DR_task.create	Create a disaster recovery task which will query the supplied list of devices
Published message	DR_task.destroy	Destroy the disaster recovery task, detaching and forgetting any SRs introduced which are no longer required
Published message	GPU_group.create
Published message	GPU_group.destroy
Published message	SR.assert_supports_database_replication	Returns successfully if the given SR supports database replication. Otherwise returns an error to explain why not.
Published message	SR.disable_database_replication
Published message	SR.enable_database_replication
Published message	VDI.open_database	Load the metadata found on the supplied VDI and return a session reference which can be used in API calls to query its contents.
Published message	VDI.read_database_pool_uuid	Check the VDI cache for the pool UUID of the database on this VDI.
Published message	VGPU.create
Published message	VGPU.destroy
Published message	VIF.unplug_force	Forcibly unplug the specified VIF
Published message	VM.assert_can_be_recovered	Assert whether all SRs required to recover this VM are available.
Published message	VM.recover	Recover the VM
Published message	VM.set_appliance	Assign this VM to an appliance.
Published message	VM.set_order	Set this VM's boot order
Published message	VM.set_shutdown_delay	Set this VM's shutdown delay in seconds
Published message	VM.set_start_delay	Set this VM's start delay in seconds
Published message	VM.set_suspend_VDI	Set this VM's suspend VDI, which must be indentical to its current one
Published message	VM_appliance.assert_can_be_recovered	Assert whether all SRs required to recover this VM appliance are available.
Published message	VM_appliance.clean_shutdown	Perform a clean shutdown of all the VMs in the appliance
Published message	VM_appliance.hard_shutdown	Perform a hard shutdown of all the VMs in the appliance
Published message	VM_appliance.recover	Recover the VM appliance
Published message	VM_appliance.shutdown	For each VM in the appliance, try to shut it down cleanly. If this fails, perform a hard shutdown of the VM.
Published message	VM_appliance.start	Start all VMs in the appliance
Published message	event.from	Blocking call which returns a new token and a (possibly empty) batch of events. The returned token can be used in subsequent calls to this function.
Deprecated field	VM.PCI_bus	Field was never used
Deprecated field	VM.ha_always_run
Deprecated field	event.obj_uuid
Deprecated field	event.timestamp
Deprecated message	VM.set_ha_always_run
Deprecated message	event.next
Deprecated message	event.register
Deprecated message	event.unregister

XenServer 5.6 FP1

Code name: "cowley".

Changes

Change	Element	Description
Published class	VMPP	VM Protection Policy
Published class	tunnel	A tunnel for network traffic
Published field	PIF.tunnel_access_PIF_of	Indicates to which tunnel this PIF gives access
Published field	PIF.tunnel_transport_PIF_of	Indicates to which tunnel this PIF provides transport
Published field	SR.local_cache_enabled	True if this SR is assigned to be the local cache for its host
Published field	VDI.allow_caching	true if this VDI is to be cached in the local cache SR
Published field	VDI.on_boot	The behaviour of this VDI on a VM boot
Published field	VM.is_snapshot_from_vmpp	true if this snapshot was created by the protection policy
Published field	VM.protection_policy	Ref pointing to a protection policy for this VM
Published field	VMPP.VMs	all VMs attached to this protection policy
Published field	VMPP.alarm_config	configuration for the alarm
Published field	VMPP.archive_frequency	frequency of the archive schedule
Published field	VMPP.archive_last_run_time	time of the last archive
Published field	VMPP.archive_schedule	schedule of the archive containing 'hour', 'min', 'days'. Date/time-related information is in Local Timezone
Published field	VMPP.archive_target_config	configuration for the archive, including its 'location', 'username', 'password'
Published field	VMPP.archive_target_type	type of the archive target config
Published field	VMPP.backup_frequency	frequency of the backup schedule
Published field	VMPP.backup_last_run_time	time of the last backup
Published field	VMPP.backup_retention_value	maximum number of backups that should be stored at any time
Published field	VMPP.backup_schedule	schedule of the backup containing 'hour', 'min', 'days'. Date/time-related information is in Local Timezone
Published field	VMPP.backup_type	type of the backup sub-policy
Published field	VMPP.is_alarm_enabled	true if alarm is enabled for this policy
Published field	VMPP.is_archive_running	true if this protection policy's archive is running
Published field	VMPP.is_backup_running	true if this protection policy's backup is running
Published field	VMPP.is_policy_enabled	enable or disable this policy
Published field	VMPP.recent_alerts	recent alerts
Published field	VMPP.uuid	Unique identifier/object reference
Published field	host.local_cache_sr	The SR that is used as a local cache
Published field	tunnel.access_PIF	The interface through which the tunnel is accessed
Published field	tunnel.other_config	Additional configuration
Published field	tunnel.status	Status information about the tunnel
Published field	tunnel.transport_PIF	The interface used by the tunnel
Published field	tunnel.uuid	Unique identifier/object reference
Published message	VDI.set_allow_caching	Set the value of the allow_caching parameter. This value can only be changed when the VDI is not attached to a running VM. The caching behaviour is only affected by this flag for VHD-based VDIs that have one parent and no child VHDs. Moreover, caching only takes place when the host running the VM containing this VDI has a nominated SR for local caching.
Published message	VDI.set_on_boot	Set the value of the on_boot parameter. This value can only be changed when the VDI is not attached to a running VM.
Published message	VM.set_protection_policy	Set the value of the protection_policy field
Published message	VMPP.add_to_alarm_config
Published message	VMPP.add_to_archive_schedule
Published message	VMPP.add_to_archive_target_config
Published message	VMPP.add_to_backup_schedule
Published message	VMPP.archive_now	This call archives the snapshot provided as a parameter
Published message	VMPP.get_alerts	This call fetches a history of alerts for a given protection policy
Published message	VMPP.protect_now	This call executes the protection policy immediately
Published message	VMPP.remove_from_alarm_config
Published message	VMPP.remove_from_archive_schedule
Published message	VMPP.remove_from_archive_target_config
Published message	VMPP.remove_from_backup_schedule
Published message	VMPP.set_alarm_config
Published message	VMPP.set_archive_frequency	Set the value of the archive_frequency field
Published message	VMPP.set_archive_last_run_time
Published message	VMPP.set_archive_schedule
Published message	VMPP.set_archive_target_config
Published message	VMPP.set_archive_target_type	Set the value of the archive_target_config_type field
Published message	VMPP.set_backup_frequency	Set the value of the backup_frequency field
Published message	VMPP.set_backup_last_run_time
Published message	VMPP.set_backup_retention_value
Published message	VMPP.set_backup_schedule
Published message	VMPP.set_is_alarm_enabled	Set the value of the is_alarm_enabled field
Published message	host.disable_local_storage_caching	Disable the use of a local SR for caching purposes
Published message	host.enable_local_storage_caching	Enable the use of a local SR for caching purposes
Published message	host.get_server_localtime	This call queries the host's clock for the current time in the host's local timezone
Published message	host.set_power_on_mode	Set the power-on-mode, host, user and password
Published message	pool.disable_local_storage_caching	This call disables pool-wide local storage caching
Published message	pool.enable_local_storage_caching	This call attempts to enable pool-wide local storage caching
Published message	pool.test_archive_target	This call tests if a location is valid
Published message	tunnel.create	Create a tunnel
Published message	tunnel.destroy	Destroy a tunnel
Extended message	VDI.copy	The copy can now be performed between any two SRs.
Extended message	VM.copy	The copy can now be performed between any two SRs.
Extended message	pool.set_vswitch_controller	Allow to be set to the empty string (no controller is used).

XenServer 5.6

Code name: "midnight-ride".

Changes

Change	Element	Description
Published class	role	A set of permissions associated with a subject
Published class	secret	A secret
Published field	VM.bios_strings	BIOS strings
Published field	VM.children	List pointing to all the children of this VM
Published field	VM.parent	Ref pointing to the parent of this VM
Published field	VM.snapshot_info	Human-readable information concerning this snapshot
Published field	VM.snapshot_metadata	Encoded information about the VM's metadata this is a snapshot of
Published field	host.bios_strings	BIOS strings
Published field	host.cpu_info	Details about the physical CPUs on this host
Published field	host.edition	Product edition
Published field	host.license_server	Contact information of the license server
Published field	host.power_on_config	The power on config
Published field	host.power_on_mode	The power on mode
Published field	network.MTU	MTU in octets
Published field	pool.redo_log_enabled	true a redo-log is to be used other than when HA is enabled, false otherwise
Published field	pool.redo_log_vdi	indicates the VDI to use for the redo-log other than when HA is enabled
Published field	pool.restrictions	Pool-wide restrictions currently in effect
Published field	pool.vswitch_controller	the IP address of the vswitch controller.
Published field	role.name_description	what this role is for
Published field	role.name_label	a short user-friendly name for the role
Published field	role.subroles	a list of pointers to other roles or permissions
Published field	session.auth_user_name	the subject name of the user that was externally authenticated. If a session instance has is_local_superuser set, then the value of this field is undefined.
Published field	session.parent	references the parent session that created this session
Published field	session.rbac_permissions	list with all RBAC permissions for this session
Published field	session.tasks	list of tasks created using the current session
Published field	subject.roles	the roles associated with this subject
Published message	VM.checkpoint	Checkpoints the specified VM, making a new VM. Checkpoint automatically exploits the capabilities of the underlying storage repository in which the VM's disk images are stored (e.g. Copy on Write) and saves the memory image as well.
Published message	VM.compute_memory_overhead	Computes the virtualization memory overhead of a VM.
Published message	VM.copy_bios_strings	Copy the BIOS strings from the given host to this VM
Published message	VM.get_cooperative	Return true if the VM is currently 'co-operative' i.e. is expected to reach a balloon target and actually has done
Published message	VM.revert	Reverts the specified VM to a previous state.
Published message	VM.set_HVM_shadow_multiplier	Set the shadow memory multiplier on a halted VM
Published message	VM.set_VCPUs_at_startup	Set the number of startup VCPUs for a halted VM
Published message	VM.set_VCPUs_max	Set the maximum number of VCPUs for a halted VM
Published message	VM.set_memory_dynamic_max	Set the value of the memory_dynamic_max field
Published message	VM.set_memory_dynamic_min	Set the value of the memory_dynamic_min field
Published message	VM.set_memory_dynamic_range	Set the minimum and maximum amounts of physical memory the VM is allowed to use.
Published message	VM.set_memory_limits	Set the memory limits of this VM.
Published message	VM.set_memory_static_min	Set the value of the memory_static_min field
Published message	VM.set_memory_static_range	Set the static (ie boot-time) range of virtual memory that the VM is allowed to use.
Published message	host.apply_edition	Change to another edition, or reactivate the current edition after a license has expired. This may be subject to the successful checkout of an appropriate license.
Published message	host.compute_memory_overhead	Computes the virtualization memory overhead of a host.
Published message	host.get_uncooperative_resident_VMs	Return a set of VMs which are not co-operating with the host's memory control system
Published message	host.refresh_pack_info	Refresh the list of installed Supplemental Packs.
Published message	host.reset_cpu_features	Remove the feature mask, such that after a reboot all features of the CPU are enabled.
Published message	host.set_cpu_features	Set the CPU features to be used after a reboot, if the given features string is valid.
Published message	pool.disable_redo_log	Disable the redo log if in use, unless HA is enabled.
Published message	pool.enable_redo_log	Enable the redo log on the given SR and start using it, unless HA is enabled.
Published message	pool.set_vswitch_controller	Set the IP address of the vswitch controller.
Published message	role.get_by_permission	This call returns a list of roles given a permission
Published message	role.get_by_permission_name_label	This call returns a list of roles given a permission name
Published message	role.get_permissions	This call returns a list of permissions given a role
Published message	role.get_permissions_name_label	This call returns a list of permission names given a role
Published message	subject.add_to_roles	This call adds a new role to a subject
Published message	subject.get_permissions_name_label	This call returns a list of permission names given a subject
Published message	subject.remove_from_roles	This call removes a role from a subject
Deprecated class	host_cpu	Deprecated in favour of the Host.cpu_info field
Deprecated field	VM.memory_target
Deprecated field	host_metrics.memory_free	Will be disabled in favour of RRD
Deprecated message	VM.set_memory_target_live
Deprecated message	VM.wait_memory_target_live

XenServer 5.5

Code name: "george".

Changes

Change	Element	Description
Published class	auth	Management of remote authentication services
Published class	subject	A user or group that can log in xapi
Published field	VIF.MAC_autogenerated	true if the MAC was autogenerated; false indicates it was set manually
Published field	host.external_auth_configuration	configuration specific to external authentication service
Published field	host.external_auth_service_name	name of external authentication service configured; empty if none configured.
Published field	host.external_auth_type	type of external authentication service configured; empty if none configured.
Published field	pool.wlb_enabled	true if workload balancing is enabled on the pool, false otherwise
Published field	pool.wlb_url	Url for the configured workload balancing host
Published field	pool.wlb_username	Username for accessing the workload balancing host
Published field	pool.wlb_verify_cert	true if communication with the WLB server should enforce TLS certificate verification.
Published field	session.auth_user_sid	the subject identifier of the user that was externally authenticated. If a session instance has is_local_superuser set, then the value of this field is undefined.
Published field	session.is_local_superuser	true iff this session was created using local superuser credentials
Published field	session.subject	references the subject instance that created the session. If a session instance has is_local_superuser set, then the value of this field is undefined.
Published field	session.validation_time	time when session was last validated
Published field	subject.other_config	additional configuration
Published field	subject.subject_identifier	the subject identifier, unique in the external directory service
Published message	VDI.set_sharable	Sets the VDI's sharable field
Published message	VM.retrieve_wlb_recommendations	Returns mapping of hosts to ratings, indicating the suitability of starting the VM at that location according to wlb. Rating is replaced with an error if the VM cannot boot there.
Published message	auth.get_group_membership	This calls queries the external directory service to obtain the transitively-closed set of groups that the the subject_identifier is member of.
Published message	auth.get_subject_identifier	This call queries the external directory service to obtain the subject_identifier as a string from the human-readable subject_name
Published message	auth.get_subject_information_from_identifier	This call queries the external directory service to obtain the user information (e.g. username, organization etc) from the specified subject_identifier
Published message	host.disable_external_auth	This call disables external authentication on the local host
Published message	host.enable_external_auth	This call enables external authentication on a host
Published message	host.get_server_certificate	Get the installed server public TLS certificate.
Published message	host.retrieve_wlb_evacuate_recommendations	Retrieves recommended host migrations to perform when evacuating the host from the wlb server. If a VM cannot be migrated from the host the reason is listed instead of a recommendation.
Published message	pool.certificate_install	Install TLS CA certificate
Published message	pool.certificate_list	List installed TLS CA certificate
Published message	pool.certificate_sync	Copy the TLS CA certificates and CRLs of the master to all slaves.
Published message	pool.certificate_uninstall	Install TLS CA certificate
Published message	pool.crl_install	Install a TLS CA-issued Certificate Revocation List, pool-wide.
Published message	pool.crl_list	List the names of all installed TLS CA-issued Certificate Revocation Lists.
Published message	pool.crl_uninstall	Remove a pool-wide TLS CA-issued Certificate Revocation List.
Published message	pool.deconfigure_wlb	Permanently deconfigures workload balancing monitoring on this pool
Published message	pool.detect_nonhomogeneous_external_auth	This call asynchronously detects if the external authentication configuration in any slave is different from that in the master and raises appropriate alerts
Published message	pool.disable_external_auth	This call disables external authentication on all the hosts of the pool
Published message	pool.enable_external_auth	This call enables external authentication on all the hosts of the pool
Published message	pool.initialize_wlb	Initializes workload balancing monitoring on this pool with the specified wlb server
Published message	pool.retrieve_wlb_configuration	Retrieves the pool optimization criteria from the workload balancing server
Published message	pool.retrieve_wlb_recommendations	Retrieves vm migrate recommendations for the pool from the workload balancing server
Published message	pool.send_test_post	Send the given body to the given host and port, using HTTPS, and print the response. This is used for debugging the SSL layer.
Published message	pool.send_wlb_configuration	Sets the pool optimization criteria for the workload balancing server
Published message	session.get_all_subject_identifiers	Return a list of all the user subject-identifiers of all existing sessions
Published message	session.logout_subject_identifier	Log out all sessions associated to a user subject-identifier, except the session associated with the context calling this function
Deprecated class	user	Deprecated in favor of subject
Removed field	VM_guest_metrics.memory	Disabled in favour of the RRDs, to improve scalability

XenServer 5.0 Update 1

Code name: "orlando-update-1".

Changes

Change	Element	Description
Published message	pool.ha_prevent_restarts_for	When this call returns the VM restart logic will not run for the requested number of seconds. If the argument is zero then the restart thread is immediately unblocked

XenServer 5.0

Code name: "orlando".

Changes

Change	Element	Description
Published class	blob	A placeholder for a binary blob
Published class	data_source	Data sources for logging in RRDs
Published class	message	An message for the attention of the administrator
Published field	PIF.disallow_unplug	Prevent this PIF from being unplugged; set this to notify the management tool-stack that the PIF has a special use and should not be unplugged under any circumstances (e.g. because you're running storage traffic over it)
Published field	PIF_metrics.other_config	additional configuration
Published field	SM.driver_filename	filename of the storage driver
Published field	SR.blobs	Binary blobs associated with this SR
Published field	SR.tags	user-specified tags for categorization purposes
Published field	VBD_metrics.other_config	additional configuration
Published field	VDI.is_a_snapshot	true if this is a snapshot.
Published field	VDI.snapshot_of	Ref pointing to the VDI this snapshot is of.
Published field	VDI.snapshot_time	Date/time when this snapshot was created.
Published field	VDI.snapshots	List pointing to all the VDIs snapshots.
Published field	VDI.tags	user-specified tags for categorization purposes
Published field	VIF_metrics.other_config	additional configuration
Published field	VM.blobs	Binary blobs associated with this VM
Published field	VM.blocked_operations	List of operations which have been explicitly blocked and an error code
Published field	VM.ha_always_run	if true then the system will attempt to keep the VM running as much as possible.
Published field	VM.ha_restart_priority	has possible values: "best-effort" meaning "try to restart this VM if possible but don't consider the Pool to be overcommitted if this is not possible"; "restart" meaning "this VM should be restarted"; "" meaning "do not try to restart this VM"
Published field	VM.is_a_snapshot	true if this is a snapshot. Snapshotted VMs can never be started, they are used only for cloning other VMs
Published field	VM.snapshot_of	Ref pointing to the VM this snapshot is of.
Published field	VM.snapshot_time	Date/time when this snapshot was created.
Published field	VM.snapshots	List pointing to all the VM snapshots.
Published field	VM.tags	user-specified tags for categorization purposes
Published field	VM.transportable_snapshot_id	Transportable ID of the snapshot VM
Published field	VM_guest_metrics.live	True if the guest is sending heartbeat messages via the guest agent
Published field	VM_guest_metrics.other_config	additional configuration
Published field	VM_metrics.other_config	additional configuration
Published field	host.blobs	Binary blobs associated with this host
Published field	host.ha_network_peers	The set of hosts visible via the network from this host
Published field	host.ha_statefiles	The set of statefiles accessible from this host
Published field	host.tags	user-specified tags for categorization purposes
Published field	host_cpu.other_config	additional configuration
Published field	host_metrics.other_config	additional configuration
Published field	message.cls	The class of the object this message is associated with
Published field	network.blobs	Binary blobs associated with this network
Published field	network.tags	user-specified tags for categorization purposes
Published field	pool.blobs	Binary blobs associated with this pool
Published field	pool.gui_config	gui-specific configuration for pool
Published field	pool.ha_allow_overcommit	If set to false then operations which would cause the Pool to become overcommitted will be blocked.
Published field	pool.ha_configuration	The current HA configuration
Published field	pool.ha_enabled	true if HA is enabled on the pool, false otherwise
Published field	pool.ha_host_failures_to_tolerate	Number of host failures to tolerate before the Pool is declared to be overcommitted
Published field	pool.ha_overcommitted	True if the Pool is considered to be overcommitted i.e. if there exist insufficient physical resources to tolerate the configured number of host failures
Published field	pool.ha_plan_exists_for	Number of future host failures we have managed to find a plan for. Once this reaches zero any future host failures will cause the failure of protected VMs.
Published field	pool.ha_statefiles	HA statefile VDIs in use
Published field	pool.tags	user-specified tags for categorization purposes
Published field	task.subtask_of	Ref pointing to the task this is a substask of.
Published field	task.subtasks	List pointing to all the substasks.
Published field	user.other_config	additional configuration
Published message	PIF.db_forget	Destroy a PIF database record.
Published message	PIF.db_introduce	Create a new PIF record in the database only
Published message	PIF.set_disallow_unplug	Set whether unplugging the PIF is allowed
Published message	SR.assert_can_host_ha_statefile	Returns successfully if the given SR can host an HA statefile. Otherwise returns an error to explain why not
Published message	SR.create_new_blob	Create a placeholder for a named binary blob of data that is associated with this SR
Published message	VM.assert_agile	Returns an error if the VM is not considered agile e.g. because it is tied to a resource local to a host
Published message	VM.create_new_blob	Create a placeholder for a named binary blob of data that is associated with this VM
Published message	VM.forget_data_source_archives	Forget the recorded statistics related to the specified data source
Published message	VM.get_data_sources
Published message	VM.query_data_source	Query the latest value of the specified data source
Published message	VM.record_data_source	Start recording the specified data source
Published message	VM.set_ha_always_run	Set the value of the ha_always_run
Published message	VM.set_ha_restart_priority	Set the value of the ha_restart_priority field
Published message	VM.set_memory_static_max	Set the value of the memory_static_max field
Published message	VM.snapshot	Snapshots the specified VM, making a new VM. Snapshot automatically exploits the capabilities of the underlying storage repository in which the VM's disk images are stored (e.g. Copy on Write).
Published message	VM.snapshot_with_quiesce	Snapshots the specified VM with quiesce, making a new VM. Snapshot automatically exploits the capabilities of the underlying storage repository in which the VM's disk images are stored (e.g. Copy on Write).
Published message	VM.wait_memory_target_live	Wait for a running VM to reach its current memory target
Published message	blob.create	Create a placeholder for a binary blob
Published message	blob.destroy
Published message	host.backup_rrds	This causes the RRDs to be backed up to the master
Published message	host.call_plugin	Call an API plugin on this host
Published message	host.compute_free_memory	Computes the amount of free memory on the host.
Published message	host.create_new_blob	Create a placeholder for a named binary blob of data that is associated with this host
Published message	host.emergency_ha_disable	This call disables HA on the local host. This should only be used with extreme care.
Published message	host.forget_data_source_archives	Forget the recorded statistics related to the specified data source
Published message	host.get_data_sources
Published message	host.get_servertime	This call queries the host's clock for the current time
Published message	host.get_vms_which_prevent_evacuation	Return a set of VMs which prevent the host being evacuated, with per-VM error codes
Published message	host.power_on	Attempt to power-on the host (if the capability exists).
Published message	host.query_data_source	Query the latest value of the specified data source
Published message	host.record_data_source	Start recording the specified data source
Published message	host.shutdown_agent	Shuts the agent down after a 10 second pause. WARNING: this is a dangerous operation. Any operations in progress will be aborted, and unrecoverable data loss may occur. The caller is responsible for ensuring that there are no operations in progress when this method is called.
Published message	host.sync_data	This causes the synchronisation of the non-database data (messages, RRDs and so on) stored on the master to be synchronised with the host
Published message	message.create
Published message	message.destroy
Published message	message.get
Published message	message.get_all
Published message	message.get_all_records
Published message	message.get_all_records_where
Published message	message.get_by_uuid
Published message	message.get_record
Published message	message.get_since
Published message	network.create_new_blob	Create a placeholder for a named binary blob of data that is associated with this pool
Published message	pool.create_new_blob	Create a placeholder for a named binary blob of data that is associated with this pool
Published message	pool.ha_compute_hypothetical_max_host_failures_to_tolerate	Returns the maximum number of host failures we could tolerate before we would be unable to restart the provided VMs
Published message	pool.ha_compute_max_host_failures_to_tolerate	Returns the maximum number of host failures we could tolerate before we would be unable to restart configured VMs
Published message	pool.ha_compute_vm_failover_plan	Return a VM failover plan assuming a given subset of hosts fail
Published message	pool.ha_failover_plan_exists	Returns true if a VM failover plan exists for up to 'n' host failures
Published message	pool.set_ha_host_failures_to_tolerate	Set the maximum number of host failures to consider in the HA VM restart planner
Removed field	VM_guest_metrics.disks	No data

XenServer 4.1.1

Code name: "symc".

Changes

Change	Element	Description
Published message	SR.update	Refresh the fields on the SR object
Published message	VDI.update	Ask the storage backend to refresh the fields in the VDI object

XenServer 4.1

Code name: "miami".

Changes

Change	Element	Description
Published class	Bond	A Network bond that combines physical network interfaces, also known as link aggregation
Published class	VLAN	A VLAN mux/demux
Published class	pool_patch	Pool-wide patches
Published field	Bond.master	The bonded interface
Published field	Bond.other_config	additional configuration
Published field	Bond.slaves	The interfaces which are part of this bond
Published field	PBD.other_config	additional configuration
Published field	PIF.DNS	Comma separated list of the IP addresses of the DNS servers to use
Published field	PIF.IP	IP address
Published field	PIF.VLAN_master_of	Indicates which VLAN this interface receives untagged traffic from
Published field	PIF.VLAN_slave_of	Indicates which VLANs this interface transmits tagged traffic to
Published field	PIF.bond_master_of	Indicates this PIF represents the results of a bond
Published field	PIF.bond_slave_of	Indicates which bond this interface is part of
Published field	PIF.currently_attached	true if this interface is online
Published field	PIF.gateway	IP gateway
Published field	PIF.ip_configuration_mode	Sets if and how this interface gets an IP address
Published field	PIF.management	Indicates whether the control software is listening for connections on this interface
Published field	PIF.netmask	IP netmask
Published field	PIF.other_config	Additional configuration
Published field	PIF.physical	true if this represents a physical network interface
Published field	SM.capabilities	capabilities of the SM plugin
Published field	SM.other_config	additional configuration
Published field	SR.sm_config	SM dependent data
Published field	VBD.unpluggable	true if this VBD will support hot-unplug
Published field	VDI.location	location information
Published field	VDI.sm_config	SM dependent data
Published field	VDI.xenstore_data	data to be inserted into the xenstore tree (/local/domain/0/backend/vbd/<domid>/<device-id>/sm-data) after the VDI is attached. This is generally set by the SM backends on vdi_attach.
Published field	VLAN.other_config	additional configuration
Published field	VLAN.tag	VLAN tag in use
Published field	VLAN.tagged_PIF	interface on which traffic is tagged
Published field	VLAN.untagged_PIF	interface on which traffic is untagged
Published field	VM.HVM_shadow_multiplier	multiplier applied to the amount of shadow that will be made available to the guest
Published field	VM.last_booted_record	Marshalled value containing VM record at time of last boot, updated dynamically to reflect the runtime state of the domain
Published field	VM.xenstore_data	data to be inserted into the xenstore tree (/local/domain/<domid>/vm-data) after the VM is created.
Published field	crashdump.other_config	additional configuration
Published field	host_crashdump.other_config	additional configuration
Published field	host_patch.other_config	additional configuration
Published field	host_patch.pool_patch	The patch applied
Published field	pool_patch.after_apply_guidance	What the client should do after this patch has been applied.
Published field	pool_patch.host_patches	This hosts this patch is applied to.
Published field	pool_patch.other_config	additional configuration
Published field	pool_patch.pool_applied	This patch should be applied across the entire pool
Published field	pool_patch.size	Size of the patch
Published field	pool_patch.version	Patch version number
Published field	session.other_config	additional configuration
Published field	task.other_config	additional configuration
Published message	Bond.create	Create an interface bond
Published message	Bond.destroy	Destroy an interface bond
Published message	PBD.set_device_config	Sets the PBD's device_config field
Published message	PIF.forget	Destroy the PIF object matching a particular network interface
Published message	PIF.introduce	Create a PIF object matching a particular network interface
Published message	PIF.plug	Attempt to bring up a physical interface
Published message	PIF.reconfigure_ip	Reconfigure the IP address settings for this interface
Published message	PIF.scan	Scan for physical interfaces on a host and create PIF objects to represent them
Published message	PIF.unplug	Attempt to bring down a physical interface
Published message	SR.probe	Perform a backend-specific scan, using the given device_config. If the device_config is complete, then this will return a list of the SRs present of this type on the device, if any. If the device_config is partial, then a backend-specific scan will be performed, returning results that will guide the user in improving the device_config.
Published message	SR.set_physical_size	Sets the SR's physical_size field
Published message	VDI.introduce	Create a new VDI record in the database only
Published message	VLAN.create	Create a VLAN mux/demuxer
Published message	VLAN.destroy	Destroy a VLAN mux/demuxer
Published message	VM.maximise_memory	Returns the maximum amount of guest memory which will fit, together with overheads, in the supplied amount of physical memory. If 'exact' is true then an exact calculation is performed using the VM's current settings. If 'exact' is false then a more conservative approximation is used
Published message	host.assert_can_evacuate	Check this host can be evacuated.
Published message	host.evacuate	Migrate all VMs off of this host, where possible.
Published message	host.get_system_status_capabilities
Published message	host.local_management_reconfigure	Reconfigure the management network interface. Should only be used if Host.management_reconfigure is impossible because the network configuration is broken.
Published message	host.management_disable	Disable the management network interface
Published message	host.management_reconfigure	Reconfigure the management network interface
Published message	host.set_hostname_live	Sets the host name to the specified string. Both the API and lower-level system hostname are changed immediately.
Published message	host.syslog_reconfigure	Re-configure syslog logging
Published message	pool.designate_new_master	Perform an orderly handover of the role of master to the referenced host.
Published message	pool.disable_ha	Turn off High Availability mode
Published message	pool.enable_ha	Turn on High Availability mode
Published message	pool_patch.apply	Apply the selected patch to a host and return its output
Published message	pool_patch.clean	Removes the patch's files from the server
Published message	pool_patch.destroy	Removes the patch's files from all hosts in the pool, and removes the database entries. Only works on unapplied patches.
Published message	pool_patch.pool_apply	Apply the selected patch to all hosts in the pool and return a map of host_ref -> patch output
Published message	pool_patch.precheck	Execute the precheck stage of the selected patch on a host and return its output
Published message	session.local_logout	Log out of local session.
Published message	session.slave_local_login_with_password	Authenticate locally against a slave in emergency mode. Note the resulting sessions are only good for use on this host.
Deprecated message	PIF.create_VLAN	Replaced by VLAN.create
Deprecated message	PIF.destroy	Replaced by VLAN.destroy and Bond.destroy
Deprecated message	SR.make	Use SR.create instead
Deprecated message	host_patch.apply
Deprecated message	host_patch.destroy

XenServer 4.0

Code name: "rio".

Changes

Change	Element	Description
Published class	PBD	The physical block devices through which hosts access SRs
Published class	PIF	A physical network interface (note separate VLANs are represented as several PIFs)
Published class	PIF_metrics	The metrics associated with a physical network interface
Published class	SM	A storage manager plugin
Published class	SR	A storage repository
Published class	VBD	A virtual block device
Published class	VBD_metrics	The metrics associated with a virtual block device
Published class	VDI	A virtual disk image
Published class	VIF	A virtual network interface
Published class	VIF_metrics	The metrics associated with a virtual network device
Published class	VM	A virtual machine (or 'guest').
Published class	VM_guest_metrics	The metrics reported by the guest (as opposed to inferred from outside)
Published class	VM_metrics	The metrics associated with a VM
Published class	console	A console
Published class	crashdump	A VM crashdump
Published class	event	Asynchronous event registration and handling
Published class	host	A physical host
Published class	host_cpu	A physical CPU
Published class	host_crashdump	Represents a host crash dump
Published class	host_metrics	The metrics associated with a host
Published class	host_patch	Represents a patch stored on a server
Published class	network	A virtual network
Published class	pool	Pool-wide information
Published class	session	A session
Published class	task	A long-running asynchronous task
Published class	user	A user of the system
Published field	Bond.uuid	Unique identifier/object reference
Published field	Cluster.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	Cluster.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	Cluster_host.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	Cluster_host.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	DR_task.introduced_SRs	All SRs introduced by this appliance
Published field	DR_task.uuid	Unique identifier/object reference
Published field	Feature.name_description	a notes field containing human-readable description
Published field	Feature.name_label	a human-readable name
Published field	LVHD.uuid	Unique identifier/object reference
Published field	Observer.name_description	a notes field containing human-readable description
Published field	Observer.name_label	a human-readable name
Published field	PBD.SR	the storage repository that the pbd realises
Published field	PBD.currently_attached	is the SR currently attached on this host?
Published field	PBD.device_config	a config string to string map that is provided to the host's SR-backend-driver
Published field	PBD.host	physical machine on which the pbd is available
Published field	PBD.uuid	Unique identifier/object reference
Published field	PIF.MAC	ethernet MAC address of physical interface
Published field	PIF.MTU	MTU in octets
Published field	PIF.VLAN	VLAN tag for all traffic passing through this interface
Published field	PIF.device	machine-readable name of the interface (e.g. eth0)
Published field	PIF.host	physical machine to which this pif is connected
Published field	PIF.metrics	metrics associated with this PIF
Published field	PIF.network	virtual network to which this pif is connected
Published field	PIF.uuid	Unique identifier/object reference
Published field	PIF_metrics.carrier	Report if the PIF got a carrier or not
Published field	PIF_metrics.device_id	Report device ID
Published field	PIF_metrics.device_name	Report device name
Published field	PIF_metrics.duplex	Full duplex capability of the link (if available)
Published field	PIF_metrics.io_read_kbs	Read bandwidth (KiB/s)
Published field	PIF_metrics.io_write_kbs	Write bandwidth (KiB/s)
Published field	PIF_metrics.last_updated	Time at which this information was last updated
Published field	PIF_metrics.pci_bus_path	PCI bus path of the pif (if available)
Published field	PIF_metrics.speed	Speed of the link in Mbit/s (if available)
Published field	PIF_metrics.uuid	Unique identifier/object reference
Published field	PIF_metrics.vendor_id	Report vendor ID
Published field	PIF_metrics.vendor_name	Report vendor name
Published field	Repository.name_description	a notes field containing human-readable description
Published field	Repository.name_label	a human-readable name
Published field	SM.configuration	names and descriptions of device config keys
Published field	SM.copyright	Entity which owns the copyright of this plugin
Published field	SM.name_description	a notes field containing human-readable description
Published field	SM.name_label	a human-readable name
Published field	SM.required_api_version	Minimum SM API version required on the server
Published field	SM.type	SR.type
Published field	SM.uuid	Unique identifier/object reference
Published field	SM.vendor	Vendor who created this plugin
Published field	SM.version	Version of the plugin
Published field	SR.PBDs	describes how particular hosts can see this storage repository
Published field	SR.VDIs	all virtual disks known to this storage repository
Published field	SR.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	SR.content_type	the type of the SR's content, if required (e.g. ISOs)
Published field	SR.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	SR.name_description	a notes field containing human-readable description
Published field	SR.name_label	a human-readable name
Published field	SR.other_config	additional configuration
Published field	SR.physical_size	total physical size of the repository (in bytes)
Published field	SR.physical_utilisation	physical space currently utilised on this storage repository (in bytes). Note that for sparse disk formats, physical_utilisation may be less than virtual_allocation
Published field	SR.shared	true if this SR is (capable of being) shared between multiple hosts
Published field	SR.type	type of the storage repository
Published field	SR.uuid	Unique identifier/object reference
Published field	SR.virtual_allocation	sum of virtual_sizes of all VDIs in this storage repository (in bytes)
Published field	VBD.VDI	the virtual disk
Published field	VBD.VM	the virtual machine
Published field	VBD.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VBD.bootable	true if this VBD is bootable
Published field	VBD.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VBD.currently_attached	is the device currently attached (erased on reboot)
Published field	VBD.device	device seen by the guest e.g. hda1
Published field	VBD.empty	if true this represents an empty drive
Published field	VBD.metrics	metrics associated with this VBD
Published field	VBD.mode	the mode the VBD should be mounted with
Published field	VBD.other_config	additional configuration
Published field	VBD.qos_algorithm_params	parameters for chosen QoS algorithm
Published field	VBD.qos_algorithm_type	QoS algorithm to use
Published field	VBD.qos_supported_algorithms	supported QoS algorithms for this VBD
Published field	VBD.runtime_properties	Device runtime properties
Published field	VBD.status_code	error/success code associated with last attach-operation (erased on reboot)
Published field	VBD.status_detail	error/success information associated with last attach-operation status (erased on reboot)
Published field	VBD.storage_lock	true if a storage level lock was acquired
Published field	VBD.type	how the VBD will appear to the guest (e.g. disk or CD)
Published field	VBD.userdevice	user-friendly device name e.g. 0,1,2,etc.
Published field	VBD.uuid	Unique identifier/object reference
Published field	VBD_metrics.io_read_kbs	Read bandwidth (KiB/s)
Published field	VBD_metrics.io_write_kbs	Write bandwidth (KiB/s)
Published field	VBD_metrics.last_updated	Time at which this information was last updated
Published field	VBD_metrics.uuid	Unique identifier/object reference
Published field	VDI.SR	storage repository in which the VDI resides
Published field	VDI.VBDs	list of vbds that refer to this disk
Published field	VDI.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VDI.crash_dumps	list of crash dumps that refer to this disk
Published field	VDI.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VDI.managed
Published field	VDI.missing	true if SR scan operation reported this VDI as not present on disk
Published field	VDI.name_description	a notes field containing human-readable description
Published field	VDI.name_label	a human-readable name
Published field	VDI.other_config	additional configuration
Published field	VDI.parent	This field is always null. Deprecated
Published field	VDI.physical_utilisation	amount of physical space that the disk image is currently taking up on the storage repository (in bytes)
Published field	VDI.read_only	true if this disk may ONLY be mounted read-only
Published field	VDI.sharable	true if this disk may be shared
Published field	VDI.storage_lock	true if this disk is locked at the storage level
Published field	VDI.type	type of the VDI
Published field	VDI.uuid	Unique identifier/object reference
Published field	VDI.virtual_size	size of disk as presented to the guest (in bytes). Note that, depending on storage backend type, requested size may not be respected exactly
Published field	VIF.MAC	ethernet MAC address of virtual interface, as exposed to guest
Published field	VIF.MTU	MTU in octets
Published field	VIF.VM	virtual machine to which this vif is connected
Published field	VIF.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VIF.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VIF.currently_attached	is the device currently attached (erased on reboot)
Published field	VIF.device	order in which VIF backends are created by xapi
Published field	VIF.metrics	metrics associated with this VIF
Published field	VIF.network	virtual network to which this vif is connected
Published field	VIF.other_config	additional configuration
Published field	VIF.qos_algorithm_params	parameters for chosen QoS algorithm
Published field	VIF.qos_algorithm_type	QoS algorithm to use
Published field	VIF.qos_supported_algorithms	supported QoS algorithms for this VIF
Published field	VIF.runtime_properties	Device runtime properties
Published field	VIF.status_code	error/success code associated with last attach-operation (erased on reboot)
Published field	VIF.status_detail	error/success information associated with last attach-operation status (erased on reboot)
Published field	VIF.uuid	Unique identifier/object reference
Published field	VIF_metrics.io_read_kbs	Read bandwidth (KiB/s)
Published field	VIF_metrics.io_write_kbs	Write bandwidth (KiB/s)
Published field	VIF_metrics.last_updated	Time at which this information was last updated
Published field	VIF_metrics.uuid	Unique identifier/object reference
Published field	VLAN.uuid	Unique identifier/object reference
Published field	VM.HVM_boot_params	HVM boot params
Published field	VM.HVM_boot_policy	HVM boot policy
Published field	VM.PCI_bus	PCI bus path for pass-through devices
Published field	VM.PV_args	kernel command-line arguments
Published field	VM.PV_bootloader	name of or path to bootloader
Published field	VM.PV_bootloader_args	miscellaneous arguments for the bootloader
Published field	VM.PV_kernel	path to the kernel
Published field	VM.PV_legacy_args	to make Zurich guests boot
Published field	VM.PV_ramdisk	path to the initrd
Published field	VM.VBDs	virtual block devices
Published field	VM.VCPUs_at_startup	Boot number of VCPUs
Published field	VM.VCPUs_max	Max number of VCPUs
Published field	VM.VCPUs_params	configuration parameters for the selected VCPU policy
Published field	VM.VIFs	virtual network interfaces
Published field	VM.VTPMs	virtual TPMs
Published field	VM.VUSBs	virtual usb devices
Published field	VM.actions_after_crash	action to take if the guest crashes
Published field	VM.actions_after_reboot	action to take after the guest has rebooted itself
Published field	VM.actions_after_shutdown	action to take after the guest has shutdown itself
Published field	VM.affinity	A host which the VM has some affinity for (or NULL). This is used as a hint to the start call when it decides where to run the VM. Resource constraints may cause the VM to be started elsewhere.
Published field	VM.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VM.appliance	the appliance to which this VM belongs
Published field	VM.consoles	virtual console devices
Published field	VM.crash_dumps	crash dumps associated with this VM
Published field	VM.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VM.domarch	Domain architecture (if available, null string otherwise)
Published field	VM.domid	domain ID (if available, -1 otherwise)
Published field	VM.guest_metrics	metrics associated with the running guest
Published field	VM.is_a_template	true if this is a template. Template VMs can never be started, they are used only for cloning other VMs
Published field	VM.is_control_domain	true if this is a control domain (domain 0 or a driver domain)
Published field	VM.last_boot_CPU_flags	describes the CPU flags on which the VM was last booted
Published field	VM.memory_dynamic_max	Dynamic maximum (bytes)
Published field	VM.memory_dynamic_min	Dynamic minimum (bytes)
Published field	VM.memory_overhead	Virtualization memory overhead (bytes).
Published field	VM.memory_static_max	Statically-set (i.e. absolute) maximum (bytes). The value of this field at VM start time acts as a hard limit of the amount of memory a guest can use. New values only take effect on reboot.
Published field	VM.memory_static_min	Statically-set (i.e. absolute) mininum (bytes). The value of this field indicates the least amount of memory this VM can boot with without crashing.
Published field	VM.memory_target	Dynamically-set memory target (bytes). The value of this field indicates the current target for memory available to this VM.
Published field	VM.metrics	metrics associated with this VM
Published field	VM.name_description	a notes field containing human-readable description
Published field	VM.name_label	a human-readable name
Published field	VM.other_config	additional configuration
Published field	VM.platform	platform-specific configuration
Published field	VM.power_state	Current power state of the machine
Published field	VM.recommendations	An XML specification of recommended values and ranges for properties of this VM
Published field	VM.resident_on	the host the VM is currently resident on
Published field	VM.scheduled_to_be_resident_on	the host on which the VM is due to be started/resumed/migrated. This acts as a memory reservation indicator
Published field	VM.suspend_VDI	The VDI that a suspend image is stored on. (Only has meaning if VM is currently suspended)
Published field	VM.user_version	Creators of VMs and templates may store version information here.
Published field	VM.uuid	Unique identifier/object reference
Published field	VMPP.name_description	a notes field containing human-readable description
Published field	VMPP.name_label	a human-readable name
Published field	VMSS.VMs	all VMs attached to this snapshot schedule
Published field	VMSS.enabled	enable or disable this snapshot schedule
Published field	VMSS.frequency	frequency of taking snapshot from snapshot schedule
Published field	VMSS.last_run_time	time of the last snapshot
Published field	VMSS.name_description	a notes field containing human-readable description
Published field	VMSS.name_label	a human-readable name
Published field	VMSS.retained_snapshots	maximum number of snapshots that should be stored at any time
Published field	VMSS.schedule	schedule of the snapshot containing 'hour', 'min', 'days'. Date/time-related information is in Local Timezone
Published field	VMSS.type	type of the snapshot schedule
Published field	VMSS.uuid	Unique identifier/object reference
Published field	VM_appliance.VMs	all VMs in this appliance
Published field	VM_appliance.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VM_appliance.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VM_appliance.name_description	a notes field containing human-readable description
Published field	VM_appliance.name_label	a human-readable name
Published field	VM_appliance.uuid	Unique identifier/object reference
Published field	VM_guest_metrics.PV_drivers_up_to_date	true if the PV drivers appear to be up to date
Published field	VM_guest_metrics.PV_drivers_version	version of the PV drivers
Published field	VM_guest_metrics.disks	Disk configuration/free space
Published field	VM_guest_metrics.last_updated	Time at which this information was last updated
Published field	VM_guest_metrics.memory	free/used/total
Published field	VM_guest_metrics.networks	network configuration
Published field	VM_guest_metrics.os_version	version of the OS
Published field	VM_guest_metrics.other	anything else
Published field	VM_guest_metrics.uuid	Unique identifier/object reference
Published field	VM_metrics.VCPUs_CPU	VCPU to PCPU map
Published field	VM_metrics.VCPUs_flags	CPU flags (blocked,online,running)
Published field	VM_metrics.VCPUs_number	Current number of VCPUs
Published field	VM_metrics.VCPUs_params	The live equivalent to VM.VCPUs_params
Published field	VM_metrics.VCPUs_utilisation	Utilisation for all of guest's current VCPUs
Published field	VM_metrics.install_time	Time at which the VM was installed
Published field	VM_metrics.last_updated	Time at which this information was last updated
Published field	VM_metrics.memory_actual	Guest's actual memory (bytes)
Published field	VM_metrics.start_time	Time at which this VM was last booted
Published field	VM_metrics.state	The state of the guest, eg blocked, dying etc
Published field	VM_metrics.uuid	Unique identifier/object reference
Published field	VTPM.VM	The virtual machine the TPM is attached to
Published field	VTPM.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VTPM.backend	The domain where the backend is located (unused)
Published field	VTPM.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VTPM.uuid	Unique identifier/object reference
Published field	VUSB.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	VUSB.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	VUSB.currently_attached	is the device currently attached
Published field	blob.last_updated	Time at which the data in the blob was last updated
Published field	blob.mime_type	The mime type associated with this object. Defaults to 'application/octet-stream' if the empty string is supplied
Published field	blob.name_description	a notes field containing human-readable description
Published field	blob.name_label	a human-readable name
Published field	blob.size	Size of the binary data, in bytes
Published field	blob.uuid	Unique identifier/object reference
Published field	console.VM	VM to which this console is attached
Published field	console.location	URI for the console service
Published field	console.other_config	additional configuration
Published field	console.protocol	the protocol used by this console
Published field	console.uuid	Unique identifier/object reference
Published field	crashdump.VDI	the virtual disk
Published field	crashdump.VM	the virtual machine
Published field	crashdump.uuid	Unique identifier/object reference
Published field	data_source.enabled	true if the data source is being logged
Published field	data_source.max	the maximum value of the data source
Published field	data_source.min	the minimum value of the data source
Published field	data_source.name_description	a notes field containing human-readable description
Published field	data_source.name_label	a human-readable name
Published field	data_source.standard	true if the data source is enabled by default. Non-default data sources cannot be disabled
Published field	data_source.units	the units of the value
Published field	data_source.value	current value of the data source
Published field	event.class	The name of the class of the object that changed
Published field	event.id	An ID, monotonically increasing, and local to the current session
Published field	event.obj_uuid	The uuid of the object that changed
Published field	event.operation	The operation that was performed
Published field	event.ref	A reference to the object that changed
Published field	event.timestamp	The time at which the event occurred
Published field	host.API_version_major	major version number
Published field	host.API_version_minor	minor version number
Published field	host.API_version_vendor	identification of vendor
Published field	host.API_version_vendor_implementation	details of vendor implementation
Published field	host.PBDs	physical blockdevices
Published field	host.PIFs	physical network interfaces
Published field	host.address	The address by which this host can be contacted from any other host in the pool
Published field	host.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	host.capabilities	Xen capabilities
Published field	host.cpu_configuration	The CPU configuration on this host. May contain keys such as "nr_nodes", "sockets_per_node", "cores_per_socket", or "threads_per_core"
Published field	host.crash_dump_sr	The SR in which VDIs for crash dumps are created
Published field	host.crashdumps	Set of host crash dumps
Published field	host.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	host.enabled	True if the host is currently enabled
Published field	host.host_CPUs	The physical CPUs on this host
Published field	host.hostname	The hostname of this host
Published field	host.license_params	State of the current license
Published field	host.logging	logging configuration
Published field	host.memory_overhead	Virtualization memory overhead (bytes).
Published field	host.metrics	metrics associated with this host
Published field	host.name_description	a notes field containing human-readable description
Published field	host.name_label	a human-readable name
Published field	host.other_config	additional configuration
Published field	host.patches	Set of host patches
Published field	host.resident_VMs	list of VMs currently resident on host
Published field	host.sched_policy	Scheduler policy currently in force on this host
Published field	host.software_version	version strings
Published field	host.supported_bootloaders	a list of the bootloaders installed on the machine
Published field	host.suspend_image_sr	The SR in which VDIs for suspend images are created
Published field	host.uuid	Unique identifier/object reference
Published field	host_cpu.family	the family (number) of the physical CPU
Published field	host_cpu.features	the physical CPU feature bitmap
Published field	host_cpu.flags	the flags of the physical CPU (a decoded version of the features field)
Published field	host_cpu.host	the host the CPU is in
Published field	host_cpu.model	the model number of the physical CPU
Published field	host_cpu.modelname	the model name of the physical CPU
Published field	host_cpu.number	the number of the physical CPU within the host
Published field	host_cpu.speed	the speed of the physical CPU
Published field	host_cpu.stepping	the stepping of the physical CPU
Published field	host_cpu.utilisation	the current CPU utilisation
Published field	host_cpu.uuid	Unique identifier/object reference
Published field	host_cpu.vendor	the vendor of the physical CPU
Published field	host_crashdump.host	Host the crashdump relates to
Published field	host_crashdump.size	Size of the crashdump
Published field	host_crashdump.timestamp	Time the crash happened
Published field	host_crashdump.uuid	Unique identifier/object reference
Published field	host_metrics.last_updated	Time at which this information was last updated
Published field	host_metrics.live	Pool master thinks this host is live
Published field	host_metrics.memory_free	Free host memory (bytes)
Published field	host_metrics.memory_total	Total host memory (bytes)
Published field	host_metrics.uuid	Unique identifier/object reference
Published field	host_patch.applied	True if the patch has been applied
Published field	host_patch.host	Host the patch relates to
Published field	host_patch.name_description	a notes field containing human-readable description
Published field	host_patch.name_label	a human-readable name
Published field	host_patch.size	Size of the patch
Published field	host_patch.timestamp_applied	Time the patch was applied
Published field	host_patch.uuid	Unique identifier/object reference
Published field	host_patch.version	Patch version number
Published field	message.body	The body of the message
Published field	message.name	The name of the message
Published field	message.obj_uuid	The uuid of the object this message is associated with
Published field	message.priority	The message priority, 0 being low priority
Published field	message.timestamp	The time at which the message was created
Published field	message.uuid	Unique identifier/object reference
Published field	network.PIFs	list of connected pifs
Published field	network.VIFs	list of connected vifs
Published field	network.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	network.bridge	name of the bridge corresponding to this network on the local host
Published field	network.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	network.name_description	a notes field containing human-readable description
Published field	network.name_label	a human-readable name
Published field	network.other_config	additional configuration
Published field	network.uuid	Unique identifier/object reference
Published field	network_sriov.uuid	Unique identifier/object reference
Published field	pool.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	pool.coordinator_bias	true if bias against pool master when scheduling vms is enabled, false otherwise
Published field	pool.crash_dump_SR	The SR in which VDIs for crash dumps are created
Published field	pool.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	pool.default_SR	Default SR for VDIs
Published field	pool.master	The host that is pool master
Published field	pool.name_description	Description
Published field	pool.name_label	Short name
Published field	pool.other_config	additional configuration
Published field	pool.suspend_image_SR	The SR in which VDIs for suspend images are created
Published field	pool.uuid	Unique identifier/object reference
Published field	pool_patch.name_description	a notes field containing human-readable description
Published field	pool_patch.name_label	a human-readable name
Published field	pool_patch.uuid	Unique identifier/object reference
Published field	pool_update.name_description	a notes field containing human-readable description
Published field	pool_update.name_label	a human-readable name
Published field	pool_update.uuid	Unique identifier/object reference
Published field	pool_update.vdi	VDI the update was uploaded to
Published field	role.uuid	Unique identifier/object reference
Published field	secret.other_config	other_config
Published field	secret.uuid	Unique identifier/object reference
Published field	secret.value	the secret
Published field	session.last_active	Timestamp for last time session was active
Published field	session.pool	True if this session relates to a intra-pool login, false otherwise
Published field	session.this_host	Currently connected host
Published field	session.this_user	Currently connected user
Published field	session.uuid	Unique identifier/object reference
Published field	subject.uuid	Unique identifier/object reference
Published field	task.allowed_operations	list of the operations allowed in this state. This list is advisory only and the server state may have changed by the time this field is read by a client.
Published field	task.created	Time task was created
Published field	task.current_operations	links each of the running tasks using this object (by reference) to a current_operation enum which describes the nature of the task.
Published field	task.error_info	if the task has failed, this field contains the set of associated error strings. Undefined otherwise.
Published field	task.finished	Time task finished (i.e. succeeded or failed). If task-status is pending, then the value of this field has no meaning
Published field	task.name_description	a notes field containing human-readable description
Published field	task.name_label	a human-readable name
Published field	task.progress	This field contains the estimated fraction of the task which is complete. This field should not be used to determine whether the task is complete - for this the status field of the task should be used.
Published field	task.resident_on	the host on which the task is running
Published field	task.result	if the task has completed successfully, this field contains the result value (either Void or an object reference). Undefined otherwise.
Published field	task.status	current status of the task
Published field	task.type	if the task has completed successfully, this field contains the type of the encoded result (i.e. name of the class whose reference is in the result field). Undefined otherwise.
Published field	task.uuid	Unique identifier/object reference
Published field	user.fullname	full name
Published field	user.short_name	short name (e.g. userid)
Published field	user.uuid	Unique identifier/object reference
Published message	PBD.plug	Activate the specified PBD, causing the referenced SR to be attached and scanned
Published message	PBD.unplug	Deactivate the specified PBD, causing the referenced SR to be detached and nolonger scanned
Published message	PIF.create_VLAN	Create a VLAN interface from an existing physical interface
Published message	PIF.destroy	Destroy the PIF object (provided it is a VLAN interface)
Published message	SR.create	Create a new Storage Repository and introduce it into the managed system, creating both SR record and PBD record to attach it to current host (with specified device_config parameters)
Published message	SR.destroy	Destroy specified SR, removing SR-record from database and remove SR from disk. (In order to affect this operation the appropriate device_config is read from the specified SR's PBD on current host)
Published message	SR.forget	Removing specified SR-record from database, without attempting to remove SR from disk
Published message	SR.get_supported_types	Return a set of all the SR types supported by the system
Published message	SR.introduce	Introduce a new Storage Repository into the managed system
Published message	SR.make	Create a new Storage Repository on disk
Published message	SR.scan	Refreshes the list of VDIs associated with an SR
Published message	SR.set_name_description	Set the name description of the SR
Published message	SR.set_name_label	Set the name label of the SR
Published message	SR.set_shared	Sets the shared flag on the SR
Published message	VBD.assert_attachable	Throws an error if this VBD could not be attached to this VM if the VM were running. Intended for debugging.
Published message	VBD.eject	Remove the media from the device and leave it empty
Published message	VBD.insert	Insert new media into the device
Published message	VBD.plug	Hotplug the specified VBD, dynamically attaching it to the running VM
Published message	VBD.set_mode	Sets the mode of the VBD. The power_state of the VM must be halted.
Published message	VBD.unplug	Hot-unplug the specified VBD, dynamically unattaching it from the running VM
Published message	VBD.unplug_force	Forcibly unplug the specified VBD
Published message	VDI.clone	Take an exact copy of the VDI and return a reference to the new disk. If any driver_params are specified then these are passed through to the storage-specific substrate driver that implements the clone operation. NB the clone lives in the same Storage Repository as its parent.
Published message	VDI.copy	Copies a VDI to an SR. There must be a host that can see both the source and destination SRs simultaneously
Published message	VDI.forget	Removes a VDI record from the database
Published message	VDI.resize	Resize the VDI.
Published message	VDI.resize_online	Resize the VDI which may or may not be attached to running guests.
Published message	VDI.set_name_description	Set the name description of the VDI. This can only happen when its SR is currently attached.
Published message	VDI.set_name_label	Set the name label of the VDI. This can only happen when then its SR is currently attached.
Published message	VDI.set_read_only	Sets the VDI's read_only field
Published message	VDI.snapshot	Take a read-only snapshot of the VDI, returning a reference to the snapshot. If any driver_params are specified then these are passed through to the storage-specific substrate driver that takes the snapshot. NB the snapshot lives in the same Storage Repository as its parent.
Published message	VIF.plug	Hotplug the specified VIF, dynamically attaching it to the running VM
Published message	VIF.unplug	Hot-unplug the specified VIF, dynamically unattaching it from the running VM
Published message	VM.add_to_VCPUs_params_live	Add the given key-value pair to VM.VCPUs_params, and apply that value on the running VM
Published message	VM.assert_can_boot_here	Returns an error if the VM could not boot on this host for some reason
Published message	VM.assert_operation_valid	Check to see whether this operation is acceptable in the current state of the system, raising an error if the operation is invalid for some reason
Published message	VM.clean_reboot	Attempt to cleanly shutdown the specified VM (Note: this may not be supported---e.g. if a guest agent is not installed). This can only be called when the specified VM is in the Running state.
Published message	VM.clean_shutdown	Attempt to cleanly shutdown the specified VM. (Note: this may not be supported---e.g. if a guest agent is not installed). This can only be called when the specified VM is in the Running state.
Published message	VM.clone	Clones the specified VM, making a new VM. Clone automatically exploits the capabilities of the underlying storage repository in which the VM's disk images are stored (e.g. Copy on Write). This function can only be called when the VM is in the Halted State.
Published message	VM.copy	Copies a VM to an SR. There must be a host that can see both the source and destination SRs simultaneously
Published message	VM.get_allowed_VBD_devices	Returns a list of the allowed values that a VBD device field can take
Published message	VM.get_allowed_VIF_devices	Returns a list of the allowed values that a VIF device field can take
Published message	VM.get_boot_record	Returns a record describing the VM's dynamic state, initialised when the VM boots and updated to reflect runtime configuration changes e.g. CPU hotplug
Published message	VM.get_possible_hosts	Return the list of hosts on which this VM may run.
Published message	VM.hard_reboot	Stop executing the specified VM without attempting a clean shutdown and immediately restart the VM.
Published message	VM.hard_shutdown	Stop executing the specified VM without attempting a clean shutdown.
Published message	VM.pause	Pause the specified VM. This can only be called when the specified VM is in the Running state.
Published message	VM.pool_migrate	Migrate a VM to another Host.
Published message	VM.power_state_reset	Reset the power-state of the VM to halted in the database only. (Used to recover from slave failures in pooling scenarios by resetting the power-states of VMs running on dead slaves to halted.) This is a potentially dangerous operation; use with care.
Published message	VM.provision	Inspects the disk configuration contained within the VM's other_config, creates VDIs and VBDs and then executes any applicable post-install script.
Published message	VM.resume	Awaken the specified VM and resume it. This can only be called when the specified VM is in the Suspended state.
Published message	VM.resume_on	Awaken the specified VM and resume it on a particular Host. This can only be called when the specified VM is in the Suspended state.
Published message	VM.send_sysrq	Send the given key as a sysrq to this VM. The key is specified as a single character (a String of length 1). This can only be called when the specified VM is in the Running state.
Published message	VM.send_trigger	Send the named trigger to this VM. This can only be called when the specified VM is in the Running state.
Published message	VM.set_HVM_boot_policy	Set the VM.HVM_boot_policy field of the given VM, which will take effect when it is next started
Published message	VM.set_VCPUs_number_live	Set the number of VCPUs for a running VM
Published message	VM.set_actions_after_crash	Sets the actions_after_crash parameter
Published message	VM.set_memory_target_live	Set the memory target for a running VM
Published message	VM.set_shadow_multiplier_live	Set the shadow memory multiplier on a running VM
Published message	VM.start	Start the specified VM. This function can only be called with the VM is in the Halted State.
Published message	VM.start_on	Start the specified VM on a particular host. This function can only be called with the VM is in the Halted State.
Published message	VM.suspend	Suspend the specified VM to disk. This can only be called when the specified VM is in the Running state.
Published message	VM.unpause	Resume the specified VM. This can only be called when the specified VM is in the Paused state.
Published message	VM.update_allowed_operations	Recomputes the list of acceptable operations
Published message	crashdump.destroy	Destroy the specified crashdump
Published message	event.get_current_id	Return the ID of the next event to be generated by the system
Published message	event.next	Blocking call which returns a (possibly empty) batch of events. This method is only recommended for legacy use. New development should use event.from which supersedes this method.
Published message	event.register	Registers this session with the event system for a set of given classes. This method is only recommended for legacy use in conjunction with event.next.
Published message	event.unregister	Removes this session's registration with the event system for a set of given classes. This method is only recommended for legacy use in conjunction with event.next.
Published message	host.bugreport_upload	Run xen-bugtool --yestoall and upload the output to support
Published message	host.destroy	Destroy specified host record in database
Published message	host.disable	Puts the host into a state in which no new VMs can be started. Currently active VMs on the host continue to execute.
Published message	host.dmesg	Get the host xen dmesg.
Published message	host.dmesg_clear	Get the host xen dmesg, and clear the buffer.
Published message	host.enable	Puts the host into a state in which new VMs can be started.
Published message	host.get_log	Get the host's log file
Published message	host.license_apply	Apply a new license to a host
Published message	host.list_methods	List all supported methods
Published message	host.reboot	Reboot the host. (This function can only be called if there are no currently running VMs on the host and it is disabled.)
Published message	host.restart_agent	Restarts the agent after a 10 second pause. WARNING: this is a dangerous operation. Any operations in progress will be aborted, and unrecoverable data loss may occur. The caller is responsible for ensuring that there are no operations in progress when this method is called.
Published message	host.send_debug_keys	Inject the given string as debugging keys into Xen
Published message	host.shutdown	Shutdown the host. (This function can only be called if there are no currently running VMs on the host and it is disabled.)
Published message	host_crashdump.destroy	Destroy specified host crash dump, removing it from the disk.
Published message	host_crashdump.upload	Upload the specified host crash dump to a specified URL
Published message	host_patch.apply	Apply the selected patch and return its output
Published message	host_patch.destroy	Destroy the specified host patch, removing it from the disk. This does NOT reverse the patch
Published message	pool.create_VLAN	Create PIFs, mapping a network to the same physical interface/VLAN on each host. This call is deprecated: use Pool.create_VLAN_from_PIF instead.
Published message	pool.create_VLAN_from_PIF	Create a pool-wide VLAN by taking the PIF.
Published message	pool.eject	Instruct a pool master to eject a host from the pool
Published message	pool.emergency_reset_master	Instruct a slave already in a pool that the master has changed
Published message	pool.emergency_transition_to_master	Instruct host that's currently a slave to transition to being master
Published message	pool.join	Instruct host to join a new pool
Published message	pool.join_force	Instruct host to join a new pool
Published message	pool.recover_slaves	Instruct a pool master, M, to try and contact its slaves and, if slaves are in emergency mode, reset their master address to M.
Published message	pool.sync_database	Forcibly synchronise the database now
Published message	session.change_password	Change the account password; if your session is authenticated with root privileges then the old_pwd is validated and the new_pwd is set regardless
Published message	session.login_with_password	Attempt to authenticate the user, returning a session reference if successful
Published message	session.logout	Log out of a session
Published message	task.cancel	Request that a task be cancelled. Note that a task may fail to be cancelled and may complete or fail normally and note that, even when a task does cancel, it might take an arbitrary amount of time.
Published message	task.create	Create a new task object which must be manually destroyed.
Published message	task.destroy	Destroy the task object

Topics

API for configuring the udhcp server in Dom0

This API allows you to configure the DHCP service running on the Host Internal Management Network (HIMN). The API configures a udhcp daemon residing in Dom0 and alters the service configuration for any VM using the network.

It should be noted that for this reason, that callers who modify the default configuration should be aware that their changes may have an adverse effect on other consumers of the HIMN.

Version history

Date        State
----        ----
2013-3-15   Stable

Stable: this API is considered stable and unlikely to change between software version and between hotfixes.

API description

The API for configuring the network is based on a series of other_config keys that can be set by the caller on the HIMN XAPI network object. Once any of the keys below have been set, the caller must ensure that any VIFs attached to the HIMN are removed, destroyed, created and plugged.

ip_begin

The first IP address in the desired subnet that the caller wishes the DHCP service to use.

ip_end

The last IP address in the desired subnet that the caller wishes the DHCP service to use.

netmask

The subnet mask for each of the issues IP addresses.

ip_disable_gw

A boolean key for disabling the DHCP server from returning a default gateway for VMs on the network. To disable returning the gateway address set the key to True.

Note: By default, the DHCP server will issue a default gateway for those requesting an address. Setting this key may disrupt applications that require the default gateway for communicating with Dom0 and so should be used with care.

Example code

An example python extract of setting the config for the network:

def get_himn_ref():
    networks = session.xenapi.network.get_all_records()
    for ref, rec in networks.iteritems():
        if 'is_host_internal_management_network' \
                                        in rec['other_config']:                                            
            return ref

    raise Exception("Error: unable to find HIMN.")


himn_ref = get_himn_ref()
other_config = session.xenapi.network.get_other_config(himn_ref)

other_config['ip_begin'] = "169.254.0.1"
other_config['ip_end'] = "169.254.255.254"
other_config['netmask'] = "255.255.0.0"

session.xenapi.network.set_other_config(himn_ref, other_config)

An example for how to disable the server returning a default gateway:

himn_ref = get_himn_ref()
other_config = session.xenapi.network.get_other_config(himn_ref)

other_config['ip_disable_gw'] = True

session.xenapi.network.set_other_config(himn_ref, other_config)

Guest agents

“Guest agents” are special programs which run inside VMs which can be controlled via the XenAPI.

One communication method between XenAPI clients is via Xenstore.

Adding Xenstore entries to VMs

Developers may wish to install guest agents into VMs which take special action based on the type of the VM. In order to communicate this information into the guest, a special Xenstore name-space known as vm-data is available which is populated at VM creation time. It is populated from the xenstore-data map in the VM record.

Set the xenstore-data parameter in the VM record:

xe vm-param-set uuid= xenstore-data:vm-data/foo=bar

Start the VM.

If it is a Linux-based VM, install the COMPANY_TOOLS and use the xenstore-read to verify that the node exists in Xenstore.

Note
Only prefixes beginning with vm-data are permitted, and anything not in this name-space will be silently ignored when starting the VM.

Memory

Memory is used for many things:

the hypervisor code: this is the Xen executable itself
the hypervisor heap: this is needed for per-domain structures and per-vCPU structures
the crash kernel: this is needed to collect information after a host crash
domain RAM: this is the memory the VM believes it has
shadow memory: for HVM guests running on hosts without hardware assisted paging (HAP) Xen uses shadow to optimise page table updates. For all guests shadow is used during live migration for tracking the memory transfer.
video RAM for the virtual graphics card

Some of these are constants (e.g. hypervisor code) while some depend on the VM configuration (e.g. domain RAM). Xapi calls the constants “host overhead” and the variables due to VM configuration as “VM overhead”. These overheads are subtracted from free memory on the host when starting, resuming and migrating VMs.

Metrics

xcp-rrdd records statistics about the host and the VMs running on top. The metrics are stored persistently for long-term access and analysis of historical trends. Statistics are stored in RRDs (Round Robin Databases). RRDs are fixed-size structures that store time series with decreasing time resolution: the older the data point is, the longer the timespan it represents. ‘Data sources’ are sampled every few seconds and points are added to the highest resolution RRD. Periodically each high-frequency RRD is ‘consolidated’ (e.g. averaged) to produce a data point for a lower-frequency RRD.

RRDs are resident on the host on which the VM is running, or the pool coordinator when the VM is not running. The RRDs are backed up every day.

Granularity

Statistics are persisted for a maximum of one year, and are stored at different granularities. The average and most recent values are stored at intervals of:

five seconds for the past ten minutes
one minute for the past two hours
one hour for the past week
one day for the past year

RRDs are saved to disk as uncompressed XML. The size of each RRD when written to disk ranges from 200KiB to approximately 1.2MiB when the RRD stores the full year of statistics.

By default each RRD contains only averaged data to save storage space. To record minimum and maximum values in future RRDs, set the Pool-wide flag

xe pool-param-set uuid= other-config:create_min_max_in_new_VM_RRDs=true

Downloading

Statistics can be downloaded over HTTP in XML or JSON format, for example using wget. See rrddump and rrdxport for information about the XML format. The JSON format has the same structure as the XML. Parameters are appended to the URL following a question mark (?) and separated by ampersands (&). HTTP authentication can take the form of a username and password or a session token in a URL parameter.

Statistics may be downloaded all at once, including all history, or as deltas suitable for interactive graphing.

Downloading statistics all at once

To obtain a full dump of RRD data for a host use:

wget  http://hostname/host_rrd?session_id=OpaqueRef:43df3204-9360-c6ab-923e-41a8d19389ba"

where the session token has been fetched from the server using the API.

For example, using Python’s XenAPI library:

import XenAPI
username = "root"
password = "actual_password"
url = "http://hostname"
session = XenAPI.Session(url)
session.xenapi.login_with_password(username, password, "1.0", "session_getter")
session._session

A URL parameter is used to decide which format to return: XML is returned by default, adding the parameter json makes the server return JSON. Starting from xapi version 23.17.0, the server uses the HTTP header Accept to decide which format to return. When both formats are accepted, for example, using */*; JSON is returned. Of interest are the clients wget and curl which use this accept header value, meaning that when using them the default behaviour will change and the accept header needs to be overridden to make the server return XML. The content type is provided in the reponse’s headers in these newer versions.

The XML RRD data is in the format used by rrdtool and looks like this:

<?xml version="1.0"?>
<rrd>
  <version>0003</version>
  <step>5</step>
  <lastupdate>1213616574</lastupdate>
  <ds>
    <name>memory_total_kib</name>
    <type>GAUGE</type>
    <minimal_heartbeat>300.0000</minimal_heartbeat>
    <min>0.0</min>
    <max>Infinity</max>
    <last_ds>2070172</last_ds>
    <value>9631315.6300</value>
    <unknown_sec>0</unknown_sec>
  </ds>
  <ds>
   <!-- other dss - the order of the data sources is important
        and defines the ordering of the columns in the archives below -->
  </ds>
  <rra>
    <cf>AVERAGE</cf>
    <pdp_per_row>1</pdp_per_row>
     <params>
      <xff>0.5000</xff>
    </params>
    <cdp_prep> <!-- This is for internal use -->
      <ds>
        <primary_value>0.0</primary_value>
        <secondary_value>0.0</secondary_value>
        <value>0.0</value>
        <unknown_datapoints>0</unknown_datapoints>
      </ds>
      ...other dss - internal use only...
    </cdp_prep>
    <database>
     <row>
        <v>2070172.0000</v>  <!-- columns correspond to the DSs defined above -->
        <v>1756408.0000</v>
        <v>0.0</v>
        <v>0.0</v>
        <v>732.2130</v>
        <v>0.0</v>
        <v>782.9186</v>
        <v>0.0</v>
        <v>647.0431</v>
        <v>0.0</v>
        <v>0.0001</v>
        <v>0.0268</v>
        <v>0.0100</v>
        <v>0.0</v>
        <v>615.1072</v>
     </row>
     ...
  </rra>
  ... other archives ...
</rrd>

To obtain a full dump of RRD data of a VM with uuid x:

wget "http://hostname/vm_rrd?session_id=<token>&uuid=x"

Note that it is quite expensive to download full RRDs as they contain lots of historical information. For interactive displays clients should download deltas instead.

Downloading deltas

To obtain an update of all VM statistics on a host, the URL would be of the form:

wget "https://hostname/rrd_updates?session_id=<token>&start=<secondsinceepoch>"

This request returns data in an rrdtool xport style XML format, for every VM resident on the particular host that is being queried. To differentiate which column in the export is associated with which VM, the legend field is prefixed with the UUID of the VM.

An example rrd_updates output:

<xport>
  <meta>
    <start>1213578000</start>
    <step>3600</step>
    <end>1213617600</end>
    <rows>12</rows>
    <columns>12</columns>
    <legend>
      <entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry> <!-- nb - each data source might have multiple entries for different consolidation functions -->
      <entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
      <entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
      <entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
      <entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
      <entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
      <entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
      <entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
      <entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
      <entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
      <entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
      <entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
    </legend>
  </meta>
  <data>
    <row>
      <t>1213617600</t>
      <v>0.0</v> <!-- once again, the order or the columns is defined by the legend above -->
      <v>0.0282</v>
      <v>209715200.0000</v>
      <v>0.0</v>
      <v>0.0201</v>
      <v>209715200.0000</v>
      <v>0.0</v>
      <v>0.0445</v>
      <v>209715200.0000</v>
      <v>0.0</v>
      <v>0.0243</v>
      <v>209715200.0000</v>
    </row>
   ...
  </data>
</xport>

To obtain host updates too, use the query parameter host=true:

wget "http://hostname/rrd_updates?session_id=<token>&start=<secondssinceepoch>&host=true"

The step will decrease as the period decreases, which means that if you request statistics for a shorter time period you will get more detailed statistics.

To download updates containing only the averages, or minimums or maximums, add the parameter cf=AVERAGE|MIN|MAX (note case is important) e.g.

wget "http://hostname/rrd_updates?session_id=<token>&start=0&cf=MAX"

To request a different update interval, add the parameter interval=seconds e.g.

wget "http://hostname/rrd_updates?session_id=<token>&start=0&interval=5"

Snapshots

Snapshots represent the state of a VM, or a disk (VDI) at a point in time. They can be used for:

backups (hourly, daily, weekly etc)
experiments (take snapshot, try something, revert back again)
golden images (install OS, get it just right, clone it 1000s of times)

Read more about Snapshots: the High-Level Feature.

Taking a VDI snapshot

To take a snapshot of a single disk (VDI):

snapshot_vdi <- VDI.snapshot(session_id, vdi, driver_params)

where vdi is the reference to the disk to be snapshotted, and driver_params is a list of string pairs providing optional backend implementation-specific hints. The snapshot operation should be quick (i.e. it should never be implemented as a slow disk copy) and the resulting VDI will have

Field name	Description
is_a_snapshot	a flag, set to true, indicating the disk is a snapshot
snapshot_of	a reference to the disk the snapshot was created from
snapshot_time	the time the snapshot was taken

The resulting snapshot should be considered read-only. Depending on the backend implementation it may be technically possible to write to the snapshot, but clients must not do this. To create a writable disk from a snapshot, see “restoring from a snapshot” below.

Note that the storage backend is free to implement this in different ways. We do not assume the presence of a .vhd-formatted storage repository. Clients must never assume anything about the backend implementation without checking first with the maintainers of the backend implementation.

Restoring to a VDI snapshot

To restore from a VDI snapshot first

new_vdi <- VDI.clone(session_id, snapshot_vdi, driver_params)

where snapshot_vdi is a reference to the snapshot VDI, and driver_params is a list of string pairs providing optional backend implementation-specific hints. The clone operation should be quick (i.e. it should never be implemented as a slow disk copy) and the resulting VDI will have

Field name	Description
is_a_snapshot	a flag, set to false, indicating the disk is not a snapshot
snapshot_of	an invalid reference
snapshot_time	an invalid time

The resulting disk is writable and can be used by the client as normal.

Note that the “restored” VDI will have a different VDI.uuid and reference to the original VDI.

Taking a VM snapshot

A VM snapshot is a copy of the VM metadata and a snapshot of all the associated VDIs at around the same point in time. To take a VM snapshot:

snapshot_vm <- VM.snapshot(session_id, vm, new_name)

where vm is a reference to the existing VM and new_name will be the name_label of the resulting VM (snapshot) object. The resulting VM will have

Field name	Description
is_a_snapshot	a flag, set to true, indicating the VM is a snapshot
snapshot_of	a reference to the VM the snapshot was created from
snapshot_time	the time the snapshot was taken

Note that each disk is snapshotted one-by-one and not at the same time.

Restoring to a VM snapshot

A VM snapshot can be reverted to a snapshot using

VM.revert(session_id, snapshot_ref)

where snapshot_ref is a reference to the snapshot VM. Each VDI associated with the VM before the snapshot will be destroyed and each VDI associated with the snapshot will be cloned (see “Reverting to a disk snapshot” above) and associated with the VM. The resulting VM will have

Field name	Description
is_a_snapshot	a flag, set to false, indicating the VM is not a snapshot
snapshot_of	an invalid reference
snapshot_time	an invalid time

Note that the VM.uuid and reference are preserved, but the VDI.uuid and VDI references are not.

Downloading a disk or snapshot

Disks can be downloaded in either raw or vhd format using an HTTP 1.0 GET request as follows:

GET /export_raw_vdi?session_id=%s&task_id=%s&vdi=%s&format=%s[&base=%s] HTTP/1.0\r\n
Connection: close\r\n
\r\n
\r\n

where

session_id is a currently logged-in session
task_id is a Task reference which will be used to monitor the progress of this task and receive errors from it
vdi is the reference of the VDI into which the data will be imported
format is either vhd or raw
(optional) base is the reference of a VDI which has already been exported and this export should only contain the blocks which have changed since then.

Note that the vhd format allows the disk to be sparse i.e. only contain allocated blocks. This helps reduce the size of the download.

The xapi-project/xen-api repo has a python download example

Uploading a disk or snapshot

Disks can be uploaded in either raw or vhd format using an HTTP 1.0 PUT request as follows:

PUT /import_raw_vdi?session_id=%s&task_id=%s&vdi=%s&format=%s HTTP/1.0\r\n
Connection: close\r\n
\r\n
\r\n

where

session_id is a currently logged-in session
task_id is a Task reference which will be used to monitor the progress of this task and receive errors from it
vdi is the reference of the VDI into which the data will be imported
format is either vhd or raw

Note that you must create the disk (with the correct size) before importing data to it. The disk doesn’t have to be empty, in fact if restoring from a series of incremental downloads it makes sense to upload them all to the same disk in order.

Example: incremental backup with xe

This section will show how easy it is to build an incremental backup tool using these APIs. For simplicity we will use the xe commands rather than raw XMLRPC and HTTP.

For a VDI with uuid $VDI, take a snapshot:

FULL=$(xe vdi-snapshot uuid=$VDI)

Next perform a full backup into a file “full.vhd”, in vhd format:

xe vdi-export uuid=$FULL filename=full.vhd format=vhd  --progress

If the SR was using the vhd format internally (this is the default) then the full backup will be sparse and will only contain blocks if they have been written to.

After some time has passed and the VDI has been written to, take another snapshot:

DELTA=$(xe vdi-snapshot uuid=$VDI)

Now we can backup only the disk blocks which have changed between the original snapshot $FULL and the next snapshot $DELTA into a file called “delta.vhd”:

xe vdi-export uuid=$DELTA filename=delta.vhd format=vhd base=$FULL --progress

We now have 2 files on the local system:

“full.vhd”: a complete backup of the first snapshot
“delta.vhd”: an incremental backup of the second snapshot, relative to the first

For example:

test $ ls -lh *.vhd
-rw------- 1 dscott xendev 213M Aug 15 10:39 delta.vhd
-rw------- 1 dscott xendev 8.0G Aug 15 10:39 full.vhd

To restore the original snapshot you must create an empty disk with the correct size. To find the size of a .vhd file use qemu-img as follows:

test $ qemu-img info delta.vhd
image: delta.vhd
file format: vpc
virtual size: 24G (25769705472 bytes)
disk size: 212M

Here the size is 25769705472 bytes. Create a fresh VDI in SR $SR to restore the backup as follows:

SIZE=25769705472
RESTORE=$(xe vdi-create name-label=restored virtual-size=$SIZE sr-uuid=$SR type=user)

then import “full.vhd” into it:

xe vdi-import uuid=$RESTORE filename=full.vhd format=vhd --progress

Once “full.vhd” has been imported, the incremental backup can be restored on top:

xe vdi-import uuid=$RESTORE filename=delta.vhd format=vhd --progress

Note there is no need to supply a “base” parameter when importing; Xapi will treat the “vhd differencing disk” as a set of blocks and import them. It is up to you to check you are importing them to the right place.

Now the VDI $RESTORE should have the same contents as $DELTA.

VM consoles

Most XenAPI graphical interfaces will want to gain access to the VM consoles, in order to render them to the user as if they were physical machines. There are several types of consoles available, depending on the type of guest or if the physical host console is being accessed:

Types of consoles

Operating System	Text	Graphical	Optimized graphical
Windows	No	VNC, using an API call	RDP, directly from guest
Linux	Yes, through VNC and an API call	No	VNC, directly from guest
Physical Host	Yes, through VNC and an API call	No	No

Hardware-assisted VMs, such as Windows, directly provide a graphical console over VNC. There is no text-based console, and guest networking is not necessary to use the graphical console. Once guest networking has been established, it is more efficient to setup Remote Desktop Access and use an RDP client to connect directly (this must be done outside of the XenAPI).

Paravirtual VMs, such as Linux guests, provide a native text console directly. XenServer provides a utility (called vncterm) to convert this text-based console into a graphical VNC representation. Guest networking is not necessary for this console to function. As with Windows above, Linux distributions often configure VNC within the guest, and directly connect to it over a guest network interface.

The physical host console is only available as a vt100 console, which is exposed through the XenAPI as a VNC console by using vncterm in the control domain.

RFB (Remote Framebuffer) is the protocol which underlies VNC, specified in The RFB Protocol. Third-party developers are expected to provide their own VNC viewers, and many freely available implementations can be adapted for this purpose. RFB 3.3 is the minimum version which viewers must support.

Retrieving VNC consoles using the API

VNC consoles are retrieved using a special URL passed through to the host agent. The sequence of API calls is as follows:

Client to Master/443: XML-RPC: Session.login_with_password().
Master/443 to Client: Returns a session reference to be used with subsequent calls.
Client to Master/443: XML-RPC: VM.get_by_name_label().
Master/443 to Client: Returns a reference to a particular VM (or the “control domain” if you want to retrieve the physical host console).
Client to Master/443: XML-RPC: VM.get_consoles().
Master/443 to Client: Returns a list of console objects associated with the VM.
Client to Master/443: XML-RPC: VM.get_location().
Returns a URI describing where the requested console is located. The URIs are of the form: https://192.168.0.1/console?ref=OpaqueRef:c038533a-af99-a0ff-9095-c1159f2dc6a0.
Client to 192.168.0.1: HTTP CONNECT “/console?ref=(…)”

The final HTTP CONNECT is slightly non-standard since the HTTP/1.1 RFC specifies that it should only be a host and a port, rather than a URL. Once the HTTP connect is complete, the connection can subsequently directly be used as a VNC server without any further HTTP protocol action.

This scheme requires direct access from the client to the control domain’s IP, and will not work correctly if there are Network Address Translation (NAT) devices blocking such connectivity. You can use the CLI to retrieve the console URI from the client and perform a connectivity check.

Retrieve the VM UUID by running:

$ VM=$(xe vm-list params=uuid --minimal name-label=<name>)

Retrieve the console information:

$ xe console-list vm-uuid=$VM
uuid ( RO)             : 8013b937-ff7e-60d1-ecd8-e52d66c5879e
          vm-uuid ( RO): 2d7c558a-8f03-b1d0-e813-cbe7adfa534c
    vm-name-label ( RO): 6
         protocol ( RO): RFB
         location ( RO): https://10.80.228.30/console?uuid=8013b937-ff7e-60d1-ecd8-e52d66c5879e

Use command-line utilities like ping to test connectivity to the IP address provided in the location field.

Disabling VNC forwarding for Linux VM

When creating and destroying Linux VMs, the host agent automatically manages the vncterm processes which convert the text console into VNC. Advanced users who wish to directly access the text console can disable VNC forwarding for that VM. The text console can then only be accessed directly from the control domain directly, and graphical interfaces such as XenCenter will not be able to render a console for that VM.

Before starting the guest, set the following parameter on the VM record:

$ xe vm-param-set uuid=$VM other-config:disable_pv_vnc=1

Start the VM.

Use the CLI to retrieve the underlying domain ID of the VM with:

$ DOMID=$(xe vm-list params=dom-id uuid=$VM --minimal)

On the host console, connect to the text console directly by:

$ /usr/lib/xen/bin/xenconsole $DOMID

This configuration is an advanced procedure, and we do not recommend that the text console is directly used for heavy I/O operations. Instead, connect to the guest over SSH or some other network-based connection mechanism.

VM import/export

VMs can be exported to a file and later imported to any Xapi host. The export protocol is a simple HTTP(S) GET, which should be sent to the Pool master. Authorization is either via a pre-created session_id or by HTTP basic authentication (particularly useful on the command-line). The VM to export is specified either by UUID or by reference. To keep track of the export, a task can be created and passed in using its reference. Note that Xapi may send an HTTP redirect if a different host has better access to the disk data.

The following arguments are passed as URI query parameters or HTTP cookies:

Argument	Description
session_id	the reference of the session being used to authenticate; required only when not using HTTP basic authentication
task_id	the reference of the task object with which to keep track of the operation; optional, required only if you have created a task object to keep track of the export
ref	the reference of the VM; required only if not using the UUID
uuid	the UUID of the VM; required only if not using the reference
use_compression	an optional boolean “true” or “false” (defaulting to “false”). If “true” then the output will be gzip-compressed before transmission.

For example, using the Linux command line tool cURL:

$ curl http://root:foo@myxenserver1/export?uuid=<vm_uuid> -o <exportfile>

will export the specified VM to the file exportfile.

To export just the metadata, use the URI http://server/export_metadata.

The import protocol is similar, using HTTP(S) PUT. The session_id and task_id arguments are as for the export. The ref and uuid are not used; a new reference and uuid will be generated for the VM. There are some additional parameters:

Argument	Description
restore	if `true`, the import is treated as replacing the original VM - the implication of this currently is that the MAC addresses on the VIFs are exactly as the export was, which will lead to conflicts if the original VM is still being run.
force	if `true`, any checksum failures will be ignored (the default is to destroy the VM if a checksum error is detected)
sr_id	the reference of an SR into which the VM should be imported. The default behavior is to import into the `Pool.default_SR`

Note there is no need to specify whether the export is compressed, as Xapi will automatically detect and decompress gzip-encoded streams.

For example, again using cURL:

curl -T <exportfile> http://root:foo@myxenserver2/import

will import the VM to the default SR on the server.

Note
Note that if no default SR has been set, and no sr_uuid is specified, the error message DEFAULT_SR_NOT_FOUND is returned.

Another example:

curl -T <exportfile> http://root:foo@myxenserver2/import?sr_id=<ref_of_sr>

will import the VM to the specified SR on the server.

To import just the metadata, use the URI http://server/import_metadata

Legacy VM Import Format

This section describes the legacy VM import/export format and is for historical interest only. It should be updated to describe the current format, see issue 64

Xapi supports a human-readable legacy VM input format called XVA. This section describes the syntax and structure of XVA.

An XVA consists of a directory containing XML metadata and a set of disk images. A VM represented by an XVA is not intended to be directly executable. Data within an XVA package is compressed and intended for either archiving on permanent storage or for being transmitted to a VM server - such as a XenServer host - where it can be decompressed and executed.

XVA is a hypervisor-neutral packaging format; it should be possible to create simple tools to instantiate an XVA VM on any other platform. XVA does not specify any particular runtime format; for example disks may be instantiated as file images, LVM volumes, QCoW images, VMDK or VHD images. An XVA VM may be instantiated any number of times, each instantiation may have a different runtime format.

XVA does not:

specify any particular serialization or transport format
provide any mechanism for customizing VMs (or templates) on install
address how a VM may be upgraded post-install
define how multiple VMs, acting as an appliance, may communicate

These issues are all addressed by the related Open Virtual Appliance specification.

An XVA is a directory containing, at a minimum, a file called ova.xml. This file describes the VM contained within the XVA and is described in Section 3.2. Disks are stored within sub-directories and are referenced from the ova.xml. The format of disk data is described later in Section 3.3.

The following terms will be used in the rest of the chapter:

HVM: a mode in which unmodified OS kernels run with the help of virtualization support in the hardware.
PV: a mode in which specially modified “paravirtualized” kernels run explicitly on top of a hypervisor without requiring hardware support for virtualization.

The “ova.xml” file contains the following elements:

<appliance version="0.1">

The number in the attribute “version” indicates the version of this specification to which the XVA is constructed; in this case version 0.1. Inside the <appliance> there is exactly one <vm>: (in the OVA specification, multiple <vm>s are permitted)

<vm name="name">

Each <vm> element describes one VM. The “name” attribute is for future internal use only and must be unique within the ova.xml file. The “name” attribute is permitted to be any valid UTF-8 string. Inside each <vm> tag are the following compulsory elements:

<label>... text ... </label>

A short name for the VM to be displayed in a UI.

<shortdesc> ... description ... </shortdesc>

A description for the VM to be displayed in the UI. Note that for both <label> and <shortdesc> contents, leading and trailing whitespace will be ignored.

<config mem_set="268435456" vcpus="1"/>

The <config> element has attributes which describe the amount of memory in bytes (mem_set) and number of CPUs (VCPUs) the VM should have.

Each <vm> has zero or more <vbd> elements representing block devices which look like the following:

<vbd device="sda" function="root" mode="w" vdi="vdi_sda"/>

The attributes have the following meanings:

device: name of the physical device to expose to the VM. For linux guests we use “sd[a-z]” and for windows guests we use “hd[a-d]”.
function: if marked as “root”, this disk will be used to boot the guest. (NB this does not imply the existence of the Linux root i.e. / filesystem) Only one device should be marked as “root”. See Section 3.4 describing VM booting. Any other string is ignored.
mode: either “w” or “ro” if the device is to be read/write or read-only
vdi: the name of the disk image (represented by a <vdi> element) to which this block device is connected

Each <vm> may have an optional <hacks> section like the following:

<hacks is_hvm="false" kernel_boot_cmdline="root=/dev/sda1 ro"/>

The <hacks> element will be removed in future. The attribute is_hvm is either true or false, depending on whether the VM should be booted in HVM or not. The kernel_boot_cmdline contains additional kernel commandline arguments when booting a guest using pygrub.

In addition to a <vm> element, the <appliance> will contain zero or more <vdi> elements like the following:

<vdi name="vdi_sda" size="5368709120" source="file://sda" type="dir-gzipped-chunks">

Each <vdi> corresponds to a disk image. The attributes have the following meanings:

name: name of the VDI, referenced by the vdi attribute of <vbd>elements. Any valid UTF-8 string is permitted.
size: size of the required image in bytes
source: a URI describing where to find the data for the image, only file:// URIs are currently permitted and must describe paths relative to the directory containing the ova.xml
type: describes the format of the disk data

A single disk image encoding is specified in which has type “dir-gzipped-chunks”: Each image is represented by a directory containing a sequence of files as follows:

-rw-r--r-- 1 dscott xendev 458286013    Sep 18 09:51 chunk000000000.gz
-rw-r--r-- 1 dscott xendev 422271283    Sep 18 09:52 chunk000000001.gz
-rw-r--r-- 1 dscott xendev 395914244    Sep 18 09:53 chunk000000002.gz
-rw-r--r-- 1 dscott xendev 9452401      Sep 18 09:53 chunk000000003.gz
-rw-r--r-- 1 dscott xendev 1096066      Sep 18 09:53 chunk000000004.gz
-rw-r--r-- 1 dscott xendev 971976       Sep 18 09:53 chunk000000005.gz
-rw-r--r-- 1 dscott xendev 971976       Sep 18 09:53 chunk000000006.gz
-rw-r--r-- 1 dscott xendev 971976       Sep 18 09:53 chunk000000007.gz
-rw-r--r-- 1 dscott xendev 573930       Sep 18 09:53 chunk000000008.gz

Each file (named “chunk-XXXXXXXXX.gz”) is a gzipped file containing exactly 1e9 bytes (1GB, not 1GiB) of raw block data. The small size was chosen to be safely under the maximum file size limits of several filesystems. If the files are gunzipped and then concatenated together, the original image is recovered.

Because the import and export of VMs can take some time to complete, an asynchronous HTTP interface to the import and export operations is provided. To perform an export using the XenServer API, construct an HTTP GET call providing a valid session ID, task ID and VM UUID, as shown in the following pseudo code:

task = Task.create()
result = HTTP.get(
  server, 80, "/export?session_id=&task_id=&ref=");

For the import operation, use an HTTP PUT call as demonstrated in the following pseudo code:

task = Task.create()
result = HTTP.put(
  server, 80, "/import?session_id=&task_id=&ref=");

VM Lifecycle

The following figure shows the states that a VM can be in and the API calls that can be used to move the VM between these states.

graph
    halted-- start(paused) -->paused
    halted-- start(not paused) -->running
    running-- suspend -->suspended
    suspended-- resume(not paused) -->running
    suspended-- resume(paused) -->paused
    suspended-- hard shutdown -->halted
    paused-- unpause -->running
    paused-- hard shutdown -->halted
    running-- clean shutdown\n hard shutdown -->halted
    running-- pause -->paused
    halted-- destroy -->destroyed

VM boot parameters

The VM class contains a number of fields that control the way in which the VM is booted. With reference to the fields defined in the VM class (see later in this document), this section outlines the boot options available and the mechanisms provided for controlling them.

VM booting is controlled by setting one of the two mutually exclusive groups: “PV” and “HVM”. If HVM.boot_policy is an empty string, then paravirtual domain building and booting will be used; otherwise the VM will be loaded as a HVM domain, and booted using an emulated BIOS.

When paravirtual booting is in use, the PV_bootloader field indicates the bootloader to use. It may be “pygrub”, in which case the platform’s default installation of pygrub will be used, or a full path within the control domain to some other bootloader. The other fields, PV_kernel, PV_ramdisk, PV_args, and PV_bootloader_args will be passed to the bootloader unmodified, and interpretation of those fields is then specific to the bootloader itself, including the possibility that the bootloader will ignore some or all of those given values. Finally the paths of all bootable disks are added to the bootloader commandline (a disk is bootable if its VBD has the bootable flag set). There may be zero, one, or many bootable disks; the bootloader decides which disk (if any) to boot from.

If the bootloader is pygrub, then the menu.lst is parsed, if present in the guest’s filesystem, otherwise the specified kernel and ramdisk are used, or an autodetected kernel is used if nothing is specified and autodetection is possible. PV_args is appended to the kernel command line, no matter which mechanism is used for finding the kernel.

If PV_bootloader is empty but PV_kernel is specified, then the kernel and ramdisk values will be treated as paths within the control domain. If both PV_bootloader and PV_kernel are empty, then the behaviour is as if PV_bootloader were specified as “pygrub”.

When using HVM booting, HVM_boot_policy and HVM_boot_params specify the boot handling. Only one policy is currently defined, “BIOS order”. In this case, HVM_boot_params should contain one key-value pair “order” = “N” where N is the string that will be passed to QEMU. Optionally HVM_boot_params can contain another key-value pair “firmware” with values “bios” or “uefi” (default is “bios” if absent). By default Secure Boot is not enabled, it can be enabled when “uefi” is enabled by setting VM.platform["secureboot"] to true.

XenCenter

XenCenter uses some conventions on top of the XenAPI:

The SRs created at install time now have an other_config key indicating how their names may be internationalized.

other_config["i18n-key"] may be one of

local-hotplug-cd
local-hotplug-disk
local-storage
xenserver-tools

Additionally, other_config["i18n-original-value-<field name>"] gives the value of that field when the SR was created. If XenCenter sees a record where SR.name_label equals other_config["i18n-original-value-name_label"] (that is, the record has not changed since it was created during XenServer installation), then internationalization will be applied. In other words, XenCenter will disregard the current contents of that field, and instead use a value appropriate to the user’s own language.

If you change SR.name_label for your own purpose, then it no longer is the same as other_config["i18n-original-value-name_label"]. Therefore, XenCenter does not apply internationalization, and instead preserves your given name.

Hiding objects from XenCenter

Networks, PIFs, and VMs can be hidden from XenCenter by adding the key HideFromXenCenter=true to the other_config parameter for the object. This capability is intended for ISVs who know what they are doing, not general use by everyday users. For example, you might want to hide certain VMs because they are cloned VMs that shouldn’t be used directly by general users in your environment.

In XenCenter, hidden Networks, PIFs, and VMs can be made visible, using the View menu.

Default value:	""
Published in:	XenServer 4.0 (rio)	a notes field containing human-readable description

Default value:	false
Published in:	XenServer 6.1 (tampa)	True if the blob is publicly accessible

Values:	balance-slb	Source-level balancing
	active-backup	Active/passive bonding: only one NIC is carrying traffic
	lacp	Link aggregation control protocol

Default value:	true
Published in:	Citrix Hypervisor 8.1 (quebec)

Default value:	Null
Published in:	XenServer 4.1 (miami)	The bonded interface

Default value:	OpaqueRef:NULL
Published in:	XenServer 6.0 (boston)

Values:	ca	Certificate that is trusted by the whole pool
	host	Certificate that identifies a single host to entities outside the pool
	host_internal	Certificate that identifies a single host to other pool members

Default value:	""
Published in:	Citrix Hypervisor 8.2 (stockholm)

Default value:	19700101T00:00:00Z
Published in:	Citrix Hypervisor 8.2 (stockholm)

Prototyped in:	XenServer 7.5 (kolkata)
Published in:	XenServer 7.6 (lima)

Values:	add	adding a new member to the cluster
	remove	removing a member from the cluster
	enable	enabling any cluster member
	disable	disabling any cluster member
	destroy	completely destroying a cluster

Default value:	"corosync"
Prototyped in:	XenServer 7.5 (kolkata)
Published in:	XenServer 7.6 (lima)

Values:	enable	enabling cluster membership on a particular host
	disable	disabling cluster membership on a particular host
	destroy	completely destroying a cluster host

Values:	vt100	VT100 terminal
	rfb	Remote FrameBuffer protocol (as used in VNC)
	rdp	Remote Desktop Protocol

Values:	add	An object has been created
	del	An object has been deleted
	mod	An object has been modified

Values:	breadth_first	vGPUs of a given type are allocated evenly across supporting pGPUs.
	depth_first	vGPUs of a given type are allocated on supporting pGPUs until they are full.

Values:	provision	Indicates this host is able to provision another VM
	evacuate	Indicates this host is evacuating
	shutdown	Indicates this host is in the process of shutting itself down
	reboot	Indicates this host is in the process of rebooting
	power_on	Indicates this host is in the process of being powered on
	vm_start	This host is starting a VM
	vm_resume	This host is resuming a VM
	vm_migrate	This host is the migration target of a VM
	apply_updates	Indicates this host is being updated
	enable	Indicates this host is in the process of enabling

Values:	yes	The host is up to date with the latest updates synced from remote CDN
	no	The host is outdated with the latest updates synced from remote CDN
	unknown	If the host is up to date with the latest updates synced from remote CDN is unknown

Values:	reboot_host	Indicates the updated host should reboot as soon as possible
	reboot_host_on_livepatch_failure	Indicates the updated host should reboot as soon as possible since one or more livepatch(es) failed to be applied.
	reboot_host_on_kernel_livepatch_failure	Indicates the updated host should reboot as soon as possible since one or more kernel livepatch(es) failed to be applied.
	reboot_host_on_xen_livepatch_failure	Indicates the updated host should reboot as soon as possible since one or more xen livepatch(es) failed to be applied.
	restart_toolstack	Indicates the Toolstack running on the updated host should restart as soon as possible
	restart_device_model	Indicates the device model of a running VM should restart as soon as possible
	restart_vm	Indicates the VM should restart as soon as possible

Values:	enabled	This host is outputting its console to a physical display device
	disable_on_reboot	The host will stop outputting its console to a physical display device on next boot
	disabled	This host is not outputting its console to a physical display device
	enable_on_reboot	The host will start outputting its console to a physical display device on next boot

XAPI Toolstack Developer Guide

Subsections of XAPI Toolstack Developer Guide

The XAPI Toolstack

Subsections of The Toolstack

Responsibilities

High-level architecture

Subsections of High-level architecture

Environment

Daemons

Interfaces

Features

Subsections of Features

Disaster Recovery

Event handling in the Control Plane - Xapi, Xenopsd and Xenstore

Introduction

Xapi

Sending events from the xenapi

Receiving events from xenopsd

Barriers

Xenopsd

Sending events to xapi

Receiving events from xenstore

Xenstore

Sending events to xenstore clients

Receiving events from xenstore clients

High-Availability

Example

Design

Terminology

Assumptions

Components

xhad

Fencing

xapi

Disks on shared storage

The role of Host.enabled

The steady-state

Planning and overcommit

Overcommit protection

xen

High-level operations

Enabling HA

Joining a liveset

Shutting down a host

Restarting a host

Disabling HA

Disabling HA cleanly

Disabling HA uncleanly

Add a host to the pool

Multi-version drivers

Variant vs. Version

Device Drivers in Linux and XAPI

Driver Properties

Multi-Version Drivers

Versions

API

Class Host_driver

Fields

Methods

Class Driver_variant

Database

Scan and Rescan

NUMA

NUMA in a nutshell

The NUMA distance matrix

Advantages of NUMA

Xen vCPU soft-affinity

Current default: No vCPU pinning

Best effort NUMA-aware memory allocation for VMs

Summary

Goals

Implementation

Limitations and tradeoffs

XAPI datamodel design

Xenopsd implementation

Future work

Snapshots

Disk snapshots

File-based vhd implementation

Block-based vhd implementation

Class `Driver_variant`

Values:	core	core scheduling
	cpu	CPU scheduling
	socket	socket scheduling

Values:	any	VMs are spread across all available NUMA nodes
	best_effort	VMs are placed on the smallest number of NUMA nodes that they fit using soft-pinning, but the policy doesn't guarantee a balanced placement, falling back to the 'any' policy.
	default_policy	Use the NUMA affinity policy that is the default for the current version

Default value:	{}
Published in:	XenServer 5.6 (midnight-ride)	BIOS strings

Default value:	{}
Published in:	XenServer 5.0 (orlando)	Binary blobs associated with this host

Default value:	{}
Published in:	XenServer 5.5 (george)	configuration specific to external authentication service

Default value:	{"address" -> "localhost", "port" -> "27000"}
Published in:	XenServer 5.6 (midnight-ride)	Contact information of the license server

Default value:	{}
Published in:	XAPI 1.303.0 (1.303.0)	The set of pending mandatory guidances after applying updates, which must be applied, as otherwise there may be e.g. VM failures

Default value:	{0}
Published in:	XenServer 6.5 SP1 (cream)	The set of versions of the virtual hardware platform that the host can offer to its guests

Published in:	XenServer 5.0 (orlando)
Extended in:	XAPI 1.313.0 (1.313.0)	Added Certificate class

Values:	unlocked	Treat all VIFs on this network with locking_mode = 'default' as if they have locking_mode = 'unlocked'
	disabled	Treat all VIFs on this network with locking_mode = 'default' as if they have locking_mode = 'disabled'