Subsections of Design Documents
Design document |
---|
Revision | v3 |
Status | proposed |
Review | #144 |
Revision history |
---|
v1 | Initial version |
v2 | Included some open questions under Xapi point 2 |
v3 | Added new error, task, and assumptions |
Aggregated Local Storage and Host Reboots
Introduction
When hosts use an aggregated local storage SR, then disks are going to be mirrored to several different hosts in the pool (RAID). This ensures that if a host goes down (e.g. due to a reboot after installing a hotfix or upgrade, or when “fenced” by the HA feature), all disk contents in the SR are still accessible. This also means that if all disks are mirrored to just two hosts (worst-case scenario), just one host may be down at any point in time to keep the SR fully available.
When a node comes back up after a reboot, it will resynchronise all its disks with the related mirrors on the other hosts in the pool. This syncing takes some time, and only after this is done, we may consider the host “up” again, and allow another host to be shut down.
Therefore, when installing a hotfix to a pool that uses aggregated local storage, or doing a rolling pool upgrade, we need to make sure that we do hosts one-by-one, and we wait for the storage syncing to finish before doing the next.
This design aims to provide guidance and protection around this by blocking hosts to be shut down or rebooted from the XenAPI except when safe, and setting the host.allowed_operations
field accordingly.
XenAPI
If an aggregated local storage SR is in use, and one of the hosts is rebooting or down (for whatever reason), or resynchronising its storage, the operations reboot
and shutdown
will be removed from the host.allowed_operations
field of all hosts in the pool that have a PBD for the SR.
This is a conservative approach in that assumes that this kind of SR tolerates only one node “failure”, and assumes no knowledge about how the SR distributes its mirrors. We may refine this in future, in order to allow some hosts to be down simultaneously.
The presence of the reboot
operation in host.allowed_operations
indicates whether the host.reboot
XenAPI call is allowed or not (similarly for shutdown
and host.shutdown
). It will not, of course, prevent anyone from rebooting a host from the dom0 console or power switch.
Clients, such as XenCenter, can use host.allowed_operations
, when applying an update to a pool, to guide them when it is safe to update and reboot the next host in the sequence.
In case host.reboot
or host.shutdown
is called while the storage is busy resyncing mirrors, the call will fail with a new error MIRROR_REBUILD_IN_PROGRESS
.
Xapi
Xapi needs to be able to:
- Determine whether aggregated local storage is in use; this just means that a PBD for such an SR present.
- TBD: To avoid SR-specific code in xapi, the storage backend should tell us whether it is an aggregated local storage SR.
- Determine whether the storage system is resynchronising its mirrors; it will need to be able to query the storage backend for this kind of information.
- Xapi will poll for this and will reflect that a resync is happening by creating a
Task
for it (in the DB). This task can be used to track progress, if available. - The exact way to get the syncing information from the storage backend is SR specific. The check may be implemented in a separate script or binary that xapi calls from the polling thread. Ideally this would be integrated with the storage backend.
- Update
host.allowed_operations
for all hosts in the pool according to the rules described above. This comes down to updating the function valid_operations
in xapi_host_helpers.ml
, and will need to use a combination of the functionality from the two points above, plus and indication of host liveness from host_metrics.live
. - Trigger an update of the allowed operations when a host shuts down or reboots (due to a XenAPI call or otherwise), and when it has finished resynchronising when back up. Triggers must be in the following places (some may already be present, but are listed for completeness, and to confirm this):
- Wherever
host_metrics.live
is updated to detect pool slaves going up and down (probably at least in Db_gc.check_host_liveness
and Xapi_ha
). - Immediately when a
host.reboot
or host.shutdown
call is executed: Message_forwarding.Host.{reboot,shutdown,with_host_operation}
. - When a storage resync is starting or finishing.
All of the above runs on the pool master (= SR master) only.
Assumptions
The above will be safe if the storage cluster is equal to the XenServer pool. In general, however, it may be desirable to have a storage cluster that is larger than the pool, have multiple XS pools on a single cluster, or even share the cluster with other kinds of nodes.
To ensure that the storage is “safe” in these scenarios, xapi needs to be able to ask the storage backend:
- if a mirror is being rebuilt “somewhere” in the cluster, AND
- if “some node” in the cluster is offline (even if the node is not in the XS pool).
If the cluster is equal to the pool, then xapi can do point 2 without asking the storage backend, which will simplify things. For the moment, we assume that the storage cluster is equal to the XS pool, to avoid making things too complicated (while still need to keep in mind that we may change this in future).
Design document |
---|
Revision | v1 |
Status | confirmed |
Backtrace support
We want to make debugging easier by recording exception backtraces which are
- reliable
- cross-process (e.g. xapi to xenopsd)
- cross-language
- cross-host (e.g. master to slave)
We therefore need
- to ensure that backtraces are captured in our OCaml and python code
- a marshalling format for backtraces
- conventions for storing and retrieving backtraces
Backtraces in OCaml
OCaml has fast exceptions which can be used for both
- control flow i.e. fast jumps from inner scopes to outer scopes
- reporting errors to users (e.g. the toplevel or an API user)
To keep the exceptions fast, exceptions and backtraces are decoupled:
there is a single active backtrace per-thread at any one time. If you
have caught an exception and then throw another exception, the backtrace
buffer will be reinitialised, destroying your previous records. For example
consider a ‘finally’ function:
let finally f cleanup =
try
let result = f () in
cleanup ();
result
with e ->
cleanup ();
raise e (* <-- backtrace starts here now *)
This function performs some action (i.e. f ()
) and guarantees to
perform some cleanup action (cleanup ()
) whether or not an exception
is thrown. This is a common pattern to ensure resources are freed (e.g.
closing a socket or file descriptor). Unfortunately the raise e
in
the exception handler loses the backtrace context: when the exception
gets to the toplevel, Printexc.get_backtrace ()
will point at the
finally
rather than the real cause of the error.
We will use a variant of the solution proposed by
Jacques-Henri Jourdan
where we will record backtraces when we catch exceptions, before the
buffer is reinitialised. Our finally
function will now look like this:
let finally f cleanup =
try
let result = f () in
cleanup ();
result
with e ->
Backtrace.is_important e;
cleanup ();
raise e
The function Backtrace.is_important e
associates the exception e
with the current backtrace before it gets deleted.
Xapi always has high-level exception handlers or other wrappers around all the
threads it spawns. In particular Xapi tries really hard to associate threads
with active tasks, so it can prefix all log lines with a task id. This helps
admins see the related log lines even when there is lots of concurrent activity.
Xapi also tries very hard to label other threads with names for the same reason
(e.g. db_gc
). Every thread should end up being wrapped in with_thread_named
which allows us to catch exceptions and log stacktraces from Backtrace.get
on the way out.
OCaml design guidelines
Making nice backtraces requires us to think when we write our exception raising
and handling code. In particular:
- If a function handles an exception and re-raise it, you must call
Backtrace.is_important e
with the exception to capture the backtrace first. - If a function raises a different exception (e.g.
Not_found
becoming a XenAPI
INTERNAL_ERROR
) then you must use Backtrace.reraise <old> <new>
to
ensure the backtrace is preserved. - All exceptions should be printable – if the generic printer doesn’t do a good
enough job then register a custom printer.
- If you are the last person who will see an exception (because you aren’t going
to rethrow it) then you may log the backtrace via
Debug.log_backtrace e
if and only if you reasonably expect the resulting backtrace to be helpful
and not spammy. - If you aren’t the last person who will see an exception (because you are going
to rethrow it or another exception), then do not log the backtrace; the
next handler will do that.
- All threads should have a final exception handler at the outermost level
for example
Debug.with_thread_named
will do this for you.
Backtraces in python
Python exceptions behave similarly to the OCaml ones: if you raise a new
exception while handling an exception, the backtrace buffer is overwritten.
Therefore the same considerations apply.
Python design guidelines
The function sys.exc_info()
can be used to capture the traceback associated with the last exception.
We must guarantee to call this before constructing another exception. In
particular, this does not work:
raise MyException(sys.exc_info())
Instead you must capture the traceback first:
exc_info = sys.exc_info()
raise MyException(exc_info)
Marshalling backtraces
We need to be able to take an exception thrown from python code, gather
the backtrace, transmit it to an OCaml program (e.g. xenopsd) and glue
it onto the end of the OCaml backtrace. We will use a simple json marshalling
format for the raw backtrace data consisting of
- a string summary of the error (e.g. an exception name)
- a list of filenames
- a corresponding list of lines
(Note we don’t use the more natural list of pairs as this confuses the
“rpclib” code generating library)
In python:
results = {
"error": str(s[1]),
"files": files,
"lines": lines,
}
print json.dumps(results)
In OCaml:
type error = {
error: string;
files: string list;
lines: int list;
} with rpc
print_string (Jsonrpc.to_string (rpc_of_error ...))
Retrieving backtraces
Backtraces will be written to syslog as usual. However it will also be
possible to retrieve the information via the CLI to allow diagnostic
tools to be written more easily.
The CLI
We add a global CLI argument “–trace” which requests the backtrace be
printed, if one is available:
# xe vm-start vm=hvm --trace
Error code: SR_BACKEND_FAILURE_202
Error parameters: , General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str],
Raised Server_error(SR_BACKEND_FAILURE_202, [ ; General backend error [opterr=exceptions must be old-style classes or derived from BaseException, not str]; ])
Backtrace:
0/50 EXT @ st30 Raised at file /opt/xensource/sm/SRCommand.py, line 110
1/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 159
2/50 EXT @ st30 Called from file /opt/xensource/sm/SRCommand.py, line 263
3/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1486
4/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 83
5/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1519
6/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1567
7/50 EXT @ st30 Called from file /opt/xensource/sm/blktap2.py, line 1065
8/50 EXT @ st30 Called from file /opt/xensource/sm/EXTSR.py, line 221
9/50 xenopsd-xc @ st30 Raised by primitive operation at file "lib/storage.ml", line 32, characters 3-26
10/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 176, characters 15-19
11/50 xenopsd-xc @ st30 Raised at file "lib/task_server.ml", line 184, characters 8-9
12/50 xenopsd-xc @ st30 Called from file "lib/storage.ml", line 57, characters 1-156
13/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 254, characters 15-63
14/50 xenopsd-xc @ st30 Called from file "xc/xenops_server_xen.ml", line 1643, characters 15-76
15/50 xenopsd-xc @ st30 Called from file "lib/xenctrl.ml", line 127, characters 13-17
16/50 xenopsd-xc @ st30 Re-raised at file "lib/xenctrl.ml", line 127, characters 56-59
17/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 937, characters 3-54
18/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1103, characters 4-71
19/50 xenopsd-xc @ st30 Called from file "list.ml", line 84, characters 24-34
20/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1098, characters 2-367
21/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1203, characters 3-46
22/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1441, characters 3-9
23/50 xenopsd-xc @ st30 Raised at file "lib/xenops_server.ml", line 1452, characters 9-10
24/50 xenopsd-xc @ st30 Called from file "lib/xenops_server.ml", line 1458, characters 48-60
25/50 xenopsd-xc @ st30 Called from file "lib/task_server.ml", line 151, characters 15-26
26/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1719, characters 11-14
27/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
28/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 2005, characters 13-14
29/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
30/50 xapi @ st30 Raised at file "xapi_xenops.ml", line 1785, characters 15-16
31/50 xapi @ st30 Called from file "message_forwarding.ml", line 233, characters 25-44
32/50 xapi @ st30 Called from file "message_forwarding.ml", line 915, characters 15-67
33/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
34/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
35/50 xapi @ st30 Called from file "message_forwarding.ml", line 1205, characters 21-199
36/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
37/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
38/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
9/50 xapi @ st30 Raised at file "rbac.ml", line 236, characters 10-15
40/50 xapi @ st30 Called from file "server_helpers.ml", line 75, characters 11-41
41/50 xapi @ st30 Raised at file "cli_util.ml", line 78, characters 9-12
42/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
43/50 xapi @ st30 Raised at file "lib/pervasiveext.ml", line 26, characters 9-12
44/50 xapi @ st30 Called from file "cli_operations.ml", line 1889, characters 2-6
45/50 xapi @ st30 Re-raised at file "cli_operations.ml", line 1898, characters 10-11
46/50 xapi @ st30 Called from file "cli_operations.ml", line 1821, characters 14-18
47/50 xapi @ st30 Called from file "cli_operations.ml", line 2109, characters 7-526
48/50 xapi @ st30 Called from file "xapi_cli.ml", line 113, characters 18-56
49/50 xapi @ st30 Called from file "lib/pervasiveext.ml", line 22, characters 3-9
One can automatically set “–trace” for a whole shell session as follows:
export XE_EXTRA_ARGS="--trace"
The XenAPI
We already store error information in the XenAPI “Task” object and so we
can store backtraces at the same time. We shall add a field “backtrace”
which will have type “string” but which will contain s-expression encoded
backtrace data. Clients should not attempt to parse this string: its
contents may change in future. The reason it is different from the json
mentioned before is that it also contains host and process information
supplied by Xapi, and may be extended in future to contain other diagnostic
information.
The Xenopsd API
We already store error information in the xenopsd API “Task” objects,
we can extend these to store the backtrace in an additional field (“backtrace”).
This field will have type “string” but will contain s-expression encoded
backtrace data.
The SMAPIv1 API
Errors in SMAPIv1 are returned as XMLRPC “Faults” containing a code and
a status line. Xapi transforms these into XenAPI exceptions usually of the
form SR_BACKEND_FAILURE_<code>
. We can extend the SM backends to use the
XenAPI exception type directly: i.e. to marshal exceptions as dictionaries:
results = {
"Status": "Failure",
"ErrorDescription": [ code, param1, ..., paramN ]
}
We can then define a new backtrace-carrying error:
- code =
SR_BACKEND_FAILURE_WITH_BACKTRACE
- param1 = json-encoded backtrace
- param2 = code
- param3 = reason
which is internally transformed into SR_BACKEND_FAILURE_<code>
and
the backtrace is appended to the current Task backtrace. From the client’s
point of view the final exception should look the same, but Xapi will have
a chance to see and log the whole backtrace.
As a side-effect, it is possible for SM plugins to throw XenAPI errors directly,
without interpretation by Xapi.
Design document |
---|
Revision | v1 |
Status | released (6.0) |
Bonding Improvements design
This document describes design details for the
PR-1006 requirements.
XAPI and XenAPI
Creating a Bond
Current Behaviour on Bond creation
Steps for a user to create a bond:
- Shutdown all VMs with VIFs using the interfaces that will be bonded,
in order to unplug those VIFs.
- Create a Network to be used by the bond:
Network.create
- Call
Bond.create
with a ref to this Network, a list of refs of
slave PIFs, and a MAC address to use. - Call
PIF.reconfigure_ip
to configure the bond master. - Call
Host.management_reconfigure
if one of the slaves is the
management interface. This command will call interface-reconfigure
to bring up the master and bring down the slave PIFs, thereby
activating the bond. Otherwise, call PIF.plug
to activate the
bond.
Bond.create
XenAPI call:
Remove duplicates in the list of slaves.
Validate the following:
- Slaves must not be in a bond already.
- Slaves must not be VLAN masters.
- Slaves must be on the same host.
- Network does not already have a PIF on the same host as the
slaves.
- The given MAC is valid.
Create the master PIF object.
- The device name of this PIF is
bond
x, with x the smallest
unused non-negative integer. - The MAC of the first-named slave is used if no MAC was
specified.
Create the Bond object, specifying a reference to the master. The
value of the PIF.master_of
field on the master is dynamically
computed on request.
Set the PIF.bond_slave_of
fields of the slaves. The value of the
Bond.slaves
field is dynamically computed on request.
New Behaviour on Bond creation
Steps for a user to create a bond:
- Create a Network to be used by the bond:
Network.create
- Call
Bond.create
with a ref to this Network, a list of refs of
slave PIFs, and a MAC address to use.
The new bond will automatically be plugged if one of the slaves was
plugged.
In the following, for a host h, a VIF-to-move is a VIF associated
with a VM that is either
- running, suspended or paused on h, OR
- halted, and h is the only host that the VM can be started on.
The Bond.create
XenAPI call is updated to do the following:
Remove duplicates in the list of slaves.
Validate the following, and raise an exception if any of these check
fails:
- Slaves must not be in a bond already.
- Slaves must not be VLAN masters.
- Slaves must not be Tunnel access PIFs.
- Slaves must be on the same host.
- Network does not already have a PIF on the same host as the
slaves.
- The given MAC is valid.
Try unplugging all currently attached VIFs of the set of VIFs that
need to be moved. Roll back and raise an exception of one of the
VIFs cannot be unplugged (e.g. due to the absence of PV drivers in
the VM).
Determine the primary slave: the management PIF (if among the
slaves), or the first slave with IP configuration.
Create the master PIF object.
- The device name of this PIF is
bond
x, with x the smallest
unused non-negative integer. - The MAC of the primary slave is used if no MAC was specified.
- Include the IP configuration of the primary slave.
- If any of the slaves has
PIF.disallow_unplug = true
, this will
be copied to the master.
Create the Bond object, specifying a reference to the master. The
value of the PIF.master_of
field on the master is dynamically
computed on request. Also a reference to the primary slave is
written to Bond.primary_slave
on the new Bond object.
Set the PIF.bond_slave_of
fields of the slaves. The value of the
Bond.slaves
field is dynamically computed on request.
Move VLANs, plus the VIFs-to-move on them, to the master.
- If all VLANs on the slaves have different tags, all VLANs will
be moved to the bond master, while the same Network is used. The
network effectively moves up to the bond and therefore no VIFs
need to be moved.
- If multiple VLANs on different slaves have the same tag, they
necessarily have different Networks as well. Only one VLAN with
this tag is created on the bond master. All VIFs-to-move on the
remaining VLAN networks are moved to the Network that was moved
up.
Move Tunnels to the master. The tunnel Networks move up with the
tunnels. As tunnel keys are different for all tunnel networks, there
are no complications as in the VLAN case.
Move VIFs-to-move on the slaves to the master.
If one of the slaves is the current management interface, move
management to the master; the master will automatically be plugged.
If none of the slaves is the management interface, plug the master
if any of the slaves was plugged. In both cases, the slaves will
automatically be unplugged.
On all slaves, reset the IP configuration and set disallow_unplug
to false.
Note: “moving” a VIF, VLAN or tunnel means “re-creating somewhere else,
and destroying the old one”.
Destroying a Bond
Current Behaviour on Bond destruction
Steps for a user to destroy a bond:
- If the management interface is on the bond, move it to another PIF
using
PIF.reconfigure_ip
and Host.management_reconfigure
.
Otherwise, no PIF.unplug
needs to be called on the bond master, as
Bond.destroy
does this automatically. - Call
Bond.destroy
with a ref to the Bond object. - If desired, bring up the former slave PIFs by calls to
PIF.plug
(this is does not happen automatically).
Bond.destroy
XenAPI call:
Validate the following constraints:
- No VLANs are attached to the bond master.
- The bond master is not the management PIF.
Bring down the master PIF and clean up the underlying network
devices.
Remove the Bond and master PIF objects.
New Behaviour on Bond destruction
Steps for a user to destroy a bond:
- Call
Bond.destroy
with a ref to the Bond object. - If desired, move VIFs/VLANs/tunnels/management from (former) primary
slave to other PIFs.
Bond.destroy
XenAPI call is updated to do the following:
- Try unplugging all currently attached VIFs of the set of VIFs that
need to be moved. Roll back and raise an exception of one of the
VIFs cannot be unplugged (e.g. due to the absence of PV drivers in
the VM).
- Copy the IP configuration of the master to the primary slave.
- Move VLANs, with their Networks, to the primary slave.
- Move Tunnels, with their Networks, to the primary slave.
- Move VIFs-to-move on the master to the primary slave.
- If the master is the current management interface, move management
to the primary slave. The primary slave will automatically be
plugged.
- If the master was plugged, plug the primary slave. This will
automatically clean up the underlying devices of the bond.
- If the master has
PIF.disallow_unplug = true
, this will be copied
to the primary slave. - Remove the Bond and master PIF objects.
Using Bond Slaves
Current Behaviour for Bond Slaves
- It possible to plug any existing PIF, even bond slaves. Any other
PIFs that cannot be attached at the same time as the PIF that is
being plugged, are automatically unplugged.
- Similarly, it is possible to make a bond slave the management
interface. Any other PIFs that cannot be attached at the same time
as the PIF that is being plugged, are automatically unplugged.
- It is possible to have a VIF on a Network associated with a bond
slave. When the VIF’s VM is started, or the VIF is hot-plugged, the
PIF is relies on is automatically plugged, and any other PIFs that
cannot be attached at the same time as this PIF are automatically
unplugged.
- It is possible to have a VLAN on a bond slave, though the bond
(master) and the VLAN may not be simultaneously attached. This is
not currently enforced (which may be considered a bug).
New behaviour for Bond Slaves
- It is no longer possible to plug a bond slave. The exception
CANNOT_PLUG_BOND_SLAVE is raised when trying to do so.
- It is no longer possible to make a bond slave the management
interface. The exception CANNOT_PLUG_BOND_SLAVE is raised when
trying to do so.
- It is still possible to have a VIF on the Network of a bond slave.
However, it is not possible to start such a VIF’s VM on a host, if
this would need a bond slave to be plugged. Trying this will result
in a CANNOT_PLUG_BOND_SLAVE exception. Likewise, it is not
possible to hot-plug such a VIF.
- It is no longer possible to place a VLAN on a bond slave. The
exception CANNOT_ADD_VLAN_TO_BOND_SLAVE is raised when trying
to do so.
- It is no longer possible to place a tunnel on a bond slave. The
exception CANNOT_ADD_TUNNEL_TO_BOND_SLAVE is raised when trying
to do so.
Actions on Start-up
Current Behaviour on Start-up
When a pool slave starts up, bonds and VLANs on the pool master are
replicated on the slave:
- Create all VLANs that the master has, but the slave has not. VLANs
are identified by their tag, the device name of the slave PIF, and
the Networks of the master and slave PIFs.
- Create all bonds that the master has, but the slave has not. If the
interfaces needed for the bond are not all available on the slave, a
partial bond is created. If some of these interface are already
bonded on the slave, this bond is destroyed first.
New Behaviour on Start-up
- The current VLAN/tunnel/bond recreation code is retained, as it uses
the new Bond.create and Bond.destroy functions, and therefore does
what it needs to do.
- Before VLAN/tunnel/bond recreation, any violations of the rules
defined in R2 are rectified, by moving VIFs, VLANs, tunnels or
management up to bonds.
CLI
The behaviour of the xe
CLI commands bond-create
, bond-destroy
,
pif-plug
, and host-management-reconfigure
is changed to match their
associated XenAPI calls.
XenCenter
XenCenter already automatically moves the management interface when a
bond is created or destroyed. This is no longer necessary, as the
Bond.create/destroy
calls already do this. XenCenter only needs to
copy any PIF.other_config
keys that is needs between primary slave and
bond master.
Manual Tests
- Create a bond of two interfaces…
- without VIFs/VLANs/management on them;
- with management on one of them;
- with a VLAN on one of them;
- with two VLANs on two different interfaces, having the same VLAN
tag;
- with a VIF associated with a halted VM on one of them;
- with a VIF associated with a running VM (with and without PV
drivers) on one of them.
- Destroy a bond of two interfaces…
- without VIFs/VLANs/management on it;
- with management on it;
- with a VLAN on it;
- with a VIF associated with a halted VM on it;
- with a VIF associated with a running VM (with and without PV
drivers) on it.
- In a pool of two hosts, having VIFs/VLANs/management on the
interfaces of the pool slave, create a bond on the pool master, and
restart XAPI on the slave.
- Restart XAPI on a host with a networking configuration that has
become illegal due to these requirements.
Design document |
---|
Revision | v2 |
Status | proposed |
Code Coverage Profiling
We would like to add optional coverage profiling to existing OCaml
projects in the context of XenServer and XenAPI. This article
presents how we do it.
Binaries instrumented for coverage profiling in the XenServer project
need to run in an environment where several services act together as
they provide operating-system-level services. This makes it a little
harder than profiling code that can be profiled and executed in
isolation.
TL;DR
To build binaries with coverage profiling, do:
./configure --enable-coverage
make
Binaries will log coverage data to /tmp/bisect*.out
from which a
coverage report can be generated in coverage/
:
bisect-ppx-report -I _build -html coverage /tmp/bisect*.out
Profiling Framework Bisect-PPX
The open-source BisectPPX instrumentation framework uses extension
points (PPX) in the OCaml compiler to instrument code during
compilation. Instrumented code for a binary is then compiled as usual
and logs during execution data to in-memory data structures. Before an
instrumented binary terminates, it writes the logged data to a file.
This data can then be analysed with the bisect-ppx-report
tool, to
produce a summary of annotated code that highlights what part of a
codebase was executed.
BisectPPX has several desirable properties:
- a robust code base that is well tested
- it is easy to integrate into the compilation pipeline (see below)
- is specific to the OCaml language; an expression-oriented language
like OCaml doesn’t fit the traditional statement coverage well
- it is actively maintained
- is generates useful reports for interactive and non-interactive use
that help to improve code coverage
Red parts indicate code that wasn’t executed whereas green parts were.
Hovering over a dark green spot reveals how often that point was
executed.
The individual steps of instrumenting code with BisectPPX are greatly
abstracted by OCamlfind (OCaml’s library manager) and OCamlbuild
(OCaml’s compilation manager):
# write code
vim example.ml
# build it with instrumentation from bisect_ppx
ocamlbuild -use-ocamlfind -pkg bisect_ppx -pkg unix example.native
# execute it - generates files ./bisect*.out
./example.native
# generate report
bisect-ppx-report -I _build -html coverage bisect000*
# view coverage/index.html
Summary:
- 'binding' points: 2/2 (100.00%)
- 'sequence' points: 10/10 (100.00%)
- 'match/function' points: 5/8 (62.50%)
- total: 17/20 (85.00%)
The fourth step generates a HTML report in coverage/
. All it takes is
to declare to OCamlbuild that a module depends on bisect_ppx
and it
will be instrumented during compilation. Behind the scenes ocamlfind
makes sure that the compiler uses a preprocessing step that instruments
the code.
Signal Handling
During execution the code instrumentation leads to the collection of
data. This code registers a function with at_exit
that writes the data
to bisect*.out
when exit
is called. A binary can terminate without
calling exit
and in that case the file would not be written. It is
therefore important to make sure that exit
is called. If this does not
happen naturally, for example in the context of a daemon that is
terminated by receiving the TERM
signal, a signal handler must be
installed:
let stop signal =
printf "caught signal %d\n" signal;
exit 0
Sys.set_signal Sys.sigterm (Sys.Signal_handle stop)
By default coverage data can only be dumped at exit, which is inconvenient if you have a test-suite
that needs to reuse a long running daemon, and starting/stopping it each time is not feasible.
In such cases we need an API to dump coverage at runtime, which is provided by bisect_ppx >= 1.3.0
.
However each daemon will need to set up a way to listen to an event that triggers this coverage dump,
furthermore it is desirable to make runtime coverage dumping compiled in conditionally to be absolutely sure
that production builds do not use coverage preprocessed code.
Hence instead of duplicating all this build logic in each daemon (xapi
, xenopsd
, etc.) provide this
functionality in a common library xapi-idl
that:
- logs a message on startup so we know it is active
- sets BISECT_FILE environment variable to dump coverage in the appropriate place
- listens on
org.xen.xapi.coverage.<name>
message queue for runtime coverage dump commands:- sending
dump <Number>
will cause runtime coverage to be dumped to a file
named bisect-<name>-<random>.<Number>.out
- sending
reset
will cause the runtime coverage counters to be reset
Daemons that use Xcp_service.configure2
(e.g. xenopsd
) will benefit from this runtime trigger automatically,
provided they are themselves preprocessed with bisect_ppx
.
Since we are interested in collecting coverage data for system-wide test-suite runs we need a way to trigger
dumping of coverage data centrally, and a good candidate for that is xapi
as the top-level daemon.
It will call Xcp_coverage.dispatcher_init ()
, which listens on org.xen.xapi.coverage.dispatch
and
dispatches the coverage dump command to all message queues under org.xen.xapi.coverage.*
except itself.
On production, and regular builds all of this is a no-op, ensured by using separate lib/coverage/disabled.ml
and lib/coverage/enabled.ml
files which implement the same interface, and choosing which one to use at build time.
Where Data is Written
By default, BisectPPX writes data in a binary’s current working
directory as bisectXXXX.out
. It doesn’t overwrite existing files and
files from several runs can be combined during analysis. However, this
name and the location can be inconvenient when multiple programs share a
directory.
BisectPPX’s default can be overridden with the BISECT_FILE
environment variable. This can happen on the command line:
BISECT_FILE=/tmp/example ./example.native
In the context of XenServer we could do this in startup scripts.
However, we added a bit of code
val Coverage.init: string -> unit
that sets the environment variable from inside the program. The files
are written to a temporary directory (respecting $TMP
or using /tmp
)
and uses the string
-typed argument to include it in the name. To be
effective, this function must be called before the programs exits. For
clarity it is called at the begin of program execution.
Instrumenting an Oasis Project
While instrumentation is easy on the level of a small file or project it
is challenging in a bigger project. We decided to focus on projects that
are build with the Oasis build and packaging manager. These have a
well-defined structure and compilation process that is controlled by a
central _oasis
file. This file describes for each library and binary
its dependencies at a package level. From this, Oasis generates a
configure
script and compilation rules for the OCamlbuild system.
Oasis is designed that the generated files can be shipped without
requiring Oasis itself being available.
Goals for instrumentation are:
- what files are instrumented should be obvious and easy to manage
- instrumentation must be optional, yet easy to activate
- avoid methods that require to keep several files in sync like multiple
_oasis
files - avoid separate Git branches for instrumented and non-instrumented
code
In the ideal case, we could introduce a configuration switch
./configure --enable-coverage
that would prepare compilation for
coverage instrumentation. While Oasis supports the creation of such
switches, they cannot be used to control build dependencies like
compiling a file with or without package bisec_ppx
. We have chosen a
different method:
A Makefile
target coverage
augments the _tags
file to include the
rules in file _tags.coverage
that cause files to be instrumented:
make coverage # prepare
make # build
leads to the execution of this code during preparation:
coverage: _tags _tags.coverage
test ! -f _tags.orig && mv _tags _tags.orig || true
cat _tags.coverage _tags.orig > _tags
The file _tags.coverage
contains two simple OCamlbuild rules that
could be tweaked to instrument only some files:
<**/*.ml{,i,y}>: pkg_bisect_ppx
<**/*.native>: pkg_bisect_ppx
When make coverage
is not called, these rules are not active and
hence, code is not instrumented for coverage. We believe that this
solution to control instrumentation meets the goals from above. In
particular, what files are instrumented and when is controlled by very
few lines of declarative code that lives in the main repository of a
project.
Project Layout
The crucial files in an Oasis-controlled project that is set up for
coverage analysis are:
./_oasis - make "profiling" a build depdency
./_tags.coverage - what files get instrumented
./profiling/coverage.ml - support file, sets env var
./Makefile - target 'coverage'
The _oasis
file bundles the files under profiling/
into an internal
library which executables then depend on:
# Support files for profiling
Library profiling
CompiledObject: best
Path: profiling
Install: false
Findlibname: profiling
Modules: Coverage
BuildDepends:
Executable set_domain_uuid
CompiledObject: best
Path: tools
ByteOpt: -warn-error +a-3
NativeOpt: -warn-error +a-3
MainIs: set_domain_uuid.ml
Install: false
BuildDepends:
xenctrl,
uuidm,
cmdliner,
profiling # <-- here
The Makefile
target coverage
primes the project for a profiling build:
# make coverage - prepares for building with coverage analysis
coverage: _tags _tags.coverage
test ! -f _tags.orig && mv _tags _tags.orig || true
cat _tags.coverage _tags.orig > _tags
Design document |
---|
Revision | v7 |
Status | released (7.0) |
Revision history |
---|
v1 | Initial version |
v2 | Add details about VM migration and import |
v3 | Included and excluded use cases |
v4 | Rolling Pool Upgrade use cases |
v5 | Lots of changes to simplify the design |
v6 | Use case refresh based on simplified design |
v7 | RPU refresh based on simplified design |
CPU feature levelling 2.0
Executive Summary
The old XS 5.6-style Heterogeneous Pool feature that is based around hardware-level CPUID masking will be replaced by a safer and more flexible software-based levelling mechanism.
History
- Original XS 5.6 design: heterogeneous-pools
- Changes made in XS 5.6 FP1 for the DR feature (added CPUID checks upon migration)
- XS 6.1: migration checks extended for cross-pool scenario
High-level Interfaces and Behaviour
A VM can only be migrated safely from one host to another if both hosts offer the set of CPU features which the VM expects. If this is not the case, CPU features may appear or disappear as the VM is migrated, causing it to crash. The purpose of feature levelling is to hide features which the hosts do not have in common from the VM, so that it does not see any change in CPU capabilities when it is migrated.
Most pools start off with homogenous hardware, but over time it may become impossible to source new hosts with the same specifications as the ones already in the pool. The main use of feature levelling is to allow such newer, more capable hosts to be added to an existing pool while preserving the ability to migrate existing VMs to any host in the pool.
Principles for Migration
The CPU levelling feature aims to both:
- Make VM migrations safe by ensuring that a VM will see the same CPU features before and after a migration.
- Make VMs as mobile as possible, so that it can be freely migrated around in a XenServer pool.
To make migrations safe:
- A migration request will be blocked if the destination host does not offer the some of the CPU features that the VM currently sees.
- Any additional CPU features that the destination host is able to offer will be hidden from the VM.
Note: Due to the limitations of the old Heterogeneous Pools feature, we are not able to guarantee the safety of VMs that are migrated to a Levelling-v2 host from an older host, during a rolling pool upgrade. This is because such VMs may be using CPU features that were not captured in the old feature sets, of which we are therefore unaware. However, migrations between the same two hosts, but before the upgrade, may have already been unsafe. The promise is that we will not make migrations more unsafe during a rolling pool upgrade.
To make VMs mobile:
- A VM that is started in a XenServer pool will be able to see only CPU features that are common to all hosts in the pool. The set of common CPU features is referred to in this document as the pool CPU feature level, or simply the pool level.
Use Cases for Pools
A user wants to add a new host to an existing XenServer pool. The new host has all the features of the existing hosts, plus extra features which the existing hosts do not. The new host will be allowed to join the pool, but its extra features will be hidden from VMs that are started on the host or migrated to it. The join does not require any host reboots.
A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones. XenCenter warns the user that adding the host to the pool is possible, but it would lower the pool’s CPU feature level. The user accepts this and continues the join. The join does not require any host reboots. VMs that are started anywhere on the pool, from now on, will only see the features of the new host (the lowest common denominator), such that they are migratable to any host in the pool, including the new one. VMs that were running before the pool join will not be migratable to the new host, because these VMs may be using features that the new host does not have. However, after a reboot, such VMs will be fully mobile.
A user wants to add a new host to an existing XenServer pool. The new host does not have all the features of the existing ones, and at the same time, it has certain features that the pool does not have (the feature sets overlap). This is essentially a combination of the two use cases above, where the pool’s CPU feature level will be downgraded to the intersection of the feature sets of the pool and the new host. The join does not require any host reboots.
A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has all the features it used to have, plus extra features which other hosts in the pool do not have. The extra features are masked out and the host resumes its place in the pool when it is booted up again.
A user wants to upgrade or repair the hardware of a host in an existing XenServer pool. After upgrade the host has fewer features than it used to have. When the host is booted up again, the pool CPU’s feature level will be automatically lowered, and the user will be alerted of this fact (through the usual alerting mechanism).
A user wants to remove a host from an existing XenServer pool. The host will be removed as normal after any VMs on it have been migrated away. The feature set offered by the pool will be automatically re-levelled upwards in case the host which was removed was the least capable in the pool, and additional features common to the remaining hosts will be unmasked.
Rolling Pool Upgrade
A VM which was running on the pool before the upgrade is expected to continue to run afterwards. However, when the VM is migrated to an upgraded host, some of the CPU features it had been using might disappear, either because they are not offered by the host or because the new feature-levelling mechanism hides them. To have the best chance for such a VM to successfully migrate (see the note under “Principles for Migration”), it will be given a temporary VM-level feature set providing all of the destination’s CPU features that were unknown to XenServer before the upgrade. When the VM is rebooted it will inherit the pool-level feature set.
A VM which is started during the upgrade will be given the current pool-level feature set. The pool-level feature set may drop after the VM is started, as more hosts are upgraded and re-join the pool, however the VM is guaranteed to be able to migrate to any host which has already been upgraded. If the VM is started on the master, there is a risk that it may only be able to run on that host.
To allow the VMs with grandfathered-in flags to be migrated around in the pool, the intra pool VM migration pre-checks will compare the VM’s feature flags to the target host’s flags, not the pool flags. This will maximise the chance that a VM can be migrated somewhere in a heterogeneous pool, particularly in the case where only a few hosts in the pool do not have features which the VMs require.
To allow cross-pool migration, including to pool of a higher XenServer version, we will still check the VM’s requirements against the pool-level features of the target pool. This is to avoid the possibility that we migrate a VM to an ‘island’ in the other pool, from which it cannot be migrated any further.
XenAPI Changes
Fields
host.cpu_info
is a field of type (string -> string) map
that contains information about the CPUs in a host. It contains the following keys: cpu_count
, socket_count
, vendor
, speed
, modelname
, family
, model
, stepping
, flags
, features
, features_after_reboot
, physical_features
and maskable
.- The following keys are specific to hardware-based CPU masking and will be removed:
features_after_reboot
, physical_features
and maskable
. - The
features
key will continue to hold the current CPU features that the host is able to use. In practise, these features will be available to Xen itself and dom0; guests may only see a subset. The current format is a string of four 32-bit words represented as four groups of 8 hexadecimal digits, separated by dashes. This will change to an arbitrary number of 32-bit words. Each bit at a particular position (starting from the left) still refers to a distinct CPU feature (1
: feature is present; 0
: feature is absent), and feature strings may be compared between hosts. The old format simply becomes a special (4 word) case of the new format, and bits in the same position may be compared between old and new feature strings. - The new key
features_pv
will be added, representing the subset of features
that the host is able to offer to a PV guest. - The new key
features_hvm
will be added, representing the subset of features
that the host is able to offer to an HVM guest.
- A new field
pool.cpu_info
of type (string -> string) map
(read only) will be added. It will contain:vendor
: The common CPU vendor across all hosts in the pool.features_pv
: The intersection of features_pv
across all hosts in the pool, representing the feature set that a PV guest will see when started on the pool.features_hvm
: The intersection of features_hvm
across all hosts in the pool, representing the feature set that an HVM guest will see when started on the pool.cpu_count
: the total number of CPU cores in the pool.socket_count
: the total number of CPU sockets in the pool.
- The
pool.other_config:cpuid_feature_mask
override key will no longer have any effect on pool join or VM migration. - The field
VM.last_boot_CPU_flags
will be updated to the new format (see host.cpu_info:features
). It will still contain the feature set that the VM was started with as well as the vendor (under the features
and vendor
keys respectively).
Messages
pool.join
currently requires that the CPU vendor and feature set (according to host.cpu_info:vendor
and host.cpu_info:features
) of the joining host are equal to those of the pool master. This requirement will be loosened to mandate only equality in CPU vendor:- The join will be allowed if
host.cpu_info:vendor
equals pool.cpu_info:vendor
. - This means that xapi will additionally allow hosts that have a more extensive feature set than the pool (as long as the CPU vendor is common). Such hosts are transparently down-levelled to the pool level (without needing reboots).
- This further means that xapi will additionally allow hosts that have a less extensive feature set than the pool (as long as the CPU vendor is common). In this case, the pool is transparently down-levelled to the new host’s level (without needing reboots). Note that this does not affect any running VMs in any way; the mobility of running VMs will not be restricted, which can still migrate to any host they could migrate to before. It does mean that those running VMs will not be migratable to the new host.
- The current error raised in case of a CPU mismatch is
POOL_HOSTS_NOT_HOMOGENEOUS
with reason
argument "CPUs differ"
. This will remain the error that is raised if the pool join fails due to incompatible CPU vendors. - The
pool.other_config:cpuid_feature_mask
override key will no longer have any effect.
host.set_cpu_features
and host.reset_cpu_features
will be removed: it is no longer to use the old method of CPU feature masking (CPU feature sets are controlled automatically by xapi). Calls will fail with MESSAGE_REMOVED
.- VM lifecycle operations will be updated internally to use the new feature fields, to ensure that:
- Newly started VMs will be given CPU features according to the pool level for maximal mobility.
- For safety, running VMs will maintain their feature set across migrations and suspend/resume cycles. CPU features will transparently be hidden from VMs.
- Furthermore, migrate and resume will only be allowed in case the target host’s CPUs are capable enough, i.e.
host.cpu_info:vendor
= VM.last_boot_CPU_flags:vendor
and host.cpu_info:features_{pv,hvm}
⊇ VM.last_boot_CPU_flags:features
. A VM_INCOMPATIBLE_WITH_THIS_HOST
error will be returned otherwise (as happens today). - For cross pool migrations, to ensure maximal mobility in the target pool, a stricter condition will apply: the VM must satisfy the pool CPU level rather than just the target host’s level:
pool.cpu_info:vendor
= VM.last_boot_CPU_flags:vendor
and pool.cpu_info:features_{pv,hvm}
⊇ VM.last_boot_CPU_flags:features
CLI Changes
The following changes to the xe
CLI will be made:
xe host-cpu-info
(as well as xe host-param-list
and friends) will return the fields of host.cpu_info
as described above.xe host-set-cpu-features
and xe host-reset-cpu-features
will be removed.xe host-get-cpu-features
will still return the value of host.cpu_info:features
for a given host.
Low-level implementation
Xenctrl
The old xc_get_boot_cpufeatures
hypercall will be removed, and replaced by two new functions, which are available to xenopsd through the Xenctrl module:
external get_levelling_caps : handle -> int64 = "stub_xc_get_levelling_caps"
type featureset_index = Featureset_host | Featureset_pv | Featureset_hvm
external get_featureset : handle -> featureset_index -> int64 array = "stub_xc_get_featureset"
In particular, the get_featureset
function will be used by xapi/xenopsd to ask Xen which are the widest sets of CPU features that it can offer to a VM (PV or HVM). I don’t think there is a use for get_levelling_caps
yet.
Xenopsd
- Update the type
Host.cpu_info
, which contains all the fields that need to go into the host.cpu_info
field in the xapi DB. The type already exists but is unused. Add the function HOST.get_cpu_info
to obtain an instance of the type. Some code from xapi and the cpuid.ml from xen-api-libs can be reused. - Add a platform key
featureset
(Vm.t.platformdata
), which xenopsd will write to xenstore along with the other platform keys (no code change needed in xenopsd). Xenguest will pick this up when a domain is created, and will apply the CPUID policy to the domain. This has the effect of masking out features that the host may have, but which have a 0
in the feature set bitmap. - Review current cpuid-related functions in
xc/domain.ml
.
Xapi
Xapi startup
- Update
Create_misc.create_host_cpu
function to use the new xenopsd call. - If the host features fall below pool level, e.g. due to a change in hardware: down-level the pool by updating
pool.cpu_info.features_{pv,hvm}
. Newly started VMs will inherit the new level; already running VMs will not be affected, but will not be able to migrate to this host. - To notify the admin of this event, an API alert (message) will be set:
pool_cpu_features_downgraded
.
VM start
- Inherit feature set from pool (
pool.cpu_info.features_{pv,hvm}
) and set VM.last_boot_CPU_flags
(cpuid_helpers.ml
). - The domain will be started with this CPU feature set enabled, by writing the feature set string to
platformdata
(see above).
VM migrate and resume
- There are already CPU compatiblity checks on migration, both in-pool and cross-pool, as well as resume. Xapi compares
VM.last_boot_CPU_flags
of the VM to-migrate with host.cpu_info
of the receiving host. Migration is only allowed if the CPU vendors and the same, and host.cpu_info:features
⊇ VM.last_boot_CPU_flags:features
. The check can be overridden by setting the force
argument to true
. - For in-pool migrations, these checks will be updated to use the appropriate
features_pv
or features_hvm
field. - For cross-pool migrations. These checks will be updated to use
pool.cpu_info
(features_pv
or features_hvm
depending on how the VM was booted) rather than host.cpu_info
. - If the above checks pass, then the
VM.last_boot_CPU_flags
will be maintained, and the new domain will be started with the same CPU feature set enabled, by writing the feature set string to platformdata
(see above). - In case the VM is migrated to a host with a higher xapi software version (e.g. a migration from a host that does not have CPU levelling v2), the feature string may be longer. This may happen during a rolling pool upgrade or a cross-pool migration, or when a suspended VM is resume after an upgrade. In this case, the following safety rules apply:
- Only the existing (shorter) feature string will be used to determine whether the migration will be allowed. This is the best we can do, because we are unaware of the state of the extended feature set on the older host.
- The existing feature set in
VM.last_boot_CPU_flags
will be extended with the extra bits in host.cpu_info:features_{pv,hvm}
, i.e. the widest feature set that can possibly be granted to the VM (just in case the VM was using any of these features before the migration). - Strictly speaking, a migration of a VM from host A to B that was allowed before B was upgraded, may no longer be allowed after the upgrade, due to stricter feature sets in the new implementation (from the
xc_get_featureset
hypercall). However, the CPU features that are switched off by the new implementation are features that a VM would not have been able to actually use. We therefore need a don’t-care feature set (similar to the old pool.other_config:cpuid_feature_mask
key) with bits that we may ignore in migration checks, and switch off after the migration. This will be a xapi config file option. - XXX: Can we actually block a cross-pool migration at the receiver end??
VM import
The VM.last_boot_CPU_flags
field must be upgraded to the new format (only really needed for VMs that were suspended while exported; preserve_power_state=true
), as described above.
Pool join
Update pool join checks according to the rules above (see pool.join
), i.e. remove the CPU features constraints.
Upgrade
- The pool level (
pool.cpu_info
) will be initialised when the pool master upgrades, and automatically adjusted if needed (downwards) when slaves are upgraded, by each upgraded host’s started sequence (as above under “Xapi startup”). - The
VM.last_boot_CPU_flags
fields of running and suspended VMs will be “upgraded” to the new format on demand, when a VM is migrated to or resume on an upgraded host, as described above.
XenCenter integration
- Don’t explicitly down-level upon join anymore
- Become aware of new pool join rule
- Update Rolling Pool Upgrade
Design document |
---|
Revision | v1 |
Status | proposed |
Distributed database
All hosts in a pool use the shared database by sending queries to
the pool master. This creates
- a performance bottleneck as the pool size increases
- a reliability problem when the master fails.
The reliability problem can be ameliorated by running with HA enabled,
but this is not always possible.
Both problems can be addressed by observing that the database objects
correspond to distinct physical objects where eventual consistency is
perfectly ok. For example if host ‘A’ is running a VM and changes the
VM’s name, it doesn’t matter if it takes a while before the change shows
up on host ‘B’. If host ‘B’ changes its network configuration then it
doesn’t matter how long it takes host ‘A’ to notice. We would still like
the metadata to be replicated to cope with failure, but we can allow
changes to be committed locally and synchronised later.
Note the one exception to this pattern: the current SM plugins use database
fields to implement locks. This should be shifted to a special-purpose
lock acquire/release API.
Using git via Irmin
A git repository is a database of key=value pairs with branching history.
If we placed our host and VM metadata in git then we could commit
changes and pull
and push
them between replicas. The
Irmin library provides an easy programming
interface on top of git which we could link with the Xapi database layer.
Proposed new architecture
The diagram above shows two hosts: one a master and the other a regular host.
The XenAPI client has sent a request to the wrong host; normally this would
result in a HOST_IS_SLAVE
error being sent to the client. In the new
world, the host is able to process the request, only contacting the master
if it is necessary to acquire a lock. Starting a VM would require a lock; but
rebooting or migrating an existing VM would not. Assuming the lock can
be acquired, then the operation is executed locally with all state updates
being made to a git topic branch.
Roughly we would have 1 topic branch per
pending XenAPI Task. Once the Task completes successfully, the topic branch
(containing the new VM state) is merged back into master.
Separately each
host will pull and push updates between each other for replication.
We would avoid merge conflicts by construction; either
- a host’s configuration will always be “owned” by the host and it will be
an error for anyone else to merge updates to it
- the master’s locking will guarantee that a VM is running on at most one
host at a time. It will be an error for anyone else to merge updates to it.
What we gain
We will gain the following
- the master will only be a bottleneck when the number of VM locks gets
really large;
- you will be able to connect XenCenter to hosts without a master and manage
them. Today such hosts are unmanageable.
- the database will have a history and you’ll be able to “go back in time”
either for debugging or to recover from mistakes
- bugs caused by concurrent threads (in separate Tasks) confusing each other
will be vanquished. A typical failure mode is: one active thread destroys
an object; a passive thread sees the object and then tries to read it
and gets a database failure instead. Since every thread is operating a
separate Task they will all have their own branch and will be isolated from
each other.
What we lose
We will lose the following
- the ability to use the Xapi database as a “lock”
- coherence between hosts: there will be no guarantee that an effect seen
by host ‘A’ will be seen immediately by host ‘B’. In particular this means
that clients should send all their commands and
event.from
calls to
the same host (although any host will do)
Stuff we need to build
A pull
/push
replicator: this would have to monitor the list
of hosts in the pool and distribute updates to them in some vaguely
efficient manner. Ideally we would avoid hassling the pool master and
use some more efficient topology: perhaps a tree?
A git diff
to XenAPI event converter: whenever a host pull
s
updates from another it needs to convert the diff into a set of touched
objects for any event.from
to read. We could send the changeset hash
as the event.from
token.
Irmin nested views: since Tasks can be nested (and git branches can be
nested) we need to make sure that Irmin views can be nested.
We need to go through the xapi code and convert all mixtures of database
access and XenAPI updates into pure database calls. With the previous system
it was better to use a XenAPI to remote large chunks of database effects to
the master than to perform them locally. It will now be better to run them
all locally and merge them at the end. Additionally since a Task will have
a local branch, it won’t be possible to see the state on a remote host
without triggering an early merge (which would harm efficiency)
We need to create a first-class locking API to use instead of the
VDI.sm_config
locks.
Prototype
A basic prototype has been created:
$ opam pin xen-api-client git://github.com/djs55/xen-api-client#improvements
$ opam pin add xapi-database git://github.com/djs55/xapi-database
$ opam pin add xapi git://github.com/djs55/xen-api#schema-sexp
The xapi-database
is clone of the existing Xapi database code
configured to run as a separate process. There is
code to convert from XML to git
and
an implementation of the Xapi remote database API
which uses the following layout:
$ git clone /xapi.db db
Cloning into 'db'...
done.
$ cd db; ls
xapi
$ ls xapi
console host_metrics PCI pool SR user VM
host network PIF session tables VBD VM_metrics
host_cpu PBD PIF_metrics SM task VDI
$ ls xapi/pool
OpaqueRef:39adc911-0c32-9e13-91a8-43a25939110b
$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/
crash_dump_SR __mtime suspend_image_SR
__ctime name_description uuid
default_SR name_label vswitch_controller
ha_allow_overcommit other_config wlb_enabled
ha_enabled redo_log_enabled wlb_password
ha_host_failures_to_tolerate redo_log_vdi wlb_url
ha_overcommitted ref wlb_username
ha_plan_exists_for _ref wlb_verify_cert
master restrictions
$ ls xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/
cpuid_feature_mask memory-ratio-hvm memory-ratio-pv
$ cat xapi/pool/OpaqueRef\:39adc911-0c32-9e13-91a8-43a25939110b/other_config/cpuid_feature_mask
ffffff7f-ffffffff-ffffffff-ffffffff
Notice how:
- every object is a directory
- every key/value pair is represented as a file
Design document |
---|
Revision | v1 |
Status | released (6.0.2) |
Emergency Network Reset Design
This document describes design details for the PR-1032 requirements.
The design consists of four parts:
- A new XenAPI call
Host.reset_networking
, which removes all the
PIFs, Bonds, VLANs and tunnels associated with the given host, and a
call PIF.scan_bios
to bring back the PIFs with device names as
defined in the BIOS. - A
xe-reset-networking
script that can be executed on a XenServer
host, which prepares the reset and causes the host to reboot. - An xsconsole page that essentially does the same as
xe-reset-networking
. - A new item in the XAPI start-up sequence, which when triggered by
xe-reset-networking
, calls Host.reset_networking
and re-creates
the PIFs.
Command-Line Utility
The xe-reset-networking
script takes the following parameters:
DNS server for management interface. Optional; ignored if --mode=dhcp
.
The script takes the following steps after processing the given
parameters:
- Inform the user that the host will be restarted, and that any
running VMs should be shut down. Make the user confirm that they
really want to reset the networking by typing ‘yes’.
- Read
/etc/xensource/pool.conf
to determine whether the host is a
pool master or pool slave. - If a pool slave, update the IP address in the
pool.conf
file to
the one given in the -m
parameter, if present. - Shut down networking subsystem (
service network stop
). - If no management device is specified, take it from
/etc/firstboot.d/data/management.conf.
- If XAPI is running, stop it.
- Reconfigure the management interface and associated bridge by
interface-reconfigure --force
. - Update
MANAGEMENT_INTERFACE
and clear CURRENT_INTERFACES
in
/etc/xensource-inventory
. - Create the file
/tmp/network-reset
to trigger XAPI to complete the
network reset after the reboot. This file should contain the full
configuration details of the management interface as key/value pairs
(format: <key>=<value>\n
), and looks similar to the firstboot data
files. The file contains at least the keys DEVICE
and MODE
, and
IP
, NETMASK
, GATEWAY
, or DNS
when appropriate. - Reboot
XAPI
XenAPI
A new hidden API call:
Host.reset_networking
- Parameter: host reference
host
- Calling this function removes all the PIF, Bond, VLAN and tunnel
objects associated with the given host from the master database.
All Network and VIF objects are maintained, as these do not
necessarily belong to a single host.
Start-up Sequence
After reboot, in the XAPI start-up sequence trigged by the presence of
/tmp/network-reset
:
- Read the desired management configuration from
/tmp/network-reset
. - Call
Host.reset_networking
with a ref to the localhost. - Call
PIF.scan
with a ref to the localhost to recreate the
(physical) PIFs. - Call
PIF.reconfigure_ip
to configure the management interface. - Call
Host.management_reconfigure
. - Delete
/tmp/network-reset
.
xsconsole
Add an “Emergency Network Reset” option under the “Network and
Management Interface” menu. Selecting this option will show some
explanation in the pane on the right-hand side. Pressing <Enter> will
bring up a dialogue to select the interfaces to use as management
interface after the reset. After choosing a device, the dialogue
continues with configuration options like in the “Configure Management
Interface” dialogue. After completing the dialogue, the same steps as
listed for xe-reset-networking
are executed.
Notes
- On a pool slave, the management interface should be the same as on
the master (the same device name, e.g. eth0).
- Resetting the networking configuration on the master should be
ideally be followed by resets of the pool slaves as well, in order
to synchronise their configuration (especially bonds/VLANs/tunnels).
Furthermore, in case the IP address of the master has changed, as a
result of a network reset or
Host.management_reconfigure
, pool
slaves may also use the network reset functionality to reconnect to
the master on its new IP.
Design document |
---|
Revision | v3 |
Status | proposed |
Review | #120 |
FCoE capable NICs
It has been possible to identify the NICs of a Host which can support FCoE.
This property can be listed in PIF object under capabilities field.
Introduction
- FCoE supported on a NIC is a hardware property. With the help of dcbtool, we can identify which NIC support FCoE.
- The new field capabilities will be
Set(String)
in PIF object. For FCoE capable NIC will have string “fcoe” in PIF capabilities field. capabilities
field will be ReadOnly, This field cannot be modified by user.
PIF Object
New field:
- Field
PIF.capabilities
will be type Set(string)
. - Default value in PIF capabilities will have an empty set.
Xapi Changes
- Set the field capabilities “fcoe” depending on output of xcp-networkd call
get_capabilities
. - Field capabilities “fcoe” can be set during
introduce_internal
on when creating a PIF. - Field capabilities “fcoe” can be updated during
refresh_all
on xapi startup. - The above field will be set everytime when xapi-restart.
XCP-Networkd Changes
New function:
- String list
string list get_capabilties (string)
- Argument: device_name for the PIF.
- This function calls method
capable
exposed by fcoe_driver.py
as part of dom0. - It returns string list [“fcoe”] or [] depending on
capable
method output.
Defaults, Installation and Upgrade
- Any newly introduced PIF will have its capabilities field as empty set until
fcoe_driver
method capable
states FCoE is supported on the NIC. - It includes PIFs obtained after a fresh install of Xenserver, as well as PIFs created using
PIF.introduce
then PIF.scan
. - During an upgrade Xapi Restart will call
refresh_all
which then populate the capabilities field as empty set.
Command Line Interface
- The
PIF.capabilities
field is exposed through xe pif-list
and xe pif-param-list
as usual.
Design document |
---|
Revision | v1 |
Status | released (6.0) |
GPU pass-through support
This document contains the software design for GPU pass-through. This
code was originally included in the version of Xapi used in XenServer 6.0.
Overview
Rather than modelling GPU pass-through from a PCI perspective, and
having the user manipulate PCI devices directly, we are taking a
higher-level view by introducing a dedicated graphics model. The
graphics model is similar to the networking and storage model, in which
virtual and physical devices are linked through an intermediate
abstraction layer (e.g. the “Network” class in the networking model).
The basic graphics model is as follows:
- A host owns a number of physical GPU devices (pGPUs), each of
which is available for passing through to a VM.
- A VM may have a virtual GPU device (vGPU), which means it expects
to have access to a GPU when it is running.
- Identical pGPUs are grouped across a resource pool in GPU groups.
GPU groups are automatically created and maintained by XS.
- A GPU group connects vGPUs to pGPUs in the same way as VIFs are
connected to PIFs by Network objects: for a VM v having a vGPU on
GPU group p to run on host h, host h must have a pGPU in GPU
group p and pass it through to VM v.
- VM start and non-live migration rules are analogous to the network
API and follow the above rules.
- In case a VM that has a vGPU is started, while no pGPU available, an
exception will occur and the VM won’t start. As a result, in order
to guarantee that a VM always has access to a pGPU, the number of
vGPUs should not exceed the number of pGPUs in a GPU group.
Currently, the following restrictions apply:
- Hotplug is not supported.
- Suspend/resume and checkpointing (memory snapshots) are not
supported.
- Live migration (XenMotion) is not supported.
- No more than one GPU per VM will be supported.
- Only Windows guests will be supported.
XenAPI Changes
The design introduces a new generic class called PCI to capture state
and information about relevant PCI devices in a host. By default, xapi
would not create PCI objects for all PCI devices, but only for the ones
that are managed and configured by xapi; currently only GPU devices.
The PCI class has no fields specific to the type of the PCI device (e.g.
a graphics card or NIC). Instead, device specific objects will contain a
link to their underlying PCI device’s object.
The new XenAPI classes and changes to existing classes are detailed
below.
PCI class
Fields:
Name | Type | Description |
---|
uuid | string | Unique identifier/object reference. |
class_id | string | PCI class ID (hidden field) |
class_name | string | PCI class name (GPU, NIC, …) |
vendor_id | string | Vendor ID (hidden field). |
vendor_name | string | Vendor name. |
device_id | string | Device ID (hidden field). |
device_name | string | Device name. |
host | host ref | The host that owns the PCI device. |
pci_id | string | BDF (domain/Bus/Device/Function identifier) of the (physical) PCI function, e.g. “0000:00:1a.1”. The format is hhhh:hh:hh.h, where h is a hexadecimal digit. |
functions | int | Number of (physical + virtual) functions; currently fixed at 1 (hidden field). |
attached_VMs | VM ref set | List of VMs that have this PCI device “currently attached”, i.e. plugged, i.e. passed-through to (hidden field). |
dependencies | PCI ref set | List of dependent PCI devices: all of these need to be passed-thru to the same VM (co-location). |
other_config | (string -> string) map | Additional optional configuration (as usual). |
Hidden fields are only for use by xapi internally, and not visible to
XenAPI users.
Messages: none.
PGPU class
A physical GPU device (pGPU).
Fields:
Name | Type | Description |
---|
uuid | string | Unique identifier/object reference. |
PCI | PCI ref | Link to the underlying PCI device. |
other_config | (string -> string) map | Additional optional configuration (as usual). |
host | host ref | The host that owns the GPU. |
GPU_group | GPU_group ref | GPU group the pGPU is contained in. Can be Null. |
Messages: none.
GPU_group class
A group of identical GPUs across hosts. A VM that is associated with a
GPU group can use any of the GPUs in the group. A VM does not need to
install new GPU drivers if moving from one GPU to another one in the
same GPU group.
Fields:
Name | Type | Description |
---|
VGPUs | VGPU ref set | List of vGPUs in the group. |
uuid | string | Unique identifier/object reference. |
PGPUs | PGPU ref set | List of pGPUs in the group. |
other_config | (string -> string) map | Additional optional configuration (as usual). |
name_label | string | A human-readable name. |
name_description | string | A notes field containing human-readable description. |
GPU_types | string set | List of GPU types (vendor+device ID) that can be in this group (hidden field). |
Messages: none.
VGPU class
A virtual GPU device (vGPU).
Fields:
Name | Type | Description |
---|
uuid | string | Unique identifier/object reference. |
VM | VM ref | VM that owns the vGPU. |
GPU_group | GPU_group ref | GPU group the vGPU is contained in. |
currently_attached | bool | Reflects whether the virtual device is currently “connected” to a physical device. |
device | string | Order in which the devices are plugged into the VM. Restricted to “0” for now. |
other_config | (string -> string) map | Additional optional configuration (as usual). |
Messages:
Prototype | Description | |
---|
VGPU ref create (GPU_group ref, string, VM ref) | Manually assign the vGPU device to the VM given a device number, and link it to the given GPU group. | |
void destroy (VGPU ref) | Remove the association between the GPU group and the VM. | |
It is possible to assign more vGPUs to a group than number number of
pGPUs in the group. When a VM is started, a pGPU must be available; if
not, the VM will not start. Therefore, to guarantee that a VM has access
to a pGPU at any time, one must manually enforce that the number of
vGPUs in a GPU group does not exceed the number of pGPUs. XenCenter
might display a warning, or simply refuse to assign a vGPU, if this
constraint is violated. This is analogous to the handling of memory
availability in a pool: a VM may not be able to start if there is no
host having enough free memory.
VM class
Fields:
- Deprecate (unused)
PCI_bus
field - Add field
VGPU ref set VGPUs
: List of vGPUs. - Add field
PCI ref set attached_PCIs
: List of PCI devices that are
“currently attached” (plugged, passed-through) (hidden field).
host class
Fields:
- Add field
PCI ref set PCIs
: List of PCI devices. - Add field
PGPU ref set PGPUs
: List of physical GPU devices. - Add field
(string -> string) map chipset_info
, which contains at
least the key iommu
. If "true"
, this key indicates whether the
host has IOMMU/VT-d support build in, and this functionality is
enabled by Xen; the value will be "false"
otherwise.
Initialisation and Operations
Enabling IOMMU/VT-d
(This may not be needed in Xen 4.1. Confirm with Simon.)
Provide a command that does this:
/opt/xensource/libexec/xen-cmdline --set-xen iommu=1
- reboot
Xapi startup
Definitions:
- PCI devices are matched on the combination of their
pci_id
,
vendor_id
, and device_id
.
First boot and any subsequent xapi start:
Find out from dmesg whether IOMMU support is present and enabled in
Xen, and set host.chipset_info:iommu
accordingly.
Detect GPU devices currently present in the host. For each:
- If there is no matching PGPU object yet, create a PGPU object,
and add it to a GPU group containing identical PGPUs, or a new
group.
- If there is no matching PCI object yet, create one, and also
create or update the PCI objects for dependent devices.
Destroy all existing PCI objects of devices that are not currently
present in the host (i.e. objects for devices that have been
replaced or removed).
Destroy all existing PGPU objects of GPUs that are not currently
present in the host. Send a XenAPI alert to notify the user of this
fact.
Update the list of dependencies
on all PCI objects.
Sync VGPU.currently_attached
on all VGPU
objects.
Upgrade
For any VMs that have VM.other_config:pci
set to use a GPU, create an
appropriate vGPU, and remove the other_config
option.
Generic PCI Interface
A generic PCI interface exposed to higher-level code, such as the
networking and GPU management modules within Xapi. This functionality
relies on Xenops.
The PCI module exposes the following functions:
- Check whether a PCI device has free (unassigned) functions. This is
the case if the number of assignments in
PCI.attached_VMs
is
smaller than PCI.functions
. - Plug a PCI function into a running VM.
- Raise exception if there are no free functions.
- Plug PCI device, as well as dependent PCI devices. The PCI
module must also tell device-specific modules to update the
currently_attached
field on dependent VGPU
objects etc. - Update
PCI.attached_VMs
.
- Unplug a PCI function from a running VM.
- Raise exception if the PCI function is not owned by (passed
through to) the VM.
- Unplug PCI device, as well as dependent PCI devices. The PCI
module must also tell device-specific modules to update the
currently_attached
field on dependent VGPU
objects etc. - Update
PCI.attached_VMs
.
Construction and Destruction
VGPU.create:
- Check license. Raise FEATURE_RESTRICTED if the GPU feature has not
been enabled.
- Raise INVALID_DEVICE if the given device number is not “0”, or
DEVICE_ALREADY_EXISTS if (indeed) the device already exists. This
is a convenient way of enforcing that only one vGPU per VM is
supported, for now.
- Create
VGPU
object in the DB. - Initialise
VGPU.currently_attached = false
. - Return a ref to the new object.
VGPU.destroy:
- Raise OPERATION_NOT_ALLOWED if
VGPU.currently_attached = true
and the VM is running. - Destroy
VGPU
object.
VM Operations
VM.start(_on):
- If
host.chipset_info:iommu = "false"
, raise VM_REQUIRES_IOMMU. - Raise FEATURE_REQUIRES_HVM (carrying the string “GPU passthrough
needs HVM”) if the VM is PV rather than HVM.
- For each of the VM’s vGPUs:
- Confirm that the given host has a pGPU in its associated GPU
group. If not, raise VM_REQUIRES_GPU.
- Consult the generic PCI module for all pGPUs in the group to
find out whether a suitable PCI function is available. If a
physical device is not available, raise VM_REQUIRES_GPU.
- Ask PCI module to plug an available pGPU into the VM’s domain
and set
VGPU.currently_attached
to true
. As a side-effect,
any dependent PCI devices would be plugged.
VM.shutdown:
- Ask PCI module to unplug all GPU devices.
- Set
VGPU.currently_attached
to false
for all the VM’s VGPUs.
VM.suspend, VM.resume(_on):
- Raise VM_HAS_PCI_ATTACHED if the VM has any plugged
VGPU
objects, as suspend/resume for VMs with GPUs is currently not
supported.
VM.pool_migrate:
- Raise VM_HAS_PCI_ATTACHED if the VM has any plugged
VGPU
objects, as live migration for VMs with GPUs is currently not
supported.
VM.clone, VM.copy, VM.snapshot:
- Copy
VGPU
objects along with the VM.
VM.import, VM.export:
- Include
VGPU
and GPU_group
objects in the VM export format.
VM.checkpoint
- Raise VM_HAS_PCI_ATTACHED if the VM has any plugged
VGPU
objects, as checkpointing for VMs with GPUs is currently not
supported.
Pool Join and Eject
Pool join:
For each PGPU
:
- Copy it to the pool.
- Add it to a
GPU_group
of identical PGPUs, or a new one.
Copy each VGPU
to the pool together with the VM that owns it, and
add it to the GPU group containing the same PGPU
as before the
join.
Step 1 is done automatically by the xapi startup code, and step 2 is
handled by the VM export/import code. Hence, no work needed.
Pool eject:
VGPU
objects will be automatically GC’ed when the VMs are removed.- Xapi’s startup code recreates the
PGPU
and GPU_group
objects.
Hence, no work needed.
Required Low-level Interface
Xapi needs a way to obtain a list of all PCI devices present on a host.
For each device, xapi needs to know:
- The PCI ID (BDF).
- The type of device (NIC, GPU, …) according to a well-defined and
stable list of device types (as in
/usr/share/hwdata/pci.ids
). - The device and vendor ID+name (currently, for PIFs, xapi looks up
the name in
/usr/share/hwdata/pci.ids
). - Which other devices/functions are required to be passed through to
the same VM (co-located), e.g. other functions of a compound PCI
device.
Command-Line Interface (xe)
- xe pgpu-list
- xe pgpu-param-list/get/set/add/remove/clear
- xe gpu-group-list
- xe gpu-group-param-list/get/set/add/remove/clear
- xe vgpu-list
- xe vgpu-create
- xe vgpu-destroy
- xe vgpu-param-list/get/set/add/remove/clear
- xe host-param-get param-name=chipset-info param-key=iommu
Design document |
---|
Revision | v3 |
Status | released (7.0) |
Revision history |
---|
v1 | Documented interface changes between xapi and xenopsd for vGPU |
v2 | Added design for storing vGPU-to-pGPU allocation in xapi database |
v3 | Marked new xapi DB fields as internal-only |
GPU support evolution
Introduction
As of XenServer 6.5, VMs can be provisioned with access to graphics processors
(either emulated or passed through) in four different ways. Virtualisation of
Intel graphics processors will exist as a fifth kind of graphics processing
available to VMs. These five situations all require the VM’s device model to be
created in subtly different ways:
Pure software emulation
- qemu is launched either with no special parameter, if the basic Cirrus
graphics processor is required, otherwise qemu is launched with the
-std-vga
flag.
Generic GPU passthrough
- qemu is launched with the
-priv
flag to turn on privilege separation - qemu can additionally be passed the
-std-vga
flag to choose the
corresponding emulated graphics card.
Intel integrated GPU passthrough (GVT-d)
- As well as the
-priv
flag, qemu must be launched with the -std-vga
and
-gfx_passthru
flags. The actual PCI passthrough is handled separately
via xen.
NVIDIA vGPU
- qemu is launched with the
-vgpu
flag - a secondary display emulator, demu, is launched with the following parameters:
--domain
- the VM’s domain ID--vcpus
- the number of vcpus available to the VM--gpu
- the PCI address of the physical GPU on which the emulated GPU will
run--config
- the path to the config file which contains detail of the GPU to
emulate
Intel vGPU (GVT-g)
- here demu is not used, but instead qemu is launched with five parameters:
-xengt
-vgt_low_gm_sz
- the low GM size in MiB-vgt_high_gm_sz
- the high GM size in MiB-vgt_fence_sz
- the number of fence registers-priv
xenopsd
To handle all these possibilities, we will add some new types to xenopsd’s
interface:
module Pci = struct
type address = {
domain: int;
bus: int;
device: int;
fn: int;
}
...
end
module Vgpu = struct
type gvt_g = {
physical_pci_address: Pci.address;
low_gm_sz: int64;
high_gm_sz: int64;
fence_sz: int;
}
type nvidia = {
physical_pci_address: Pci.address;
config_file: string
}
type implementation =
| GVT_g of gvt_g
| Nvidia of nvidia
type id = string * string
type t = {
id: id;
position: int;
implementation: implementation;
}
type state = {
plugged: bool;
emulator_pid: int option;
}
end
module Vm = struct
type igd_passthrough of
| GVT_d
type video_card =
| Cirrus
| Standard_VGA
| Vgpu
| Igd_passthrough of igd_passthrough
...
end
module Metadata = struct
type t = {
vm: Vm.t;
vbds: Vbd.t list;
vifs: Vif.t list;
pcis: Pci.t list;
vgpus: Vgpu.t list;
domains: string option;
}
end
The video_card
type is used to indicate to the function
Xenops_server_xen.VM.create_device_model_config
how the VM’s emulated graphics
card will be implemented. A value of Vgpu
indicates that the VM needs to be
started with one or more virtualised GPUs - the function will need to look at
the list of GPUs associated with the VM to work out exactly what parameters to
send to qemu.
If Vgpu.state.emulator_pid
of a plugged vGPU is None
, this indicates that
the emulation of the vGPU is being done by qemu rather than by a separate
emulator.
n.b. adding the vgpus
field to Metadata.t
will break backwards compatibility
with old versions of xenopsd, so some upgrade logic will be required.
This interface will allow us to support multiple vGPUs per VM in future if
necessary, although this may also require reworking the interface between
xenopsd, qemu and demu. For now, xenopsd will throw an exception if it is asked
to start a VM with more than one vGPU.
xapi
To support the above interface, xapi will convert all of a VM’s non-passthrough
GPUs into Vgpu.t
objects when sending VM metadata to xenopsd.
In contrast to GVT-d, which can only be run on an Intel GPU which has been
has been hidden from dom0, GVT-g will only be allowed to run on a GPU which has
not been hidden from dom0.
If a GVT-g-capable GPU is detected, and it is not hidden from dom0, xapi will
create a set of VGPU_type objects to represent the vGPU presets which can run on
the physical GPU. Exactly how these presets are defined is TBD, but a likely
solution is via a set of config files as with NVIDIA vGPU.
Allocation of vGPUs to physical GPUs
For NVIDIA vGPU, when starting a VM, each vGPU attached to the VM is assigned
to a physical GPU as a result of capacity planning at the pool level. The
resulting configuration is stored in the VM.platform dictionary, under
specific keys:
vgpu_pci_id
- the address of the physical GPU on which the vGPU will runvgpu_config
- the path to the vGPU config file which the emulator will use
Instead of storing the assignment in these fields, we will add a new
internal-only database field:
VGPU.scheduled_to_be_resident_on (API.ref_PGPU)
This will be set to the ref of the physical GPU on which the vGPU will run. From
here, xapi can easily obtain the GPU’s PCI address. Capacity planning will also
take into account which vGPUs are scheduled to be resident on a physical GPU,
which will avoid races resulting from many vGPU-enabled VMs being started at
once.
The path to the config file is already stored in the VGPU_type.internal_config
dictionary, under the key vgpu_config
. xapi will use this value directly
rather than copying it to VM.platform.
To support other vGPU implementations, we will add another internal-only
database field:
VGPU_type.implementation enum(Passthrough|Nvidia|GVT_g)
For the GVT_g
implementation, no config file is needed. Instead,
VGPU_type.internal_config
will contain three key-value pairs, with the keys
vgt_low_gm_sz
vgt_high_gm_sz
vgt_fence_sz
The values of these pairs will be used to construct a value of type
Xenops_interface.Vgpu.gvt_g
, which will be passed down to xenopsd.
Design document |
---|
Revision | v1 |
Status | released (6.5) |
GRO and other properties of PIFs
It has been possible to enable and disable GRO and other “ethtool” features on
PIFs for a long time, but there was never an official API for it. Now there is.
Introduction
The former way to enable GRO via the CLI is as follows:
xe pif-param-set uuid=<pif-uuid> other-config:ethtool-gro=on
xe pif-plug uuid=<pif-uuid>
The other-config
field is a grab-bag of options that are not clearly defined.
The options exposed through other-config
are mostly experimental features, and
the interface is not considered stable. Furthermore, the field is read/write
and does not have any input validation, and cannot not trigger any actions
immediately. The latter is why it is needed to call pif-plug
after setting
the ethtool-gro
key, in order to actually make things happen.
New API
New field:
- Field
PIF.properties
of type (string -> string) map
. - Physical and bond PIFs have a
gro
key in their properties
, with possible values on
and off
. There are currently no other properties defined. - VLAN and Tunnel PIFs do not have any properties. They implicitly inherit the properties from the PIF they are based upon (either a physical PIF or a bond).
- For backwards compatibility, if there is a
other-config:ethtool-gro
key present on the PIF, it will be treated as an override of the gro
key in PIF.properties
.
New function:
- Message
void PIF.set_property (PIF ref, string, string)
. - First argument: the reference of the PIF to act on.
- Second argument: the key to change in the
properties
field. - Third argument: the value to write.
- The function can only be used on physical PIFs that are not bonded, and on bond PIFs. Attempts to call the function on bond slaves, VLAN PIFs, or Tunnel PIFs, fail with
CANNOT_CHANGE_PIF_PROPERTIES
. - Calls with invalid keys or values fail with
INVALID_VALUE
. - When called on a bond PIF, the key in the
properties
of the associated bond slaves will also be set to same value. - The function automatically causes the settings to be applied to the network devices (no additional
plug
is needed). This includes any VLANs that are on top of the PIF to-be-changed, as well as any bond slaves.
Defaults, Installation and Upgrade
- Any newly introduced PIF will have its
properties
field set to "gro" -> "on"
. This includes PIFs obtained after a fresh installation of XenServer, as well as PIFs created using PIF.introduce
or PIF.scan
. In other words, GRO will be “on” by default. - An upgrade from a version of XenServer that does not have the
PIF.properties
field, will give every physical and bond PIF a properties
field set to "gro" -> "on"
. In other words, GRO will be “on” by default after an upgrade.
Bonding
- When creating a bond, the bond-slaves-to-be must all have equal
PIF.properties
. If not, the bond.create
call will fail with INCOMPATIBLE_BOND_PROPERTIES
. - When a bond is created successfully, the
properties
of the bond PIF will be equal to the properties of the bond slaves.
Command Line Interface
- The
PIF.properties
field is exposed through xe pif-list
and xe pif-param-list
as usual. - The
PIF.set_property
call is exposed through xe pif-param-set
. For example: xe pif-param-set uuid=<pif-uuid> properties:gro=off
.
Design document |
---|
Revision | v1 |
Status | released (5.6) |
Heterogeneous pools
Notes
- The
cpuid
instruction is used to obtain a CPU’s manufacturer,
family, model, stepping and features information. - The feature bitvector is 128 bits wide: 2 times 32 bits of base
features plus 2 times 32 bits of extended features, which are
referred to as
base_ecx
, base_edx
, ext_ecx
and ext_edx
(after the registers used by cpuid
to store the results). - The feature bits can be masked by Intel FlexMigration and AMD
Extended Migration. This means that features can be made to appear
as absent. Hence, a CPU can appear as a less-capable CPU.
- AMD Extended Migration is able to mask both base and extended
features.
- Intel FlexMigration on Core 2 CPUs (Penryn) is able to mask
only the base features (
base_ecx
and base_edx
). The
newer Nehalem and Westmere CPUs support extended-feature masking
as well.
- A process in dom0 (e.g. xapi) is able to call
cpuid
to obtain the
(possibly modified) CPU info, or can obtain this information from
Xen. Masking is done only by Xen at boot time, before any domains
are loaded. - To apply a feature mask, a dom0 process may specify the mask in the
Xen command line in the file
/boot/extlinux.conf
. After a reboot,
the mask will be enforced. - It is not possible to obtain the original features from a dom0
process, if the features have been masked. Before applying the first
mask, the process could remember/store the original feature vector,
or obtain the information from Xen.
- All CPU cores on a host can be assumed to be identical. Masking will
be done simultaneously on all cores in a host.
- Whether a CPU supports FlexMigration/Extended Migration can (only)
be derived from the family/model/stepping information.
- XS5.5 has an exception for the EST feature in base_ecx. This flag
is ignored on pool join.
Overview of XenAPI Changes
Fields
Currently, the datamodel has Host_cpu
objects for each CPU core in a
host. As they are all identical, we are considering keeping just one CPU
record in the Host
object itself, and deprecating the Host_cpu
class. For backwards compatibility, the Host_cpu
objects will remain
as they are in MNR, but may be removed in subsequent releases.
Hence, there will be a new field called Host.cpu_info
, a read-only
string-string map, containing the following fixed set of keys:
Indicating whether the CPU supports Intel FlexMigration or AMD Extended
Migration. There are three possible values: "no"
means that masking is
not possible, "base"
means that only base features can be masked, and
"full"
means that base as well as extended features can be masked.
Note: When the features
and features_after_reboot
are different,
XenCenter could display a warning saying that a reboot is needed to
enforce the feature masking.
The Pool.other_config:cpuid_feature_mask
key is recognised. If this
key is present and if it contains a value in the same format as
Host.cpu_info:features
, the value is used to mask the feature vectors
before comparisons during any pool join in the pool it is defined on.
This can be used to white-list certain feature flags, i.e. to ignore
them when adding a new host to a pool. The default it
ffffff7f-ffffffff-ffffffff-ffffffff
, which white-lists the EST feature
for compatibility with XS 5.5 and earlier.
Messages
New messages:
Host.set_cpu_features
- Parameters: Host reference
host
, new CPU feature vector
features
. - Roles: only Pool Operator and Pool Admin.
- Sets the feature vector to be used after a reboot
(
Host.cpu_info:features_after_reboot
), if features
is valid.
Host.reset_cpu_features
- Parameter: Host reference
host
. - Roles: only Pool Operator and Pool Admin.
- Removes the feature mask, such that after a reboot all features
of the CPU are enabled.
XAPI
Back-end
- Xen keeps the physical (unmasked) CPU features in memory when
starts, before applying any masks. Xen exposes the physical
features, as well as the current (possibly masked) features, to
dom0/xapi via the function
xc_get_boot_cpufeatures
in libxc. - A dom0 script
/etc/xensource/libexec/xen-cmdline
, which provides a
future-proof way of modifying the Xen command-line key/value pairs.
This script has the following options, where mask
is one of
cpuid_mask_ecx
, cpuid_mask_edx
, cpuid_mask_ext_ecx
or
cpuid_mask_ext_edx
, and value
is 0xhhhhhhhh
(h
is represents
a hex digit).:--list-cpuid-masks
--set-cpuid-masks mask=value mask=value
--delete-cpuid-masks mask mask
- A
restrict_cpu_masking
key has been added to the host licensing
restrictions map. This will be true
when the Host.edition
is
free
, and false
if it is enterprise
or platinum
.
Start-up
The Host.cpu_info
field is refreshed:
- The values for the keys
cpu_count
, vendor
, speed
, modelname
,
flags
, stepping
, model
, and family
are obtained from
/etc/xensource/boot_time_cpus
(and ultimately from
/proc/cpuinfo
). - The values of the
features
and physical_features
are obtained
from Xen and the features_after_reboot
key is made equal to the
features
field. - The value of the
maskable
key is determined by the CPU details.- for Intel Core2 (Penryn) CPUs:
family = 6 and (model = 1dh or (model = 17h and stepping >= 4))
(maskable = "base"
) - for Intel Nehalem/Westmere CPUs:
family = 6 and ((model = 1ah and stepping > 2) or model = 1eh or model = 25h or model = 2ch or model = 2eh or model = 2fh)
(maskable = "full"
) - for AMD CPUs:
family >= 10h
(maskable = "full"
)
Setting (Masking) and Resetting the CPU Features
- The
Host.set_cpu_features
call:- checks whether the license of the host is Enterprise or
Platinum; throws FEATURE_RESTRICTED if not.
- expects a string of 32 hexadecimal digits, optionally containing
spaces; throws INVALID_FEATURE_STRING if malformed.
- checks whether the given feature vector can be formed by masking
the physical feature vector; throws INVALID_FEATURE_STRING if
not. Note that on Intel Core 2 CPUs, it is only possible to the
mask the base features!
- checks whether the CPU supports FlexMigration/Extended
Migration; throws CPU_FEATURE_MASKING_NOT_SUPPORTED if not.
- sets the value of
features_after_reboot
to the given feature
vector. - adds the new feature mask to the Xen command-line via the
xen-cmdline
script. The mask is represented by one or more of
the following key/value pairs (where h
represents a hex
digit):cpuid_mask_ecx=0xhhhhhhhh
cpuid_mask_edx=0xhhhhhhhh
cpuid_mask_ext_ecx=0xhhhhhhhh
cpuid_mask_ext_edx=0xhhhhhhhh
- The
Host.reset_cpu_features
call:- copies
physical_features
to features_after_reboot
. - removes the feature mask from the Xen command-line via the
xen-cmdline
script (if any).
Pool Join and Eject
Pool.join
fails when the vendor
and feature
keys do not match,
and disregards any other key in Host.cpu_info
.- However, as XS5.5 disregards the EST flag, there is a new way to
disregard/ignore feature flags on pool join, by setting a mask
in
Pool.other_config:cpuid_feature_mask
. The value of this
field should have the same format as Host.cpu_info:features
.
When comparing the CPUID features of the pool and the joining
host for equality, this mask is applied before the comparison.
The default is ffffff7f-ffffffff-ffffffff-ffffffff
, which
defines the EST feature, bit 7 of the base ecx flags, as “don’t
care”.
Pool.eject
clears the database (as usual), and additionally
removes the feature mask from /boot/extlinux.conf
(if any).
CLI
New commands:
host-cpu-info
- Parameters:
uuid
(optional, uses localhost if absent). - Lists
Host.cpu_info
associated with the host.
host-get-cpu-features
- Parameters:
uuid
(optional, uses localhost if absent). - Returns the value of
Host.cpu_info:features]
associated with
the host.
host-set-cpu-features
- Parameters:
features
(string of 32 hexadecimal digits,
optionally containing spaces or dashes), uuid
(optional, uses
localhost if absent). - Calls
Host.set_cpu_features
.
host-reset-cpu-features
- Parameters:
uuid
(optional, uses localhost if absent). - Calls
Host.reset_cpu_features
.
The following commands will be deprecated: host-cpu-list
,
host-cpu-param-get
, host-cpu-param-list
.
WARNING:
If the user is able to set any mask they like, they may end up disabling
CPU features that are required by dom0 (and probably other guest OSes),
resulting in a kernel panic when the machine restarts. Hence, using the
set function is potentially dangerous.
It is apparently not easy to find out exactly which flags are safe to
mask and which aren’t, so we cannot prevent an API/CLI user from making
mistakes in this way. However, using XenCenter would always be safe, as
XC always copies features masks from real hosts.
If a machine ends up in such a bad state, there is a way to get out of
it. At the boot prompt (before Xen starts), you can type “menu.c32”,
select a boot option and alter the Xen command-line to remove the
feature masks, after which the machine will again boot normally (note:
in our set-up, there is first a PXE boot prompt; the second prompt is
the one we mean here).
The API/CLI documentation should stress the potential danger of using
this functionality, and explain how to get out of trouble again.
Design document |
---|
Revision | v1 |
Status | confirmed |
Improving snapshot revert behaviour
Currently there is a XenAPI VM.revert
which reverts a “VM” to the state it
was in when a VM-level snapshot was taken. There is no VDI.revert
so
VM.revert
uses VDI.clone
to change the state of the disks.
The use of VDI.clone
has the side-effect of changing VDI refs and uuids.
This causes the following problems:
- It is difficult for clients
such as Apache CloudStack to keep track
of the disks it is actively managing
- VDI snapshot metadata (
VDI.snapshot_of
et al) has to be carefully
fixed up since all the old refs are now dangling
We will fix these problems by:
- adding a
VDI.revert
to the SMAPIv2 and calling this from VM.revert
- defining a new SMAPIv1 operation
vdi_revert
and a corresponding capability
VDI_REVERT
- the Xapi implementation of
VDI.revert
will first try the vdi_revert
,
and fall back to VDI.clone
if that fails - implement
vdi_revert
for common storage types, including File and LVM-based
SRs.
XenAPI changes
We will add the function VDI.revert
with arguments:
- in:
snapshot: Ref(VDI)
: the snapshot to which we want to revert - in:
driver_params: Map(String,String)
: optional extra parameters - out:
Ref(VDI)
the new VDI
The function will look up the VDI which this is a snapshot_of
, and change
the VDI to have the same contents as the snapshot. The snapshot will not be
modified. If the implementation is able to revert in-place, then the reference
returned will be the VDI this is a snapshot_of
; otherwise it is a reference
to a fresh VDI (created by the VDI.clone
fallback path)
References:
SMAPIv1 changes
We will define the function vdi_revert
with arguments:
- in:
sr_uuid
: the UUID of the SR containing both the VDI and the snapshot - in:
vdi_uuid
: the UUID of the snapshot whose contents should be duplicated - in:
target_uuid
: the UUID of the target whose contents should be replaced
The function will replace the contents of the target_uuid
VDI with the
contents of the vdi_uuid
VDI without changing the identify of the target
(i.e. name-label, uuid and location are guaranteed to remain the same).
The vdi_uuid
is preserved by this operation. The operation is obvoiusly
idempotent.
Xapi changes
Xapi will
- use
VDI.revert
in the VM.revert
code-path - expose a new
xe vdi-revert
CLI command - implement the
VDI.revert
by calling the SMAPIv1 function and falling back
to VDI.clone
if a Not_implemented
exception is thrown
References:
SM changes
We will modify
- SRCommand.py and VDI.py to add a new
vdi_revert
function which throws
a ’not implemented’ exception - FileSR.py to implement
VDI.revert
using a variant of the existing
snapshot/clone machinery - EXTSR.py and NFSSR.py to advertise the
VDI_REVERT
capability - LVHDSR.py to implement
VDI.revert
using a variant of the existing
snapshot/clone machinery - LVHDoISCSISR.py and LVHDoHBASR.py to advertise the
VDI_REVERT
capability
Prototype code
Prototype code exists here:
Design document |
---|
Revision | v3 |
Status | released (6.5 sp1) |
Review | #33 |
Integrated GPU passthrough support
Introduction
Passthrough of discrete GPUs has been
available since XenServer 6.0.
With some extensions, we will also be able to support passthrough of integrated
GPUs.
- Whether an integrated GPU will be accessible to dom0 or available to
passthrough to guests must be configurable via XenAPI.
- Passthrough of an integrated GPU requires an extra flag to be sent to qemu.
Host Configuration
New fields will be added (both read-only):
PGPU.dom0_access enum(enabled|disable_on_reboot|disabled|enable_on_reboot)
host.display enum(enabled|disable_on_reboot|disabled|enable_on_reboot)
as well as new API calls used to modify the state of these fields:
PGPU.enable_dom0_access
PGPU.disable_dom0_access
host.enable_display
host.disable_display
Each of these API calls will return the new state of the field e.g. calling
host.disable_display
on a host with display = enabled
will return
disable_on_reboot
.
Disabling dom0 access will modify the xen commandline (using the xen-cmdline
tool) such that dom0 will not be able to access the GPU on next boot.
Calling host.disable_display will modify the xen and dom0 commandlines such
that neither will attempt to send console output to the system display device.
A state diagram for the fields PGPU.dom0_access
and host.display
is shown
below:
While it is possible for these two fields to be modified independently, a
client must disable both the host display and dom0 access to the system display
device before that device can be passed through to a guest.
Note that when a client enables or disables either of these fields, the change
can be cancelled until the host is rebooted.
Handling vga_arbiter
Currently, xapi will not create a PGPU object for the PCI device with address
reported by /dev/vga_arbiter
. This is to prevent a GPU in use by dom0 from
from being passed through to a guest. This behaviour will be changed - instead
of not creating a PGPU object at all, xapi will create a PGPU, but its
supported_VGPU_types field will be empty.
However, the PGPU’s supported_VGPU_types will be populated as normal if:
- dom0 access to the GPU is disabled.
- The host’s display is disabled.
- The vendor ID of the device is contained in a whitelist provided by xapi’s
config file.
A read-only field will be added:
PGPU.is_system_display_device bool
This will be true for a PGPU iff /dev/vga_arbiter
reports the PGPU as the
system display device for the host on which the PGPU is installed.
Interfacing with xenopsd
When starting a VM attached to an integrated GPU, the VM config sent to xenopsd
will contain a video_card of type IGD_passthrough. This will override the type
determined from VM.platform:vga. xapi will consider a GPU to be integrated if
both:
- It resides on bus 0.
- The vendor ID of the device is contained in a whitelist provided by xapi’s
config file.
When xenopsd starts qemu for a VM with a video_card of type IGD_passthrough,
it will pass the flags “-std-vga” AND “-gfx_passthru”.
Design document |
---|
Revision | v1 |
Status | proposed |
Local database
All hosts in a pool use the shared database by sending queries to
the pool master. This creates a performance bottleneck as the pool
size increases. All hosts in a pool receive a database backup from
the master periodically, every couple of hours. This creates a
reliability problem as updates may be lost if the master fails during
the window before the backup.
The reliability problem can be avoided by running with HA or the redo
log enabled, but this is not always possible.
We propose to:
- adapt the existing event machinery to allow every host to maintain
an up-to-date database replica;
- actively cache the database locally on each host and satisfy read
operations from the cache. Most database operations are reads so
this should reduce the number of RPCs across the network.
In a later phase we can move to a completely
distributed database.
Replicating the database
We will create a database-level variant of the existing XenAPI event.from
API. The new RPC will block until a database event is generated, and then
the events will be returned using the existing “redo-log” event types. We
will add a few second delay into the RPC to batch the updates.
We will replace the pool database download logic with an event.from
-like
loop which fetches all the events from the master’s database and applies
them to the local copy. The first call will naturally return the full database
contents.
We will turn on the existing “in memory db cache” mechanism on all hosts,
not just the master. This will be where the database updates will go.
The result should be that every host will have a /var/xapi/state.db
file,
with writes going to the master first and then filtering down to all slaves.
Using the replica as a cache
We will re-use the Disaster Recovery multiple
database mechanism to allow slaves to access their local database. We will
change the defalult database “context” to snapshot the local database,
perform reads locally and write-through to the master.
We will add an HTTP header to all forwarded XenAPI calls from the master which
will include the current database generation count. When a forwarded XenAPI
operation is received, the slave will deliberately wait until the local cache
is at least as new as this, so that we always use fresh metadata for XenAPI
calls (e.g. the VM.start uses the absolute latest VM memory size).
We will document the new database coherence policy, i.e. that writes on a host
will not immediately be seen by reads on another host. We believe that this
is only a problem when we are using the database for locking and are attempting
to hand over a lock to another host. We are already using XenAPI calls forwarded
to the master for some of this, but may need to do a bit more of this; in
particular the storage backends may need some updating.
Design document |
---|
Revision | v3 |
Status | proposed |
Revision history |
---|
v1 | Initial version |
v2 | Addition of `networkd_db` update for Upgrade |
v3 | More info on `networkd_db` and API Errors |
Management Interface on VLAN
This document describes design details for the
REQ-42: Support Use of VLAN on XAPI Management Interface.
XAPI and XCP-Networkd
Creating a VLAN
Creating a VLAN is already there, Lisiting the steps to create a VLAN which is used later in the document.
Steps:
- Check the PIFs created on a Host for physical devices
eth0
, eth1
.
xe pif-list params=uuid physical=true host-uuid=UUID
this will list pif-UUID
- Create a new network for the VLAN interface.
xe network-create name-label=VLAN1
It returns a new network-UUID
- Create a VLAN PIF.
xe vlan-create pif-uuid=pif-UUID network-uuid=network-UUID vlan=VLAN-ID
It returns a new VLAN PIF new-pif-UUID
- Plug the VLAN PIF.
xe pif-plug uuid=new-pif-UUID
- Configure IP on the VLAN PIF.
xe pif-reconfigure-ip uuid=new-pif-UUID mode= IP= netmask= gateway= DNS=
This will configure IP on the PIF, here mode
is must and other parametrs are needed on selecting mode=static
Similarly, creating a vlan pif can be achieved by corresponding XenAPI calls.
Recognise VLAN config from management.conf
For a newly installed host, If host installer was asked to put the management interface on given VLAN.
We will expect a new entry VLAN=ID
under /etc/firstboot.d/data/management.conf
.
Listing current contents of management.conf which will be used later in the document.
LABEL
=eth0
-> Represents Pyhsical device on which Management Interface must reside.
MODE
=dhcp
||static
-> Represents IP configuration mode for the Management Interface. There can be other parameters like IP, NETMASK, GATEWAY and DNS when we have static
mode.
VLAN
=ID
-> New entry for specifying VLAN TAG going to be configured on device LABEL
.
Management interface going to be configured on this VLAN ID with specified mode.
Firstboot script need to recognise VLAN config
Firstboot script /etc/firstboot.d/30-prepare-networking
need to be updated for configuring
management interface to be on provided VLAN ID.
Steps to be followed:
PIF.scan
performed in the script must have created the PIFs for the underlying pyhsical devices.- Get the PIF UUID for physical device
LABEL
. - Repeat the steps mentioned in
Creating a VLAN
, i.e. network-create, vlan-create and pif-plug. Now we have a new PIF for the VLAN. - Perform
pif-reconfigure-ip
for the new VLAN PIF. - Perform
host-management-reconfigure
using new VLAN PIF.
XCP-Networkd need to recognise VLAN config during startup
XCP-Networkd during first boot and boot after pool eject gets the initial network setup from the management.conf
and xensource-inventory
file to update the network.db for management interface info.
XCP-Networkd must honour the new VLAN config.
Steps to be followed:
- During startup
read_config
step tries to read the /var/lib/xcp/networkd.db
file which is not yet created just after host installation. - Since
networkd.db
read throws Read_Error
, it tries to read network.dbcache
which is also not available hence it goes to read read_management_conf
file. - There can be two possible MODE
static
or dhcp
taken from management.conf. bridge_name
is taken as MANAGEMENT_INTERFACE
from xensource-inventory, further bridge_config
and interface_config
are build based on MODE.- Call
Bridge.make_config()
and Interface.make_config()
are performed with respective bridge_config
and interface_config
.
Updating networkd_db program
networkd_db
provides the management interface info to the host installer during upgrade.
It reads /var/lib/xcp/networkd.db
file to output the Management Interface information. Here we need to update the networkd_db to output the VLAN information when vlan bridge is a input.
Steps to be followed:
- Currently VLAN interface IP information is provided correctly on passing VLAN bridge as input.
networkd_db -iface xapi0
this will list mode
as dhcp or static, if mode=static then it will provide ipaddr
and netmask
too. - We need to udpate this program to provide VLAN ID and parent bridge info on passing VLAN bridge as input.
networkd_db -bridge xapi0
It should output the VLAN info like:
interfaces=
vlan=vlanID
parent=xenbr0
using the parent bridge user can identify the physical interfaces.
Here we will extract VLAN and parent bridge from bridge_config
under networkd.db
.
Additional VLAN parameter for Emergency Network Reset
Detail design is mentioned on http://xapi-project.github.io/xapi/design/emergency-network-reset.html
For using xe-reset-networking
utility to configure management interface on VLAN, We need to add one more parameter --vlan=vlanID
to the utility.
There are certain parameters need to be passed to this utility: –master, –device, –mode, –ip, –netmask, –gateway, –dns and new one –vlan.
VLAN parameter addition to xe-reset-networking
Steps to be followed:
- Check if
VLANID
is passed then let bridge=xapi0
. - Write the
bridge=xapi0
into xensource-inventory file, This should work as Xapi check avialable bridges while creating networks. - Write the
VLAN=vlanID
into management.conf
and /tmp/network-reset
. - Modify
check_network_reset
under xapi.ml to perform steps Creating a VLAN
and perform management_reconfigure
on vlan pif.
Step Creating a VLAN
must have created the VLAN record in Xapi DB similar to firstboot script. - If no VLANID is specified then retain the current one, This utility must take the management interface info from
networkd_db
program and handle the VLAN config.
VLAN parameter addition to xsconsole Emergency Network Reset
Under Emergency Network Reset
option under the Network and Management Interface
menu.
Selecting this option will show some explanation in the pane on the right-hand side.
Pressing will bring up a dialogue to select the interfaces to use as management interface after the reset.
After choosing a device, the dialogue continues with configuration options like in the Configure Management Interface
dialogue.
There will be an additionall option for VLAN in the dialogue.
After completing the dialogue, the same steps as listed for xe-reset-networking are executed.
Updating Pool Join/Eject operations
Pool Join while Pool having Management Interface on a VLAN
Currently pool-join
fails if VLANs are present on the host joining a pool.
We need to allow pool-join only if Pool and host joining a pool both has management interface on same VLAN.
Steps to be followed:
- Under
pre_join_checks
update function assert_only_physical_pifs
to check Pool master management_interface is on same VLAN. - Call
Host.get_management_interface
on Pool master and get the vlanID, match it with localhost
management_interface VLAN ID.
If it matches then allow pool-join. - In case if there are multiple VLANs on host joining a pool, fail the pool-join gracefully.
- After the pool-join, Host xapi db will get sync from pool master xapi db, This will be fine to have management interface on VLAN.
Pool Eject while host ejected having Management Interface on a VLAN
Currently managament interface VLAN config on host is not been retained in xensource-inventory
or management.conf
file.
We need to retain the vlanID under config files.
Steps to be followed:
- Under call
Pool.eject
we need to update write_first_boot_management_interface_configuration_file
function. - Check if management_interface is on VLAN then get the VLANID from the pif.
- Update the VLANID into the
managament.conf
file and the bridge
into xensource-inventory
file.
In order to be retained by XCP-Networkd on startup after the host is ejected.
Currently there is no Pool Level API to reconfigure management_interface for all of the Hosts in a Pool at once.
API Pool.management_reconfigure
will be needed in order to reconfigure manamegemnt_interface
on all hosts in a Pool to the same Network either VLAN or Physical.
Current behaviour to change the Management Interface on Host
Currently call Host.management_reconfigure
with VLAN pif-uuid can change the management_interface to specified VLAN.
Listing the steps to understand the workflow of management_interface
reconfigure. We will be using Host.management_reconfigure
call inside the new API.
Steps performed during management_reconfigure:
bring_pif_up
get called for the pif.xensource-inventory
get updated with the latest info of interface.
3 update-mh-info
updates the management_mac into xenstore.- Http server gets restarted, even though xapi listen on all IP addresses, This new interface as
_the_ management
interface is used by slaves to connect to pool master. on_dom0_networking_change
refreshes console URIs for the new IP address.- Xapi db is updated with new management interface info.
Listing steps to be performed manually on each Host or Pool as a prerequisite to use the New API.
We need to make sure that new network which is going to be a management interface has PIFs configured on each Host.
In case of pyhsical network we will assume pifs are configured on each host, In case of vlan network we need to create vlan pifs on each Host.
We would assume that VLAN is available on the switch/network.
Manual steps to be performed before calling new API:
- Create a vlan network on pool via
network.create
, In case of pyhsical NICs network must be present. - Create a vlan pif on each host via
VLAN.create
using above network ref, physical PIF ref and vlanID, Not needed in case of pyhsical network.
Or An Alternate call pool.create_VLAN
providing device
and above network
will create vlan PIFs for all hosts in a pool. - Perform
PIF.reconfigure_ip
for each new Network PIF on each Host.
If User wishes to change the management interface manually on each Host in a Pool, We should allow it, There will be a guideline for that:
User can individually change management interface on each host calling Host.management_reconfigure
using pifs on physical devices or vlan pifs.
This must be perfomed on slaves first and lastly on Master, As changing management_interface on master will disconnect slaves from master then further calls Host.management_reconfigure
cannot be performed till master recover slaves via call pool.recover_slaves
.
API Details
Pool.management_reconfigure
- Parameter: network reference
network
. - Calling this function configures
management_interface
on each host of a pool. - For the
network
provided it will check pifs are present on each Host,
In case of VLAN network it will check vlan pifs on provided network are present on each Host of Pool. - Check IP is configured on above pifs on each Host.
- If PIFs are not present or IP is not configured on PIFs this call must fail gracefully, Asking user to configure them.
- Call
Host.management_reconfigure
on each slave then lastly on master. - Call
pool.recover_slaves
on master inorder to recover slaves which might have lost the connection to master.
API errors
Possible API errors that may be raised by pool.management_reconfigure
:
INTERFACE_HAS_NO_IP
: the specified PIF (pif
parameter) has no IP configuration. The new API checks for all PIFs on the new Network has IP configured. There might be a case when user has forgotten to configure IP on PIF on one or many of the Hosts in a Pool.
New API ERROR:
REQUIRED_PIF_NOT_PRESENT
: the specified Network (network
parameter) has no PIF present on the host in pool. There might be a case when user has forgotten to create vlan pif on one or many of the Hosts in a Pool.
CP-Tickets
- CP-14027
- CP-14028
- CP-14029
- CP-14030
- CP-14031
- CP-14032
- CP-14033
Design document |
---|
Revision | v2 |
Status | confirmed |
Revision history |
---|
v1 | Initial revision |
v2 | Short-term simplications and scope reduction |
Multiple Cluster Managers
Introduction
Xapi currently uses a cluster manager called xhad. Sometimes other software comes with its own built-in way of managing clusters, which would clash with xhad (example: xhad could choose to fence node ‘a’ while the other system could fence node ‘b’ resulting in a total failure). To integrate xapi with this other software we have 2 choices:
- modify the other software to take membership information from xapi; or
- modify xapi to take membership information from this other software.
This document proposes a way to do the latter.
XenAPI changes
New field
We will add the following new field:
pool.ha_cluster_stack
of type string
(read-only)- If HA is enabled, this field reflects which cluster stack is in use.
- Set to
"xhad"
on upgrade, which implies that so far we have used XenServer’s own cluster stack, called xhad
.
Cluster-stack choice
We assume for now that a particular cluster manager will be mandated (only) by certain types of clustered storage, recognisable by SR type (e.g. OCFS2 or Melio). The SR backend will be able to inform xapi if the SR needs a particular cluster stack, and if so, what is the name of the stack.
When pool.enable_ha
is called, xapi will determine which cluster stack to use based on the presence or absence of such SRs:
- If an SR that needs its own cluster stack is attached to the pool, then xapi will use that cluster stack.
- If no SR that needs a particular cluster stack is attached to the pool, then xapi will use
xhad
.
If multiple SRs that need a particular cluster stack exist, then the storage parts of xapi must ensure that no two such SRs are ever attached to a pool at the same time.
New errors
We will add the following API error that may be raised by pool.enable_ha
:
INCOMPATIBLE_STATEFILE_SR
: the specified SRs (heartbeat_srs
parameter) are not of the right type to hold the HA statefile for the cluster_stack
that will be used. For example, there is a Melio SR attached to the pool, and therefore the required cluster stack is the Melio one, but the given heartbeat SR is not a Melio SR. The single parameter will be the name of the required SR type.
The following new API error may be raised by PBD.plug
:
INCOMPATIBLE_CLUSTER_STACK_ACTIVE
: the operation cannot be performed because an incompatible cluster stack is active. The single parameter will be the name of the required cluster stack. This could happen (or example) if you tried to create an OCFS2 SR with XenServer HA already enabled.
Future extensions
In future, we may add a parameter to explicitly choose the cluster stack:
- New parameter to
pool.enable_ha
called cluster_stack
of type string
which will have the default value of empty string (meaning: let the implementation choose). - With the additional parameter,
pool.enable_ha
may raise two new errors:UNKNOWN_CLUSTER_STACK
:
The operation cannot be performed because the requested cluster stack does not exist. The user should check the name was entered correctly and, failing that, check to see if the software is installed. The exception will have a single parameter: the name of the cluster stack which was not found.CLUSTER_STACK_CONSTRAINT
: HA cannot be enabled with the provided cluster stack because some third-party software is already active which requires a different cluster stack setting. The two parameters are: a reference to an object (such as an SR) which has created the restriction, and the name of the cluster stack that this object requires.
Implementation
The xapi.conf
file will have a new field: cluster-stack-root
which will have the default value /usr/libexec/xapi/cluster-stack
. The existing xhad
scripts and tools will be moved to /usr/libexec/xapi/cluster-stack/xhad/
. A hypothetical cluster stack called foo
would be placed in /usr/libexec/xapi/cluster-stack/foo/
.
In Pool.enable_ha
with cluster_stack="foo"
we will verify that the subdirectory <cluster-stack-root>/foo
exists. If it does not exist, then the call will fail with UNKNOWN_CLUSTER_STACK
.
Alternative cluster stacks will need to conform to the exact same interface as xhad.
Design document |
---|
Revision | v1 |
Status | proposed |
Multiple device emulators
Xen’s ioreq-server
feature allows for several device emulator
processes to be attached to the same domain, each emulating different
sets of virtual hardware. This makes it possible, for example, to
emulate network devices in a separate process for improved security
and isolation, or to provide special purpose emulators for particular
virtual hardware devices.
ioreq-server
is currently used in XenServer to support vGPU, where it
is configured via the legacy toolstack interface. These changes will make
multiple emulators usable in open source Xen via the new libxl interface.
libxl changes
The singleton device_model_version, device_model_stubdomain and
device_model fields in the b_info structure will be replaced by a list of
(version, stubdomain, model, arguments) tuples, one for each emulator.
libxl_domain_create_new() will be changed to spawn a new device model
for each entry in the list.
It may also be useful to spawn the device models separately and only
attach them during domain creation. This could be supported by
making each device_model entry a union of pid | parameter_tuple
.
If such an entry specifies a parameter tuple, it is processed as above;
if it specifies a pid, libxl_domain_create_new(), the existing device
model with that pid is attached instead.
QEMU changes
Patches to make QEMU register with Xen as an ioreq-server have been
submitted upstream, but not yet applied.
QEMU’s --machine none
and --nodefaults
options should make it
possible to create an empty machine and add just a host bus, PCI bus
and device. This has not yet been fully demonstrated, so QEMU changes
may be required.
Xen changes
- Until now,
ioreq-server
has only been used to connect one extra
device model, in addition to the default one. Multiple emulators
should work, but there is a chance that bugs will be discovered.
Interfacing with xenopsd
This functionality will only be available through the experimental
Xenlight-based xenopsd.
- the
VM_build
clause in the atomics_of_operation
function will be
changed to fill in the list of emulators to be created (or attached)
in the b_info struct
Host Configuration
vGPU support is implemented mostly in xenopsd, so no Xapi changes are
required to support vGPU through the generic device model mechanism.
Changes would be required if we decided to expose the additional device
models through the API, but in the near future it is more likely that
any additional device models will be dealt with entirely by xenopsd.
Design document |
---|
Revision | v1 |
Status | proposed |
OCFS2 storage
OCFS2 is a (host-)clustered filesystem which runs on top of a shared raw block
device. Hosts using OCFS2 form a cluster using a combination of network and
storage heartbeats and host fencing to avoid split-brain.
The following diagram shows the proposed architecture with xapi
:
Please note the following:
- OCFS2 is configured to use global heartbeats rather than per-mount heartbeats
because we quite often have many SRs and therefore many mountpoints
- The OCFS2 global heartbeat should be collocated on the same SR as the XenServer
HA SR so that we depend on fewer SRs (the storage is a single point of failure
for OCFS2)
- The OCFS2 global heartbeat should itself be a raw VDI within an LVHDSR.
- Every host can be in at-most-one OCFS2 cluster i.e. the host cluster membership
is a per-host thing rather than a per-SR thing. Therefore
xapi
will be
modified to configure the cluster and manage the cluster node numbers. - Every SR will be a filesystem mount, managed by a SM plugin called “OCFS2”.
- Xapi HA uses the
xhad
process which runs in userspace but in the realtime
scheduling class so it has priority over all other userspace tasks. xhad
sends heartbeats via the ha_statefile
VDI and via UDP, and uses the
Xen watchdog for host fencing. - OCFS2 HA uses the
o2cb
kernel driver which sends heartbeats via the
o2cb_statefile
and via TCP, fencing the host by panicing domain 0.
Managing O2CB
OCFS2 uses the O2CB “cluster stack” which is similar to our xhad
. To configure
O2CB we need to
- assign each host an integer node number (from zero)
- on pool/cluster join: update the configuration on every node to include the
new node. In OCFS2 this can be done online.
- on pool/cluster leave/eject: update the configuration on every node to exclude
the old node. In OCFS2 this needs to be done offline.
In the current Xapi toolstack there is a single global implicit cluster called a “Pool”
which is used for: resource locking; “clustered” storage repositories and fault handling (in HA). In the long term we will allow these types of clusters to be
managed separately or all together, depending on the sophistication of the
admin and the complexity of their environment. We will take a small step in that
direction by keeping the OCFS2 O2CB cluster management code at “arms length”
from the Xapi Pool.join code.
In
xcp-idl
we will define a new API category called “Cluster” (in addition to the
categories for
Xen domains
, ballooning
, stats
,
networking
and
storage
). These APIs will only be called by Xapi on localhost. In particular they will
not be called across-hosts and therefore do not have to be backward compatible.
These are “cluster plugin APIs”.
We will define the following APIs:
Plugin:Membership.create
: add a host to a cluster. On exit the local host cluster software
will know about the new host but it may need to be restarted before the
change takes effect- in:
hostname:string
: the hostname of the management domain - in:
uuid:string
: a UUID identifying the host - in:
id:int
: the lowest available unique integer identifying the host
where an integer will never be re-used unless it is guaranteed that
all nodes have forgotten any previous state associated with it - in:
address:string list
: a list of addresses through which the host
can be contacted - out: Task.id
Plugin:Membership.destroy
: removes a named host from the cluster. On exit the local
host software will know about the change but it may need to be restarted
before it can take effect- in:
uuid:string
: the UUID of the host to remove
Plugin:Cluster.query
: queries the state of the cluster- out:
maintenance_required:bool
: true if there is some outstanding configuration
change which cannot take effect until the cluster is restarted. - out:
hosts
: a list of all known hosts together with a state including:
whether they are known to be alive or dead; or whether they are currently
“excluded” because the cluster software needs to be restarted
Plugin:Cluster.start
: turn on the cluster software and let the local host joinPlugin:Cluster.stop
: turn off the cluster software
Xapi will be modified to:
- add table
Cluster
which will have columnsname: string
: this is the name of the Cluster plugin (TODO: use same
terminology as SM?)configuration: Map(String,String)
: this will contain any cluster-global
information, overrides for default values etc.enabled: Bool
: this is true when the cluster “should” be running. It
may require maintenance to synchronise changes across the hosts.maintenance_required: Bool
: this is true when the cluster needs to
be placed into maintenance mode to resync its configuration
- add method
XenAPI:Cluster.enable
which sets enabled=true
and waits for all
hosts to report Membership.enabled=true
. - add method
XenAPI:Cluster.disable
which sets enabled=false
and waits for all
hosts to report Membership.enabled=false
. - add table
Membership
which will have columnsid: int
: automatically generated lowest available unique integer
starting from 0cluster: Ref(Cluster)
: the type of cluster. This will never be NULL.host: Ref(host)
: the host which is a member of the cluster. This may
be NULL.left: Date
: if not 1/1/1970 this means the time at which the host
left the cluster.maintenance_required: Bool
: this is true when the Host believes the
cluster needs to be placed into maintenance mode.
- add field
Host.memberships: Set(Ref(Membership))
- extend enum
vdi_type
to include o2cb_statefile
as well as ha_statefile
- add method
Pool.enable_o2cb
with arguments- in:
heartbeat_sr: Ref(SR)
: the SR to use for global heartbeats - in:
configuration: Map(String,String)
: available for future configuration tweaks - Like
Pool.enable_ha
this will find or create the heartbeat VDI, create the
Cluster
entry and the Membership
entries. All Memberships
will have
maintenance_required=true
reflecting the fact that the desired cluster
state is out-of-sync with the actual cluster state.
- add method
XenAPI:Membership.enable
- in:
self:Host
: the host to modify - in:
cluster:Cluster
: the cluster.
- add method
XenAPI:Membership.disable
- in:
self:Host
: the host to modify - in:
cluster:Cluster
: the cluster name.
- add a cluster monitor thread which
- watches the
Host.memberships
field and calls Plugin:Membership.create
and
Plugin:Membership.destroy
to keep the local cluster software up-to-date
when any host in the pool changes its configuration - calls
Plugin:Cluster.query
after an Plugin:Membership:create
or
Plugin:Membership.destroy
to see whether the
SR needs maintenance - when all hosts have a last start time later than a
Membership
record’s left
date, deletes the Membership
.
- modify
XenAPI:Pool.join
to resync with the master’s Host.memberships
list. - modify
XenAPI:Pool.eject
to- call
Membership.disable
in the cluster plugin to stop the o2cb
service - call
Membership.destroy
in the cluster plugin to remove every other host
from the local configuration - remove the
Host
metadata from the pool - set
XenAPI:Membership.left
to NOW()
- modify
XenAPI:Host.forget
to- remove the
Host
metadata from the pool - set
XenAPI:Membership.left
to NOW()
- set
XenAPI:Cluster.maintenance_required
to true
A Cluster plugin called “o2cb” will be added which
- on
Plugin:Membership.destroy
- comment out the relevant node id in cluster.conf
- set the ’needs a restart’ flag
- on
Plugin:Membership.create
- if the provided node id is too high: return an error. This means the
cluster needs to be rebooted to free node ids.
- if the node id is not too high: rewrite the cluster.conf using
the “online” tool.
- on
Plugin:Cluster.start
: find the VDI with type=o2cb_statefile
;
add this to the “static-vdis” list; chkconfig
the service on. We
will use the global heartbeat mode of o2cb
. - on
Plugin:Cluster.stop
: stop the service; chkconfig
the service off;
remove the “static-vdis” entry; leave the VDI itself alone - keeps track of the current ’live’ cluster.conf which allows it to
- report the cluster service as ’needing a restart’ (which implies
we need maintenance mode)
Summary of differences between this and xHA:
- we allow for the possibility that hosts can join and leave, without
necessarily taking the whole cluster down. In the case of
o2cb
we
should be able to have join
work live and only eject
requires
maintenance mode - rather than write explicit RPCs to update cluster configuration state
we instead use an event watch and resync pattern, which is hopefully
more robust to network glitches while a reconfiguration is in progress.
Managing xhad
We need to ensure o2cb
and xhad
do not try to conflict by fencing
hosts at the same time. We shall:
use the default o2cb
timeouts (hosts fence if no I/O in 60s): this
needs to be short because disk I/O on otherwise working hosts can
be blocked while another host is failing/ has failed.
make the xhad
host fence timeouts much longer: 300s. It’s much more
important that this is reliable than fast. We will make this change
globally and not just when using OCFS2.
In the xhad
config we will cap the HeartbeatInterval
and StatefileInterval
at 5s (the default otherwise would be 31s). This means that 60 heartbeat
messages have to be lost before xhad
concludes that the host has failed.
SM plugin
The SM plugin OCFS2
will be a file-based plugin.
TODO: which file format by default?
The SM plugin will first check whether the o2cb
cluster is active and fail
operations if it is not.
I/O paths
When either HA or OCFS O2CB “fences” the host it will look to the admin like
a host crash and reboot. We need to (in priority order)
- help the admin prevent fences by monitoring their I/O paths
and fixing issues before they lead to trouble
- when a fence/crash does happen, help the admin
- tell the difference between an I/O error (admin to fix) and a software
bug (which should be reported)
- understand how to make their system more reliable
Monitoring I/O paths
If heartbeat I/O fails for more than 60s when running o2cb
then the host will fence.
This can happen either
for a good reason: for example the host software may have deadlocked or someone may
have pulled out a network cable.
for a bad reason: for example a network bond link failure may have been ignored
and then the second link failed; or the heartbeat thread may have been starved of
I/O bandwidth by other processes
Since the consequences of fencing are severe – all VMs on the host crash simultaneously –
it is important to avoid the host fencing for bad reasons.
We should recommend that all users
- use network bonding for their network heartbeat
- use multipath for their storage heartbeat
Furthermore we need to help users monitor their I/O paths. It’s no good if they use
a bonded network but fail to notice when one of the paths have failed.
The current XenServer HA implementation generates the following I/O-related alerts:
HA_HEARTBEAT_APPROACHING_TIMEOUT
(priority 5 “informational”): when half the
network heartbeat timeout has been reached.HA_STATEFILE_APPROACHING_TIMEOUT
(priority 5 “informational”): when half the
storage heartbeat timeout has been reached.HA_NETWORK_BONDING_ERROR
(priority 3 “service degraded”): when one of the bond
links have failed.HA_STATEFILE_LOST
(priority 2 “service loss imminent”): when the storage heartbeat
has completely failed and only the network heartbeat is left.- MULTIPATH_PERIODIC_ALERT (priority 3 “service degrated”): when one of the multipath
links have failed.
Unfortunately alerts are triggered on “edges” i.e. when state changes, and not on “levels”
so it is difficult to see whether the link is currently broken.
We should define datasources suitable for use by xcp-rrdd to expose the current state
(and the history) of the I/O paths as follows:
pif_<name>_paths_failed
: the total number of paths which we know have failed.pif_<name>_paths_total
: the total number of paths which are configured.sr_<name>_paths_failed
: the total number of storage paths which we know have failed.sr_<name>_paths_total
: the total number of storage paths which are configured.
The pif
datasources should be generated by xcp-networkd
which already has a
network bond monitoring thread.
THe sr
datasources should be generated by xcp-rrdd
plugins since there is no
storage daemon to generate them.
We should create RRDs using the MAX
consolidation function, otherwise information
about failures will be lost by averaging.
XenCenter (and any diagnostic tools) should warn when the system is at risk of fencing
in particular if any of the following are true:
pif_<name>_paths_failed
is non-zerosr_<name>_paths_failed
is non-zeropif_<name>_paths_total
is less than 2sr_<name>_paths_total
is less than 2
XenCenter (and any diagnostic tools) should warn if any of the following have been
true over the past 7 days:
pif_<name>_paths_failed
is non-zerosr_<name>_paths_failed
is non-zero
Heartbeat “QoS”
The network and storage paths used by heartbeats must remain responsive otherwise
the host will fence (i.e. the host and all VMs will crash).
Outstanding issue: how slow can multipathd
get? How does it scale with the number of
LUNs.
Post-crash diagnostics
When a host crashes the effect on the user is severe: all the VMs will also
crash. In cases where the host crashed for a bad reason (such as a single failure
after a configuration error) we must help the user understand how they can
avoid the same situation happening again.
We must make sure the crash kernel runs reliably when xhad
and o2cb
fence the host.
Xcp-rrdd will be modified to store RRDs in an mmap(2)
d file sin the dom0
filesystem (rather than in-memory). Xcp-rrdd will call msync(2)
every 5s
to ensure the historical records have hit the disk. We should use the same
on-disk format as RRDtool (or as close to it as makes sense) because it has
already been optimised to minimise the amount of I/O.
Xapi will be modified to run a crash-dump analyser program xen-crash-analyse
.
xen-crash-analyse
will:
- parse the Xen and dom0 stacks and diagnose whether
- the dom0 kernel was panic’ed by
o2cb
- the Xen watchdog was fired by
xhad
- anything else: this would indicate a bug that should be reported
- in cases where the system was fenced by
o2cb
or xhad
then the analyser- will read the archived RRDs and look for recent evidence of a path failure
or of a bad configuration (i.e. one where the total number of paths is 1)
- will parse the
xhad.log
and look for evidence of heartbeats “approaching
timeout”
TODO: depending on what information we can determine from the analyser, we
will want to record some of it in the Host_crash_dump
database table.
XenCenter will be modified to explain why the host crashed and explain what
the user should do to fix it, specifically:
- if the host crashed for no obvious reason then consider this a software
bug and recommend a bugtool/system-status-report is taken and uploaded somewhere
- if the host crashed because of
o2cb
or xhad
then either- if there is evidence of path failures in the RRDs: recommend the user
increase the number of paths or investigate whether some of the equipment
(NICs or switches or HBAs or SANs) is unreliable
- if there is evidence of insufficient paths: recommend the user add more
paths
Network configuration
The documentation should strongly recommend
- the management network is bonded
- the management network is dedicated i.e. used only for management traffic
(including heartbeats)
- the OCFS2 storage is multipathed
xcp-networkd
will be modified to change the behaviour of the DHCP client.
Currently the dhclient
will wait for a response and eventually background
itself. This is a big problem since DHCP can reset the hostname, and this can
break o2cb
. Therefore we must insist that PIF.reconfigure_ip
becomes
fully synchronous, supporting timeout and cancellation. Once the call returns
– whether through success or failure – there must not be anything in the
background which will change the system’s hostname.
TODO: figure out whether we need to request “maintenance mode” for hostname
changes.
Maintenance mode
The purpose of “maintenance mode” is to take a host out of service and leave
it in a state where it’s safe to fiddle with it without affecting services
in VMs.
XenCenter currently does the following:
Host.disable
: prevents new VMs starting here- makes a list of all the VMs running on the host
Host.evacuate
: move the running VMs somewhere else
The problems with maintenance mode are:
- it’s not safe to fiddle with the host network configuration with storage
still attached. For NFS this risks deadlocking the SR. For OCFS2 this
risks fencing the host.
- it’s not safe to fiddle with the storage or network configuration if HA
is running because the host will be fenced. It’s not safe to disable fencing
unless we guarantee to reboot the host on exit from maintenance mode.
We should also
PBD.unplug
: all storage. This allows the network to be safely reconfigured.
If the network is configured when NFS storage is plugged then the SR can
permanently deadlock; if the network is configured when OCFS2 storage is
plugged then the host can crash.
TODO: should we add a Host.prepare_for_maintenance
(better name TBD)
to take care of all this without XenCenter having to script it. This would also
help CLI and powershell users do the right thing.
TODO: should we insist that the host is rebooted to leave maintenance
mode? This would make maintenance mode more reliable and allow us to integrate
maintenance mode with xHA (where maintenance mode is a “staged reboot”)
TODO: should we leave all clusters as part of maintenance mode? We
probably need to do this to avoid fencing.
Walk-through: adding OCFS2 storage
Assume you have an existing Pool of 2 hosts. First the client will set up
the O2CB cluster, choosing where to put the global heartbeat volume. The
client should check that the I/O paths have all been setup correctly with
bonding and multipath and prompt the user to fix any obvious problems.
Internally within Pool.enable_o2cb
Xapi will set up the cluster metadata
on every host in the pool:
At this point all hosts have in-sync cluster.conf
files but all cluster
services are disabled. We also have requires_mainenance=true
on all
Membership
entries and the global Cluster
has enabled=false
.
The client will now try to enable the cluster with Cluster.enable
:
Now all hosts are in the cluster and the SR can be created using the standard
SM APIs.
Walk-through: remove a host
Assume you have an existing Pool of 2 hosts with o2cb
clustering enabled
and at least one ocfs2
filesystem mounted. If the host is online then
XenAPI:Pool.eject
will:
Note that:
- All hosts will have modified their
o2cb
cluster.conf
to comment out
the former host - The
Membership
table still remembers the node number of the ejected host–
this cannot be re-used until the SR is taken down for maintenance. - All hosts can see the difference between their current
cluster.conf
and the one they would use if they restarted the cluster service, so all
hosts report that the cluster must be taken offline i.e. requires_maintence=true
.
Summary of the impact on the admin
OCFS2 is fundamentally a different type of storage to all existing storage
types supported by xapi. OCFS2 relies upon O2CB, which provides
Host-level High Availability. All HA implementations
(including O2CB and xhad
) impose restrictions on the server admin to
prevent unnecessary host “fencing” (i.e. crashing). Once we have OCFS2 as
a feature, we will have to live with these restrictions which previously only
applied when HA was explicitly enabled. To reduce complexity we will not try
to enforce restrictions only when OCFS2 is being used or is likely to be used.
Impact even if not using OCFS2
- “Maintenance mode” now includes detaching all storage.
- Host network reconfiguration can only be done in maintenance mode
- XenServer HA enable takes longer
- XenServer HA failure detection takes longer
- Network configuration with DHCP must be fully synchronous i.e. it wil block
until the DHCP server responds. On a timeout, the change will not be made.
Impact when using OCFS2
- Sometimes a host will not be able to join the pool without taking the
pool into maintenance mode
- Every VM will have to be XSM’ed (is that a verb?) to the new OCFS2 storage.
This means that VMs with more than 2 snapshots will have their snapshots
deleted; it means you need to provision another storage target, temporarily
doubling your storage needs; and it will take a long time.
- There will now be 2 different reasons why a host has fenced which the
admin needs to understand.
Design document |
---|
Revision | v1 |
Status | proposed |
patches in VDIs
“Patches” are signed binary blobs which can be queried and applied.
They are stored in the dom0 filesystem under /var/patch
. Unfortunately
the patches can be quite large – imagine a repo full of RPMs – and
the dom0 filesystem is usually quite small, so it can be difficult
to upload and apply some patches.
Instead of writing patches to the dom0 filesystem, we shall write them
to disk images (VDIs) instead. We can then take advantage of features like
- shared storage
- cross-host
VDI.copy
to manage the patches.
XenAPI changes
Add a field pool_patch.VDI
of type Ref(VDI)
. When a new patch is
stored in a VDI, it will be referenced here. Older patches and cleaned
patches will have invalid references here.
The HTTP handler for uploading patches will choose an SR to stream the
patch into. It will prefer to use the pool.default_SR
and fall back
to choosing an SR on the master whose driver supports the VDI_CLONE
capability: we want the ability to fast clone patches, one per host
concurrently installing them. A VDI will be created whose size is 4x
the apparent size of the patch, defaulting to 4GiB if we have no size
information (i.e. no content-length
header)
pool_patch.clean_on_host
will be deprecated. It will still try to
clean a patch from the local filesystem but this is pointless for
the new VDI patch uploads.
pool_patch.clean
will be deprecated. It will still try to clean a patch
from the local filesystem of the master but this is pointless for the
new VDI patch uploads.
pool_patch.pool_clean
will be deprecated. It will destroy any associated
patch VDI. Users will be encouraged to call VDI.destroy
instead.
Changes beneath the XenAPI
pool_patch
records will only be deleted if both the filename
field
refers to a missing file on the master and the VDI
field is a dangling
reference
Patches stored in VDIs will be stored within a filesystem, like we used
to do with suspend images. This is needed because (a) we want to execute
the patches and block devices cannot be executed; and (b) we can use
spare space in the VDI as temporary scratch space during the patch
application process. Within the VDI we will call patches patch
rather
than using a complicated filename.
When a host wishes to apply a patch it will call VDI.copy
to duplicate
the VDI to a locally-accessible SR, mount the filesystem and execute it.
If the patch is still in the master’s dom0 filesystem then it will fall
back to the HTTP handler.
Summary of the impact on the admin
- There will nolonger be a size limit on hotfixes imposed by the mechanism
itself.
- There must be enough free space in an SR connected to the host to be able
to apply a patch on that host.
Design document |
---|
Revision | v1 |
Status | proposed |
PCI passthrough support
Introduction
GPU passthrough is already available in XAPI, this document proposes to also
offer passthrough for all PCI devices through XAPI.
Design proposal
New methods for PCI object:
PCI.enable_dom0_access
PCI.disable_dom0_access
PCI.get_dom0_access_status
: compares the outputs of /opt/xensource/libexec/xen-cmdline
and /proc/cmdline
to produce one of the four values that can be currently contained
in the PGPU.dom0_access
field:
- disabled
- disabled_on_reboot
- enabled
- enabled_on_reboot
How do determine the expected dom0 access state:
If the device id is present in both pciback.hide
of /proc/cmdline
and xen-cmdline
: enabled
If the device id is present not in both pciback.hide
of /proc/cmdline
and xen-cmdline
: disabled
If the device id is present in the pciback.hide
of /proc/cmdline
but not in the one of xen-cmdline
: disabled_on_reboot
If the device id is not present in the pciback.hide
of /proc/cmdline
but is in the one of xen-cmdline
: enabled_on_reboot
A function rather than a field makes the data always accurate and even accounts for
changes made by users outside XAPI, directly through /opt/xensource/libexec/xen-cmdline
With these generic methods available, the following field and methods will be deprecated:
PGPU.enable_dom0_access
PGPU.disable_dom0_access
PGPU.dom0_access
(DB field)
They would still be usable and up to date with the same info as for the PCI methods.
Test cases
Design document |
---|
Revision | v1 |
Status | proposed |
Pool-wide SSH
Background
The SMAPIv3 plugin architecture requires that storage plugins are able to work
in the absence of xapi. Amongst other benefits, this allows them to be tested
in isolation, are able to be shared more widely than just within the XenServer
community and will cause less load on xapi’s database.
However, many of the currently existing SMAPIv1 backends require inter-host
operations to be performed. This is achieved via the use of the Xen-API call
‘host.call_plugin’, which allows an API user to execute a pre-installed plugin
on any pool member. This is important for operations such as coalesce / snapshot
where the active data path for a VM somewhere in the pool needs to be refreshed
in order to complete the operation. In order to use this, the RPM in which the
SM backend lives is used to deliver a plugin script into /etc/xapi.d/plugins,
and this executes the required function when the API call is made.
In order to support these use-cases without xapi running, a new mechanism needs
to be provided to allow the execution of required functionality on remote hosts.
The canonical method for remotely executing scripts is ssh - the secure shell.
This design proposal is setting out how xapi might manage the public and
private keys to enable passwordless authentication of ssh sessions between all
hosts in a pool.
Modifications to the host
On firstboot (and after being ejected), the host should generate a
host key (already done I believe), and an authentication key for the
user (root/xapi?).
Modifications to xapi
Three new fields will be added to the host object:
host.ssh_public_host_key : string
: This is the host key that identifies the host
during the initial ssh key exchange protocol. This should be added to the
‘known_hosts’ field of any other host wishing to ssh to this host.
host.ssh_public_authentication_key : string
: This field is the public
key used for authentication when sshing from the root account on that host -
host A. This can be added to host B’s authorized_keys
file in order to
allow passwordless logins from host A to host B.
host.ssh_ready : bool
: A boolean flag indicating that the configuration
files in use by the ssh server/client on the host are up to date.
One new field will be added to the pool record:
pool.revoked_authentication_keys : string list
: This field records all
authentication keys that have been used by hosts in the past. It is updated
when a host is ejected from the pool.
Pool Join
On pool join, the master creates the record for the new host and populates the
two public key fields with values supplied by the joining host. It then sets
the ssh_ready
field on all other hosts to false
.
On each host in the pool, a thread is watching for updates to the
ssh_ready
value for the local host. When this is set to false, the host
then adds the keys from xapi’s database to the appropriate places in the ssh
configuration files and restarts sshd. Once this is done, the host sets the
ssh_ready
field to ’true’
Pool Eject
On pool eject, the host’s ssh_public_host_key is lost, but the authetication key is added to a list of revoked keys on the pool object. This allows all other hosts to remove the key from the authorized_keys list when they next sync, which in the usual case is immediately the database is modified due to the event watch thread. If the host is offline though, the authorized_keys file will be updated the next time the host comes online.
Questions
- Do we want a new user? e.g. ‘xapi’ - how would we then use this user to execute privileged things? setuid binaries?
- Is keeping the revoked_keys list useful? If we ‘control the world’ of the authorized_keys file, we could just remove anything that’s currently in there that xapi doesn’t know about
Design document |
---|
Revision | v1 |
Status | proposed |
Process events from xenopsd in a timely manner
Background
There is a significant delay between the VM being unpaused and XAPI reporting it
as started during a bootstorm.
It can happen that the VM is able to send UDP packets already, but XAPI still reports it as not started for minutes.
XAPI currently processes all events from xenopsd in a single thread, the unpause
events get queued up behind a lot of other events generated by the already
running VMs.
We need to ensure that unpause events from xenopsd get processed in a timely
manner, even if XAPI is busy processing other events.
Timely processing of events
If we process the events in a Round-Robin fashion then unpause
events are reported in a timely fashion.
We need to ensure that events operating on the same VM are not processed in parallel.
Xenopsd already has code that does exactly this, the purpose of the xapi-work-queues refactoring PR is to
reuse this code in XAPI by creating a shared package between xenopsd and xapi: xapi-work-queues
.
xapi-work-queues
From the documentation of the new Worker Pool interface:
A worker pool has a limited number of worker threads.
Each worker pops one tagged item from the queue in a round-robin fashion.
While the item is executed the tag temporarily doesn’t participate in round-robin scheduling.
If during execution more items get queued with the same tag they get redirected to a private queue.
Once the item finishes execution the tag will participate in RR scheduling again.
This ensures that items with the same tag do not get executed in parallel,
and that a tag with a lot of items does not starve the execution of other tags.
The XAPI side of the changes will look like this
Known limitations: The active per-VM events should be a small number, this is already ensured in the push_with_coalesce
/ should_keep
code on the xenopsd side. Events to XAPI from xenopsd should already arrive coalesced.
Design document |
---|
Revision | v2 |
Status | released (xenserver 6.5 sp1) |
Review | #12 |
RDP control
Purpose
To administer guest VMs it can be useful to connect to them over Remote Desktop Protocol (RDP). XenCenter supports this; it has an integrated RDP client.
First it is necessary to turn on the RDP service in the guest.
This can be controlled from XenCenter. Several layers are involved. This description starts in the guest and works up the stack to XenCenter.
This feature was completed in the first quarter of 2015, and released in Service Pack 1 for XenServer 6.5.
The guest agent
The XenServer guest agent installed in Windows VMs can turn the RDP service on and off, and can report whether it is running.
The guest agent is at https://github.com/xenserver/win-xenguestagent
Interaction with the agent is done through some Xenstore keys:
The guest agent running in domain N writes two xenstore nodes when it starts up:
/local/domain/N/control/feature-ts = 1
/local/domain/N/control/feature-ts2 = 1
This indicates support for the rest of the functionality described below.
(The “…ts2” flag is new for this feature; older versions of the guest agent wrote the “…ts” flag and had support for only a subset of the functionality (no firewall modification), and had a bug in updating .../data/ts
.)
To indicate whether RDP is running, the guest agent writes the string “1” (running) or “0” (disabled) to xenstore node
/local/domain/N/data/ts
.
It does this on start-up, and also in response to the deletion of that node.
The guest agent also watches xenstore node /local/domain/N/control/ts
and it turns RDP on and off in response to “1” or “0” (respectively) being written to that node. The agent acknowledges the request by deleting the node, and afterwards it deletes local/domain/N/data/ts
, thus triggering itself to update that node as described above.
When the guest agent turns the RDP service on/off, it also modifies the standard Windows firewall to allow/forbid incoming connections to the RDP port. This is the same as the firewall change that happens automatically when the RDP service is turned on/off through the standard Windows GUI.
XAPI etc.
xenopsd sets up watches on xenstore nodes including the control
tree and data/ts
, and prompts xapi to react by updating the relevant VM guest metrics record, which is available through a XenAPI call.
XenAPI includes a new message (function call) which can be used to ask the guest agent to turn RDP on and off.
This is VM.call_plugin
(analogous to Host.call_plugin
) in the hope that it can be used for other purposes in the future, even though for now it does not really call a plugin.
To use it, supply plugin="guest-agent-operation"
and either fn="request_rdp_on"
or fn="request_rdp_off"
.
See http://xapi-project.github.io/xen-api/classes/vm.html
The function strings are named with “request” (rather than, say, “enable_rdp” or “turn_rdp_on”) to make it clear that xapi only makes a request of the guest: when one of these calls returns successfully this means only that the appropriate string (1 or 0) was written to the control/ts
node and it is up to the guest whether it responds.
XenCenter
Behaviour on older XenServer versions that do not support RDP control
Note that the current behaviour depends on some global options: “Enable Remote Desktop console scanning” and “Automatically switch to the Remote Desktop console when it becomes available”.
- When tools are not installed:
- As of XenCenter 6.5, the RDP button is absent.
- When tools are installed but RDP is not switched on in the guest:
- If “Enable Remote Desktop console scanning” is on:
- The RDP button is present but greyed out. (It seems to sometimes read “Switch to Remote Desktop” and sometimes read “Looking for guest console…”: I haven’t yet worked out the difference).
- We scan the RDP port to detect when RDP is turned on
- If “Enable Remote Desktop console scanning” is off:
- The RDP button is enabled and reads “Switch to Remote Desktop”
- When tools are installed and RDP is switched on in the guest:
- If “Enable Remote Desktop console scanning” is on:
- The RDP button is enabled and reads “Switch to Remote Desktop”
- If “Automatically switch” is on, we switch to RDP immediately we detect it
- If “Enable Remote Desktop console scanning” is off:
- As above, the RDP button is enabled and reads “Switch to Remote Desktop”
New behaviour on XenServer versions that support RDP control
- This new XenCenter behaviour is only for XenServer versions that support RDP control, with guests with the new guest agent: behaviour must be unchanged if the server or guest-agent is older.
- There should be no change in the behaviour for Linux guests, either PV or HVM varieties: this must be tested.
- We should never scan the RDP port; instead we should watch for a change in the relevant variable in guest_metrics.
- The XenCenter option “Enable Remote Desktop console scanning” should change to read “Enable Remote Desktop console scanning (XenServer 6.5 and earlier)”
- The XenCenter option “Automatically switch to the Remote Desktop console when it becomes available” should be enabled even when “Enable Remote Desktop console scanning” is off.
- When tools are not installed:
- As above, the RDP button should be absent.
- When tools are installed but RDP is not switched on in the guest:
- The RDP button should be enabled and read “Turn on Remote Desktop”
- If pressed, it should launch a dialog with the following wording: “Would you like to turn on Remote Desktop in this VM, and then connect to it over Remote Desktop? [Yes] [No]”
- That button should turn on RDP, wait for RDP to become enabled, and switch to an RDP connection. It should do this even if “Automatically switch” is off.
- When tools are installed and RDP is switched on in the guest:
- The RDP button should be enabled and read “Switch to Remote Desktop”
- If “Automatically switch” is on, we should switch to RDP immediately
- There is no need for us to provide UI to switch RDP off again
- We should also test the case where RDP has been switched on in the guest before the tools are installed.
Design document |
---|
Revision | v1 |
Status | released (7,0) |
RRDD archival redesign
Introduction
Current problems with rrdd:
- rrdd stores knowledge about whether it is running on a master or a slave
This determines the host to which rrdd will archive a VM’s rrd when the VM’s
domain disappears - rrdd will always try to archive to the master. However,
when a host joins a pool as a slave rrdd is not restarted so this knowledge is
out of date. When a VM shuts down on the slave rrdd will archive the rrd
locally. When starting this VM again the master xapi will attempt to push any
locally-existing rrd to the host on which the VM is being started, but since
no rrd archive exists on the master the slave rrdd will end up creating a new
rrd and the previous rrd will be lost.
- rrdd handles rebooting VMs unpredictably
When rebooting a VM, there is a chance rrdd will attempt to update that VM’s rrd
during the brief period when there is no domain for that VM. If this happens,
rrdd will archive the VM’s rrd to the master, and then create a new rrd for the
VM when it sees the new domain. If rrdd doesn’t attempt to update that VM’s rrd
during this period, rrdd will continue to add data for the new domain to the old
rrd.
Proposal
To solve these problems, we will remove some of the intelligence from rrdd and
make it into more of a slave process of xapi. This will entail removing all
knowledge from rrdd of whether it is running on a master or a slave, and also
modifying rrdd to only start monitoring a VM when it is told to, and only
archiving an rrd (to a specified address) when it is told to. This matches the
way xenopsd only manages domains which it has been told to manage.
Design
For most VM lifecycle operations, xapi and rrdd processes (sometimes across more
than one host) cooperate to start or stop recording a VM’s metrics and/or to
restore or backup the VM’s archived metrics. Below we will describe, for each
relevant VM operation, how the VM’s rrd is currently handled, and how we propose
it will be handled after the redesign.
VM.destroy
The master xapi makes a remove_rrd call to the local rrdd, which causes rrdd to
to delete the VM’s archived rrd from disk. This behaviour will remain unchanged.
VM.start(_on) and VM.resume(_on)
The master xapi makes a push_rrd call to the local rrdd, which causes rrdd to
send any locally-archived rrd for the VM in question to the rrdd of the host on
which the VM is starting. This behaviour will remain unchanged.
VM.shutdown and VM.suspend
Every update cycle rrdd compares its list of registered VMs to the list of
domains actually running on the host. Any registered VMs which do not have a
corresponding domain have their rrds archived to the rrdd running on the host
believed to be the master. We will change this behaviour by stopping rrdd from
doing the archiving itself; instead we will expose a new function in rrdd’s
interface:
val archive_rrd : vm_uuid:string -> remote_address:string -> unit
This will cause rrdd to remove the specified rrd from its table of registered
VMs, and archive the rrd to the specified host. When a VM has finished shutting
down or suspending, the xapi process on the host on which the VM was running
will call archive_rrd to ask the local rrdd to archive back to the master rrdd.
VM.reboot
Removing rrdd’s ability to automatically archive the rrds for disappeared
domains will have the bonus effect of fixing how the rrds of rebooting VMs are
handled, as we don’t want the rrds of rebooting VMs to be archived at all.
VM.checkpoint
This will be handled automatically, as internally VM.checkpoint carries out a
VM.suspend followed by a VM.resume.
VM.pool_migrate and VM.migrate_send
The source host’s xapi makes a migrate_rrd call to the local rrd, with a
destination address and an optional session ID. The session ID is only required
for cross-pool migration. The local rrdd sends the rrd for that VM to the
destination host’s rrdd as an HTTP PUT. This behaviour will remain unchanged.
Design document |
---|
Revision | v1 |
Status | released (7.0) |
Revision history |
---|
v1 | Initial version |
RRDD plugin protocol v2
Motivation
rrdd plugins currently report datasources via a shared-memory file, using the
following format:
DATASOURCES
000001e4
dba4bf7a84b6d11d565d19ef91f7906e
{
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
"type": "absolute",
"units": "degC",
"value": "64.33"
"value_type": "float",
},
"cpu-temp-cpu1": {
"description": "Temperature of CPU 1",
"type": "absolute",
"units": "degC",
"value": "62.14"
"value_type": "float",
}
}
}
This format contains four main components:
DATASOURCES
This should always be present.
- The JSON data length, encoded as hexadecimal
000001e4
- The md5sum of the JSON data
dba4bf7a84b6d11d565d19ef91f7906e
- The JSON data itself, encoding the values and metadata associated with the
reported datasources.
Example
{
"timestamp": 1339685573.245,
"data_sources": {
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
"type": "absolute",
"units": "degC",
"value": "64.33"
"value_type": "float",
},
"cpu-temp-cpu1": {
"description": "Temperature of CPU 1",
"type": "absolute",
"units": "degC",
"value": "62.14"
"value_type": "float",
}
}
}
The disadvantage of this protocol is that rrdd has to parse the entire JSON
structure each tick, even though most of the time only the values will change.
For this reason a new protocol is proposed.
Protocol V2
value | bits | format | notes |
---|
header string | (string length)*8 | string | “DATASOURCES” as in the V1 protocol |
data checksum | 32 | int32 | binary-encoded crc32 of the concatenation of the encoded timestamp and datasource values |
metadata checksum | 32 | int32 | binary-encoded crc32 of the metadata string (see below) |
number of datasources | 32 | int32 | only needed if the metadata has changed - otherwise RRDD can use a cached value |
timestamp | 64 | double | Unix epoch |
datasource values | n * 64 | int64 | double | n is the number of datasources exported by the plugin, type dependent on the setting in the metadata for value_type [int64|float] |
metadata length | 32 | int32 | |
metadata | (string length)*8 | string | |
All integers/double are bigendian. The metadata will have the same JSON-based format as
in the V1 protocol, minus the timestamp and value
key-value pair for each
datasource.
field | values | notes | required |
---|
description | string | Description of the datasource | no |
owner | host | vm | sr | The object to which the data relates | no, default host |
value_type | int64 | float | The type of the datasource | yes |
type | absolute | derive | gauge | The type of measurement being sent. Absolute for counters which are reset on reading, derive stores the derivative of the recorded values (useful for metrics which continually increase like amount of data written since start), gauge for things like temperature | no, default absolute |
default | true | false | Whether the source is default enabled or not | no, default false |
units | | The units the data should be displayed in | no |
min | | The minimum value for the datasource | no, default -infinity |
max | | The maximum value for the datasource | no, default +infinity |
Example
{
"datasources": {
"memory_reclaimed": {
"description":"Host memory reclaimed by squeezed",
"owner":"host",
"value_type":"int64",
"type":"absolute",
"default":"true",
"units":"B",
"min":"-inf",
"max":"inf"
},
"memory_reclaimed_max": {
"description":"Host memory that could be reclaimed by squeezed",
"owner":"host",
"value_type":"int64",
"type":"absolute",
"default":"true",
"units":"B",
"min":"-inf",
"max":"inf"
},
{
"cpu-temp-cpu0": {
"description": "Temperature of CPU 0",
"owner":"host",
"value_type": "float",
"type": "absolute",
"default":"true",
"units": "degC",
"min":"-inf",
"max":"inf"
},
"cpu-temp-cpu1": {
"description": "Temperature of CPU 1",
"owner":"host",
"value_type": "float",
"type": "absolute",
"default":"true",
"units": "degC",
"min":"-inf",
"max":"inf"
}
}
}
The above formatting is not required, but added here for readability.
Reading algorithm
if header != expected_header:
raise InvalidHeader()
if data_checksum == last_data_checksum:
raise NoUpdate()
if data_checksum != crc32(encoded_timestamp_and_values):
raise InvalidChecksum()
if metadata_checksum == last_metadata_checksum:
for datasource, value in cached_datasources, values:
update(datasource, value)
else:
if metadata_checksum != crc32(metadata):
raise InvalidChecksum()
cached_datasources = create_datasources(metadata)
for datasource, value in cached_datasources, values:
update(datasource, value)
This means that for a normal update, RRDD will only have to read the header plus
the first (16 + 16 + 4 + 8 + 8*n) bytes of data, where n is the number of
datasources exported by the plugin. If the metadata changes RRDD will have to
read all the data (and parse the metadata).
Design document |
---|
Revision | v1 |
Status | proposed |
Revision history |
---|
v1 | Initial version |
RRDD plugin protocol v3
Motivation
rrdd plugins protocol v2 report datasources via shared-memory file, however it
has various limitations :
- metrics are unique by their names, thus it is not possible cannot have
several metrics that shares a same name (e.g vCPU usage per vm)
- only number metrics are supported, for example we can’t expose string
metrics (e.g CPU Model)
Therefore, it implies various limitations on plugins and limits
OpenMetrics support for the metrics daemon.
Moreover, it may not be practical for plugin developpers and parser implementations :
- json implementations may not keep insersion order on maps, which can cause
issues to expose datasource values as it is sensitive to the order of the metadata map
- header length is not constant and depends on datasource count, which complicates parsing
- it still requires a quite advanced parser to convert between bytes and numbers according to metadata
A simpler protocol is proposed, based on OpenMetrics binary format to ease plugin and parser implementations.
Protocol V3
For this protocol, we still use a shared-memory file, but significantly change the structure of the file.
value | bits | format | notes |
---|
header string | 12*8=96 | string | “OPENMETRICS1” which is one byte longer than “DATASOURCES”, intentionally made at 12 bytes for alignment purposes |
data checksum | 32 | uint32 | Checksum of the concatenation of the rest of the header (from timestamp) and the payload data |
timestamp | 64 | uint64 | Unix epoch |
payload length | 32 | uint32 | Payload length |
payload data | 8*(payload length) | binary | OpenMetrics encoded metrics data (protocol-buffers format) |
All values are big-endian.
The header size is constant (28 bytes) that implementation can rely on (read
the entire header in one go, simplify usage of memory mapping).
As opposed to protocol v2 but alike protocol v1, metadata is included along
metrics in OpenMetrics format.
owner
attribute for metric should be exposed using a OpenMetrics label instead (named owner
).
Multiple metrics that shares the same name should be exposed under the same
Metric Family and be differenciated by labels (e.g owner
).
Reading algorithm
if header != expected_header:
raise InvalidHeader()
if data_checksum == last_data_checksum:
raise NoUpdate()
if timestamp == last_timestamp:
raise NoUpdate()
if data_checksum != crc32(concat_header_end_payload):
raise InvalidChecksum()
metrics = parse_openmetrics(payload_data)
for family in metrics:
if family_exists(family):
update_family(family)
else
create_family(family)
track_removed_families(metrics)
Design document |
---|
Revision | v2 |
Status | proposed |
Review | #186 |
Revision history |
---|
v1 | Initial version |
v2 | Renaming VMSS fields and APIs. API message_create superseeds vmss_create_alerts. |
v3 | Remove VMSS alarm_config details and use existing pool wide alarm config |
v4 | Renaming field from retention-value to retained-snapshots and schedule-snapshot to scheduled-snapshot |
v5 | Add new API task_set_status |
Schedule Snapshot Design
The scheduled snapshot feature will utilize the existing architecture of VMPR. In terms of functionality, scheduled snapshot is basically VMPR without its archiving capability.
Introduction
- Schedule snapshot will be a new object in xapi as VMSS.
- A pool can have multiple VMSS.
- Multiple VMs can be a part of VMSS but a VM cannot be a part of multiple VMSS.
- A VMSS takes VMs snapshot with type [
snapshot
, checkpoint
, snapshot_with_quiesce
]. - VMSS takes snapshot of VMs on configured intervals:
hourly
-> On everyday, Each hour, Mins [0;15;30;45]daily
-> On everyday, Hour [0 to 23], Mins [0;15;30;45]weekly
-> Days [Monday
,Tuesday
,Wednesday
,Thursday
,Friday
,Saturday
,Sunday
], Hour[0 to 23], Mins [0;15;30;45]
- VMSS will have a limit on retaining number of VM snapshots in range [1 to 10].
Datapath Design
- There will be a cron job for VMSS.
- VMSS plugin will go through all the scheduled snapshot policies in the pool and check if any of them are due.
- If a snapshot is due then : Go through all the VM objects in XAPI associated with this scheduled snapshot policy and create a new snapshot.
- If the snapshot operation fails, create a notification alert for the event and move to the next VM.
- Check if an older snapshot now needs to be deleted to comply with the retained snapshots defined in the scheduled policy.
- If we need to delete any existing snapshots, delete the oldest snapshot created via scheduled policy.
- Set the last-run timestamp in the scheduled policy.
Xapi Changes
There is a new record for VM Scheduled Snapshot with new fields.
New fields:
name-label
type String
: Name label for VMSS.name-description
type String
: Name description for VMSS.enabled
type Bool
: Enable/Disable VMSS to take snapshot.type
type Enum
[snapshot
; checkpoint
; snapshot_with_quiesce
] : Type of snapshot VMSS takes.retained-snapshots
type Int64
: Number of snapshots limit for a VM, max limit is 10 and default is 7.frequency
type Enum
[hourly
; daily
; weekly
] : Frequency of taking snapshot of VMs.schedule
type Map(String,String)
with (key, value) pair:- hour : 0 to 23
- min : [0;15;30;45]
- days : [
Monday
,Tuesday
,Wednesday
,Thursday
,Friday
,Saturday
,Sunday
]
last-run-time
type Date : DateTime of last execution of VMSS.VMs
type VM refs : List of VMs part of VMSS.
New fields to VM record:
scheduled-snapshot
type VMSS ref : VM part of VMSS.is-vmss-snapshot
type Bool : If snapshot created from VMSS.
New APIs
- vmss_snapshot_now (Ref vmss, Pool_Operater) -> String : This call executes the scheduled snapshot immediately.
- vmss_set_retained_snapshots (Ref vmss, Int value, Pool_Operater) -> unit : Set the value of vmss retained snapshots, max is 10.
- vmss_set_frequency (Ref vmss, String “value”, Pool_Operater) -> unit : Set the value of the vmss frequency field.
- vmss_set_type (Ref vmss, String “value”, Pool_Operater) -> unit : Set the snapshot type of the vmss type field.
- vmss_set_scheduled (Ref vmss, Map(String,String) “value”, Pool_Operater) -> unit : Set the vmss scheduled to take snapshot.
- vmss_add_to_schedule (Ref vmss, String “key”, String “value”, Pool_Operater) -> unit : Add key value pair to VMSS schedule.
- vmss_remove_from_schedule (Ref vmss, String “key”, Pool_Operater) -> unit : Remove key from VMSS schedule.
- vmss_set_last_run_time (Ref vmss, DateTime “value”, Local_Root) -> unit : Set the last run time for VMSS.
- task_set_status (Ref task, status_type “value”, READ_ONLY) -> unit : Set the status of task owned by same user, Pool_Operator can set status for any tasks.
New CLIs
- vmss-create (required : “name-label”;“type”;“frequency”, optional : “name-description”;“enabled”;“schedule:”;“retained-snapshots”) -> unit : Creates VM scheduled snapshot.
- vmss-destroy (required : uuid) -> unit : Destroys a VM scheduled snapshot.
Design document |
---|
Revision | v1 |
Status | released (7.6) |
SMAPIv3
Xapi accesses storage through “plugins” which currently use a protocol
called “SMAPIv1”. This protocol has a number of problems:
the protocol has many missing features, and this leads to people
using the XenAPI from within a plugin, which is racy, difficult to
get right, unscalable and makes component testing impossible.
the protocol expects plugin authors to have a deep knowledge of the
Xen storage datapath (tapdisk
, blkback
etc) and the storage.
the protocol is undocumented.
We shall create a new revision of the protocol (“SMAPIv3”) to address these
problems.
The following diagram shows the new control plane:
Requests from xapi are filtered through the existing storage_access
layer which is responsible for managing the mapping between VM VBDs and
VDIs.
Each plugin is represented by a named queue, with APIs for
- querying the state of each queue
- explicitly cancelling or replying to messages
Legacy SMAPIv1 plugins will be processed via the existing storage_access.SMAPIv1
module. Newer SMAPIv3 plugins will be handled by a new xapi-storage-script
service.
The SMAPIv3 APIs will be defined in an IDL format in a separate repo.
xapi-storage-script
The xapi-storage-script
will run as a service and will
- use
inotify
to monitor a well-known path in dom0 - when a directory is created, check whether it contains storage plugins by
executing a
Plugin.query
- assuming the directory contains plugins, it will register the queue name
and start listening for messages
- when messages from
xapi
or the CLI are received, it will generate the SMAPIv3
.json message and fork the relevant script.
SMAPIv3 IDL
The IDL will support
- documentation for all functions, parameters and results
- this will be extended to be a XenAPI-style versioning scheme in future
- generating hyperlinked HTML documentation, published on github
- generating libraries for python and OCaml
- the libraries will include marshalling, unmarshalling, type-checking
and command-line parsing and help generation
It will be possible to view the contents of the queue associated with any
plugin, and see whether
- the queue is being served or not (perhaps the
xapi-storage-script
has
crashed) - there are unanswered messages (perhaps one of the messages has caused
a deadlock in the implementation?)
It will be possible to
- delete/clear queues/messages
- download a message-sequence chart of the last N messages for inclusion in
bugtools.
Anatomy of a plugin
The following diagram shows what a plugin would look like:
The SMAPIv3
Please read the current SMAPIv3 documentation.
Design document |
---|
Revision | v1 |
Status | proposed |
Specifying Emulated PCI Devices
Background and goals
At present (early March 2015) the datamodel defines a VM as having a “platform” string-string map, in which two keys are interpreted as specifying a PCI device which should be emulated for the VM. Those keys are “device_id” and “revision” (with int values represented as decimal strings).
Limitations:
- Hardcoded defaults are used for the the vendor ID and all other parameters except device_id and revision.
- Only one emulated PCI device can be specified.
When instructing qemu to emulate PCI devices, qemu accepts twelve parameters for each device.
Future guest-agent features rely on additional emulated PCI devices. We cannot know in advance the full details of all the devices that will be needed, but we can predict some.
We need a way to configure VMs such that they will be given additional emulated PCI devices.
Design
In the datamodel, there will be a new type of object for emulated PCI devices.
Tentative name: “emulated_pci_device”
Fields to be passed through to qemu are the following, all static read-only, and all ints except devicename:
- devicename (string)
- vendorid
- deviceid
- command
- status
- revision
- classcode
- headertype
- subvendorid
- subsystemid
- interruptline
- interruptpin
We also need a “built_in” flag: see below.
Allow creation of these objects through the API (and CLI).
(It would be nice, but by no means essential, to be able to create one by specifying an existing one as a basis, along with one or more altered fields, e.g. “Make a new one just like that existing one except with interruptpin=9.”)
Create some of these devices to be defined as standard in XenServer, along the same lines as the VM templates. Those ones should have built_in=true.
Allow destruction of these objects through the API (and CLI), but not if they are in use or if they have built_in=true.
A VM will have a list of zero or more of these emulated-pci-device objects. (OPEN QUESTION: Should we forbid having more than one of a given device?)
Provide API (and CLI) commands to add and remove one of these devices from a VM (identifying the VM and device by uuid or other identifier such as name).
The CLI should allow performing this on multiple VMs in one go, based on a selector or filter for the VMs. We have this concept already in the CLI in commands such as vm-start.
In the function that adds an emulated PCI device to a VM, we must check if this is the first device to be added, and must refuse if the VM’s Virtual Hardware Platform Version is too low. (Or should we just raise the version automatically if needed?)
When starting a VM, check its list of emulated pci devices and pass the details through to qemu (via xenopsd).
Design document |
---|
Revision | v11 |
Status | confirmed |
Review | #139 |
Revision history |
---|
v1 | Initial version |
v2 | Added details about the VDI's binary format and size, and the SR capability name. |
v3 | Tar was not needed after all! |
v4 | Add details about discovering the VDI using a new vdi_type. |
v5 | Add details about the http handlers and interaction with xapi's database |
v6 | Add details about the framing of the data within the VDI |
v7 | Redesign semantics of the rrd_updates handler |
v8 | Redesign semantics of the rrd_updates handler (again) |
v9 | Magic number change in framing format of vdi |
v10 | Add details of new APIs added to xapi and xcp-rrdd |
v11 | Remove unneeded API calls |
SR-Level RRDs
Introduction
Xapi has RRDs to track VM- and host-level metrics. There is a desire to have SR-level RRDs as a new category, because SR stats are not specific to a certain VM or host. Examples are size and free space on the SR. While recording SR metrics is relatively straightforward within the current RRD system, the main question is where to archive them, which is what this design aims to address.
Stats Collection
All SR types, including the existing ones, should be able to have RRDs defined for them. Some RRDs, such as a “free space” one, may make sense for multiple (if not all) SR types. However, the way to measure something like free space will be SR specific. Furthermore, it should be possible for each type of SR to have its own specialised RRDs.
It follows that each SR will need its own xcp-rrdd
plugin, which runs on the SR master and defines and collects the stats. For the new thin-lvhd SR this could be xenvmd
itself. The plugin registers itself with xcp-rrdd
, so that the latter records the live stats from the plugin into RRDs.
Archiving
SR-level RRDs will be archived in the SR itself, in a VDI, rather than in the local filesystem of the SR master. This way, we don’t need to worry about master failover.
The VDI will be 4MB in size. This is a little more space than we would need for the RRDs we have in mind at the moment, but will give us enough headroom for the foreseeable future. It will not have a filesystem on it for simplicity and performance. There will only be one RRD archive file for each SR (possibly containing data for multiple metrics), which is gzipped by xcp-rrdd
, and can be copied onto the VDI.
There will be a simple framing format for the data on the VDI. This will be as follows:
Offset | Type | Name | Comment |
---|
0 | 32 bit network-order int | magic | Magic number = 0x7ada7ada |
4 | 32 bit network-order int | version | 1 |
8 | 32 bit network-order int | length | length of payload |
12 | gzipped data | data | |
Xapi will be in charge of the lifecycle of this VDI, not the plugin or xcp-rrdd
, which will make it a little easier to manage them. Only xapi will attach/detach and read from/write to this VDI. We will keep xcp-rrdd
as simple as possible, and have it archive to its standard path in the local file system. Xapi will then copy the RRDs in and out of the VDI.
A new value "rrd"
in the vdi_type
enum of the datamodel will be defined, and the VDI.type
of the VDI will be set to that value. The storage backend will write the VDI type to the LVM metadata of the VDI, so that xapi can discover the VDI containing the SR-level RRDs when attaching an SR to a new pool. This means that SR-level RRDs are currently restricted to LVM SRs.
Because we will not write plugins for all SRs at once, and therefore do not need xapi to set up the VDI for all SRs, we will add an SR “capability” for the backends to be able to tell xapi whether it has the ability to record stats and will need storage for them. The capability name will be: SR_STATS
.
Management of the SR-stats VDI
The SR-stats VDI will be attached/detached on PBD.plug
/unplug
on the SR master.
On PBD.plug
on the SR master, if the SR has the stats capability, xapi:
- Creates a stats VDI if not already there (search for an existing one based on the VDI type).
- Attaches the stats VDI if it did already exist, and copies the RRDs to the local file system (standard location in the filesystem; asks
xcp-rrdd
where to put them). - Informs
xcp-rrdd
about the RRDs so that it will load the RRDs and add newly recorded data to them (needs a function like push_rrd_local
for VM-level RRDs). - Detaches stats VDI.
On PBD.unplug
on the SR master, if the SR has the stats capability xapi:
- Tells
xcp-rrdd
to archive the RRDs for the SR, which it will do to the local filesystem. - Attaches the stats VDI, copies the RRDs into it, detaches VDI.
Periodic Archiving
Xapi’s periodic scheduler regularly triggers xcp-rrdd
to archive the host and VM RRDs. It will need to do this for the SR ones as well. Furthermore, xapi will need to attach the stats VDI and copy the RRD archives into it (as on PBD.unplug
).
Exporting
There will be a new handler for downloading an SR RRD:
http://<server>/sr_rrd?session_id=<SESSION HANDLE>&uuid=<SR UUID>
RRD updates are handled via a single handler for the host, VM and SR UUIDs
RRD updates for the host, VMs and SRs are handled by a a single handler at
/rrd_updates
. Exactly what is returned will be determined by the parameters
passed to this handler.
Whether the host RRD updates are returned is governed by the presence of
host=true
in the parameters. host=<anything else>
or the absence of the
host
key will mean the host RRD is not returned.
Whether the VM RRD updates are returned is governed by the vm_uuid
key in the
URL parameters. vm_uuid=all
will return RRD updates for all VM RRDs.
vm_uuid=xxx
will return the RRD updates for the VM with uuid xxx
only.
If vm_uuid
is none
(or any other string which is not a valid VM UUID) then
the handler will return no VM RRD updates. If the vm_uuid
key is absent, RRD
updates for all VMs will be returned.
Whether the SR RRD updates are returned is governed by the sr_uuid
key in the
URL parameters. sr_uuid=all
will return RRD updates for all SR RRDs.
sr_uuid=xxx
will return the RRD updates for the SR with uuid xxx
only.
If sr_uuid
is none
(or any other string which is not a valid SR UUID) then
the handler will return no SR RRD updates. If the sr_uuid
key is absent, no
SR RRD updates will be returned.
It will be possible to mix and match these parameters; for example to return
RRD updates for the host and all VMs, the URL to use would be:
http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=true&vm_uuid=all&sr_uuid=none
Or, to return RRD updates for all SRs but nothing else, the URL to use would be:
http://<server>/rrd_updates?session_id=<SESSION HANDLE>&start=10258122541&host=false&vm_uuid=none&sr_uuid=all
While behaviour is defined if any of the keys host
, vm_uuid
and sr_uuid
is
missing, this is for backwards compatibility and it is recommended that clients
specify each parameter explicitly.
Database updating.
If the SR is presenting a data source called ‘physical_utilisation’,
xapi will record this periodically in its database. In order to do
this, xapi will fork a thread that, every n minutes (2 suggested, but
open to suggestions here), will query the attached SRs, then query
RRDD for the latest data source for these, and update the database.
The utilisation of VDIs will not be updated in this way until
scalability worries for RRDs are addressed.
Xapi will cache whether it is SR master for every attached SR and only
attempt to update if it is the SR master.
New APIs.
xcp-rrdd:
Get the filesystem location where sr rrds are archived: val sr_rrds_path : uid:string -> string
Archive the sr rrds to the filesystem: val archive_sr_rrd : sr_uuid:string -> unit
Load the sr rrds from the filesystem: val push_sr_rrd : sr_uuid:string -> unit
Design document |
---|
Revision | v3 |
Status | proposed |
thin LVHD storage
LVHD is a block-based storage system built on top of Xapi and LVM. LVHD
disks are represented as LVM LVs with vhd-format data inside. When a
disk is snapshotted, the LVM LV is “deflated” to the minimum-possible
size, just big enough to store the current vhd data. All other disks are
stored “inflated” i.e. consuming the maximum amount of storage space.
This proposal describes how we could add dynamic thin-provisioning to
LVHD such that
- disks only consume the space they need (plus an adjustable small
overhead)
- when a disk needs more space, the allocation can be done locally
in the common-case; in particular there is no network RPC needed
- when the resource pool master host has failed, allocations can still
continue, up to some limit, allowing time for the master host to be
recovered; in particular there is no need for very low HA timeouts.
- we can (in future) support in-kernel block allocation through the
device mapper dm-thin target.
The following diagram shows the “Allocation plane”:
All VM disk writes are channelled through tapdisk
which keeps track
of the remaining reserved space within the device mapper device. When
the free space drops below a “low-water mark”, tapdisk sends a message
to a local per-SR daemon called local-allocator
and requests more
space.
The local-allocator
maintains a free pool of blocks available for
allocation locally (hence the name). It will pick some blocks and
transactionally send the update to the xenvmd
process running
on the SRmaster via the shared ring (labelled ToLVM queue
in the diagram)
and update the device mapper tables locally.
There is one xenvmd
process per SR on the SRmaster. xenvmd
receives
local allocations from all the host shared rings (labelled ToLVM queue
in the diagram) and combines them together, appending them to a redo-log
also on shared storage. When xenvmd
notices that a host’s free space
(represented in the metadata as another LV) is low it allocates new free blocks
and pushes these to the host via another shared ring (labelled FromLVM queue
in the diagram).
The xenvmd
process maintains a cache of the current VG metadata for
fast query and update. All updates are appended to the redo-log to ensure
they operate in O(1) time. The redo log updates are periodically flushed
to the primary LVM metadata.
Since the operations are stored in the redo-log and will only be removed
after the real metadata has been written, the implication is that it is
possible for the operations to be performed more than once. This will
occur if the xenvmd process exits between flushing to the real metadata
and acknowledging the operations as completed. For this to work as expected,
every individual operation stored in the redo-log must be idempotent.
Note on running out of blocks
Note that, while the host has plenty of free blocks, local allocations should
be fast. If the master fails and the local free pool starts running out
and tapdisk
asks for more blocks, then the local allocator won’t be able
to provide them.
tapdisk
should start to slow
I/O in order to provide the local allocator more time.
Eventually if tapdisk
runs
out of space before the local allocator can satisfy the request then
guest I/O will block. Note Windows VMs will start to crash if guest
I/O blocks for more than 70s. Linux VMs, no matter PV or HVM, may suffer
from “block for more than 120 seconds” issue due to slow I/O. This
known issue is that, slow I/O during dirty pages writeback/flush may
cause memory starvation, then other userland process or kernel threads
would be blocked.
The following diagram shows the control-plane:
When thin-provisioning is enabled we will be modifying the LVM metadata at
an increased rate. We will cache the current metadata in the xenvmd
process
and funnel all queries through it, rather than “peeking” at the metadata
on-disk. Note it will still be possible to peek at the on-disk metadata but it
will be out-of-date. Peeking can still be used to query the PV state of the volume
group.
The xenvm
CLI uses a simple
RPC interface to query the xenvmd
process, tunnelled through xapi
over
the management network. The RPC interface can be used for
- activating volumes locally:
xenvm
will query the LV segments and program
device mapper - deactivating volumes locally
- listing LVs, PVs etc
Note that current LVHD requires the management network for these control-plane
functions.
When the SM backend wishes to query or update volume group metadata it should use the
xenvm
CLI while thin-provisioning is enabled.
The xenvmd
process shall use a redo-log to ensure that metadata updates are
persisted in constant time and flushed lazily to the regular metadata area.
Tunnelling through xapi will be done by POSTing to the localhost URI
/services/xenvmd/<SR uuid>
Xapi will the either proxy the request transparently to the SRmaster, or issue an
http level redirect that the xenvm CLI would need to follow.
If the xenvmd process is not running on the host on which it should
be, xapi will start it.
Components: roles and responsibilities
xenvmd
:
- one per plugged SRmaster PBD
- owns the LVM metadata
- provides a fast query/update API so we can (for example) create lots of LVs very fast
- allocates free blocks to hosts when they are running low
- receives block allocations from hosts and incorporates them in the LVM metadata
- can safely flush all updates and downgrade to regular LVM
xenvm
:
- a CLI which talks the
xenvmd
protocol to query / update LVs - can be run on any host, calls (except “format” and “upgrade”) are forwarded by
xapi
- can “format” a LUN to prepare it for
xenvmd
- can “upgrade” a LUN to prepare it for
xenvmd
local_allocator
:
- one per plugged PBD
- exposes a simple interface to
tapdisk
for requesting more space - receives free block allocations via a queue on the shared disk from
xenvmd
- sends block allocations to
xenvmd
and updates the device mapper target locally
tapdisk
:
- monitors the free space inside LVs and requests more space when running out
- slows down I/O when nearly out of space
xapi
:
- provides authenticated communication tunnels
- ensures the xenvmd daemons are only running on the correct hosts.
SM
:
- writes the configuration file for xenvmd (though doesn’t start it)
- has an on/off switch for thin-provisioning
- can use either normal LVM or the
xenvm
CLI
membership_monitor
- configures and manages the connections between
xenvmd
and the local_allocator
Queues on the shared disk
The local_allocator
communicates with xenvmd
via a pair
of queues on the shared disk. Using the disk rather than the network means
that VMs will continue to run even if the management network is not working.
In particular
- if the (management) network fails, VMs continue to run on SAN storage
- if a host changes IP address, nothing needs to be reconfigured
- if xapi fails, VMs continue to run.
Logical messages in the queues
The local_allocator
needs to tell the xenvmd
which blocks have
been allocated to which guest LV. xenvmd
needs to tell the
local_allocator
which blocks have become free. Since we are based on
LVM, a “block” is an extent, and an “allocation” is a segment i.e. the
placing of a physical extent at a logical extent in the logical volume.
The local_allocator
needs to send a message with logical contents:
volume
: a human-readable name of the LVsegments
: a list of LVM segments which says
“place physical extent x at logical extent y using a linear mapping”.
Note this message is idempotent.
The xenvmd
needs to send a message with logical contents:
extents
: a list of physical extents which are free for the host to use
Although
for internal housekeeping xenvmd
will want to assign these
physical extents to logical extents within the host’s free LV, the
local_allocator
doesn’t need to know the logical extents. It only needs to know
the set of blocks which it is free to allocate.
Starting up the local_allocator
What happens when a local_allocator
(re)starts, after a
- process crash, respawn
- host crash, reboot?
When the local_allocator
starts up, there are 2 cases:
- the host has just rebooted, there are no attached disks and no running VMs
- the process has just crashed, there are attached disks and running VMs
Case 1 is uninteresting. In case 2 there may have been an allocation in
progress when the process crashed and this must be completed. Therefore
the operation is journalled in a local filesystem in a directory which
is deliberately deleted on host reboot (Case 1). The allocation operation
consists of:
push
ing the allocation to xenvmd
on the SRmaster- updating the device mapper
Note that both parts of the allocation operation are idempotent and hence
the whole operation is idempotent. The journalling will guarantee it executes
at-least-once.
When the local_allocator
starts up it needs to discover the list of
free blocks. Rather than have 2 code paths, it’s best to treat everything
as if it is a cold start (i.e. no local caches already populated) and to
ask the master to resync the free block list. The resync is performed by
executing a “suspend” and “resume” of the free block queue, and requiring
the remote allocator to:
pop
all block allocations and incorporate these updates- send the complete set of free blocks “now” (i.e. while the queue is
suspended) to the local allocator.
Starting xenvmd
xenvmd
needs to know
- the device containing the volume group
- the hosts to “connect” to via the shared queues
The device containing the volume group should be written to a config
file when the SR is plugged.
xenvmd
does not remember which hosts it is listening to across crashes,
restarts or master failovers. The membership_monitor
will keep the
xenvmd
list in sync with the PBD.currently_attached
fields.
Shutting down the local_allocator
The local_allocator
should be able to crash at any time and recover
afterwards. If the user requests a PBD.unplug
we can perform a
clean shutdown by:
- signalling
xenvmd
to suspend the block allocation queue - arranging for the
local_allocator
to acknowledge the suspension and exit - when the
xenvmd
sees the acknowlegement, we know that the
local_allocator
is offline and it doesn’t need to poll the queue any more
xenvmd
can be terminated at any time and restarted, since all compound
operations are journalled.
Downgrade is a special case of shutdown.
To downgrade, we need to stop all hosts allocating and ensure all updates
are flushed to the global LVM metadata. xenvmd
can shutdown
by:
- shutting down all
local_allocator
s (see previous section) - flushing all outstanding block allocations to the LVM redo log
- flushing the LVM redo log to the global LVM metadata
Queues as rings
We can use a simple ring protocol to represent the queues on the disk.
Each queue will have a single consumer and single producer and reside within
a single logical volume.
To make diagnostics simpler, we can require the ring to only support push
and pop
of whole messages i.e. there can be no partial reads or partial
writes. This means that the producer
and consumer
pointers will always
point to valid message boundaries.
One possible format used by the prototype is as follows:
- sector 0: a magic string
- sector 1: producer state
- sector 2: consumer state
- sector 3…: data
Within the producer state sector we can have:
- octets 0-7: producer offset: a little-endian 64-bit integer
- octet 8: 1 means “suspend acknowledged”; 0 otherwise
Within the consumer state sector we can have:
- octets 0-7: consumer offset: a little-endian 64-bit integer
- octet 8: 1 means “suspend requested”; 0 otherwise
The consumer and producer pointers point to message boundaries. Each
message is prefixed with a 4 byte length and padded to the next 4-byte
boundary.
To push a message onto the ring we need to
- check whether the message is too big to ever fit: this is a permanent
error
- check whether the message is too big to fit given the current free
space: this is a transient error
- write the message into the ring
- advance the producer pointer
To pop a message from the ring we need to
- check whether there is unconsumed space: if not this is a transient
error
- read the message from the ring and process it
- advance the consumer pointer
Journals as queues
When we journal an operation we want to guarantee to execute it never
or at-least-once. We can re-use the queue implementation by push
ing
a description of the work item to the queue and waiting for the
item to be pop
ped, processed and finally consumed by advancing the
consumer pointer. The journal code needs to check for unconsumed data
during startup, and to process it before continuing.
Suspending and resuming queues
During startup (resync the free blocks) and shutdown (flush the allocations)
we need to suspend and resume queues. The ring protocol can be extended
to allow the consumer to suspend the ring by:
- the consumer asserts the “suspend requested” bit
- the producer
push
function checks the bit and writes “suspend acknowledged” - the producer also periodically polls the queue state and writes
“suspend acknowledged” (to catch the case where no items are to be pushed)
- after the producer has acknowledged it will guarantee to
push
no more
items - when the consumer polls the producer’s state and spots the “suspend acknowledged”,
it concludes that the queue is now suspended.
The key detail is that the handshake on the ring causes the two sides
to synchronise and both agree that the ring is now suspended/ resumed.
Modelling the suspend/resume protocol
To check that the suspend/resume protocol works well enough to be used
to resynchronise the free blocks list on a slave, a simple
promela model was created. We model the queue state as
2 boolean flags:
bool suspend /* suspend requested */
bool suspend_ack /* suspend acknowledged *./
and an abstract representation of the data within the ring:
/* the queue may have no data (none); a delta or a full sync.
the full sync is performed immediately on resume. */
mtype = { sync delta none }
mtype inflight_data = none
There is a “producer” and a “consumer” process which run forever,
exchanging data and suspending and resuming whenever they want.
The special data item sync
is only sent immediately after a resume
and we check that we never desynchronise with asserts:
:: (inflight_data != none) ->
/* In steady state we receive deltas */
assert (suspend_ack == false);
assert (inflight_data == delta);
inflight_data = none
i.e. when we are receiving data normally (outside of the suspend/resume
code) we aren’t suspended and we expect deltas, not full syncs.
The model-checker spin
verifies this property holds.
Interaction with HA
Consider what will happen if a host fails when HA is disabled:
- if the host is a slave: the VMs running on the host will crash but
no other host is affected.
- if the host is a master: allocation requests from running VMs will
continue provided enough free blocks are cached on the hosts. If a
host eventually runs out of free blocks, then guest I/O will start to
block and VMs may eventually crash.
Therefore we recommend that users enable HA and only disable it
for short periods of time. Note that, unlike other thin-provisioning
implementations, we will allow HA to be disabled.
Host-local LVs
When a host calls SMAPI sr_attach
, it will use xenvm
to tell xenvmd
on the
SRmaster to connect to the local_allocator
on the host. The xenvmd
daemon will create the volumes for queues and a volume to represent the
“free blocks” which a host is allowed to allocate.
Monitoring
The xenvmd
process should export RRD datasources over shared
memory named
sr_<SR uuid>_<host uuid>_free
: the number of free blocks in
the local cache. It’s useful to look at this and verify that it doesn’t
usually hit zero, since that’s when allocations will start to block.
For this reason we should use the MIN
consolidation function.sr_<SR uuid>_<host uuid>_requests
: a counter of the number
of satisfied allocation requests. If this number is too high then the quantum
of allocation should be increased. For this reason we should use the
MAX
consolidation function.sr_<SR uuid>_<host uuid>_allocations
: a counter of the number of
bytes being allocated. If the allocation rate is too high compared with
the number of free blocks divided by the HA timeout period then the
SRmaster-allocator
should be reconfigured to supply more blocks with the host.
Modifications to tapdisk
TODO: to be updated by Germano
tapdisk
will be modified to
- on open: discover the current maximum size of the file/LV (for a file
we assume there is no limit for now)
- read a low-water mark value from a config file
/etc/tapdisk3.conf
- read a very-low-water mark value from a config file
/etc/tapdisk3.conf
- read a Unix domain socket path from a config file
/etc/tapdisk3.conf
- when there is less free space available than the low-water mark: connect
to Unix domain socket and write an “extend” request
- upon receiving the “extend” response, re-read the maximum size of the
file/LV
- when there is less free space available than the very-low-water mark:
start to slow I/O responses and write a single ’error’ line to the log.
The extend request
TODO: to be updated by Germano
The request has the following format:
Octet offsets | Name | Description |
---|
0,1 | tl | Total length (including this field) of message (in network byte order) |
2 | type | The value ‘0’ indicating an extend request |
3 | nl | The length of the LV name in octets, including NULL terminator |
4,..,4+nl-1 | name | The LV name |
4+nl,..,12+nl-1 | vdi_size | The virtual size of the logical VDI (in network byte order) |
12+nl,..,20+nl-1 | lv_size | The current size of the LV (in network byte order) |
20+nl,..,28+nl-1 | cur_size | The current size of the vhd metadata (in network byte order) |
The extend response
The response is a single byte value “0” which is a signal to re-examime
the LV size. The request will block indefinitely until it succeeds. The
request will block for a long time if
- the SR has genuinely run out of space. The admin should observe the
existing free space graphs/alerts and perform an SR resize.
- the master has failed and HA is disabled. The admin should re-enable
HA or fix the problem manually.
The local_allocator
There is one local_allocator
process per plugged PBD.
The process will be
spawned by the SM sr_attach
call, and shutdown from the sr_detach
call.
The local_allocator
accepts the following configuration (via a config file):
socket
: path to a local Unix domain socket. This is where the local_allocator
listens for requests from tapdisk
allocation_quantum
: number of megabytes to allocate to each tapdisk on requestlocal_journal
: path to a block device or file used for local journalling. This
should be deleted on reboot.free_pool
: name of the LV used to store the host’s free blocksdevices
: list of local block devices containing the PVsto_LVM
: name of the LV containing the queue of block allocations sent to xenvmd
from_LVM
: name of the LV containing the queue of messages sent from xenvmd
.
There are two types of messages:- Free blocks to put into the free pool
- Cap requests to remove blocks from the free pool.
When the local_allocator
process starts up it will read the host local
journal and
- re-execute any pending allocation requests from tapdisk
- suspend and resume the
from_LVM
queue to trigger a full retransmit
of free blocks from xenvmd
The procedure for handling an allocation request from tapdisk is:
- if there aren’t enough free blocks in the free pool, wait polling the
from_LVM
queue - choose a range of blocks to assign to the tapdisk LV from the free LV
- write the operation (i.e. exactly what we are about to do) to the journal.
This ensures that it will be repeated if the allocator crashes and restarts.
Note that, since the operation may be repeated multiple times, it must be
idempotent.
- push the block assignment to the
toLVM
queue - suspend the device mapper device
- add/modify the device mapper target
- resume the device mapper device
- remove the operation from the local journal (i.e. there’s no need to repeat
it now)
- reply to tapdisk
Shutting down the local-allocator
The SM sr_detach
called from PBD.unplug
will use the xenvm
CLI to request
that xenvmd
disconnects from a host. The procedure is:
- SM calls
xenvm disconnect host
xenvm
sends an RPC to xenvmd
tunnelled through xapi
xenvmd
suspends the to_LVM
queue- the
local_allocator
acknowledges the suspend and exits xenvmd
flushes all updates from the to_LVM
queue and stops listening
xenvmd
xenvmd
is a daemon running per SRmaster PBD, started in sr_attach
and
terminated in sr_detach
. xenvmd
has a config file containing:
socket
: Unix domain socket where xenvmd
listens for requests from
xenvm
tunnelled by xapi
host_allocation_quantum
: number of megabytes to hand to a host at a timehost_low_water_mark
: threshold below which we will hand blocks to a hostdevices
: local devices containing the PVs
xenvmd
continually
- peeks updates from all the
to_LVM
queues - calculates how much free space each host still has
- if the size of a host’s free pool drops below some threshold:
- if the size of a host’s free pool goes above some threshold:
- request a cap of the host’s free pool
- writes the change it is going to make to a journal stored in an LV
- pops the updates from the
to_LVM
queues - pushes the updates to the
from_LVM
queues - pushes updates to the LVM redo-log
- periodically flush the LVM redo-log to the LVM metadata area
The membership monitor
The role of the membership monitor is to keep the list of xenvmd
connections
in sync with the PBD.currently_attached
fields.
We shall
- install a
host-pre-declare-dead
script to use xenvm
to send an RPC
to xenvmd
to forcibly flush (without acknowledgement) the to_LVM
queue
and destroy the LVs. - modify XenAPI
Host.declare_dead
to call host-pre-declare-dead
before
the VMs are unlocked - add a
host-pre-forget
hook type which will be called just before a Host
is forgotten - install a
host-pre-forget
script to use xenvm
to call xenvmd
to
destroy the host’s local LVs
Modifications to LVHD SR
sr_attach
should:- if an SRmaster, update the
MGT
major version number to prevent - Write the xenvmd configuration file (on all hosts, not just SRmaster)
- spawn
local_allocator
sr_detach
should:- call
xenvm
to request the shutdown of local_allocator
vdi_deactivate
should:- call
xenvm
to request the flushing of all the to_LVM
queues to the
redo log
vdi_activate
should:- if necessary, call
xenvm
to deflate the LV to the minimum size (with some slack)
Note that it is possible to attach and detach the individual hosts in any order
but when the SRmaster is unplugged then there will be no “refilling” of the host
local free LVs; it will behave as if the master host has failed.
Modifications to xapi
- Xapi needs to learn how to forward xenvm connections to the SR master.
- Xapi needs to start and stop xenvmd at the appropriate times
- We must disable unplugging the PBDs for shared SRs on the pool master
if any other slave has its PBD plugging. This is actually fixing an
issue that exists today - LVHD SRs require the master PBD to be
plugged to do many operations.
- Xapi should provide a mechanism by which the xenvmd process can be killed
once the last PBD for an SR has been unplugged.
Enabling thin provisioning
Thin provisioning will be automatically enabled on upgrade. When the SRmaster
plugs in PBD
the MGT
major version number will be bumped to prevent old
hosts from plugging in the SR and getting confused.
When a VDI is activated, it will be deflated to the new low size.
Disabling thin provisioning
We shall make a tool which will
- allow someone to downgrade their pool after enabling thin provisioning
- allow developers to test the upgrade logic without fully downgrading their
hosts
The tool will
- check if there is enough space to fully inflate all non-snapshot leaves
- unplug all the non-SRmaster
PBD
s - unplug the SRmaster
PBD
. As a side-effect all pending LVM updates will be
written to the LVM metadata. - modify the
MGT
volume to have the lower metadata version - fully inflate all non-snapshot leaves
Walk-through: upgrade
Rolling upgrade should work in the usual way. As soon as the pool master has been
upgraded, hosts will be able to use thin provisioning when new VDIs are attached.
A VM suspend/resume/reboot or migrate will be needed to turn on thin provisioning
for existing running VMs.
Walk-through: downgrade
A pool may be safely downgraded to a previous version without thin provisioning
provided that the downgrade tool is run. If the tool hasn’t run then the old
pool will refuse to attach the SR because the metadata has been upgraded.
Walk-through: after a host failure
If HA is enabled:
xhad
elects a new master if necessaryXapi
on the master will start xenvmd processes for shared thin-lvhd SRs- the
xhad
tells Xapi
which hosts are alive and which have failed. Xapi
runs the host-pre-declare-dead
scripts for every failed host- the
host-pre-declare-dead
tells xenvmd
to flush the to_LVM
updates Xapi
unlocks the VMs and restarts them on new hosts.
If HA is not enabled:
- The admin should verify the host is definitely dead
- If the dead host was the master, a new master must be designated. This will
start the xenvmd processes for the shared thin-lvhd SRs.
- the admin must tell
Xapi
which hosts have failed with xe host-declare-dead
Xapi
runs the host-pre-declare-dead
scripts for every failed host- the
host-pre-declare-dead
tells xenvmd
to flush the to_LVM
updates Xapi
unlocks the VMs- the admin may now restart the VMs on new hosts.
Walk-through: co-operative master transition
The admin calls Pool.designate_new_master. This initiates a two-phase
commit of the new master. As part of this, the slaves will restart,
and on restart each host’s xapi will kill any xenvmd that should only
run on the pool master. The new designated master will then restart itself
and start up the xenvmd process on itself.
Future use of dm-thin?
Dm-thin also uses 2 local LVs: one for the “thin pool” and one for the metadata.
After replaying our journal we could potentially delete our host local LVs and
switch over to dm-thin.
Summary of the impact on the admin
- If the VM workload performs a lot of disk allocation, then the admin should
enable HA.
- The admin must not downgrade the pool without first cleanly detaching the
storage.
- Extra metadata is needed to track thin provisioing, reducing the amount of
space available for user volumes.
- If an SR is completely full then it will not be possible to enable thin
provisioning.
- There will be more fragmentation, but the extent size is large (4MiB) so it
shouldn’t be too bad.
Ring protocols
Each ring consists of 3 sectors of metadata followed by the data area. The
contents of the first 3 sectors are:
Sector, Octet offsets | Name | Type | Description |
---|
0,0-30 | signature | string | Signature (“mirage shared-block-device 1.0”) |
1,0-7 | producer | uint64 | Pointer to the end of data written by the producer |
1,8 | suspend_ack | uint8 | Suspend acknowledgement byte |
2,0-7 | consumer | uint64 | Pointer to the end of data read by the consumer |
2,8 | suspend | uint8 | Suspend request byte |
Note. producer and consumer pointers are stored in little endian
format.
The pointers are free running byte offsets rounded up to the next
4-byte boundary, and the position of the actual data is found by
finding the remainder when dividing by the size of the data area. The
producer pointer points to the first free byte, and the consumer
pointer points to the byte after the last data consumed. The actual
payload is preceded by a 4-byte length field, stored in little endian
format. When writing a 1 byte payload, the next value of the producer
pointer will therefore be 8 bytes on from the previous - 4 for the
length (which will contain [0x01,0x00,0x00,0x00]), 1 byte for the
payload, and 3 bytes padding.
A ring is suspended and resumed by the consumer. To suspend, the
consumer first checks that the producer and consumer agree on the
current suspend status. If they do not, the ring cannot be
suspended. The consumer then writes the byte 0x02 into byte 8 of
sector 2. The consumer must then wait for the producer to acknowledge
the suspend, which it will do by writing 0x02 into byte 8 of sector 1.
The FromLVM ring
Two different types of message can be sent on the FromLVM ring.
The FreeAllocation message contains the blocks for the free pool.
Example message:
(FreeAllocation((blocks((pv0(12326 12249))(pv0(11 1))))(generation 2)))
Pretty-printed:
(FreeAllocation
(
(blocks
(
(pv0(12326 12249))
(pv0(11 1))
)
)
(generation 2)
)
)
This is a message to add two new sets of extents to the free pool. A
span of length 12249 extents starting at extent 12326, and a span of
length 1 starting from extent 11, both within the physical volume
‘pv0’. The generation count of this message is ‘2’. The semantics of
the generation is that the local allocator must record the generation
of the last message it received since the FromLVM ring was resumed,
and ignore any message with a generated less than or equal to the last
message received.
The CapRequest message contains a request to cap the free pool at
a maximum size.
Example message:
(CapRequest((cap 6127)(name host1-freeme)))
Pretty-printed:
(CapRequest
(
(cap 6127)
(name host1-freeme)
)
)
This is a request to cap the free pool at a maximum size of 6127
extents. The ’name’ parameter reflects the name of the LV into which
the extents should be transferred.
The ToLVM Ring
The ToLVM ring only contains 1 type of message. Example:
((volume test5)(segments(((start_extent 1)(extent_count 32)(cls(Linear((name pv0)(start_extent 12328))))))))
Pretty-printed:
(
(volume test5)
(segments
(
(
(start_extent 1)
(extent_count 32)
(cls
(Linear
(
(name pv0)
(start_extent 12328)
)
)
)
)
)
)
)
This message is extending an LV named ’test5’ by giving it 32 extents
starting at extent 1, coming from PV ‘pv0’ starting at extent
12328. The ‘cls’ field should always be ‘Linear’ - this is the only
acceptable value.
Cap requests
Xenvmd will try to keep the free pools of the hosts within a range
set as a fraction of free space. There are 3 parameters adjustable
via the config file:
- low_water_mark_factor
- medium_water_mark_factor
- high_water_mark_factor
These three are all numbers between 0 and 1. Xenvmd will sum the free
size and the sizes of all hosts’ free pools to find the total
effective free size in the VG, F
. It will then subtract the sizes of
any pending desired space from in-flight create or resize calls s
. This
will then be divided by the number of hosts connected, n
, and
multiplied by the three factors above to find the 3 absolute values
for the high, medium and low watermarks.
{high, medium, low} * (F - s) / n
When xenvmd notices that a host’s free pool size has dropped below
the low watermark, it will be topped up such that the size is equal
to the medium watermark. If xenvmd notices that a host’s free pool
size is above the high watermark, it will issue a ‘cap request’ to
the host’s local allocator, which will then respond by allocating
from its free pool into the fake LV, which xenvmd will then delete
as soon as it gets the update.
Xenvmd keeps track of the last update it has sent to the local
allocator, and will not resend the same request twice, unless it
is restarted.
Design document |
---|
Revision | v2 |
Status | released (22.6.0) |
TLS vertification for intra-pool communications
Overview
Xenserver has used TLS-encrypted communications between xapi daemons in a pool since its first release.
However it does not use TLS certificates to authenticate the servers it connects to.
This allows possible attackers opportunities to impersonate servers when the pools’ management network is compromised.
In order to enable certificate verification, certificate exchange as well as proper set up to trust them must be provided by xapi.
This is currently done by allowing users to generate, sign and install the certificates themselves; and then enable the Common Criteria mode.
This requires a CA and has a high barrier of entry.
Using the same certificates for intra-host communication creates friction between what the user needs and what the host needs.
Instead of trying to reconcile these two uses with one set of certificates, host will serve two certificates: one for API calls from external clients, which is the one that can be changed by the users; and one that is use for intra-pool communications.
The TLS server in the host can select which certificate to serve depending on the service name the client requests when opening a TLS connection.
This mechanism is called Server Name Identification or SNI in short.
Last but not least the update bearing these changes must not disrupt pool operations while or after being applied.
Glossary
Term | Meaning |
---|
SNI | Server Name Identification. This TLS protocol extension allows a server to select a certificate during the initial TLS handshake depending on a client-provided name. This usually allows a single reverse-proxy to serve several HTTPS websites. |
Host certificate | Certificate that a host sends clients when the latter initiate a connection with the former. The clients may close the connection depending on the properties of this certificate and whether they have decided to trust it previously. |
Trusted certificate | Certificate that a computer uses to verify whether a host certificate is valid. If the host certificate’s chain of trust does not include a trusted certificate it will be considered invalid. |
Default Certificate | Xenserver hosts present this certificate to clients which do not request an SNI. Users are allowed to install their own custom certificate. |
Pool Certificate | Xenserver hosts present this certificate to clients which request xapi:pool as the SNI. They are used for host-to-host communications. |
Common Criteria | Common Criteria for Information Technology Security Evaluation is a certification on computer security. |
Certificates and Identity management
Currently Xenserver hosts generate self-signed certificates with the IP or FQDN as their subjects, users may also choose to install certificates.
When installing these certificates only the cryptographic algorithms used to generate the certificates (private key and hash) are validated and no properties about them are required.
This means that using user-installed certificates for intra-pool communication may prove difficult as restrictions regarding FQDN and chain validation need to be ensured before enabling TLS certificate checking or the pool communications will break down.
Instead a different certificate is used only for pool communication.
This allows to decouple whatever requirements users might have for the certificates they install to the requirements needed for secure pool communication.
This has several benefits:
- Frees the pool from ensuring a sound hostname resolution on the internal communications.
- Allows the pool to rotate the certificates when it deems necessary. (in particular expiration, or forced invalidation)
- Hosts never share a host certificate, and their private keys never get transmitted.
In general, the project is able to more safely change the parameters of intra-pool communication without disrupting how users use custom certificates.
To be able to establish trust in a pool, hosts must distribute the certificates to the rest of the pool members.
Once that is done servers can verify whether they are connecting to another host in the pool by comparing the server certificate with the certificates in the trust root.
Certificate pinning is available and would allow more stringent checks, but it doesn’t seem a necessity: hosts in a pool already share secret that allows them to have full control of the pool.
To be able to select a host certificate depending whether the connections is intra-pool or comes from API clients SNI will be used.
This allows clients to ask for a service when establishing a TLS connection.
This allows the server to choose the certificate they want to offer when negotiating the connection with the client.
The hosts will exploit this to request a particular service when they establish a connection with other hosts in the pool.
When initiating a connection to another host in the pool, a server will create requests for TLS connections with the server_name xapi:pool
with the name_type
DNS
, this goes against RFC-6066 as this server_name
is not resolvable.
This still works because we control the implementation in both peers of the connection and can follow the same convention.
In addition connections to the WLB appliance will continue to be validated using the current scheme of user-installed CA certificates.
This means that hosts connecting to the appliance will need a special case to only trust user-installed certificated when establishing the connection.
Conversely pool connections will ignore these certificates.
Name | Filesystem location | User-configurable | Used for |
---|
Host Default | /etc/xensource/xapi-ssl.pem | yes (using API) | Hosts serve it to normal API clients |
Host Pool | /etc/xensource/xapi-pool-tls.pem | no | Hosts serve to clients requesting “xapi:pool” as the SNI |
Trusted Default | /etc/stunnel/certs/ | yes (using API) | Certificates that users can install for trusting appliances |
Trusted Pool | /etc/stunnel/certs-pool/ | no | Certificates that are managed by the pool for host-to-host communications |
Default Bundle | /etc/stunnel/xapi-stunnel-ca-bundle.pem | no | Bundle of certificates that hosts use to verify appliances (in particular WLB), this is kept in sync with “Trusted Default” |
Pool Bundle | /etc/stunnel/xapi-pool-ca-bundle.pem | no | Bundle of certificates that hosts use to verify other hosts on pool communications, this is kept in sync with “Trusted Pool” |
Cryptography of certificates
The certificates until now have been signed using sha256WithRSAEncryption:
- Pre-8.0 releases use 1024-bit RSA keys.
- 8.0, 8.1 and 8.2 use 2048-bit RSA keys.
The Default Certificates served to API clients will continue to use sha256WithRSAEncryption with 2048-bit RSA keys. The Pool certificates will use the same algorithms for consistency.
The self-signed certificates until now have used a mix of IP and hostname claims:
- All released versions:
- Subject and issuer have CN FQDN if the hostname is different from localhost, or CN management IP
- Subject Alternate Names extension contains all the domain names as DNS names
- Next release:
- Subject and issuer have CN management IP
- SAN extension contains all domain names as DNS names and the management IP as IP
The Pool certificates do not contain claims about IPs nor hostnames as this may change during runtime and depending on their validity may make pool communication more brittle.
Instead the only claim they have is that their Issuer and their Subject are CN Host UUID, along with a serial number.
Self-signed certificates produced until now have had validity periods of 3650 days (~10 years).
The Pool certificates will have the same validity period.
Server Components
HTTPS Connections between hosts usually involve the xapi daemons and stunnel processes:
- When a xapi daemon needs to initiate a connection with another host it starts an HTTP connection with a local stunnel process.
- The stunnel processes wrap http connections inside a TLS connection, allowing HTTPS to be used when hosts communicate
This means that stunnel needs to be set up correctly to verify certificates when connecting to other hosts.
Some aspects like CA certificates are already managed, but certificate pinning is not.
Use Cases
There are several use cases that need to be modified in order correctly manage trust between hosts.
Opening a connection with a pool host
This is the main use case for the feature, the rest of use cases that need changes are modified to support this one.
Currently a Xenserver host connecting to another host within the pool does not try to authenticate the receiving server when opening a TLS connection.
(The receiving server authenticates the originating server by xapi authentication, see below)
Stunnel will be configured to verify the peer certificate against the CA certificates that are present in the host.
The CA certificates must be correctly set up when a host joins the pool to correctly establish trust.
The previous behaviour for WLB must be kept as the WLB connection must be checked against the user-friendly CA certificates.
Receiving an incoming TLS connection
All incoming connections authenticate the client using credentials, this does not need the addition of certificates.
(username and password, pool secret)
The hosts must present the certificate file to incoming connections so the client can authenticate them.
This is already managed by xapi, it configures stunnel to present the configured host certificate.
The configuration has to be changed so stunnel responds to SNI requests containing the string xapi:pool
to serve the internal certificate instead of the client-installed one.
U1. Host Installation
On xapi startup an additional certificate is created now for pool operations.
It’s added to the trusted pool certificates.
The certificate’s only claim is the host’s UUID.
No IP nor hostname information is kept as the clients only check for the certificate presence in the trust root.
U2. Pool Join
This use-case is delicate as it is the point where trust is established between hosts.
This is done with a call from the joiner to the pool coordinator where the certificate of the coordinator is not verified.
In this call the joiner transmits its certificate to the coordinator and the coordinator returns a list of the pool members’ UUIDs and certificates.
This means that in the policy used is trust on first use.
To deal with parallel pool joins, hosts download all the Pool certificates in the pool from the coordinator after all restarts.
The connection is initiated by a client, just like before, there is no change in the API as all the information needed to start the join is already provided (pool username and password, IP of coordinator)
sequenceDiagram
participant clnt as Client
participant join as Joiner
participant coor as Coordinator
participant memb as Member
clnt->>join: pool.join coordinator_ip coordinator_username coordinator_password
join->>coor:login_with_password coordinator_ip coordinator_username coordinator_password
Note over join: pre_join_checks
join->>join: remote_pool_has_tls_enabled = self_pool_has_tls_enabled
alt are different
Note over join: interrupt join, raise error
end
Note right of join: certificate distribution
coor-->>join:
join->>coor: pool.internal_certificate_list_content
coor-->>join:
join->>coor: pool.upload_identity_host_certificate joiner_certificate uuid
coor->>memb: pool.internal_certificates_sync
memb-->>coor:
loop for every <user CA certificate> in Joiner
join->>coor: Pool.install_ca_certitificate <user CA certificate>
coor-->>join:
end
loop for every <user CRL> in Joiner
join->>coor: Pool.install_crl <user CRL>
coor-->>join:
end
join->>coor: host.add joiner
coor-->>join:
join->>join: restart_as_slave
join->>coor: pool.user_certificates_sync
join->>coor: host.copy_primary_host_certs
U3. Pool Eject
During pool eject the pool must remove the host certificate of the ejected member from the internal trust root, this must be done by the xapi daemon of the coordinator.
The ejected member will recreate both server certificates to replicate a new installation.
This can be triggered by deleting the certificates and their private keys in the host before rebooting, the current boot scripts automatically generates a new self-signed certificate if the file is not present.
Additionally, both the user and the internal trust roots will be cleared before rebooting as well.
U4. Pool Upgrade
When a pool has finished upgrading to the version with certificate checking the database reflects that the feature is turned off, this is done as part of the database upgrade procedure in xen-api.
The internal certificate is created on restart.
It is added to the internal trusted certificates directory.
The distribution of certificate will happens when the tls verification is turned on, afterwards.
U5. Host certificate state inspection
In order to give information about the validity and useful information of installed user-facing certificates to API clients as well as the certificates used for internal purposes, 2 fields are added to certificate records in xapi’s datamodel and database:
- type: indicates which of the 3 kind of certificates is the certificate. If it’s a user-installed trusted CA certificate, a server certificate served to clients that do not use SNI, and a server certificate served when the SNI xapi:pool is used. The exact values are ca, host and host-internal, respectively.
- name: the human-readable name given by the user. This fields is only present on trusted CA certificates and allows the pool operators to better recognise the certificates.
Additionally, now the _host field contains a null reference if the certificate is a corporate CA (a ca certificate).
The fields will get exposed in the CLI whenever a certificate record is listed, this needs a xapi-cli-server to be modified to show the new field.
U6. Migrating a VM to another pool
To enable a frictionless migration when pools have tls verification enabled, the host certificate of the host receiving the vm is sent to the sender.
This is done by adding the certificate of the receiving host as well as its pool coordinator to the return value of the function migrate_receive function.
The sender can then add the certificate to the folder of CA certificates that stunnel uses to verify the server in a TLS connection.
When the transaction finishes, whether it fails or succeeds the CA certificate is deleted.
The certificate is stored in a temporary location so xapi can clean up the file when it starts up, in case after the host fences or power cycles while the migration is in progress.
Xapi invokes sparse_dd with the filename correct trusted bundle as a parameter so it can verify the vhd-server running on the other host.
Xapi also invokes xcp-rrdd to migrate the VM metrics.
xcp-rrdd is passed the 2 certificates to verify the remote hosts when sending the metrics.
Clients should not be aware of this change and require no change.
Xapi-cli-server, the server of xe embedded into xapi, connects to the remote coordinator using TLS to be able to initiate the migration.
Currently no verification is done. A certificate is required to initiate the connection to verify the remote server.
In u6.3 and u6.4 no changes seem necessary.
U7. Change a host’s name
The Pool certificates do not depend on hostnames.
Changing the hostnames does not affect TLS certificate verification in a pool.
U8. Installing a certificate (corporate CA)
Installation of corporate CA can be done with current API.
Certificates are added to the database as CA certificates.
U9. Resetting a certificate (to self-signed certificate)
This needs a reimplementation of the current API to reset host certificate, this time allowing the operation to happen when the host is not on emergency node and to be able to do it remotely.
U10. Enabling certificate verification
A new API call is introduced to enable tls certificate verification: Pool.enable_tls_verification.
This is used by the CLI command pool-enable-tls-verification.
The call causes the coordinator of the pool to install the Pool certificates of all the members in its internal trust root.
Then calls the api for each member to install all of these certificates.
After this public key exchange is done, TLS certificate verification is enabled on the members, with the coordinator being the last to enable it.
When there are issues that block enabling the feature, the call returns an error specific to that problem:
- HA must not be enabled, as it can interrupt the procedure when certificates are distributed
- Pool operations that can disrupt the certificate exchange block this operation: These operations are listed in here
- There was an issue with the certificate exchange in the pool.
The coordinator enabling verification last is done to ensure that if there is any issue enabling the coordinator host can still connect to members and rollback the setting.
A new field is added to the pool: tls_verification_enabled. This enables clients to query whether TLS verification is enabled.
U11. Disabling certificate verification
A new emergency command is added emergency-host-disable-tls-verification.
This command disables tls-verification for the xapi daemon in a host.
This allows the host to communicate with other hosts in the pool.
After that, the admin can regenerate the certificates using the new host-refresh-server-certificate in the hosts with invalid certificates, finally they can reenable tls certificate checking using the call emergency-host-reenable-tls-verification.
The documentation will include instructions for administrators on how to reset certificates and manually installing the host certificates as CA certificates to recover pools.
This means they will not have to disable TLS and compromise on security.
U12. Being aware of certificate expiry
Stockholm hosts provide alerts 30 days before hosts certificates expire, it must be changed to alert about users’ CA certificates expiring.
Pool certificates need to be cycled when the certificate expiry is approaching.
Alerts are introduced to warn the administrator this task must be done, or risk the operation of the pool.
A new API is introduced to create certificates for all members in a pool and replace the existing internal certificates with these.
This call imposes the same requirements in a pool as the pool secret rotation: It cannot be run in a pool unless all the host are online, it can only be started by the coordinator, the coordinator is in a valid state, HA is disabled, no RPU is in progress, and no pool operations are in progress.
The API call is Pool.rotate_internal_certificates.
It is exposed by xe as pool-rotate-internal-certificates.
Changes
Xapi startup has to account for host changes that affect this feature and modify the filesystem and pool database accordingly.
- Public certificate changed: On first boot, after a pool join and when doing emergency repairs the server certificate record of the host may not match to the contents in the filesystem. A check is to be introduced that detects if the database does not associate a certificate with the host or if the certificate’s public key in the database and the filesystem are different. If that’s the case the database is updated with the certificate in the filesystem.
- Pool certificate not present: In the same way the public certificate served is generated on startup, the internal certificate must be generated if the certificate is not present in the filesystem.
- Pool certificate changed: On first boot, after a pool join and after having done emergency repairs the internal server certificate record may not match the contents of the filesystem. A check is to be introduced that detects if the database does not associate a certificate with the host or if the certificate’s public key in the database and the filesystem are different. This check is made aware whether the host is joining a pool or is on first-boot, it does this by counting the amount of hosts in the pool from the database. In the case where it’s joining a pool it simply updated the database record with the correct information from the filesystem as the filesystem contents have been put in place before the restart. In the case of first boot the public part of the certificate is copied to the directory and the bundle for internally-trusted certificates: /etc/stunnel/certs-pool/ and /etc/stunnel/xapi-pool-ca-bundle.pem.
The xapi database records for certificates must be changed according with the additions explained before.
API
Additions
- Pool.tls_verification_enabled: this is a field that indicates whether TLS verification is enabled.
- Pool.enable_tls_verification: this call is allowed for role _R_POOL_ADMIN. It’s not allowed to run if HA is enabled nor pool operations are in progress. All the hosts in the pool transmit their certificate to the coordinator and the coordinator then distributes the certificates to all members of the pool. Once that is done the coordinator tries to initiate a session with all the pool members with TLS verification enabled. If it’s successful TLS verification is enabled for the whole pool, otherwise the error COULD_NOT_VERIFY_HOST [member UUID] is emmited.
- TLS_VERIFICATION_ENABLE_IN_PROGRESS is a new error that is produced when trying to do other pool operations while enabling TLS verification is in progress
- Host.emergency_disable_tls_verification: this called is allowed for role _R_LOCAL_ROOT_ONLY: it’s an emergency command and acts locally. It forces connections in xapi to stop verifying the peers on outgoing connections. It generates an alert to warn the administrators of this uncommon state.
- Host.emergency_reenable_tls_verification: this call is allowed for role _R_LOCAL_ROOT_ONLY: it’s an emergency command and acts locally. It changes the configuration so xapi verifies connections by default after being switched off with the previous command.
- Pool.install_ca_certificate: rename of Pool.certificate_install, add the ca certificate to the database.
- Pool.uninstall_ca_certificate: rename of Pool.certificate_uninstall, removes the certificate from the database.
- Host.reset_server_certificate: replaces Host.emergency_reset_server_certificate, now it’s allowed for role _R_POOL_ADMIN. It adds a record for the generated Default Certificate to the database while removing the previous record, if any.
- Pool.rotate_internal_certificates: This call generates new Pool certificates, and substitutes the previous certificates with these. See the certificate expiry section for more details.
Modifications:
- Pool.join: certificates must be correctly distributed. API Error POOL_JOINING_HOST_TLS_VERIFICATION_MISMATCH is returned if the tls_verification of the two pools doesn’t match.
- Pool.eject: all certificates must be deleted from the ejected host’s filesystem and the ejected host’s certificate must be deleted from the pool’s trust root.
- Host.install_server_certificate: the certificate type host for the record must be added to denote it’s a Standard Certificate.
Deprecations:
- pool.certificate_install
- pool.certificate_uninstall
- pool.certificate_list
- pool.wlb_verify_cert: This setting is superseeded by pool.enable_tls_verification. It cannot be removed, however. When updating from a previous version when this setting is on, TLS connections to WLB must still verify the external host. When the global setting is enabled this setting is ignored.
- host.emergency_reset_server_certificate: host.reset_server_certificate should be used instead as this call does not modify the database.
CLI
Following API additions:
- pool-enable-tls-verification
- pool-install-ca-certificate
- pool-uninstall-ca-certificate
- pool-internal-certificates-rotation
- host-reset-server-certificate
- host-emergency-disable-tls-verification (emits a warning when verification is off and the pool-level is on)
- host-emergency-reenable-tls-verification
And removals:
- host-emergency-server-certificate
Feature Flags
This feature needs clients to behave differently when initiating pool joins, to allow them to choose behaviour the toolstack will expose a new feature flag ‘Certificate_verification’. This flag will be part of the express edition as it’s meant to aid detection of a feature and not block access to it.
Alerts
Several alerts are introduced:
POOL_CA_CERTIFICATE_EXPIRING_30, POOL_CA_CERTIFICATE_EXPIRING_14, POOL_CA_CERTIFICATE_EXPIRING_07, POOL_CA_CERTIFICATE_EXPIRED: Similar to host certificates, now the user-installable pool’s CA certificates are monitored for expiry dates and alerts are generated about them. The body for this type of message is:
HOST_INTERNAL_CERTIFICATE_EXPIRING_30, HOST_INTERNAL_CERTIFICATE_EXPIRING_14, HOST_INTERNAL_CERTIFICATE_EXPIRING_07, HOST_INTERNAL_CERTIFICATE_EXPIRED: Similar to host certificates, the newly-introduced hosts’ internal server certificates are monitored for expiry dates and alerts are generated about them. The body for this type of message is:
TLS_VERIFICATION_EMERGENCY_DISABLED: The host is in emergency mode and is not enforcing tls verification anymore, the situation that forced the disabling must be fixed and the verification enabled ASAP.
FAILED_LOGIN_ATTEMPTS: An hourly alert that contains the number of failed attempts and the 3 most common origins for these failed alerts. The body for this type of message is:
Design document |
---|
Revision | v1 |
Status | released (5.6 fp1) |
Tunnelling API design
To isolate network traffic between VMs (e.g. for security reasons) one can use
VLANs. The number of possible VLANs on a network, however, is limited, and
setting up a VLAN requires configuring the physical switches in the network.
GRE tunnels provide a similar, though more flexible solution. This document
proposes a design that integrates the use of tunnelling in the XenAPI. The
design relies on the recent introduction of the Open vSwitch, and
requires an Open vSwitch
(OpenFlow) controller
(further referred to as
the controller) to set up and maintain the actual GRE tunnels.
We suggest following the way VLANs are modelled in the datamodel. Introducing a
VLAN involves creating a Network object for the VLAN, that VIFs can connect to.
The VLAN.create
API call takes references to a PIF and Network to use and a
VLAN tag, and creates a VLAN object and a PIF object. We propose something
similar for tunnels; the resulting objects and relations for two hosts would
look like this:
PIF (transport) -- Tunnel -- PIF (access) \ / VIF
Network -- VIF
PIF (transport) -- Tunnel -- PIF (access) / \ VIF
XenAPI changes
New tunnel class
Fields
string uuid
(read-only)PIF ref access_PIF
(read-only)PIF ref transport_PIF
(read-only)(string -> string) map status
(read/write); owned by the controller, containing at least the
key active
, and key
and error
when appropriate (see below)(string -> string) map other_config
(read/write)
New fields in PIF class (automatically linked to the corresponding tunnel
fields):
PIF ref set tunnel_access_PIF_of
(read-only)PIF ref set tunnel_transport_PIF_of
(read-only)
Messages
tunnel ref create (PIF ref, network ref)
void destroy (tunnel ref)
Backends
For clients to determine which network backend is in use (to decide whether
tunnelling functionality is enabled) a key network_backend
is added to the
Host.software_version
map on each host. The value of this key can be:
bridge
: the Linux bridging backend is in use;openvswitch
: the [Open vSwitch] backend is in use.
Notes
The user is responsible for creating tunnel and network objects, associating
VIFs with the right networks, and configuring the physical PIFs, all using
the XenAPI/CLI/XC.
The tunnel.status
field is owned by the controller. It
may be possible to define an RBAC role for the controller, such that only the
controller is able to write to it.
The tunnel.create
message does not take
a tunnel identifier (GRE key). The controller is responsible for assigning
the right keys transparently. When a tunnel has been set up, the controller
will write its key to tunnel.status:key
, and it will set
tunnel.status:active
to "true"
in the same field.
In case a tunnel could
not be set up, an error code (to be defined) will be written to
tunnel.status:error
, and tunnel.status:active
will be "false"
.
Xapi
tunnel.create
- Fails with
OPENVSWITCH_NOT_ACTIVE
if the Open vSwitch networking sub-system
is not active (the host uses linux bridging). - Fails with
IS_TUNNEL_ACCESS_PIF
if the specified transport PIF is a tunnel access PIF. - Takes care of creating and connecting the new tunnel and PIF objects.
- Sets a random MAC on the access PIF.
- IP configuration of the tunnel
access PIF is left blank. (The IP configuration on a PIF is normally used for
the interface in dom0. In this case, there is no tunnel interface for dom0 to
use. Such functionality may be added in future.)
- The
tunnel.status:active
field is initialised to "false"
, indicating that no actual tunnelling
infrastructure has been set up yet.
- Calls
PIF.plug
on the new tunnel access PIF.
tunnel.destroy
- Calls
PIF.unplug
on the tunnel access PIF. Destroys the tunnel
and
tunnel access PIF objects.
PIF.plug on a tunnel access PIF
- Fails with
TRANSPORT_PIF_NOT_CONFIGURED
if the underlying transport PIF has
PIF.ip_configuration_mode = None
, as this interface needs to be configured
for the tunnelling to work. Otherwise, the transport PIF will be plugged. - Xapi requests
interface-reconfigure
to “bring up” the tunnel access PIF,
which causes it to create a local bridge. - No link will be made between the
new bridge and the physical interface by
interface-reconfigure
. The
controller is responsible for setting up these links. If the controller is
not available, no links can be created, and the tunnel network degrades to an
internal network (only intra-host connectivity). PIF.currently_attached
is set to true
.
PIF.unplug on a tunnel access PIF
- Xapi requests
interface-reconfigure
to “bring down” the tunnel PIF, which
causes it to destroy the local bridge. PIF.currently_attached
is set to false
.
PIF.unplug on a tunnel transport PIF
- Calls
PIF.unplug
on the associated tunnel access PIF(s).
PIF.forget on a tunnel access of transport PIF
- Fails with
PIF_TUNNEL_STILL_EXISTS
.
VLAN.create
- Tunnels can only exist on top of physical/VLAN/Bond PIFs, and not the other
way around.
VLAN.create
fails with IS_TUNNEL_ACCESS_PIF
if given an
underlying PIF that is a tunnel access PIF.
Pool join
- As for VLANs, when a host joins a pool, it will inherit the tunnels that are
present on the pool master.
- Any tunnels (tunnel and access PIF objects)
configured on the host are removed, which will leave their networks
disconnected (the networks become internal networks). As a joining host is
always a single host, there is no real use for having had tunnels on it, so
this probably will never be an issue.
The controller
- The controller tracks the
tunnel
class to determine which bridges/networks
require GRE tunnelling.- On start-up, it calls
tunnel.get_all
to obtain the information about all
tunnels. - Registers for events on the
tunnel
class to stay up-to-date.
- A tunnel network is organised as a star topology. The controller is free to
decide which host will be the central host (“switching host”).
- If the
current switching host goes down, a new one will be selected, and GRE tunnels
will be reconstructed.
- The controller creates GRE tunnels connecting each
existing Open vSwitch bridge that is associated with the same tunnel network,
after assigning the network a unique GRE key.
- The controller destroys GRE
tunnels if associated Open vSwitch bridges are destroyed. If the destroyed
bridge was on the switching host, and other hosts are still using the same
tunnel network, a new switching host will be selected, and GRE tunnels will
be reconstructed.
- The controller sets
tunnel.status:active
to "true"
for
all tunnel links that have been set up, and "false"
if links are broken. - The controller writes an appropriate error code (to be defined) to
tunnel.status:error
in case something went wrong. - When an access PIF is
plugged, and the controller succeeds to set up the tunnelling infrastructure,
it writes the GRE key to
tunnel.status:key
on the associated tunnel object
(at the same time tunnel.status:active
will be set to "true"
). - When the
tunnel infrastructure is not up and running, the controller may remove the
key
tunnel.status:key
(optional; the key should anyway be disregarded if
tunnel.status:active
is "false"
).
CLI
New xe
commands (analogous to xe vlan-
):
tunnel-create
tunnel-destroy
tunnel-list
tunnel-param-get
tunnel-param-list
Design document |
---|
Revision | v2 |
Status | released (8.2) |
User-installable host certificates
Introduction
It is often necessary to replace the TLS certificate used to secure
communications to Xenservers hosts, for example to allow a XenAPI user such as
Citrix Virtual Apps and Desktops (CVAD) to validate that the host is genuine
and not impersonating the actual host.
Historically there has not been a supported mechanism to do this, and as a
result users have had to rely on guides written by third parties that show how
to manually replace the xapi-ssl.pem file on a host. This process is
error-prone, and if a mistake is made, can result in an unuseable system.
This design provides a fully supported mechanism to allow replacing the
certificates.
Design proposal
It is expected that an API caller will provide, in a single API call, a private
key, and one or more certificates for use on the host. The key will be provided
in PKCS #8 format, and the certificates in X509 format, both in
base-64-encoded PEM containers.
Multiple certificates can be provided to cater for the case where an
intermediate certificate or certificates are required for the caller to be able
to verify the certificate back to a trusted root (best practice for Certificate
Authorities is to have an ‘offline’ root, and issue certificates from an
intermediate Certificate Authority). In this situation, it is expected (and
common practice among other tools) that the first certificate provided in the
chain is the host’s unique server certificate, and subsequent certificates form
the chain.
To detect mistakes a user may make, certain checks will be carried out on the
provided key and certificate(s) before they are used on the host. If all checks
pass, the key and certificate(s) will be written to the host, at which stage a
signal will be sent to stunnel that will cause it to start serving the new
certificate.
Certificate Installation
API Additions
Xapi must provide an API call through Host RPC API to install host
certificates:
let install_server_certificate = call
~lifecycle:[Published, rel_stockholm, ""]
~name:"install_server_certificate"
~doc:"Install the TLS server certificate."
~versioned_params:
[{ param_type=Ref _host; param_name="host"; param_doc="The host"
; param_release=stockholm_release; param_default=None}
;{ param_type=String; param_name="certificate"
; param_doc="The server certificate, in PEM form"
; param_release=stockholm_release; param_default=None}
;{ param_type=String; param_name="private_key"
; param_doc="The unencrypted private key used to sign the certificate, \
in PKCS#8 form"
; param_release=stockholm_release; param_default=None}
;{ param_type=String; param_name="certificate_chain"
; param_doc="The certificate chain, in PEM form"
; param_release=stockholm_release; param_default=Some (VString "")}
]
~allowed_roles:_R_POOL_ADMIN
()
This call should be implemented within xapi, using the already-existing crypto
libraries available to it.
Analogous to the API call, a new CLI call host-server-certificate-install
must be introduced, which takes the parameters certificate
, key
and
certificate-chain
- these parameters are expected to be filenames, from which
the key and certificate(s) must be read, and passed to the
install_server_certificate
RPC call.
The CLI will be defined as:
"host-server-certificate-install",
{
reqd=["certificate"; "private-key"];
optn=["certificate-chain"];
help="Install a server TLS certificate on a host";
implementation=With_fd Cli_operations.host_install_server_certificate;
flags=[ Host_selectors ];
};
Validation
Xapi must perform the following validation steps on the provided key and
certificate. If any validation step fails, the API call must return an error
with the specified error code, providing any associated text:
Private Key
Validate that it is a pem-encoded PKCS#8 key, use error
SERVER_CERTIFICATE_KEY_INVALID []
and exposed as
“The provided key is not in a pem-encoded PKCS#8 format.”
Validate that the algorithm of the key is RSA, use error
SERVER_CERTIFICATE_KEY_ALGORITHM_NOT_SUPPORTED, [<algorithms's ASN.1 OID>]
and exposed as “The provided key uses an unsupported algorithm.”
Validate that the key length is ≥ 2048, and ≤ 4096 bits, use error
SERVER_CERTIFICATE_KEY_RSA_LENGTH_NOT_SUPPORTED, [length]
and exposed as
“The provided RSA key does not have a length between 2048 and 4096.”
The library used does not support multi-prime RSA keys, when it’s
encountered use error SERVER_CERTIFICATE_KEY_RSA_MULTI_NOT_SUPPORTED []
and
exposed as “The provided RSA key is using more than 2 primes, expecting only
2”
Server Certificate
Validate that it is a pem-encoded X509 certificate, use error
SERVER_CERTIFICATE_INVALID []
and exposed as “The provided certificate is not
in a pem-encoded X509.”
Validate that the public key of the certificate matches the public key from
the private key, using error SERVER_CERTIFICATE_KEY_MISMATCH []
and exposing
it as “The provided key does not match the provided certificate’s public key.”
Validate that the certificate is currently valid. (ensure all time
comparisons are done using UTC, and any times presented in errors are using
ISO8601 format):
Ensure the certificate’s not_before
date is ≤ NOW
SERVER_CERTIFICATE_NOT_VALID_YET, [<NOW>; <not_before>]
and exposed as
“The provided certificate certificate is not valid yet.”
Ensure the certificate’s not_after
date is > NOW
SERVER_CERTIFICATE_EXPIRED, [<NOW>; <not_after>]
and exposed as “The
provided certificate has expired.”
Validate that the certificate signature algorithm is SHA-256
SERVER_CERTIFICATE_SIGNATURE_NOT_SUPPORTED []
and exposed as
“The provided certificate is not using the SHA256 (SHA2) signature algorithm.”
- Validate that it is an X509 certificate, use
SERVER_CERTIFICATE_CHAIN_INVALID []
and exposed as “The provided
intermediate certificates are not in a pem-encoded X509.”
Filesystem Interaction
If validation has been completed successfully, a temporary file must be created
with permissions 0x400 containing the key and certificate(s), in that order,
separated by an empty line.
This file must then be atomically moved to /etc/xensource/xapi-ssl.pem in
order to ensure the integrity of the contents. This may be done using rename
with the origin and destination in the same mount-point.
Alerting
A daily task must be added. This task must check the expiry date of the first
certificate present in /etc/xensource/xapi-ssl.pem, and if it is within 30
days of expiry, generate a message
to alert the administrator that the
certificate is due to expire shortly.
The body of the message should contain:
<body>
<message>
The TLS server certificate is expiring soon
</message>
<date>
<expiry date in ISO8601 'YYYY-MM-DDThh:mm:ssZ' format>`
</date>
</body>
The priority of the message should be based on the number of days to expiry as
follows:
Number of days | Priority |
---|
0-7 | 1 |
8-14 | 2 |
14+ | 3 |
The other fields of the message should be:
Field | Value |
---|
name | HOST_SERVER_CERTIFICATE_EXPIRING |
class | Host |
obj-uuid | < Host UUID > |
Any existing HOST_SERVER_CERTIFICATE_EXPIRING
messages with this host’s UUID
should be removed to avoid a build-up of messages.
Additionally, the task may also produce messages for expired server
certificates which must use the name HOST_SERVER_CERTIFICATE_EXPIRED
.
These kind of message must contain the message “The TLS server certificate has
expired.” as well as the expiry date, like the expiring messages.
They also may replace the existing expiring messages in a host.
Currently xapi exposes a CLI command to print the certificate being used to
verify external hosts. We would like to also expose through the API and the
CLI useful metadata about the certificates in use by each host.
The new class is meant to cover server certificates and trusted certificates.
Schema
A new class, Certificate, will be added with the following schema:
Field | Type | Notes |
---|
uuid | | |
type | CA | Certificate trusted by all hosts |
| Host | Certificate that the host present sto normal clients |
name | String | Name, only present for trusted certificates |
host | Ref _host | Host where the certificate is installed |
not_before | DateTime | Date after which the certificate is valid |
not_after | DateTime | Date before which the certificate is valid |
fingerprint_sha256 | String | The certificate’s SHA256 fingerprint / hash |
fingerprint_sha1 | String | The certificate’s SHA1 fingerprint / hash |
CLI / API
There are currently-existing CLI parameters for certificates:
pool-certificate-{install,uninstall,list,sync}
,
pool-crl-{install,uninstall,list}
and host-get-server-certificate
.
The new command must show the metadata of installed server certificates in
the pool.
It must be able to show all of them in the same call, and be able to filter
the certificates per-host.
To make it easy to separate it from the previous calls and to reflect that
certificates are a class type in xapi the call will be named certificate-list
and it will accept the parameter host-uuid=<uuid>
.
Recovery mechanism
In the case a certificate is let to expire TLS clients connecting to the host
will refuse establish the connection.
This means that the host is going to be unable to be managed using the xapi
API (Xencenter, or a CVAD control plane)
There needs to be a mechanism to recover from this situation.
A CLI command must be provided to install a self-signed certificate, in the
same way it is generated during the setup process at the moment.
The command will be host-emergency-reset-server-certificate
.
This command is never to be forwarded to another host and will call openssl to
create a new RSA private key
The command must notify stunnel to make sure stunnel uses the newly-created
certificate.
Miscellaneous
The auto-generated xapi-ssl.pem
currently contains Diffie-Hellman (DH)
Parameters, specifically 512 bits worth. We no longer support any ciphers which
require DH parameters, so these are no longer needed, and it is acceptable for
them to be lost as part of installing a new certificate/key pair.
The generation should also be modified to avoid creating these for new
installations.
Design document |
---|
Revision | v1 |
Status | released (7.0) |
Review | #156 |
Revision history |
---|
v1 | Initial version |
VGPU type identifiers
Introduction
When xapi starts, it may create a number of VGPU_type objects. These act as
VGPU presets, and exactly which VGPU_type objects are created depends on the
installed hardware and in certain cases the presence of certain files in dom0.
When deciding which VGPU_type objects need to be created, xapi needs to
determine whether a suitable VGPU_type object already exists, as there should
never be duplicates. At the moment the combination of vendor name and model name
is used as a primary key, but this is not ideal as these values are subject to
change. We therefore need a way of creating a primary key to uniquely identify
VGPU_type objects.
Identifier
We will add a new read-only field to the database:
VGPU_type.identifier (string)
This field will contain a string representation of the parameters required to
uniquely identify a VGPU_type. The parameters required can be summed up with the
following OCaml type:
type nvidia_id = {
pdev_id : int;
psubdev_id : int option;
vdev_id : int;
vsubdev_id : int;
}
type gvt_g_id = {
pdev_id : int;
low_gm_sz : int64;
high_gm_sz : int64;
fence_sz : int64;
monitor_config_file : string option;
}
type t =
| Passthrough
| Nvidia of nvidia_id
| GVT_g of gvt_g_id
When converting this type to a string, the string will always be prefixed with
0001:
enabling future versioning of the serialisation format.
For passthrough, the string will simply be:
0001:passthrough
For NVIDIA, the string will be nvidia
followed by the four device IDs
serialised as four-digit hex values, separated by commas. If psubdev_id
is
None
, the empty string will be used e.g.
Nvidia {
pdev_id = 0x11bf;
psubdev_id = None;
vdev_id = 0x11b0;
vsubdev_id = 0x109d;
}
would map to
0001:nvidia,11bf,,11b0,109d
For GVT-g, the string will be gvt-g
followed by the physical device ID encoded
as four-digit hex, followed by low_gm_sz
, high_gm_sz
and fence_sz
encoded
as hex, followed by monitor_config_file
(or the empty string if it is None
)
e.g.
GVT_g {
pdev_id = 0x162a;
low_gm_sz = 128L;
high_gm_sz = 384L;
fence_sz = 4L;
monitor_config_file = None;
}
would map to
0001:gvt-g,162a,80,180,4,,
Having this string in the database will allow us to do a simple lookup to test
whether a certain VGPU_type already exists. Although it is not currently
required, this string can also be converted back to the type from which it was
generated.
When deciding whether to create VGPU_type objects, xapi will generate the
identifier string and use it to look for existing VGPU_type objects in the
database. If none are found, xapi will look for existing VGPU_type objects with
the tuple of model name and vendor name. If still none are found, xapi will
create a new VGPU_type object.
Design document |
---|
Revision | v1 |
Status | released (7.0) |
Background and goal
Some VMs can only be run on hosts of sufficiently recent versions.
We want a clean way to ensure that xapi only tries to run a guest VM on a host that supports the “virtual hardware platform” required by the VM.
Suggested design
- In the datamodel, VM has a new integer field “hardware_platform_version” which defaults to zero.
- In the datamodel, Host has a corresponding new integer-list field “virtual_hardware_platform_versions” which defaults to list containing a single zero element (i.e.
[0]
or [0L]
in OCaml notation). The zero represents the implicit version supported by older hosts that lack the code to handle the Virtual Hardware Platform Version concept. - When a host boots it populates its own entry from a hardcoded value, currently
[0; 1]
i.e. a list containing the two integer elements 0
and 1
. (Alternatively this could come from a config file.)- If this new version-handling functionality is introduced in a hotfix, at some point the pool master will have the new functionality while at least one slave does not. An old slave-host that does not yet have software to handle this feature will not set its DB entry, which will therefore remain as
[0]
(maintained in the DB by the master).
- The existing test for whether a VM can run on (or migrate to) a host must include a check that the VM’s virtual hardware platform version is in the host’s list of supported versions.
- When a VM is made to start using a feature that is available only in a certain virtual hardware platform version, xapi must set the VM’s hardware_platform_version to the maximum of that version-number and its current value (i.e. raise if needed).
For the version we could consider some type other than integer, but a strict ordering is needed.
First use-case
Version 1 denotes support for a certain feature:
When a VM starts, if a certain flag is set in VM.platform then XenServer will provide an emulated PCI device which will trigger the guest Windows OS to seek drivers for the device, or updates for those drivers. Thus updated drivers can be obtained through the standard Windows Update mechanism.
If the PCI device is removed, the guest OS will fail to boot. A VM using this feature must not be migrated to or started on a XenServer that lacks support for the feature.
Therefore at VM start, we can look at whether this feature is being used; if it is, then if the VM’s Virtual Hardware Platform Version is less than 1 we should raise it to 1.
Limitation
Consider a VM that requires version 1 or higher. Suppose it is exported, then imported into an old host that does not support this feature. Then the host will not check the versions but will attempt to run the VM, which will then have difficulties.
The only way to prevent this would be to make a backwards-incompatible change to the VM metadata (e.g. a new item in an enum) so that the old hosts cannot read it, but that seems like a bad idea.
Design document |
---|
Revision | v2 |
Status | proposed |
XenPrep
Background
Windows guests should have XenServer-specific drivers installed. As of mid-2015 these have been always been installed and upgraded by an essentially manual process involving an ISO carrying the drivers. We have a plan to enable automation through the standard Windows Update mechanism. This will involve a new additional virtual PCI device being provided to the VM, to trigger Windows Update to fetch drivers for the device.
There are many existing Windows guests that have drivers installed already. These drivers must be uninstalled before the new drivers are installed (and ideally before the new PCI device is added). To make this easier, we are planning a XenAPI call that will cause the removal of the old drivers and the addition of the new PCI device.
Since this is only to help with updating old guests, the call may well be removed at some point in the future.
Brief high-level design
The XenAPI call will be called VM.xenprep_start
. It will update the VM record to note that the process has started, and will insert a special ISO into the VM’s virtual CD drive.
That ISO will contain a tool which will be set up to auto-run (if auto-run is enabled in the guest). The tool will:
- Lock the CD drive so other Windows programs cannot eject the disc.
- Uninstall the old drivers.
- Eject the CD to signal success.
- Shut down the VM.
XenServer will interpret the ejection of the CD as a success signal, and when the VM shuts down without the special ISO in the drive, XenServer will:
- Update the VM record:
- Remove the mark that shows that the xenprep process is in progress
- Give it the new PCI device: set
VM.auto_update_drivers
to true
. - If
VM.virtual_hardware_platform_version
is less than 2, then set it to 2.
- Start the VM.
More details of the xapi-project parts
(The tool that runs in the guest is out of scope for this document.)
Start
The XenAPI call VM.xenprep_start
will throw a power-state error if the VM is not running.
For RBAC roles, it will be available to “VM Operator” and above.
It will:
- Insert the xenprep ISO into the VM’s virtual CD drive.
- Write
VM.other_config
key xenprep_progress=ISO_inserted
to record the fact that the xenprep process has been initiated.
If xenprep_start
is called on a VM already undergoing xenprep, the call will return successfully but will not do anything.
If the VM does not have an empty virtual CD drive, the call will fail with a suitable error.
Cancellation
While xenprep is in progress, any request to eject the xenprep ISO (except from inside the guest) will be rejected with a new error “VBD_XENPREP_CD_IN_USE”.
There will be a new XenAPI call VM.xenprep_abort
which will:
- Remove the
xenprep_progress
entry from VM.other_config
. - Make a best-effort attempt to eject the CD. (The guest might prevent ejection.)
This is not intended for cancellation while the xenprep tool is running, but rather for use before it starts, for example if auto-run is disabled or if the VM has a non-Windows OS.
Completion
Aim: when the guest shuts down after ejecting the CD, XenServer will start the guest again with the new PCI device.
Xapi works through the queue of events it receives from xenopsd. It is possible that by the time xapi processes the cd-eject event, the guest might have shut down already.
When the shutdown (not reboot) event is handled, we shall check whether we need to do anything xenprep-related. If
- The VM
other_config
map has xenprep_progress
as either of ISO_inserted
or shutdown
, and - The xenprep ISO is no longer in the drive
then we must (in the specified order)
- Update the VM record:
- In
VM.other_config
set xenprep_progress=shutdown
- If
VM.virtual_hardware_platform_version
is less than 2, then set it to 2. - Give it the new PCI device: set
VM.auto_update_drivers
to true
. - Initiate VM start.
- Remove
xenprep_progress
from VM.other_config
The most relevant code is probably the update_vm
function in ocaml/xapi/xapi_xenops.ml
in the xen-api
repo (or in some function called from there).
Subsections of XenAPI
XenAPI Basics
This document contains a description of the Xen Management API - an interface for
remotely configuring and controlling virtualised guests running on a
Xen-enabled host.
The API is presented here as a set of Remote Procedure Calls (RPCs).
There are two supported wire formats, one based upon
XML-RPC
and one based upon JSON-RPC (v1.0 and v2.0 are both
recognized). No specific language bindings are prescribed, although examples
are given in the Python programming language.
Although we adopt some terminology from object-oriented programming,
future client language bindings may or may not be object oriented.
The API reference uses the terminology classes and objects.
For our purposes a class is simply a hierarchical namespace;
an object is an instance of a class with its fields set to
specific values. Objects are persistent and exist on the server-side.
Clients may obtain opaque references to these server-side objects and then
access their fields via get/set RPCs.
For each class we specify a list of fields along with their types and
qualifiers. A qualifier is one of:
RO/runtime: the field is Read Only. Furthermore, its value is
automatically computed at runtime. For example, current CPU load and disk IO
throughput.
RO/constructor: the field must be manually set when a new object is
created, but is then Read Only for the duration of the object’s life.
For example, the maximum memory addressable by a guest is set
before the guest boots.
RW: the field is Read/Write. For example, the name of a VM.
Types
The following types are used to specify methods and fields in the API Reference:
string
: Text strings.int
: 64-bit integers.float
: IEEE double-precision floating-point numbers.bool
: Boolean.datetime
: Date and timestamp.c ref
: Reference to an object of class c
.t set
: Arbitrary-length set of values of type t
.(k -> v) map
: Mapping from values of type k
to values of type v
.e enum
: Enumeration type with name e
. Enums are defined in the API
reference together with classes that use them.
Note that there are a number of cases where ref
s are doubly linked.
For example, a VM
has a field called VIFs
of type VIF ref set
;
this field lists the network interfaces attached to a particular VM.
Similarly, the VIF
class has a field called VM
of type VM ref
which references the VM to which the interface is connected.
These two fields are bound together, in the sense that
creating a new VIF causes the VIFs
field of the corresponding
VM object to be updated automatically.
The API reference lists explicitly the fields that are
bound together in this way. It also contains a diagram that shows
relationships between classes. In this diagram an edge signifies the
existence of a pair of fields that are bound together, using standard
crows-foot notation to signify the type of relationship (e.g.
one-many, many-many).
RPCs associated with fields
Each field, f
, has an RPC accessor associated with it that returns f
’s value:
get_f (r)
: takes a ref
, r
that refers to an object and returns the value
of f
.
Each field, f
, with qualifier RW and whose outermost type is set
has the
following additional RPCs associated with it:
add_f(r, v)
: adds a new element v
to the set.
Note that sets cannot contain duplicate values, hence this operation has
no action in the case that v
is already in the set.
remove_f(r, v)
: removes element v
from the set.
Each field, f
, with qualifier RW and whose outermost type is map
has the
following additional RPCs associated with it:
add_to_f(r, k, v)
: adds new pair k -> v
to the mapping stored in f
in
object r
. Attempting to add a new pair for duplicate key, k
, fails with a
MAP_DUPLICATE_KEY
error.
remove_from_f(r, k)
: removes the pair with key k
from the mapping stored in f
in object r
.
Each field whose outermost type is neither set
nor map
, but whose
qualifier is RW has an RPC accessor associated with it that sets its value:
set_f(r, v)
: sets the field f
on object r
to value v
.
RPCs associated with classes
Most classes have a constructor RPC named create
that
takes as parameters all fields marked RW and RO/constructor. The result
of this RPC is that a new persistent object is created on the server-side
with the specified field values.
Each class has a get_by_uuid(uuid)
RPC that returns the object
of that class that has the specified uuid
.
Each class that has a name_label
field has a
get_by_name_label(name_label)
RPC that returns a set of objects of that
class that have the specified name_label
.
Most classes have a destroy(r)
RPC that explicitly deletes
the persistent object specified by r
from the system. This is a
non-cascading delete - if the object being removed is referenced by another
object then the destroy
call will fail.
Apart from the RPCs enumerated above, most classes have additional RPCs
associated with them. For example, the VM
class has RPCs for cloning,
suspending, starting etc. Such additional RPCs are described explicitly
in the API reference.
Wire Protocol
API calls are sent over a network to a Xen-enabled host using an RPC protocol.
Here we describe how the higher-level types used in our API Reference are mapped
to primitive RPC types, covering the two supported wire formats
XML-RPC and JSON-RPC.
XML-RPC Protocol
We specify the signatures of API functions in the following style:
(VM ref set) VM.get_all()
This specifies that the function with name VM.get_all
takes
no parameters and returns a set
of VM ref
.
These types are mapped onto XML-RPC types in a straight-forward manner:
the types float
, bool
, datetime
, and string
map directly to the XML-RPC
<double>
, <boolean>
, <dateTime.iso8601>
, and <string>
elements.
all ref
types are opaque references, encoded as the
XML-RPC’s <string>
type. Users of the API should not make assumptions
about the concrete form of these strings and should not expect them to
remain valid after the client’s session with the server has terminated.
fields named uuid
of type string
are mapped to
the XML-RPC <string>
type. The string itself is the OSF
DCE UUID presentation format (as output by uuidgen
).
int
is assumed to be 64-bit in our API and is encoded as a string
of decimal digits (rather than using XML-RPC’s built-in 32-bit <i4>
type).
values of enum
types are encoded as strings. For example, the value
destroy
of enum on_normal_exit
, would be conveyed as:
<value><string>destroy</string></value>
- for all our types,
t
, our type t set
simply maps to XML-RPC’s <array>
type, so, for example, a value of type string set
would be transmitted like
this:
<array>
<data>
<value><string>CX8</string></value>
<value><string>PSE36</string></value>
<value><string>FPU</string></value>
</data>
</array>
- for types
k
and v
, our type (k -> v) map
maps onto an
XML-RPC <struct>
, with the key as the name of the struct. Note that the
(k -> v) map
type is only valid when k
is a string
, ref
, or
int
, and in each case the keys of the maps are stringified as
above. For example, the (string -> float) map
containing the mappings
Mike -> 2.3 and John -> 1.2 would be represented as:
<value>
<struct>
<member>
<name>Mike</name>
<value><double>2.3</double></value>
</member>
<member>
<name>John</name>
<value><double>1.2</double></value>
</member>
</struct>
</value>
- our
void
type is transmitted as an empty string.
XML-RPC Return Values and Status Codes
The return value of an RPC call is an XML-RPC <struct>
.
- The first element of the struct is named
Status
; it contains a string value
indicating whether the result of the call was a Success
or a Failure
.
If the Status
is Success
then the struct contains a second element named
Value
:
- The element of the struct named
Value
contains the function’s return value.
If the Status
is Failure
then the struct contains a second element named
ErrorDescription
:
- The element of the struct named
ErrorDescription
contains an array of string
values. The first element of the array is an error code; the rest of the
elements are strings representing error parameters relating to that code.
For example, an XML-RPC return value from the host.get_resident_VMs
function
may look like this:
<struct>
<member>
<name>Status</name>
<value>Success</value>
</member>
<member>
<name>Value</name>
<value>
<array>
<data>
<value>81547a35-205c-a551-c577-00b982c5fe00</value>
<value>61c85a22-05da-b8a2-2e55-06b0847da503</value>
<value>1d401ec4-3c17-35a6-fc79-cee6bd9811fe</value>
</data>
</array>
</value>
</member>
</struct>
JSON-RPC Protocol
We specify the signatures of API functions in the following style:
(VM ref set) VM.get_all()
This specifies that the function with name VM.get_all
takes no parameters and
returns a set
of VM ref
. These types are mapped onto JSON-RPC types in the
following manner:
the types float
and bool
map directly to the JSON types number
and
boolean
, while datetime
and string
are represented as the JSON string
type.
all ref
types are opaque references, encoded as the JSON string
type.
Users of the API should not make assumptions about the concrete form of these
strings and should not expect them to remain valid after the client’s session
with the server has terminated.
fields named uuid
of type string
are mapped to the JSON string
type. The
string itself is the OSF DCE UUID presentation format (as output by uuidgen
).
int
is assumed to be 64-bit in our API and is encoded as a JSON number
without decimal point or exponent, preserved as a string.
values of enum
types are encoded as the JSON string
type. For example, the
value destroy
of enum on_normal_exit
, would be conveyed as:
- for all our types,
t
, our type t set
simply maps to the JSON array
type, so, for example, a value of type string set
would be transmitted like
this:
[ "CX8", "PSE36", "FPU" ]
- for types
k
and v
, our type (k -> v) map
maps onto a JSON object which
contains members with name k
and value v
. Note that the
(k -> v) map
type is only valid when k
is a string
, ref
, or
int
, and in each case the keys of the maps are stringified as
above. For example, the (string -> float) map
containing the mappings
Mike -> 2.3 and John -> 1.2 would be represented as:
{
"Mike": 2.3,
"John": 1.2
}
- our
void
type is transmitted as an empty string.
Both versions 1.0 and 2.0 of the JSON-RPC wire format are recognised and,
depending on your client library, you can use either of them.
JSON-RPC v1.0
JSON-RPC v1.0 Requests
An API call is represented by sending a single JSON object to the server, which
contains the members method
, params
, and id
.
method
: A JSON string
containing the name of the function to be invoked.
params
: A JSON array
of values, which represents the parameters of the
function to be invoked.
id
: A JSON string
or integer
representing the call id. Note that,
diverging from the JSON-RPC v1.0 specification the API does not accept
notification requests (requests without responses), i.e. the id cannot be
null
.
For example, the body of a JSON-RPC v1.0 request to retrieve the resident VMs of
a host may look like this:
{
"method": "host.get_resident_VMs",
"params": [
"OpaqueRef:74f1a19cd-b660-41e3-a163-10f03e0eae67",
"OpaqueRef:08c34fc9-f418-4f09-8274-b9cb25cd8550"
],
"id": "xyz"
}
In the above example, the first element of the params
array is the reference
of the open session to the host, while the second is the host reference.
JSON-RPC v1.0 Return Values
The return value of a JSON-RPC v1.0 call is a single JSON object containing
the members result
, error
, and id
.
result
: If the call is successful, it is a JSON value (string
, array
etc.) representing the return value of the invoked function. If an error has
occurred, it is null
.
error
: If the call is successful, it is null
. If the call has failed, it
a JSON array
of string
values. The first element of the array is an error
code; the remainder of the array are strings representing error parameters
relating to that code.
id
: The call id. It is a JSON string
or integer
and it is the same id
as the request it is responding to.
For example, a JSON-RPC v1.0 return value from the host.get_resident_VMs
function may look like this:
{
"result": [
"OpaqueRef:604f51e7-630f-4412-83fa-b11c6cf008ab",
"OpaqueRef:670d08f5-cbeb-4336-8420-ccd56390a65f"
],
"error": null,
"id": "xyz"
}
while the return value of the same call made on a logged out session may look
like this:
{
"result": null,
"error": [
"SESSION_INVALID",
"OpaqueRef:93f1a23cd-a640-41e3-b163-10f86e0eae67"
],
"id": "xyz"
}
JSON-RPC v2.0
JSON-RPC v2.0 Requests
An API call is represented by sending a single JSON object to the server, which
contains the members jsonrpc
, method
, params
, and id
.
jsonrpc
: A JSON string
specifying the version of the JSON-RPC protocol. It
is exactly “2.0”.
method
: A JSON string
containing the name of the function to be invoked.
params
: A JSON array
of values, which represents the parameters of the
function to be invoked. Although the JSON-RPC v2.0 specification allows this
member to be ommitted, in practice all API calls accept at least one parameter.
id
: A JSON string
or integer
representing the call id. Note that,
diverging from the JSON-RPC v2.0 specification it cannot be null. Neither can
it be ommitted because the API does not accept notification requests
(requests without responses).
For example, the body of a JSON-RPC v2.0 request to retrieve the VMs resident on
a host may may look like this:
{
"jsonrpc": "2.0",
"method": "host.get_resident_VMs",
"params": [
"OpaqueRef:c90cd28f-37ec-4dbf-88e6-f697ccb28b39",
"OpaqueRef:08c34fc9-f418-4f09-8274-b9cb25cd8550"
],
"id": 3
}
As before, the first element of the parameter
array is the reference
of the open session to the host, while the second is the host reference.
JSON-RPC v2.0 Return Values
The return value of a JSON-RPC v2.0 call is a single JSON object containing the
members jsonrpc
, either result
or error
depending on the outcome of the
call, and id
.
jsonrpc
: A JSON string
specifying the version of the JSON-RPC protocol. It
is exactly “2.0”.
result
: If the call is successful, it is a JSON value (string
, array
etc.)
representing the return value of the invoked function. If an error has
occurred, it does not exist.
error
: If the call is successful, it does not exist. If the call has failed,
it is a single structured JSON object (see below).
id
: The call id. It is a JSON string
or integer
and it is the same id
as the request it is responding to.
The error
object contains the members code
, message
, and data
.
code
: The API does not make use of this member and only retains it for
compliance with the JSON-RPC v2.0 specification. It is a JSON integer
which has a non-zero value.
message
: A JSON string
representing an API error code.
data
: A JSON array of string
values representing error parameters
relating to the aforementioned API error code.
For example, a JSON-RPC v2.0 return value from the host.get_resident_VMs
function may look like this:
{
"jsonrpc": "2.0",
"result": [
"OpaqueRef:604f51e7-630f-4412-83fa-b11c6cf008ab",
"OpaqueRef:670d08f5-cbeb-4336-8420-ccd56390a65f"
],
"id": 3
}
while the return value of the same call made on a logged out session may look
like this:
{
"jsonrpc": "2.0",
"error": {
"code": 1,
"message": "SESSION_INVALID",
"data": [
"OpaqueRef:c90cd28f-37ec-4dbf-88e6-f697ccb28b39"
]
},
"id": 3
}
Errors
When a low-level transport error occurs, or a request is malformed at the HTTP
or RPC level, the server may send an HTTP 500 error response, or the client
may simulate the same. The client must be prepared to handle these errors,
though they may be treated as fatal.
For example, the following malformed request when using the XML-RPC protocol:
$curl -D - -X POST https://server -H 'Content-Type: application/xml' \
-d '<?xml version="1.0"?>
<methodCall>
<methodName>session.logout</methodName>
</methodCall>'
results to the following response:
HTTP/1.1 500 Internal Error
content-length: 297
content-type:text/html
connection:close
cache-control:no-cache, no-store
<html><body><h1>HTTP 500 internal server error</h1>An unexpected error occurred;
please wait a while and try again. If the problem persists, please contact your
support representative.<h1> Additional information </h1>Xmlrpc.Parse_error(&quo
t;close_tag", "open_tag", _)</body></html>
When using the JSON-RPC protocol:
$curl -D - -X POST https://server/jsonrpc -H 'Content-Type: application/json' \
-d '{
"jsonrpc": "2.0",
"method": "session.login_with_password",
"id": 0
}'
the response is:
HTTP/1.1 500 Internal Error
content-length: 308
content-type:text/html
connection:close
cache-control:no-cache, no-store
<html><body><h1>HTTP 500 internal server error</h1>An unexpected error occurred;
please wait a while and try again. If the problem persists, please contact your
support representative.<h1> Additional information </h1>Jsonrpc.Malformed_metho
d_request("{jsonrpc=...,method=...,id=...}")</body></html>
All other failures are reported with a more structured error response, to
allow better automatic response to failures, proper internationalization of
any error message, and easier debugging.
On the wire, these are transmitted like this when using the XML-RPC protocol:
<struct>
<member>
<name>Status</name>
<value>Failure</value>
</member>
<member>
<name>ErrorDescription</name>
<value>
<array>
<data>
<value>MAP_DUPLICATE_KEY</value>
<value>Customer</value>
<value>eSpiel Inc.</value>
<value>eSpiel Incorporated</value>
</data>
</array>
</value>
</member>
</struct>
Note that ErrorDescription
value is an array of string values. The
first element of the array is an error code; the remainder of the array are
strings representing error parameters relating to that code. In this case,
the client has attempted to add the mapping Customer ->
eSpiel Incorporated to a Map, but it already contains the mapping
Customer -> eSpiel Inc., hence the request has failed.
When using the JSON-RPC protocol v2.0, the above error is transmitted as:
{
"jsonrpc": "2.0",
"error": {
"code": 1,
"message": "MAP_DUPLICATE_KEY",
"data": [
"Customer",
"eSpiel Inc.",
"eSpiel Incorporated"
]
},
"id": 3
}
Finally, when using the JSON-RPC protocol v1.0:
{
"result": null,
"error": [
"MAP_DUPLICATE_KEY",
"Customer",
"eSpiel Inc.",
"eSpiel Incorporated"
],
"id": "xyz"
}
Each possible error code is documented in the last section of the API reference.
Note on References vs UUIDs
References are opaque types - encoded as XML-RPC and JSON-RPC strings on the
wire - understood only by the particular server which generated them. Servers
are free to choose any concrete representation they find convenient; clients
should not make any assumptions or attempt to parse the string contents.
References are not guaranteed to be permanent identifiers for objects; clients
should not assume that references generated during one session are valid for any
future session. References do not allow objects to be compared for equality. Two
references to the same object are not guaranteed to be textually identical.
UUIDs are intended to be permanent identifiers for objects. They are
guaranteed to be in the OSF DCE UUID presentation format (as output by uuidgen
).
Clients may store UUIDs on disk and use them to look up objects in subsequent sessions
with the server. Clients may also test equality on objects by comparing UUID strings.
The API provides mechanisms for translating between UUIDs and opaque references.
Each class that contains a UUID field provides:
A get_by_uuid
method that takes a UUID and returns an opaque reference
to the server-side object that has that UUID;
A get_uuid
function (a regular “field getter” RPC) that takes an opaque reference
and returns the UUID of the server-side object that is referenced by it.
Making RPC Calls
Transport Layer
The following transport layers are currently supported:
- HTTP/HTTPS for remote administration
- HTTP over Unix domain sockets for local administration
Session Layer
The RPC interface is session-based; before you can make arbitrary RPC calls
you must login and initiate a session. For example:
(session ref) session.login_with_password(string uname, string pwd,
string version, string originator)
where uname
and password
refer to your username and password, as defined by
the Xen administrator, while version
and originator
are optional. The
session ref
returned by session.login_with_password
is passed
to subequent RPC calls as an authentication token. Note that a session
reference obtained by a login request to the XML-RPC backend can be used in
subsequent requests to the JSON-RPC backend, and vice-versa.
A session can be terminated with the session.logout
function:
void session.logout(session ref session_id)
Synchronous and Asynchronous Invocation
Each method call (apart from methods on the Session
and Task
objects and
“getters” and “setters” derived from fields) can be made either synchronously or
asynchronously. A synchronous RPC call blocks until the
return value is received; the return value of a synchronous RPC call is
exactly as specified above.
Only synchronous API calls are listed explicitly in this document.
All their asynchronous counterparts are in the special Async
namespace.
For example, the synchronous call VM.clone(...)
has an asynchronous
counterpart, Async.VM.clone(...)
, that is non-blocking.
Instead of returning its result directly, an asynchronous RPC call
returns an identifier of type task ref
which is subsequently used
to track the status of a running asynchronous RPC.
Note that an asychronous call may fail immediately, before a task has even been
created. When using the XML-RPC wire protocol, this eventuality is represented
by wrapping the returned task ref
in an XML-RPC struct with a Status
,
ErrorDescription
, and Value
fields, exactly as specified above; the
task ref
is provided in the Value
field if Status
is set to Success
.
When using the JSON-RPC protocol, the task ref
is wrapped in a response JSON
object as specified above and it is provided by the value of the result
member
of a successful call.
The RPC call
(task ref set) Task.get_all(session ref session_id)
returns a set of all task identifiers known to the system. The status (including any
returned result and error codes) of these can then be queried by accessing the
fields of the Task
object in the usual way. Note that, in order to get a
consistent snapshot of a task’s state, it is advisable to call the get_record
function.
Example interactive session
This section describes how an interactive session might look, using python
XML-RPC and JSON-RPC client libraries.
First, initialise python:
Using the XML-RPC Protocol
Import the library xmlrpc.client
and create a
python object referencing the remote server as shown below:
>>> import xmlrpc.client
>>> xen = xmlrpc.client.ServerProxy("https://localhost:443")
Note that you may need to disable SSL certificate validation to establish the
connection, this can be done as follows:
>>> import ssl
>>> ctx = ssl._create_unverified_context()
>>> xen = xmlrpc.client.ServerProxy("https://localhost:443", context=ctx)
Acquire a session reference by logging in with a username and password; the
session reference is returned under the key Value
in the resulting dictionary
(error-handling ommitted for brevity):
>>> session = xen.session.login_with_password("user", "passwd",
... "version", "originator")['Value']
This is what the call looks like when serialized
<?xml version='1.0'?>
<methodCall>
<methodName>session.login_with_password</methodName>
<params>
<param><value><string>user</string></value></param>
<param><value><string>passwd</string></value></param>
<param><value><string>version</string></value></param>
<param><value><string>originator</string></value></param>
</params>
</methodCall>
Next, the user may acquire a list of all the VMs known to the system (note the
call takes the session reference as the only parameter):
>>> all_vms = xen.VM.get_all(session)['Value']
>>> all_vms
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
The VM references here have the form OpaqueRef:X
(though they may not be
that simple in reality) and you should treat them as opaque strings.
Templates are VMs with the is_a_template
field set to true
. We can
find the subset of template VMs using a command like the following:
>>> all_templates = filter(lambda x: xen.VM.get_is_a_template(session, x)['Value'],
all_vms)
Once a reference to a VM has been acquired, a lifecycle operation may be invoked:
>>> xen.VM.start(session, all_templates[0], False, False)
{'Status': 'Failure', 'ErrorDescription': ['VM_IS_TEMPLATE', 'OpaqueRef:X']}
In this case the start
message has been rejected, because the VM is
a template, and so an error response has been returned. These high-level
errors are returned as structured data (rather than as XML-RPC faults),
allowing them to be internationalized.
Rather than querying fields individually, whole records may be returned at once.
To retrieve the record of a single object as a python dictionary:
>>> record = xen.VM.get_record(session, all_templates[0])['Value']
>>> record['power_state']
'Halted'
>>> record['name_label']
'Windows 10 (64-bit)'
To retrieve all the VM records in a single call:
>>> records = xen.VM.get_all_records(session)['Value']
>>> list(records.keys())
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
>>> records['OpaqueRef:1']['name_label']
'Red Hat Enterprise Linux 7'
Using the JSON-RPC Protocol
For this example we are making use of the package jsonrpcclient
and the
requests
library due to their simplicity, although other packages can also be
used.
First, import the requests
and jsonrpcclient
libraries:
>>> import requests
>>> import jsonrpcclient
Now we construct a utility method to make using these libraries easier:
>>> def jsonrpccall(method, params):
... r = requests.post("https://localhost:443/jsonrpc",
... json=jsonrpcclient.request(method, params=params),
... verify=False)
... p = jsonrpcclient.parse(r.json())
... if isinstance(p, jsonrpcclient.Ok):
... return p.result
... raise Exception(p.message, p.data)
Acquire a session reference by logging in with a username and password:
>>> session = jsonrpccall("session.login_with_password",
... ("user", "password", "version", "originator"))
jsonrpcclient
uses the JSON-RPC protocol v2.0, so this is what the serialized
request looks like:
{
"jsonrpc": "2.0",
"method": "session.login_with_password",
"params": ["user", "passwd", "version", "originator"],
"id": 0
}
Next, the user may acquire a list of all the VMs known to the system (note the
call takes the session reference as the only parameter):
>>> all_vms = jsonrpccall("VM.get_all", (session,))
>>> all_vms
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
The VM references here have the form OpaqueRef:X
(though they may not be
that simple in reality) and you should treat them as opaque strings.
Templates are VMs with the is_a_template
field set to true
. We can
find the subset of template VMs using a command like the following:
>>> all_templates = filter(
... lambda x: jsonrpccall("VM.get_is_a_template", (session, x)),
... all_vms)
Once a reference to a VM has been acquired, a lifecycle operation may be invoked:
>>> try:
... jsonrpccall("VM.start", (session, next(all_templates), False, False))
... except Exception as e:
... e
...
Exception('VM_IS_TEMPLATE', ['OpaqueRef:1', 'start'])
In this case the start
message has been rejected because the VM is
a template, hence an error response has been returned. These high-level
errors are returned as structured data, allowing them to be internationalized.
Rather than querying fields individually, whole records may be returned at once.
To retrieve the record of a single object as a python dictionary:
>>> record = jsonrpccall("VM.get_record", (session, next(all_templates)))
>>> record['power_state']
'Halted'
>>> record['name_label']
'Windows 10 (64-bit)'
To retrieve all the VM records in a single call:
>>> records = jsonrpccall("VM.get_all_records", (session,))
>>> records.keys()
['OpaqueRef:1', 'OpaqueRef:2', 'OpaqueRef:3', 'OpaqueRef:4' ]
>>> records['OpaqueRef:1']['name_label']
'Red Hat Enterprise Linux 7'
Overview of the XenAPI
This chapter introduces the XenAPI and its associated object model. The API has the following key features:
Management of all aspects of the XenServer Host.
The API allows you to manage VMs, storage, networking, host configuration and pools. Performance and status metrics can also be queried from the API.
Persistent Object Model.
The results of all side-effecting operations (e.g. object creation, deletion and parameter modifications) are persisted in a server-side database that is managed by the XenServer installation.
An event mechanism.
Through the API, clients can register to be notified when persistent (server-side) objects are modified. This enables applications to keep track of datamodel modifications performed by concurrently executing clients.
Synchronous and asynchronous invocation.
All API calls can be invoked synchronously (that is, block until completion); any API call that may be long-running can also be invoked asynchronously. Asynchronous calls return immediately with a reference to a task object. This task object can be queried (through the API) for progress and status information. When an asynchronously invoked operation completes, the result (or error code) is available from the task object.
Remotable and Cross-Platform.
The client issuing the API calls does not have to be resident on the host being managed; nor does it have to be connected to the host over ssh in order to execute the API. API calls make use of the XML-RPC protocol to transmit requests and responses over the network.
Secure and Authenticated Access.
The XML-RPC API server executing on the host accepts secure socket connections. This allows a client to execute the APIs over the https protocol. Further, all the API calls execute in the context of a login session generated through username and password validation at the server. This provides secure and authenticated access to the XenServer installation.
Getting Started with the API
We will start our tour of the API by describing the calls required to create a new VM on a XenServer installation, and take it through a start/suspend/resume/stop cycle. This is done without reference to code in any specific language; at this stage we just describe the informal sequence of RPC invocations that accomplish our “install and start” task.
Authentication: acquiring a session reference
The first step is to call Session.login_with_password(, , , )
. The API is session based, so before you can make other calls you will need to authenticate with the server. Assuming the username and password are authenticated correctly, the result of this call is a session reference. Subsequent API calls take the session reference as a parameter. In this way we ensure that only API users who are suitably authorized can perform operations on a XenServer installation. You can continue to use the same session for any number of API calls. When you have finished the session, Citrix recommends that you call Session.logout(session)
to clean up: see later.
Acquiring a list of templates to base a new VM installation on
The next step is to query the list of “templates” on the host. Templates are specially-marked VM objects that specify suitable default parameters for a variety of supported guest types. (If you want to see a quick enumeration of the templates on a XenServer installation for yourself then you can execute the xe template-list
CLI command.) To get a list of templates from the API, we need to find the VM objects on the server that have their is_a_template
field set to true. One way to do this by calling VM.get_all_records(session)
where the session parameter is the reference we acquired from our Session.login_with_password
call earlier. This call queries the server, returning a snapshot (taken at the time of the call) containing all the VM object references and their field values.
(Remember that at this stage we are not concerned about the particular mechanisms by which the returned object references and field values can be manipulated in any particular client language: that detail is dealt with by our language-specific API bindings and described concretely in the following chapter. For now it suffices just to assume the existence of an abstract mechanism for reading and manipulating objects and field values returned by API calls.)
Now that we have a snapshot of all the VM objects’ field values in the memory of our client application we can simply iterate through them and find the ones that have their “is_a_template
” set to true. At this stage let’s assume that our example application further iterates through the template objects and remembers the reference corresponding to the one that has its “name_label
” set to “Debian Etch 4.0” (one of the default Linux templates supplied with XenServer).
Installing the VM based on a template
Continuing through our example, we must now install a new VM based on the template we selected. The installation process requires 4 API calls:
First we must now invoke the API call VM.clone(session, t_ref, "my first VM")
. This tells the server to clone the VM object referenced by t_ref
in order to make a new VM object. The return value of this call is the VM reference corresponding to the newly-created VM. Let’s call this new_vm_ref
.
Next, we need to specify the UUID of the Storage Repository where the VM’s
disks will be instantiated. We have to put this in the sr
attribute in
the disk provisioning XML stored under the “disks
” key in the
other_config
map of the newly-created VM. This field can be updated by
calling its getter (other_config <- VM.get_other_config(session, new_vm_ref)
) and then its setter (VM.set_other_config(session, new_vm_ref, other_config)
) with the modified other_config
map.
At this stage the object referred to by new_vm_ref
is still a template (just like the VM object referred to by t_ref
, from which it was cloned). To make new_vm_ref
into a VM object we need to call VM.provision(session, new_vm_ref)
. When this call returns the new_vm_ref
object will have had its is_a_template
field set to false, indicating that new_vm_ref
now refers to a regular VM ready for starting.
Note
The provision operation may take a few minutes, as it is as during this call that the template’s disk images are created. In the case of the Debian template, the newly created disks are also at this stage populated with a Debian root filesystem.
Taking the VM through a start/suspend/resume/stop cycle
Now we have an object reference representing our newly-installed VM, it is trivial to take it through a few lifecycle operations:
To start our VM we can just call VM.start(session, new_vm_ref)
After it’s running, we can suspend it by calling VM.suspend(session, new_vm_ref)
,
and then resume it by calling VM.resume(session, new_vm_ref)
.
We can call VM.shutdown(session, new_vm_ref)
to shutdown the VM cleanly.
Logging out
Once an application is finished interacting with a XenServer Host it is good practice to call Session.logout(session)
. This invalidates the session reference (so it cannot be used in subsequent API calls) and simultaneously deallocates server-side memory used to store the session object.
Although inactive sessions will eventually timeout, the server has a hardcoded limit of 500 concurrent sessions for each username
or originator
. Once this limit has been reached fresh logins will evict the session objects that have been used least recently, causing their associated session references to become invalid. For successful interoperability with other applications, concurrently accessing the server, the best policy is:
Choose a string that identifies your application and its version.
Create a single session at start-of-day, using that identifying string for the originator
parameter to Session.login_with_password
.
Use this session throughout the application (note that sessions can be used across multiple separate client-server network connections) and then explicitly logout when possible.
If a poorly written client leaks sessions or otherwise exceeds the limit, then as long as the client uses an appropriate originator
argument, it will be easily identifiable from the XenServer logs and XenServer will destroy the longest-idle sessions of the rogue client only; this may cause problems for that client but not for other clients. If the misbehaving client did not specify an originator
, it would be harder to identify and would cause the premature destruction of sessions of any clients that also did not specify an originator
Install and start example: summary
We have seen how the API can be used to install a VM from a XenServer template and perform a number of lifecycle operations on it. You will note that the number of calls we had to make in order to affect these operations was small:
One call to acquire a session: Session.login_with_password()
One call to query the VM (and template) objects present on the XenServer installation: VM.get_all_records()
. Recall that we used the information returned from this call to select a suitable template to install from.
Four calls to install a VM from our chosen template: VM.clone()
, followed
by the getter and setter of the other_config
field to specify where to
create the disk images of the template, and then VM.provision()
.
One call to start the resultant VM: VM.start()
(and similarly other single calls to suspend, resume and shutdown accordingly)
And then one call to logout Session.logout()
The take-home message here is that, although the API as a whole is complex and fully featured, common tasks (such as creating and performing lifecycle operations on VMs) are very straightforward to perform, requiring only a small number of simple API calls. Keep this in mind while you study the next section which may, on first reading, appear a little daunting!
Object Model Overview
This section gives a high-level overview of the object model of the API. A more detailed description of the parameters and methods of each class outlined here can be found in the XenServer API Reference document.
We start by giving a brief outline of some of the core classes that make up the API. (Don’t worry if these definitions seem somewhat abstract in their initial presentation; the textual description in subsequent sections, and the code-sample walk through in the next Chapter will help make these concepts concrete.)
Class | Description |
---|
VM | A VM object represents a particular virtual machine instance on a XenServer Host or Resource Pool. Example methods include start , suspend , pool_migrate ; example parameters include power_state , memory_static_max , and name_label . (In the previous section we saw how the VM class is used to represent both templates and regular VMs) |
Host | A host object represents a physical host in a XenServer pool. Example methods include reboot and shutdown . Example parameters include software_version , hostname , and [IP] address . |
VDI | A VDI object represents a Virtual Disk Image. Virtual Disk Images can be attached to VMs, in which case a block device appears inside the VM through which the bits encapsulated by the Virtual Disk Image can be read and written. Example methods of the VDI class include “resize” and “clone”. Example fields include “virtual_size” and “sharable”. (When we called VM.provision on the VM template in our previous example, some VDI objects were automatically created to represent the newly created disks, and attached to the VM object.) |
SR | An SR (Storage Repository) aggregates a collection of VDIs and encapsulates the properties of physical storage on which the VDIs’ bits reside. Example parameters include type (which determines the storage-specific driver a XenServer installation uses to read/write the SR’s VDIs) and physical_utilisation ; example methods include scan (which invokes the storage-specific driver to acquire a list of the VDIs contained with the SR and the properties of these VDIs) and create (which initializes a block of physical storage so it is ready to store VDIs). |
Network | A network object represents a layer-2 network that exists in the environment in which the XenServer Host instance lives. Since XenServer does not manage networks directly this is a lightweight class that serves merely to model physical and virtual network topology. VM and Host objects that are attached to a particular Network object (by virtue of VIF and PIF instances – see below) can send network packets to each other. |
At this point, readers who are finding this enumeration of classes rather terse may wish to skip to the code walk-throughs of the next chapter: there are plenty of useful applications that can be written using only a subset of the classes already described! For those who wish to continue this description of classes in the abstract, read on.
On top of the classes listed above, there are 4 more that act as connectors, specifying relationships between VMs and Hosts, and Storage and Networks. The first 2 of these classes that we will consider, VBD and VIF, determine how VMs are attached to virtual disks and network objects respectively:
Class | Description |
---|
VBD | A VBD (Virtual Block Device) object represents an attachment between a VM and a VDI. When a VM is booted its VBD objects are queried to determine which disk images (VDIs) should be attached. Example methods of the VBD class include “plug” (which hot plugs a disk device into a running VM, making the specified VDI accessible therein) and “unplug” (which hot unplugs a disk device from a running guest); example fields include “device” (which determines the device name inside the guest under which the specified VDI will be made accessible). |
VIF | A VIF (Virtual network InterFace) object represents an attachment between a VM and a Network object. When a VM is booted its VIF objects are queried to determine which network devices should be created. Example methods of the VIF class include “plug” (which hot plugs a network device into a running VM) and “unplug” (which hot unplugs a network device from a running guest). |
The second set of “connector classes” that we will consider determine how Hosts are attached to Networks and Storage.
Class | Description |
---|
PIF | A PIF (Physical InterFace) object represents an attachment between a Host and a Network object. If a host is connected to a Network (over a PIF) then packets from the specified host can be transmitted/received by the corresponding host. Example fields of the PIF class include “device” (which specifies the device name to which the PIF corresponds – e.g. eth0) and “MAC” (which specifies the MAC address of the underlying NIC that a PIF represents). Note that PIFs abstract both physical interfaces and VLANs (the latter distinguished by the existence of a positive integer in the “VLAN” field). |
PBD | A PBD (Physical Block Device) object represents an attachment between a Host and a SR (Storage Repository) object. Fields include “currently-attached” (which specifies whether the chunk of storage represented by the specified SR object) is currently available to the host; and “device_config” (which specifies storage-driver specific parameters that determines how the low-level storage devices are configured on the specified host – e.g. in the case of an SR rendered on an NFS filer, device_config may specify the host-name of the filer and the path on the filer in which the SR files live.). |
The figure above presents a graphical overview of the API classes involved in managing VMs, Hosts, Storage and Networking. From this diagram, the symmetry between storage and network configuration, and also the symmetry between virtual machine and host configuration is plain to see.
Working with VIFs and VBDs
In this section we walk through a few more complex scenarios, describing informally how various tasks involving virtual storage and network devices can be accomplished using the API.
Creating disks and attaching them to VMs
Let’s start by considering how to make a new blank disk image and attach it to a running VM. We will assume that we already have ourselves a running VM, and we know its corresponding API object reference (e.g. we may have created this VM using the procedure described in the previous section, and had the server return its reference to us.) We will also assume that we have authenticated with the XenServer installation and have a corresponding session reference
. Indeed in the rest of this chapter, for the sake of brevity, we will stop mentioning sessions altogether.
Creating a new blank disk image
The first step is to instantiate the disk image on physical storage. We do this by calling VDI.create()
. The VDI.create
call takes a number of parameters, including:
name_label
and name_description
: a human-readable name/description for the disk (e.g. for convenient display in the UI etc.). These fields can be left blank if desired.
SR
: the object reference of the Storage Repository representing the physical storage in which the VDI’s bits will be placed.
read_only
: setting this field to true indicates that the VDI can only be attached to VMs in a read-only fashion. (Attempting to attach a VDI with its read_only
field set to true in a read/write fashion results in error.)
Invoking the VDI.create
call causes the XenServer installation to create a blank disk image on physical storage, create an associated VDI object (the datamodel instance that refers to the disk image on physical storage) and return a reference to this newly created VDI object.
The way in which the disk image is represented on physical storage depends on the type of the SR in which the created VDI resides. For example, if the SR is of type “lvm” then the new disk image will be rendered as an LVM volume; if the SR is of type “nfs” then the new disk image will be a sparse VHD file created on an NFS filer. (You can query the SR type through the API using the SR.get_type()
call.)
Note
Some SR types might round up the virtual-size
value to make it divisible by a configured block size.
Attaching the disk image to a VM
So far we have a running VM (that we assumed the existence of at the start of this example) and a fresh VDI that we just created. Right now, these are both independent objects that exist on the XenServer Host, but there is nothing linking them together. So our next step is to create such a link, associating the VDI with our VM.
The attachment is formed by creating a new “connector” object called a VBD (Virtual Block Device). To create our VBD we invoke the VBD.create()
call. The VBD.create()
call takes a number of parameters including:
VM
- the object reference of the VM to which the VDI is to be attached
VDI
- the object reference of the VDI that is to be attached
mode
- specifies whether the VDI is to be attached in a read-only or a read-write fashion
userdevice
- specifies the block device inside the guest through which applications running inside the VM will be able to read/write the VDI’s bits.
type
- specifies whether the VDI should be presented inside the VM as a regular disk or as a CD. (Note that this particular field has more meaning for Windows VMs than it does for Linux VMs, but we will not explore this level of detail in this chapter.)
Invoking VBD.create
makes a VBD object on the XenServer installation and returns its object reference. However, this call in itself does not have any side-effects on the running VM (that is, if you go and look inside the running VM you will see that the block device has not been created). The fact that the VBD object exists but that the block device in the guest is not active, is reflected by the fact that the VBD object’s currently_attached
field is set to false.
For expository purposes, the figure above presents a graphical example that shows the relationship between VMs, VBDs, VDIs and SRs. In this instance a VM object has 2 attached VDIs: there are 2 VBD objects that form the connections between the VM object and its VDIs; and the VDIs reside within the same SR.
Hotplugging the VBD
If we rebooted the VM at this stage then, after rebooting, the block device corresponding to the VBD would appear: on boot, XenServer queries all VBDs of a VM and actively attaches each of the corresponding VDIs.
Rebooting the VM is all very well, but recall that we wanted to attach a newly created blank disk to a running VM. This can be achieved by invoking the plug
method on the newly created VBD object. When the plug
call returns successfully, the block device to which the VBD relates will have appeared inside the running VM – i.e. from the perspective of the running VM, the guest operating system is led to believe that a new disk device has just been hot plugged. Mirroring this fact in the managed world of the API, the currently_attached
field of the VBD is set to true.
Unsurprisingly, the VBD plug
method has a dual called “unplug
”. Invoking the unplug
method on a VBD object causes the associated block device to be hot unplugged from a running VM, setting the currently_attached
field of the VBD object to false accordingly.
Creating and attaching Network Devices to VMs
The API calls involved in configuring virtual network interfaces in VMs are similar in many respects to the calls involved in configuring virtual disk devices. For this reason we will not run through a full example of how one can create network interfaces using the API object-model; instead we will use this section just to outline briefly the symmetry between virtual networking device and virtual storage device configuration.
The networking analogue of the VBD class is the VIF class. Just as a VBD is the API representation of a block device inside a VM, a VIF (Virtual network InterFace) is the API representation of a network device inside a VM. Whereas VBDs associate VM objects with VDI objects, VIFs associate VM objects with Network objects. Just like VBDs, VIFs have a currently_attached
field that determines whether or not the network device (inside the guest) associated with the VIF is currently active or not. And as we saw with VBDs, at VM boot-time the VIFs of the VM are queried and a corresponding network device for each created inside the booting VM. Similarly, VIFs also have plug
and unplug
methods for hot plugging/unplugging network devices in/out of running VMs.
Host configuration for networking and storage
We have seen that the VBD and VIF classes are used to manage configuration of block devices and network devices (respectively) inside VMs. To manage host configuration of storage and networking there are two analogous classes: PBD (Physical Block Device) and PIF (Physical [network] InterFace).
Host storage configuration: PBDs
Let us start by considering the PBD class. A PBD_create()
call takes a number of parameters including:
Parameter | Description |
---|
host | physical machine on which the PBD is available |
SR | the Storage Repository that the PBD connects to |
device_config | a string-to-string map that is provided to the host’s SR-backend-driver, containing the low-level parameters required to configure the physical storage device(s) on which the SR is to be realized. The specific contents of the device_config field depend on the type of the SR to which the PBD is connected. (Executing xe sm-list will show a list of possible SR types; the configuration field in this enumeration specifies the device_config parameters that each SR type expects.) |
For example, imagine we have an SR object s of type “nfs” (representing a directory on an NFS filer within which VDIs are stored as VHD files); and let’s say that we want a host, h, to be able to access s. In this case we invoke PBD.create()
specifying host h, SR s, and a value for the device_config parameter that is the following map:
("server", "my_nfs_server.example.com"), ("serverpath", "/scratch/mysrs/sr1")
This tells the XenServer Host that SR s is accessible on host h, and further that to access SR s, the host needs to mount the directory /scratch/mysrs/sr1
on the NFS server named my_nfs_server.example.com
.
Like VBD objects, PBD objects also have a field called currently_attached
. Storage repositories can be attached and detached from a given host by invoking PBD.plug
and PBD.unplug
methods respectively.
Host networking configuration: PIFs
Host network configuration is specified by virtue of PIF objects. If a PIF object connects a network object, n, to a host object h, then the network corresponding to n is bridged onto a physical interface (or a physical interface plus a VLAN tag) specified by the fields of the PIF object.
For example, imagine a PIF object exists connecting host h to a network n, and that device
field of the PIF object is set to eth0
. This means that all packets on network n are bridged to the NIC in the host corresponding to host network device eth0
.
XML-RPC notes
Datetimes
The API deviates from the XML-RPC specification in handling of datetimes. The API appends a “Z” to the end of datetime strings, which is meant to indicate that the time is expressed in UTC.
API evolution
All APIs evolve as bugs are fixed, new features added and features are removed
- the XenAPI is no exception. This document lists policies describing how the
XenAPI evolves over time.
The goals of XenAPI evolution are:
- to allow bugs to be fixed efficiently;
- to allow new, innovative features to be added easily;
- to keep old, unmodified clients working as much as possible; and
- where backwards-incompatible changes are to be made, publish this
information early to enable affected parties to give timely feedback.
Background
In this document, the term XenAPI refers to the XMLRPC-derived wire protocol
used by xapi. The XenAPI has objects which each have fields and
messages. The XenAPI is described in detail elsewhere.
XenAPI Lifecycle
graph LR
Prototype -->|1| Published -->|4| Deprecated -->|5| Removed
Published -->|2,3| Published
Each element of the XenAPI (objects, messages and fields) follows the lifecycle
diagram above. When an element is newly created and being still in development,
it is in the Prototype state. Elements in this state may be stubs: the
interface is there and can be used by clients for prototyping their new
features, but the actual implementation is not yet ready.
When the element subsequently becomes ready for use (the stub is replaced by a
real implementation), it transitions to the Published state. This is the only
state in which the object, message or field should be used. From this point
onwards, the element needs to have clearly defined semantics that are available
for reference in the XenAPI documentation.
If the XenAPI element becomes Deprecated, it will still function as it did
before, but its use is discouraged. The final stage of the lifecycle is the
Removed state, in which the element is not available anymore.
The numbered state changes in the diagram have the following meaning:
- Publish: declare that the XenAPI element is ready for people to use.
- Extend: a backwards-compatible extension of the XenAPI, for example an
additional parameter in a message with an appropriate default value. If the
API is used as before, it still has the same effect.
- Change: a backwards-incompatible change. That is, the message now behaves
differently, or the field has different semantics. Such changes are
discouraged and should only be considered in special cases (always consider
whether deprecation is a better solution). The use of a message can for
example be restricted for security or efficiency reasons, or the behaviour
can be changed simply to fix a bug.
- Deprecate: declare that the use of this XenAPI element should be avoided from
now on. Reasons for doing this include: the element is redundant (it
duplicates functionality elsewhere), it is inconsistent with other parts of
the XenAPI, it is insecure or inefficient (for examples of deprecation
policies of other projects, see
symbian
eclipse
oval.
- Remove: the element is taken out of the public API and can no longer be used.
Each lifecycle transition must be accompanied by an explanation describing the
change and the reason for the change. This message should be enough to
understand the semantics of the XenAPI element after the change, and in the case
of backwards-incompatible changes or deprecation, it should give directions
about how to modify a client to deal with the change (for example, how to avoid
using the deprecated field or message).
Releases
Every release must be accompanied by release notes listing all objects, fields
and messages that are newly prototyped, published, extended, changed, deprecated
or removed in the release. Each item should have an explanation as implied
above, documenting the new or changed XenAPI element. The release notes for
every release shall be prominently displayed in the XenAPI HTML documentation.
Documentation
The XenAPI documentation will contain its complete lifecycle history for each
XenAPI element. Only the elements described in the documentation are
“official” and supported.
Each object, message and field in datamodel.ml
will have lifecycle
metadata attached to it, which is a list of transitions (transition type *
release * explanation string) as described above. Release notes are automatically generated from this data.
Using the API
This chapter describes how to use the XenServer Management API from real programs to manage XenServer Hosts and VMs. The chapter begins with a walk-through of a typical client application and demonstrates how the API can be used to perform common tasks. Example code fragments are given in python syntax but equivalent code in the other programming languages would look very similar. The language bindings themselves are discussed afterwards and the chapter finishes with walk-throughs of two complete examples.
Anatomy of a typical application
This section describes the structure of a typical application using the XenServer Management API. Most client applications begin by connecting to a XenServer Host and authenticating (e.g. with a username and password). Assuming the authentication succeeds, the server will create a “session” object and return a reference to the client. This reference will be passed as an argument to all future API calls. Once authenticated, the client may search for references to other useful objects (e.g. XenServer Hosts, VMs, etc.) and invoke operations on them. Operations may be invoked either synchronously or asynchronously; special task objects represent the state and progress of asynchronous operations. These application elements are all described in detail in the following sections.
Choosing a low-level transport
API calls can be issued over two transports:
The SSL-encrypted TCP transport is used for all off-host traffic while the Unix domain socket can be used from services running directly on the XenServer Host itself. In the SSL-encrypted TCP transport, all API calls should be directed at the Resource Pool master; failure to do so will result in the error HOST_IS_SLAVE
, which includes the IP address of the master as an error parameter.
Because the master host of a pool can change, especially if HA is enabled on a pool, clients must implement the following steps to detect a master host change and connect to the new master as required:
Subscribe to updates in the list of hosts servers, and maintain a current list of hosts in the pool
If the connection to the pool master fails to respond, attempt to connect to all hosts in the list until one responds
The first host to respond will return the HOST_IS_SLAVE
error message, which contains the identity of the new pool master (unless of course the host is the new master)
Connect to the new master
Note
As a special-case, all messages sent through the Unix domain socket are transparently forwarded to the correct node.
Authentication and session handling
The vast majority of API calls take a session reference as their first parameter; failure to supply a valid reference will result in a SESSION_INVALID
error being returned. Acquire a session reference by supplying a username and password to the login_with_password
function.
Note
As a special-case, if this call is executed over the local Unix domain socket then the username and password are ignored and the call always succeeds.
Every session has an associated “last active” timestamp which is updated on every API call. The server software currently has a built-in limit of 500 active sessions and will remove those with the oldest “last active” field if this limit is exceeded for a given username
or originator
. In addition all sessions whose “last active” field is older than 24 hours are also removed. Therefore it is important to:
Specify an appropriate originator
when logging in; and
Remember to log out of active sessions to avoid leaking them; and
Be prepared to log in again to the server if a SESSION_INVALID
error is caught.
In the following Python fragment a connection is established over the Unix domain socket and a session is created:
import XenAPI
session = XenAPI.xapi_local()
try:
session.xenapi.login_with_password("root", "", "2.3", "My Widget v0.1")
...
finally:
session.xenapi.session.logout()
Finding references to useful objects
Once an application has authenticated the next step is to acquire references to objects in order to query their state or invoke operations on them. All objects have a set of “implicit” messages which include the following:
get_by_name_label
: return a list of all objects of a particular class with a particular label;
get_by_uuid
: return a single object named by its UUID;
get_all
: return a set of references to all objects of a particular class; and
get_all_records
: return a map of reference to records for each object of a particular class.
For example, to list all hosts:
hosts = session.xenapi.host.get_all()
To find all VMs with the name “my first VM”:
vms = session.xenapi.VM.get_by_name_label('my first VM')
Note
Object name_label
fields are not guaranteed to be unique and so the get_by_name_label
API call returns a set of references rather than a single reference.
In addition to the methods of finding objects described above, most objects also contain references to other objects within fields. For example it is possible to find the set of VMs running on a particular host by calling:
vms = session.xenapi.host.get_resident_VMs(host)
Invoking synchronous operations on objects
Once object references have been acquired, operations may be invoked on them. For example to start a VM:
session.xenapi.VM.start(vm, False, False)
All API calls are by default synchronous and will not return until the operation has completed or failed. For example in the case of VM.start
the call does not return until the VM has started booting.
Note
When the VM.start
call returns the VM will be booting. To determine when the booting has finished, wait for the in-guest agent to report internal statistics through the VM_guest_metrics
object.
Using Tasks to manage asynchronous operations
To simplify managing operations which take quite a long time (e.g. VM.clone
and VM.copy
) functions are available in two forms: synchronous (the default) and asynchronous. Each asynchronous function returns a reference to a task object which contains information about the in-progress operation including:
whether it is pending
whether it is has succeeded or failed
progress (in the range 0-1)
the result or error code returned by the operation
An application which wanted to track the progress of a VM.clone
operation and display a progress bar would have code like the following:
vm = session.xenapi.VM.get_by_name_label('my vm')
task = session.xenapi.Async.VM.clone(vm)
while session.xenapi.task.get_status(task) == "pending":
progress = session.xenapi.task.get_progress(task)
update_progress_bar(progress)
time.sleep(1)
session.xenapi.task.destroy(task)
Note
Note that a well-behaved client should remember to delete tasks created by asynchronous operations when it has finished reading the result or error. If the number of tasks exceeds a built-in threshold then the server will delete the oldest of the completed tasks.
Subscribing to and listening for events
With the exception of the task and metrics classes, whenever an object is modified the server generates an event. Clients can subscribe to this event stream on a per-class basis and receive updates rather than resorting to frequent polling. Events come in three types:
add
- generated when an object has been created;
del
- generated immediately before an object is destroyed; and
mod
- generated when an object’s field has changed.
Events also contain a monotonically increasing ID, the name of the class of object and a snapshot of the object state equivalent to the result of a get_record()
.
Clients register for events by calling event.register()
with a list of class names or the special string “*”. Clients receive events by executing event.next()
which blocks until events are available and returns the new events.
Note
Since the queue of generated events on the server is of finite length a very slow client might fail to read the events fast enough; if this happens an EVENTS_LOST
error is returned. Clients should be prepared to handle this by re-registering for events and checking that the condition they are waiting for hasn’t become true while they were unregistered.
The following python code fragment demonstrates how to print a summary of every event generated by a system: (similar code exists in Xenserver-SDK/XenServerPython/samples/watch-all-events.py
)
fmt = "%8s %20s %5s %s"
session.xenapi.event.register(["*"])
while True:
try:
for event in session.xenapi.event.next():
name = "(unknown)"
if "snapshot" in event.keys():
snapshot = event["snapshot"]
if "name_label" in snapshot.keys():
name = snapshot["name_label"]
print fmt % (event['id'], event['class'], event['operation'], name)
except XenAPI.Failure, e:
if e.details == [ "EVENTS_LOST" ]:
print "Caught EVENTS_LOST; should reregister"
Language bindings
C
The SDK includes the source to the C language binding in the directory XenServer-SDK/libxenserver/src
together with a Makefile which compiles the binding into a library. Every API object is associated with a header file which contains declarations for all that object’s API functions; for example the type definitions and functions required to invoke VM operations are all contained in xen_vm.h
.
C binding dependencies
The following simple examples are included with the C bindings:
test_vm_async_migrate
: demonstrates how to use asynchronous API calls to migrate running VMs from a slave host to the pool master.
test_vm_ops
: demonstrates how to query the capabilities of a host, create a VM, attach a fresh blank disk image to the VM and then perform various powercycle operations;
test_failures
: demonstrates how to translate error strings into enum_xen_api_failure, and vice versa;
test_event_handling
: demonstrates how to listen for events on a connection.
test_enumerate
: demonstrates how to enumerate the various API objects.
C#
The C# bindings are contained within the directory XenServer-SDK/XenServer.NET
and include project files suitable for building under Microsoft Visual Studio. Every API object is associated with one C# file; for example the functions implementing the VM operations are contained within the file VM.cs
.
C# binding dependencies
Three examples are included with the C# bindings in the directory XenServer-SDK/XenServer.NET/samples
as separate projects of the XenSdkSample.sln
solution:
GetVariousRecords
: logs into a XenServer Host and displays information about hosts, storage and virtual machines;
GetVmRecords
: logs into a XenServer Host and lists all the VM records;
VmPowerStates
: logs into a XenServer Host, finds a VM and takes it through the various power states. Requires a shut-down VM to be already installed.
Java
The Java bindings are contained within the directory XenServer-SDK/XenServerJava
and include project files suitable for building under Microsoft Visual Studio. Every API object is associated with one Java file; for example the functions implementing the VM operations are contained within the file VM.java
.
Java binding dependencies
Running the main file XenServer-SDK/XenServerJava/samples/RunTests.java
will run a series of examples included in the same directory:
AddNetwork
: Adds a new internal network not attached to any NICs;
SessionReuse
: Demonstrates how a Session object can be shared between multiple Connections;
AsyncVMCreate
: Makes asynchronously a new VM from a built-in template, starts and stops it;
VdiAndSrOps
: Performs various SR and VDI tests, including creating a dummy SR;
CreateVM
: Creates a VM on the default SR with a network and DVD drive;
DeprecatedMethod
: Tests a warning is displayed wehn a deprecated API method is called;
GetAllRecordsOfAllTypes
: Retrieves all the records for all types of objects;
SharedStorage
: Creates a shared NFS SR;
StartAllVMs
: Connects to a host and tries to start each VM on it.
PowerShell
The PowerShell bindings are contained within the directory XenServer-SDK/XenServerPowerShell
. We provide the PowerShell module XenServerPSModule
and source code exposing the XenServer API as Windows PowerShell cmdlets.
PowerShell binding dependencies
These example scripts are included with the PowerShell bindings in the directory XenServer-SDK/XenServerPowerShell/samples
:
AutomatedTestCore.ps1
: demonstrates how to log into a XenServer host, create a storage repository and a VM, and then perform various powercycle operations;
HttpTest.ps1
: demonstrates how to log into a XenServer host, create a VM, and then perform operations such as VM importing and exporting, patch upload, and retrieval of performance statistics.
Python
The python bindings are contained within a single file: XenServer-SDK/XenServerPython/XenAPI.py
.
Python binding dependencies
|:–|:–|
|Platform supported:|Linux|
|Library:|XenAPI.py|
|Dependencies:|None|
The SDK includes 7 python examples:
fixpbds.py
- reconfigures the settings used to access shared storage;
install.py
- installs a Debian VM, connects it to a network, starts it up and waits for it to report its IP address;
license.py
- uploads a fresh license to a XenServer Host;
permute.py
- selects a set of VMs and uses XenMotion to move them simultaneously between hosts;
powercycle.py
- selects a set of VMs and powercycles them;
shell.py
- a simple interactive shell for testing;
vm_start_async.py
- demonstrates how to invoke operations asynchronously;
watch-all-events.py
- registers for all events and prints details when they occur.
Command Line Interface (CLI)
Besides using raw XML-RPC or one of the supplied language bindings, third-party software developers may integrate with XenServer Hosts by using the XE command line interface xe
. The xe CLI is installed by default on XenServer hosts; a stand-alone remote CLI is also available for Linux. On Windows, the xe.exe
CLI executable is installed along with XenCenter.
CLI dependencies
|:–|:–|
|Platform supported:|Linux and Windows|
|Library:|None|
|Binary:|xe (xe.exe on Windows)|
|Dependencies:|None|
The CLI allows almost every API call to be directly invoked from a script or other program, silently taking care of the required session management.
The XE CLI syntax and capabilities are described in detail in the XenServer Administrator’s Guide. For additional resources and examples, visit the Citrix Knowledge Center.
Note
When running the CLI from a XenServer Host console, tab-completion of both command names and arguments is available.
Complete application examples
This section describes two complete examples of real programs using the API.
Simultaneously migrating VMs using XenMotion
This python example (contained in XenServer-SDK/XenServerPython/samples/permute.py
) demonstrates how to use XenMotion to move VMs simultaneously between hosts in a Resource Pool. The example makes use of asynchronous API calls and shows how to wait for a set of tasks to complete.
The program begins with some standard boilerplate and imports the API bindings module
import sys, time
import XenAPI
Next the commandline arguments containing a server URL, username, password and a number of iterations are parsed. The username and password are used to establish a session which is passed to the function main
, which is called multiple times in a loop. Note the use of try: finally:
to make sure the program logs out of its session at the end.
if __name__ == "__main__":
if len(sys.argv) <> 5:
print "Usage:"
print sys.argv[0], " <url> <username> <password> <iterations>"
sys.exit(1)
url = sys.argv[1]
username = sys.argv[2]
password = sys.argv[3]
iterations = int(sys.argv[4])
# First acquire a valid session by logging in:
session = XenAPI.Session(url)
session.xenapi.login_with_password(username, password, "2.3",
"Example migration-demo v0.1")
try:
for i in range(iterations):
main(session, i)
finally:
session.xenapi.session.logout()
The main
function examines each running VM in the system, taking care to filter out control domains (which are part of the system and not controllable by the user). A list of running VMs and their current hosts is constructed.
def main(session, iteration):
# Find a non-template VM object
all = session.xenapi.VM.get_all()
vms = []
hosts = []
for vm in all:
record = session.xenapi.VM.get_record(vm)
if not(record["is_a_template"]) and \
not(record["is_control_domain"]) and \
record["power_state"] == "Running":
vms.append(vm)
hosts.append(record["resident_on"])
print "%d: Found %d suitable running VMs" % (iteration, len(vms))
Next the list of hosts is rotated:
# use a rotation as a permutation
hosts = [hosts[-1]] + hosts[:(len(hosts)-1)]
Each VM is then moved using XenMotion to the new host under this rotation (i.e. a VM running on host at position 2 in the list will be moved to the host at position 1 in the list etc.) In order to execute each of the movements in parallel, the asynchronous version of the VM.pool_migrate
is used and a list of task references constructed. Note the live
flag passed to the VM.pool_migrate
; this causes the VMs to be moved while they are still running.
tasks = []
for i in range(0, len(vms)):
vm = vms[i]
host = hosts[i]
task = session.xenapi.Async.VM.pool_migrate(vm, host, { "live": "true" })
tasks.append(task)
The list of tasks is then polled for completion:
finished = False
records = {}
while not(finished):
finished = True
for task in tasks:
record = session.xenapi.task.get_record(task)
records[task] = record
if record["status"] == "pending":
finished = False
time.sleep(1)
Once all tasks have left the pending state (i.e. they have successfully completed, failed or been cancelled) the tasks are polled once more to see if they all succeeded:
allok = True
for task in tasks:
record = records[task]
if record["status"] <> "success":
allok = False
If any one of the tasks failed then details are printed, an exception is raised and the task objects left around for further inspection. If all tasks succeeded then the task objects are destroyed and the function returns.
if not(allok):
print "One of the tasks didn't succeed at", \
time.strftime("%F:%HT%M:%SZ", time.gmtime())
idx = 0
for task in tasks:
record = records[task]
vm_name = session.xenapi.VM.get_name_label(vms[idx])
host_name = session.xenapi.host.get_name_label(hosts[idx])
print "%s : %12s %s -> %s [ status: %s; result = %s; error = %s ]" % \
(record["uuid"], record["name_label"], vm_name, host_name, \
record["status"], record["result"], repr(record["error_info"]))
idx = idx + 1
raise "Task failed"
else:
for task in tasks:
session.xenapi.task.destroy(task)
Cloning a VM using the XE CLI
This example is a bash
script which uses the XE CLI to clone a VM taking care to shut it down first if it is powered on.
The example begins with some boilerplate which first checks if the environment variable XE
has been set: if it has it assumes that it points to the full path of the CLI, else it is assumed that the XE CLI is on the current path. Next the script prompts the user for a server name, username and password:
# Allow the path to the 'xe' binary to be overridden by the XE environment variable
if [ -z "${XE}" ]; then
XE=xe
fi
if [ ! -e "${HOME}/.xe" ]; then
read -p "Server name: " SERVER
read -p "Username: " USERNAME
read -p "Password: " PASSWORD
XE="${XE} -s ${SERVER} -u ${USERNAME} -pw ${PASSWORD}"
fi
Next the script checks its commandline arguments. It requires exactly one: the UUID of the VM which is to be cloned:
# Check if there's a VM by the uuid specified
${XE} vm-list params=uuid | grep -q " ${vmuuid}$"
if [ $? -ne 0 ]; then
echo "error: no vm uuid \"${vmuuid}\" found"
exit 2
fi
The script then checks the power state of the VM and if it is running, it attempts a clean shutdown. The event system is used to wait for the VM to enter state “Halted”.
Note
The XE CLI supports a command-line argument --minimal
which causes it to print its output without excess whitespace or formatting, ideal for use from scripts. If multiple values are returned they are comma-separated.
# Check the power state of the vm
name=$(${XE} vm-list uuid=${vmuuid} params=name-label --minimal)
state=$(${XE} vm-list uuid=${vmuuid} params=power-state --minimal)
wasrunning=0
# If the VM state is running, we shutdown the vm first
if [ "${state}" = "running" ]; then
${XE} vm-shutdown uuid=${vmuuid}
${XE} event-wait class=vm power-state=halted uuid=${vmuuid}
wasrunning=1
fi
The VM is then cloned and the new VM has its name_label
set to cloned_vm
.
# Clone the VM
newuuid=$(${XE} vm-clone uuid=${vmuuid} new-name-label=cloned_vm)
Finally, if the original VM had been running and was shutdown, both it and the new VM are started.
# If the VM state was running before cloning, we start it again
# along with the new VM.
if [ "$wasrunning" -eq 1 ]; then
${XE} vm-start uuid=${vmuuid}
${XE} vm-start uuid=${newuuid}
fi
XenAPI Reference
XenAPI Classes
Click on a class to view the associated fields and messages.
Classes, Fields and Messages
Classes have both fields and messages. Messages are either implicit or explicit where an implicit message is one of:
- a constructor (usually called "create");
- a destructor (usually called "destroy");
- "get_by_name_label";
- "get_by_uuid";
- "get_record";
- "get_all"; and
- "get_all_records".
Explicit messages include all the rest, more class-specific messages (e.g. "VM.start", "VM.clone")
Every field has at least one accessor depending both on its type and whether it is read-only or read-write. Accessors for a field named "X" would be a proper subset of:
- set_X: change the value of field X (only if it is read-write);
- get_X: retrieve the value of field X;
- add_X: add a key/value pair (for fields of type set);
- remove_X: remove a key (for fields of type set);
- add_to_X: add a key/value pair (for fields of type map); and
- remove_from_X: remove a key (for fields of type map).
Subsections of XenAPI Reference
auth
blob
Bond
Certificate
Cluster
Cluster_host
console
crashdump
data_source
DR_task
event
Feature
GPU_group
host
host_cpu
host_crashdump
host_metrics
host_patch
LVHD
message
network
network_sriov
Observer
PBD
PCI
PGPU
PIF
PIF_metrics
pool
pool_patch
pool_update
probe_result
PUSB
PVS_cache_storage
PVS_proxy
PVS_server
PVS_site
Repository
role
SDN_controller
secret
session
SM
SR
sr_stat
subject
task
tunnel
USB_group
user
VBD
VBD_metrics
VDI
vdi_nbd_server_info
VGPU
VGPU_type
VIF
VIF_metrics
VLAN
VM
VM_appliance
VM_guest_metrics
VM_metrics
VMPP
VMSS
VTPM
VUSB
XenAPI Releases
Subsections of XenAPI Releases
XAPI 24.16.0
XAPI 24.14.0
XAPI 24.10.0
XAPI 24.3.0
XAPI 24.0.0
XAPI 23.30.0
XAPI 23.27.0
XAPI 23.25.0
XAPI 23.18.0
XAPI 23.14.0
XAPI 23.9.0
XAPI 23.1.0
XAPI 22.37.0
XAPI 22.33.0
XAPI 22.27.0
XAPI 22.26.0
XAPI 22.20.0
XAPI 22.19.0
XAPI 22.16.0
XAPI 22.12.0
XAPI 22.5.0
XAPI 21.4.0
XAPI 21.3.0
XAPI 21.2.0
XAPI 1.329.0
XAPI 1.318.0
XAPI 1.313.0
XAPI 1.307.0
XAPI 1.304.0
XAPI 1.303.0
XAPI 1.301.0
XAPI 1.298.0
XAPI 1.297.0
XAPI 1.294.0
XAPI 1.290.0
XAPI 1.271.0
XAPI 1.257.0
XAPI 1.250.0
XenServer 8 Preview
Citrix Hypervisor 8.2 Hotfix 2
Citrix Hypervisor 8.2
Citrix Hypervisor 8.1
Citrix Hypervisor 8.0
XenServer 7.6
XenServer 7.5
XenServer 7.4
XenServer 7.3
XenServer 7.2
XenServer 7.1
XenServer 7.0
XenServer 6.5 SP1 Hotfix 31
XenServer 6.5 SP1
XenServer 6.5
XenServer 6.2 SP1 Hotfix 11
XenServer 6.2 SP1 Hotfix 4
XenServer 6.2 SP1
XenServer 6.2 SP1 Tech-Preview
XenServer 6.2
XenServer 6.1
XenServer 6.0
XenServer 5.6 FP1
XenServer 5.6
XenServer 5.5
XenServer 5.0 Update 1
XenServer 5.0
XenServer 4.1.1
XenServer 4.1
XenServer 4.0
Topics
Subsections of Topics
API for configuring the udhcp server in Dom0
This API allows you to configure the DHCP service running on the Host
Internal Management Network (HIMN). The API configures a udhcp daemon
residing in Dom0 and alters the service configuration for any VM using
the network.
It should be noted that for this reason, that callers who modify the
default configuration should be aware that their changes may have an
adverse affect on other consumers of the HIMN.
Version history
Date State
---- ----
2013-3-15 Stable
Stable: this API is considered stable and unlikely to change between
software version and between hotfixes.
API description
The API for configuring the network is based on a series of other_config
keys that can be set by the caller on the HIMN XAPI network object. Once
any of the keys below have been set, the caller must ensure that any VIFs
attached to the HIMN are removed, destroyed, created and plugged.
ip_begin
The first IP address in the desired subnet that the caller wishes the
DHCP service to use.
ip_end
The last IP address in the desired subnet that the caller wishes the
DHCP service to use.
netmask
The subnet mask for each of the issues IP addresses.
ip_disable_gw
A boolean key for disabling the DHCP server from returning a default
gateway for VMs on the network. To disable returning the gateway address
set the key to True.
Note: By default, the DHCP server will issue a default gateway for
those requesting an address. Setting this key may disrupt applications
that require the default gateway for communicating with Dom0 and so
should be used with care.
Example code
An example python extract of setting the config for the network:
def get_himn_ref():
networks = session.xenapi.network.get_all_records()
for ref, rec in networks.iteritems():
if 'is_host_internal_management_network' \
in rec['other_config']:
return ref
raise Exception("Error: unable to find HIMN.")
himn_ref = get_himn_ref()
other_config = session.xenapi.network.get_other_config(himn_ref)
other_config['ip_begin'] = "169.254.0.1"
other_config['ip_end'] = "169.254.255.254"
other_config['netmask'] = "255.255.0.0"
session.xenapi.network.set_other_config(himn_ref, other_config)
An example for how to disable the server returning a default gateway:
himn_ref = get_himn_ref()
other_config = session.xenapi.network.get_other_config(himn_ref)
other_config['ip_disable_gw'] = True
session.xenapi.network.set_other_config(himn_ref, other_config)
Guest agents
“Guest agents” are special programs which run inside VMs which can be controlled
via the XenAPI.
One communication method between XenAPI clients is via Xenstore.
Adding Xenstore entries to VMs
Developers may wish to install guest agents into VMs which take special action based on the type of the VM. In order to communicate this information into the guest, a special Xenstore name-space known as vm-data
is available which is populated at VM creation time. It is populated from the xenstore-data
map in the VM record.
Set the xenstore-data
parameter in the VM record:
xe vm-param-set uuid= xenstore-data:vm-data/foo=bar
Start the VM.
If it is a Linux-based VM, install the COMPANY_TOOLS and use the xenstore-read
to verify that the node exists in Xenstore.
Note
Only prefixes beginning with vm-data
are permitted, and anything not in this name-space will be silently ignored when starting the VM.
Memory
Memory is used for many things:
- the hypervisor code: this is the Xen executable itself
- the hypervisor heap: this is needed for per-domain structures and per-vCPU
structures
- the crash kernel: this is needed to collect information after a host crash
- domain RAM: this is the memory the VM believes it has
- shadow memory: for HVM guests running on hosts without hardware assisted
paging (HAP) Xen uses shadow to optimise page table updates. For all guests
shadow is used during live migration for tracking the memory transfer.
- video RAM for the virtual graphics card
Some of these are constants (e.g. hypervisor code) while some depend on the VM
configuration (e.g. domain RAM). Xapi calls the constants “host overhead” and
the variables due to VM configuration as “VM overhead”.
These overheads are subtracted from free memory on the host when starting,
resuming and migrating VMs.
Metrics
xcp-rrdd
records statistics about the host and the VMs running on top.
The metrics are stored persistently for long-term access and analysis of
historical trends.
Statistics are stored in RRDs (Round Robin
Databases).
RRDs are fixed-size structures that store time series with decreasing time
resolution: the older the data point is, the longer the timespan it represents.
‘Data sources’ are sampled every few seconds and points are added to
the highest resolution RRD. Periodically each high-frequency RRD is
‘consolidated’ (e.g. averaged) to produce a data point for a lower-frequency
RRD.
RRDs are resident on the host on which the VM is running, or the pool
coordinator when the VM is not running.
The RRDs are backed up every day.
Granularity
Statistics are persisted for a maximum of one year, and are stored at
different granularities.
The average and most recent values are stored at intervals of:
- five seconds for the past ten minutes
- one minute for the past two hours
- one hour for the past week
- one day for the past year
RRDs are saved to disk as uncompressed XML. The size of each RRD when
written to disk ranges from 200KiB to approximately 1.2MiB when the RRD
stores the full year of statistics.
By default each RRD contains only averaged data to save storage space.
To record minimum and maximum values in future RRDs, set the Pool-wide flag
xe pool-param-set uuid= other-config:create_min_max_in_new_VM_RRDs=true
Downloading
Statistics can be downloaded over HTTP in XML or JSON format, for example
using wget
.
See rrddump and
rrdxport for information
about the XML format.
The JSON format has the same structure as the XML.
Parameters are appended to the URL following a question mark (?) and separated
by ampersands (&).
HTTP authentication can take the form of a username and password or a session
token in a URL parameter.
Statistics may be downloaded all at once, including all history, or as
deltas suitable for interactive graphing.
Downloading statistics all at once
To obtain a full dump of RRD data for a host use:
wget http://hostname/host_rrd?session_id=OpaqueRef:43df3204-9360-c6ab-923e-41a8d19389ba"
where the session token has been fetched from the server using the API.
For example, using Python’s XenAPI library:
import XenAPI
username = "root"
password = "actual_password"
url = "http://hostname"
session = XenAPI.Session(url)
session.xenapi.login_with_password(username, password, "1.0", "session_getter")
session._session
A URL parameter is used to decide which format to return: XML is returned by
default, adding the parameter json
makes the server return JSON.
Starting from xapi version 23.17.0, the server uses the HTTP header Accept
to decide which format to return.
When both formats are accepted, for example, using */*
; JSON is returned.
Of interest are the clients wget and curl which use this accept header value,
meaning that when using them the default behaviour will change and the accept
header needs to be overridden to make the server return XML.
The content type is provided in the reponse’s headers in these newer versions.
The XML RRD data is in the format used by rrdtool and looks like this:
<?xml version="1.0"?>
<rrd>
<version>0003</version>
<step>5</step>
<lastupdate>1213616574</lastupdate>
<ds>
<name>memory_total_kib</name>
<type>GAUGE</type>
<minimal_heartbeat>300.0000</minimal_heartbeat>
<min>0.0</min>
<max>Infinity</max>
<last_ds>2070172</last_ds>
<value>9631315.6300</value>
<unknown_sec>0</unknown_sec>
</ds>
<ds>
<!-- other dss - the order of the data sources is important
and defines the ordering of the columns in the archives below -->
</ds>
<rra>
<cf>AVERAGE</cf>
<pdp_per_row>1</pdp_per_row>
<params>
<xff>0.5000</xff>
</params>
<cdp_prep> <!-- This is for internal use -->
<ds>
<primary_value>0.0</primary_value>
<secondary_value>0.0</secondary_value>
<value>0.0</value>
<unknown_datapoints>0</unknown_datapoints>
</ds>
...other dss - internal use only...
</cdp_prep>
<database>
<row>
<v>2070172.0000</v> <!-- columns correspond to the DSs defined above -->
<v>1756408.0000</v>
<v>0.0</v>
<v>0.0</v>
<v>732.2130</v>
<v>0.0</v>
<v>782.9186</v>
<v>0.0</v>
<v>647.0431</v>
<v>0.0</v>
<v>0.0001</v>
<v>0.0268</v>
<v>0.0100</v>
<v>0.0</v>
<v>615.1072</v>
</row>
...
</rra>
... other archives ...
</rrd>
To obtain a full dump of RRD data of a VM with uuid x
:
wget "http://hostname/vm_rrd?session_id=<token>&uuid=x"
Note that it is quite expensive to download full RRDs as they contain
lots of historical information. For interactive displays clients should
download deltas instead.
Downloading deltas
To obtain an update of all VM statistics on a host, the URL would be of
the form:
wget "https://hostname/rrd_updates?session_id=<token>&start=<secondsinceepoch>"
This request returns data in an rrdtool xport
style XML format, for every VM
resident on the particular host that is being queried.
To differentiate which column in the export is associated with which VM, the
legend
field is prefixed with the UUID of the VM.
An example rrd_updates
output:
<xport>
<meta>
<start>1213578000</start>
<step>3600</step>
<end>1213617600</end>
<rows>12</rows>
<columns>12</columns>
<legend>
<entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry> <!-- nb - each data source might have multiple entries for different consolidation functions -->
<entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
<entry>AVERAGE:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
<entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
<entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
<entry>MIN:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
<entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
<entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
<entry>MAX:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
<entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu1</entry>
<entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:cpu0</entry>
<entry>LAST:vm:ecd8d7a0-1be3-4d91-bd0e-4888c0e30ab3:memory</entry>
</legend>
</meta>
<data>
<row>
<t>1213617600</t>
<v>0.0</v> <!-- once again, the order or the columns is defined by the legend above -->
<v>0.0282</v>
<v>209715200.0000</v>
<v>0.0</v>
<v>0.0201</v>
<v>209715200.0000</v>
<v>0.0</v>
<v>0.0445</v>
<v>209715200.0000</v>
<v>0.0</v>
<v>0.0243</v>
<v>209715200.0000</v>
</row>
...
</data>
</xport>
To obtain host updates too, use the query parameter host=true
:
wget "http://hostname/rrd_updates?session_id=<token>&start=<secondssinceepoch>&host=true"
The step will decrease as the period decreases, which means that if you
request statistics for a shorter time period you will get more detailed
statistics.
To download updates containing only the averages, or minimums or maximums,
add the parameter cf=AVERAGE|MIN|MAX
(note case is important) e.g.
wget "http://hostname/rrd_updates?session_id=<token>&start=0&cf=MAX"
To request a different update interval, add the parameter interval=seconds
e.g.
wget "http://hostname/rrd_updates?session_id=<token>&start=0&interval=5"
Snapshots
Snapshots represent the state of a VM, or a disk (VDI) at a point in time.
They can be used for:
- backups (hourly, daily, weekly etc)
- experiments (take snapshot, try something, revert back again)
- golden images (install OS, get it just right, clone it 1000s of times)
Read more about Snapshots: the High-Level Feature.
Taking a VDI snapshot
To take a snapshot of a single disk (VDI):
snapshot_vdi <- VDI.snapshot(session_id, vdi, driver_params)
where vdi
is the reference to the disk to be snapshotted, and driver_params
is a list of string pairs providing optional backend implementation-specific hints.
The snapshot operation should be quick (i.e. it should never be implemented as
a slow disk copy) and the resulting VDI will have
Field name | Description |
---|
is_a_snapshot | a flag, set to true, indicating the disk is a snapshot |
snapshot_of | a reference to the disk the snapshot was created from |
snapshot_time | the time the snapshot was taken |
The resulting snapshot should be considered read-only. Depending on the backend
implementation it may be technically possible to write to the snapshot, but clients
must not do this. To create a writable disk from a snapshot, see “restoring from
a snapshot” below.
Note that the storage backend is free to implement this in different ways. We
do not assume the presence of a .vhd-formatted storage repository. Clients
must never assume anything about the backend implementation without checking
first with the maintainers of the backend implementation.
Restoring to a VDI snapshot
To restore from a VDI snapshot first
new_vdi <- VDI.clone(session_id, snapshot_vdi, driver_params)
where snapshot_vdi
is a reference to the snapshot VDI, and driver_params
is a list of string pairs providing optional backend implementation-specific hints.
The clone operation should be quick (i.e. it should never be implemented as
a slow disk copy) and the resulting VDI will have
Field name | Description |
---|
is_a_snapshot | a flag, set to false, indicating the disk is not a snapshot |
snapshot_of | an invalid reference |
snapshot_time | an invalid time |
The resulting disk is writable and can be used by the client as normal.
Note that the “restored” VDI will have a different VDI.uuid
and reference to
the original VDI.
Taking a VM snapshot
A VM snapshot is a copy of the VM metadata and a snapshot of all the associated
VDIs at around the same point in time. To take a VM snapshot:
snapshot_vm <- VM.snapshot(session_id, vm, new_name)
where vm
is a reference to the existing VM and new_name
will be the name_label
of the resulting VM (snapshot) object. The resulting VM will have
Field name | Description |
---|
is_a_snapshot | a flag, set to true, indicating the VM is a snapshot |
snapshot_of | a reference to the VM the snapshot was created from |
snapshot_time | the time the snapshot was taken |
Note that each disk is snapshotted one-by-one and not at the same time.
Restoring to a VM snapshot
A VM snapshot can be reverted to a snapshot using
VM.revert(session_id, snapshot_ref)
where snapshot_ref
is a reference to the snapshot VM. Each VDI associated with
the VM before the snapshot will be destroyed and each VDI associated with the
snapshot will be cloned (see “Reverting to a disk snapshot” above) and associated
with the VM. The resulting VM will have
Field name | Description |
---|
is_a_snapshot | a flag, set to false, indicating the VM is not a snapshot |
snapshot_of | an invalid reference |
snapshot_time | an invalid time |
Note that the VM.uuid
and reference are preserved, but the VDI.uuid
and
VDI references are not.
Downloading a disk or snapshot
Disks can be downloaded in either raw or vhd format using an HTTP 1.0 GET
request as follows:
GET /export_raw_vdi?session_id=%s&task_id=%s&vdi=%s&format=%s[&base=%s] HTTP/1.0\r\n
Connection: close\r\n
\r\n
\r\n
where
session_id
is a currently logged-in sessiontask_id
is a Task
reference which will be used to monitor the
progress of this task and receive errors from itvdi
is the reference of the VDI
into which the data will be
importedformat
is either vhd
or raw
- (optional)
base
is the reference of a VDI
which has already been
exported and this export should only contain the blocks which have changed
since then.
Note that the vhd format allows the disk to be sparse i.e. only contain allocated
blocks. This helps reduce the size of the download.
The xapi-project/xen-api repo has a
python download example
Uploading a disk or snapshot
Disks can be uploaded in either raw or vhd format using an HTTP 1.0 PUT
request as follows:
PUT /import_raw_vdi?session_id=%s&task_id=%s&vdi=%s&format=%s HTTP/1.0\r\n
Connection: close\r\n
\r\n
\r\n
where
session_id
is a currently logged-in sessiontask_id
is a Task
reference which will be used to monitor the
progress of this task and receive errors from itvdi
is the reference of the VDI
into which the data will be
importedformat
is either vhd
or raw
Note that you must create the disk (with the correct size) before importing
data to it. The disk doesn’t have to be empty, in fact if restoring from a
series of incremental downloads it makes sense to upload them all to the
same disk in order.
Example: incremental backup with xe
This section will show how easy it is to build an incremental backup
tool using these APIs. For simplicity we will use the xe
commands
rather than raw XMLRPC and HTTP.
For a VDI with uuid $VDI, take a snapshot:
FULL=$(xe vdi-snapshot uuid=$VDI)
Next perform a full backup into a file “full.vhd”, in vhd format:
xe vdi-export uuid=$FULL filename=full.vhd format=vhd --progress
If the SR was using the vhd format internally (this is the default)
then the full backup will be sparse and will only contain blocks if they
have been written to.
After some time has passed and the VDI has been written to, take another
snapshot:
DELTA=$(xe vdi-snapshot uuid=$VDI)
Now we can backup only the disk blocks which have changed between the original
snapshot $FULL and the next snapshot $DELTA into a file called “delta.vhd”:
xe vdi-export uuid=$DELTA filename=delta.vhd format=vhd base=$FULL --progress
We now have 2 files on the local system:
- “full.vhd”: a complete backup of the first snapshot
- “delta.vhd”: an incremental backup of the second snapshot, relative to
the first
For example:
test $ ls -lh *.vhd
-rw------- 1 dscott xendev 213M Aug 15 10:39 delta.vhd
-rw------- 1 dscott xendev 8.0G Aug 15 10:39 full.vhd
To restore the original snapshot you must create an empty disk with the
correct size. To find the size of a .vhd file use qemu-img
as follows:
test $ qemu-img info delta.vhd
image: delta.vhd
file format: vpc
virtual size: 24G (25769705472 bytes)
disk size: 212M
Here the size is 25769705472 bytes.
Create a fresh VDI in SR $SR to restore the backup as follows:
SIZE=25769705472
RESTORE=$(xe vdi-create name-label=restored virtual-size=$SIZE sr-uuid=$SR type=user)
then import “full.vhd” into it:
xe vdi-import uuid=$RESTORE filename=full.vhd format=vhd --progress
Once “full.vhd” has been imported, the incremental backup can be restored
on top:
xe vdi-import uuid=$RESTORE filename=delta.vhd format=vhd --progress
Note there is no need to supply a “base” parameter when importing; Xapi will
treat the “vhd differencing disk” as a set of blocks and import them. It
is up to you to check you are importing them to the right place.
Now the VDI $RESTORE should have the same contents as $DELTA.
VM consoles
Most XenAPI graphical interfaces will want to gain access to the VM consoles, in order to render them to the user as if they were physical machines. There are several types of consoles available, depending on the type of guest or if the physical host console is being accessed:
Types of consoles
Operating System | Text | Graphical | Optimized graphical |
---|
Windows | No | VNC, using an API call | RDP, directly from guest |
Linux | Yes, through VNC and an API call | No | VNC, directly from guest |
Physical Host | Yes, through VNC and an API call | No | No |
Hardware-assisted VMs, such as Windows, directly provide a graphical console over VNC. There is no text-based console, and guest networking is not necessary to use the graphical console. Once guest networking has been established, it is more efficient to setup Remote Desktop Access and use an RDP client to connect directly (this must be done outside of the XenAPI).
Paravirtual VMs, such as Linux guests, provide a native text console directly. XenServer provides a utility (called vncterm
) to convert this text-based console into a graphical VNC representation. Guest networking is not necessary for this console to function. As with Windows above, Linux distributions often configure VNC within the guest, and directly connect to it over a guest network interface.
The physical host console is only available as a vt100
console, which is exposed through the XenAPI as a VNC console by using vncterm
in the control domain.
RFB (Remote Framebuffer) is the protocol which underlies VNC, specified in The RFB Protocol. Third-party developers are expected to provide their own VNC viewers, and many freely available implementations can be adapted for this purpose. RFB 3.3 is the minimum version which viewers must support.
Retrieving VNC consoles using the API
VNC consoles are retrieved using a special URL passed through to the host agent. The sequence of API calls is as follows:
Client to Master/443: XML-RPC: Session.login_with_password()
.
Master/443 to Client: Returns a session reference to be used with subsequent calls.
Client to Master/443: XML-RPC: VM.get_by_name_label()
.
Master/443 to Client: Returns a reference to a particular VM (or the “control domain” if you want to retrieve the physical host console).
Client to Master/443: XML-RPC: VM.get_consoles()
.
Master/443 to Client: Returns a list of console objects associated with the VM.
Client to Master/443: XML-RPC: VM.get_location()
.
Returns a URI describing where the requested console is located. The URIs are of the form: https://192.168.0.1/console?ref=OpaqueRef:c038533a-af99-a0ff-9095-c1159f2dc6a0
.
Client to 192.168.0.1: HTTP CONNECT “/console?ref=(…)”
The final HTTP CONNECT is slightly non-standard since the HTTP/1.1 RFC specifies that it should only be a host and a port, rather than a URL. Once the HTTP connect is complete, the connection can subsequently directly be used as a VNC server without any further HTTP protocol action.
This scheme requires direct access from the client to the control domain’s IP, and will not work correctly if there are Network Address Translation (NAT) devices blocking such connectivity. You can use the CLI to retrieve the console URI from the client and perform a connectivity check.
Retrieve the VM UUID by running:
$ VM=$(xe vm-list params=uuid --minimal name-label=<name>)
Retrieve the console information:
$ xe console-list vm-uuid=$VM
uuid ( RO) : 8013b937-ff7e-60d1-ecd8-e52d66c5879e
vm-uuid ( RO): 2d7c558a-8f03-b1d0-e813-cbe7adfa534c
vm-name-label ( RO): 6
protocol ( RO): RFB
location ( RO): https://10.80.228.30/console?uuid=8013b937-ff7e-60d1-ecd8-e52d66c5879e
Use command-line utilities like ping
to test connectivity to the IP address provided in the location
field.
Disabling VNC forwarding for Linux VM
When creating and destroying Linux VMs, the host agent automatically manages the vncterm
processes which convert the text console into VNC. Advanced users who wish to directly access the text console can disable VNC forwarding for that VM. The text console can then only be accessed directly from the control domain directly, and graphical interfaces such as XenCenter will not be able to render a console for that VM.
Before starting the guest, set the following parameter on the VM record:
$ xe vm-param-set uuid=$VM other-config:disable_pv_vnc=1
Start the VM.
Use the CLI to retrieve the underlying domain ID of the VM with:
$ DOMID=$(xe vm-list params=dom-id uuid=$VM --minimal)
On the host console, connect to the text console directly by:
$ /usr/lib/xen/bin/xenconsole $DOMID
This configuration is an advanced procedure, and we do not recommend that the text console is directly used for heavy I/O operations. Instead, connect to the guest over SSH or some other network-based connection mechanism.
VM import/export
VMs can be exported to a file and later imported to any Xapi host. The export
protocol is a simple HTTP(S) GET, which should be sent to the Pool master.
Authorization is either via a pre-created session_id
or by HTTP basic
authentication (particularly useful on the command-line).
The VM to export is specified either by UUID or by reference. To keep track of
the export, a task can be created and passed in using its reference. Note that
Xapi may send an HTTP redirect if a different host has better access to the
disk data.
The following arguments are passed as URI query parameters or HTTP cookies:
Argument | Description |
---|
session_id | the reference of the session being used to authenticate; required only when not using HTTP basic authentication |
task_id | the reference of the task object with which to keep track of the operation; optional, required only if you have created a task object to keep track of the export |
ref | the reference of the VM; required only if not using the UUID |
uuid | the UUID of the VM; required only if not using the reference |
use_compression | an optional boolean “true” or “false” (defaulting to “false”). If “true” then the output will be gzip-compressed before transmission. |
For example, using the Linux command line tool cURL:
$ curl http://root:foo@myxenserver1/export?uuid=<vm_uuid> -o <exportfile>
will export the specified VM to the file exportfile
.
To export just the metadata, use the URI http://server/export_metadata
.
The import protocol is similar, using HTTP(S) PUT. The session_id
and task_id
arguments are as for the export. The ref
and uuid
are not used; a new reference and uuid will be generated for the VM. There are some additional parameters:
Argument | Description |
---|
restore | if true , the import is treated as replacing the original VM - the implication of this currently is that the MAC addresses on the VIFs are exactly as the export was, which will lead to conflicts if the original VM is still being run. |
force | if true , any checksum failures will be ignored (the default is to destroy the VM if a checksum error is detected) |
sr_id | the reference of an SR into which the VM should be imported. The default behavior is to import into the Pool.default_SR |
Note there is no need to specify whether the export is compressed, as Xapi
will automatically detect and decompress gzip-encoded streams.
For example, again using cURL:
curl -T <exportfile> http://root:foo@myxenserver2/import
will import the VM to the default SR on the server.
Note
Note that if no default SR has been set, and no sr_uuid
is specified, the error message DEFAULT_SR_NOT_FOUND
is returned.
Another example:
curl -T <exportfile> http://root:foo@myxenserver2/import?sr_id=<ref_of_sr>
will import the VM to the specified SR on the server.
To import just the metadata, use the URI http://server/import_metadata
This section describes the legacy VM import/export format and is for historical
interest only. It should be updated to describe the current format, see
issue 64
Xapi supports a human-readable legacy VM input format called XVA. This section describes the syntax and structure of XVA.
An XVA consists of a directory containing XML metadata and a set of disk images. A VM represented by an XVA is not intended to be directly executable. Data within an XVA package is compressed and intended for either archiving on permanent storage or for being transmitted to a VM server - such as a XenServer host - where it can be decompressed and executed.
XVA is a hypervisor-neutral packaging format; it should be possible to create simple tools to instantiate an XVA VM on any other platform. XVA does not specify any particular runtime format; for example disks may be instantiated as file images, LVM volumes, QCoW images, VMDK or VHD images. An XVA VM may be instantiated any number of times, each instantiation may have a different runtime format.
XVA does not:
specify any particular serialization or transport format
provide any mechanism for customizing VMs (or templates) on install
address how a VM may be upgraded post-install
define how multiple VMs, acting as an appliance, may communicate
These issues are all addressed by the related Open Virtual Appliance specification.
An XVA is a directory containing, at a minimum, a file called ova.xml
. This file describes the VM contained within the XVA and is described in Section 3.2. Disks are stored within sub-directories and are referenced from the ova.xml. The format of disk data is described later in Section 3.3.
The following terms will be used in the rest of the chapter:
HVM: a mode in which unmodified OS kernels run with the help of virtualization support in the hardware.
PV: a mode in which specially modified “paravirtualized” kernels run explicitly on top of a hypervisor without requiring hardware support for virtualization.
The “ova.xml” file contains the following elements:
<appliance version="0.1">
The number in the attribute “version” indicates the version of this specification to which the XVA is constructed; in this case version 0.1. Inside the <appliance> there is exactly one <vm>: (in the OVA specification, multiple <vm>s are permitted)
Each <vm>
element describes one VM. The “name” attribute is for future internal use only and must be unique within the ova.xml file. The “name” attribute is permitted to be any valid UTF-8 string. Inside each <vm> tag are the following compulsory elements:
<label>... text ... </label>
A short name for the VM to be displayed in a UI.
<shortdesc> ... description ... </shortdesc>
A description for the VM to be displayed in the UI. Note that for both <label>
and <shortdesc>
contents, leading and trailing whitespace will be ignored.
<config mem_set="268435456" vcpus="1"/>
The <config>
element has attributes which describe the amount of memory in bytes (mem_set
) and number of CPUs (VCPUs) the VM should have.
Each <vm>
has zero or more <vbd>
elements representing block devices which look like the following:
<vbd device="sda" function="root" mode="w" vdi="vdi_sda"/>
The attributes have the following meanings:
device
: name of the physical device to expose to the VM. For linux guests
we use “sd[a-z]” and for windows guests we use “hd[a-d]”.function
: if marked as “root”, this disk will be used to boot the guest.
(NB this does not imply the existence of the Linux root i.e. / filesystem)
Only one device should be marked as “root”. See Section 3.4 describing VM
booting. Any other string is ignored.mode
: either “w” or “ro” if the device is to be read/write or read-onlyvdi
: the name of the disk image (represented by a <vdi>
element) to which
this block device is connected
Each <vm>
may have an optional <hacks>
section like the following:
<hacks is_hvm="false" kernel_boot_cmdline="root=/dev/sda1 ro"/>
The <hacks>
element will be removed in future. The attribute is_hvm
is
either true
or false
, depending on whether the VM should be booted in HVM or not.
The kernel_boot_cmdline
contains additional kernel commandline arguments when
booting a guest using pygrub.
In addition to a <vm>
element, the <appliance>
will contain zero or more
<vdi>
elements like the following:
<vdi name="vdi_sda" size="5368709120" source="file://sda" type="dir-gzipped-chunks">
Each <vdi>
corresponds to a disk image. The attributes have the following meanings:
name
: name of the VDI, referenced by the vdi attribute of <vbd>
elements.
Any valid UTF-8 string is permitted.size
: size of the required image in bytessource
: a URI describing where to find the data for the image, only
file:// URIs are currently permitted and must describe paths relative to the
directory containing the ova.xmltype
: describes the format of the disk data
A single disk image encoding is specified in which has type “dir-gzipped-chunks”: Each image is represented by a directory containing a sequence of files as follows:
-rw-r--r-- 1 dscott xendev 458286013 Sep 18 09:51 chunk000000000.gz
-rw-r--r-- 1 dscott xendev 422271283 Sep 18 09:52 chunk000000001.gz
-rw-r--r-- 1 dscott xendev 395914244 Sep 18 09:53 chunk000000002.gz
-rw-r--r-- 1 dscott xendev 9452401 Sep 18 09:53 chunk000000003.gz
-rw-r--r-- 1 dscott xendev 1096066 Sep 18 09:53 chunk000000004.gz
-rw-r--r-- 1 dscott xendev 971976 Sep 18 09:53 chunk000000005.gz
-rw-r--r-- 1 dscott xendev 971976 Sep 18 09:53 chunk000000006.gz
-rw-r--r-- 1 dscott xendev 971976 Sep 18 09:53 chunk000000007.gz
-rw-r--r-- 1 dscott xendev 573930 Sep 18 09:53 chunk000000008.gz
Each file (named “chunk-XXXXXXXXX.gz”) is a gzipped file containing exactly 1e9 bytes (1GB, not 1GiB) of raw block data. The small size was chosen to be safely under the maximum file size limits of several filesystems. If the files are gunzipped and then concatenated together, the original image is recovered.
Because the import and export of VMs can take some time to complete, an
asynchronous HTTP interface to the import and export operations is
provided. To perform an export using the XenServer API, construct
an HTTP GET call providing a valid session ID, task ID and VM UUID, as
shown in the following pseudo code:
task = Task.create()
result = HTTP.get(
server, 80, "/export?session_id=&task_id=&ref=");
For the import operation, use an HTTP PUT call as demonstrated in the
following pseudo code:
task = Task.create()
result = HTTP.put(
server, 80, "/import?session_id=&task_id=&ref=");
VM Lifecycle
The following figure shows the states that a VM can be in and the
API calls that can be used to move the VM between these states.
graph
halted-- start(paused) -->paused
halted-- start(not paused) -->running
running-- suspend -->suspended
suspended-- resume(not paused) -->running
suspended-- resume(paused) -->paused
suspended-- hard shutdown -->halted
paused-- unpause -->running
paused-- hard shutdown -->halted
running-- clean shutdown\n hard shutdown -->halted
running-- pause -->paused
halted-- destroy -->destroyed
VM boot parameters
The VM
class contains a number of fields that control the way in which the VM
is booted. With reference to the fields defined in the VM class (see later in
this document), this section outlines the boot options available and the
mechanisms provided for controlling them.
VM booting is controlled by setting one of the two mutually exclusive groups:
“PV” and “HVM”. If HVM.boot_policy
is an empty string, then paravirtual
domain building and booting will be used; otherwise the VM will be loaded as a
HVM domain, and booted using an emulated BIOS.
When paravirtual booting is in use, the PV_bootloader
field indicates the
bootloader to use. It may be “pygrub”, in which case the platform’s default
installation of pygrub will be used, or a full path within the control domain to
some other bootloader. The other fields, PV_kernel
, PV_ramdisk
, PV_args
,
and PV_bootloader_args
will be passed to the bootloader unmodified, and
interpretation of those fields is then specific to the bootloader itself,
including the possibility that the bootloader will ignore some or all of
those given values. Finally the paths of all bootable disks are added to the
bootloader commandline (a disk is bootable if its VBD has the bootable flag set).
There may be zero, one, or many bootable disks; the bootloader decides which
disk (if any) to boot from.
If the bootloader is pygrub, then the menu.lst is parsed, if present in the
guest’s filesystem, otherwise the specified kernel and ramdisk are used, or an
autodetected kernel is used if nothing is specified and autodetection is
possible. PV_args
is appended to the kernel command line, no matter which
mechanism is used for finding the kernel.
If PV_bootloader
is empty but PV_kernel
is specified, then the kernel and
ramdisk values will be treated as paths within the control domain. If both
PV_bootloader
and PV_kernel
are empty, then the behaviour is as if
PV_bootloader
were specified as “pygrub”.
When using HVM booting, HVM_boot_policy
and HVM_boot_params
specify the boot
handling. Only one policy is currently defined, “BIOS order”. In this case,
HVM_boot_params
should contain one key-value pair “order” = “N” where N is the
string that will be passed to QEMU.
Optionally HVM_boot_params
can contain another key-value pair “firmware”
with values “bios” or “uefi” (default is “bios” if absent).
By default Secure Boot is not enabled, it can be enabled when “uefi” is enabled
by setting VM.platform["secureboot"]
to true.
XenCenter
XenCenter uses some conventions on top of the XenAPI:
Internationalization for SR names
The SRs created at install time now have an other_config
key indicating how their names may be internationalized.
other_config["i18n-key"]
may be one of
local-hotplug-cd
local-hotplug-disk
local-storage
xenserver-tools
Additionally, other_config["i18n-original-value-<field name>"]
gives the value of that field when the SR was created. If XenCenter sees a record where SR.name_label
equals other_config["i18n-original-value-name_label"]
(that is, the record has not changed since it was created during XenServer installation), then internationalization will be applied. In other words, XenCenter will disregard the current contents of that field, and instead use a value appropriate to the user’s own language.
If you change SR.name_label
for your own purpose, then it no longer is the same as other_config["i18n-original-value-name_label"]
. Therefore, XenCenter does not apply internationalization, and instead preserves your given name.
Hiding objects from XenCenter
Networks, PIFs, and VMs can be hidden from XenCenter by adding the key HideFromXenCenter=true
to the other_config
parameter for the object. This capability is intended for ISVs who know what they are doing, not general use by everyday users. For example, you might want to hide certain VMs because they are cloned VMs that shouldn’t be used directly by general users in your environment.
In XenCenter, hidden Networks, PIFs, and VMs can be made visible, using the View menu.