Walkthrough: Starting a VM
A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM
configuration to use. A VM configuration is broken down into objects:
- VM: A device-less Virtual Machine
- VBD: A virtual block device for a VM
- VIF: A virtual network interface for a VM
- PCI: A virtual PCI device for a VM
Treating devices as first-class objects is convenient because we wish to expose
operations on the devices such as hotplug, unplug, eject (for removable media),
carrier manipulation (for network interfaces) etc.
The “add” functions in the Xenopsd interface cause Xenopsd to create the
objects:
In the case of xapi, there are a set
of functions which
convert between the XenAPI objects and the Xenopsd objects.
The two interfaces are slightly different because they have different expected
users:
- the XenAPI has many clients which are updated on long release cycles. The
main property needed is backwards compatibility, so that new release of xapi
remain compatible with these older clients. Quite often, we will choose to
“grandfather in” some poorly designed interface simply because we wish to
avoid imposing churn on 3rd parties.
- the Xenopsd API clients are all open-source and are part of the xapi-project.
These clients can be updated as the API is changed. The main property needed
is to keep the interface clean, so that it properly hides the complexity
of dealing with Xen from other components.
The Xenopsd “VM.add” function has code like this:
let add' x =
debug "VM.add %s" (Jsonrpc.to_string (rpc_of_t x));
DB.write x.id x;
let module B = (val get_backend () : S) in
B.VM.add x;
x.id
This function does 2 things:
- it stores the VM configuration in the “database”
- it tells the “backend” that the VM exists
The Xenopsd database is really a set of config files in the filesystem. All
objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not
stand-alone entities like disks) and are placed into a subdirectory named after
the VM e.g.:
# ls /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701
config vbd.xvda vbd.xvdb
# cat /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701/config
{"id": "7b719ce6-0b17-9733-e8ee-dbc1e6e7b701", "name": "fedora",
...
}
Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that
all objects are deleted when the host is rebooted. However the objects should
be persisted over a simple Xenopsd restart, which is why the objects are stored
in the filesystem.
Aside: it would probably be more appropriate to store the metadata in Xenstore
since this has the exact object lifetime we need. This will require a more
performant Xenstore to realise.
Every running Xenopsd process is linked with a single backend. Currently backends
exist for:
- Xen via libxc, libxenguest and xenstore
- Xen via libxl, libxc and xenstore
- Xen via libvirt
- KVM by direct invocation of qemu
- Simulation for testing
From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a.
“Xenopsd classic”) backend.
The backend VM.add
function checks whether the VM we have to manage already exists – and if it does
then it ensures the Xenstore configuration is intact. This Xenstore configuration
is important because at any time a client can query the state of a VM with
VM.stat
and this relies on certain Xenstore keys being present.
Once the VM metadata has been registered with Xenopsd, the client can call
VM.start.
Like all potentially-blocking Xenopsd APIs, this function returns a Task id.
Please refer to the Task handling design for a general
overview of how tasks are handled.
Clients can poll the state of a task by calling TASK.stat
but most clients will prefer to use the event system instead.
Please refer to the Event handling design for a general
overview of how events are handled.
The event model is similar to the XenAPI: clients call a blocking
UPDATES.get
passing in a token which represents the point in time when the last UPDATES.get
returned. The call blocks until some objects have changed state, and these object
ids are returned (NB in the XenAPI the current object states are returned)
The client must then call the relevant “stat” function, in this
case TASK.stat
The client will be able to see the task make progress and use this to – for example –
populate a progress bar in a UI. If the client needs to cancel the task then it
can call the TASK.cancel;
again see the Task handling design to understand how this is
implemented.
When the Task has completed successfully, then calls to *.stat will show:
- the power state is Paused
- exactly one valid Xen domain id
- all VBDs have active = plugged = true
- all VIFs have active = plugged = true
- all PCI devices have plugged = true
- at least one active console
- a valid start time
- valid “targets” for memory and vCPU
Note: before a Task completes, calls to *.stat will show partial updates. E.g.
the power state may be paused, but no disk may have been plugged.
UI clients must choose whether they are happy displaying this in-between state
or whether they wish to hide it and pretend the whole operation has happened
transactionally. If a particular, when a client wishes to perform side-effects in
response to xenopsd
state changes (for example, to clean up an external resource
when a VIF becomes unplugged), it must be very careful to avoid responding
to these in-between states. Generally, it is safest to passively report these
values without driving things directly from them.
Note: the Xenopsd implementation guarantees that, if it is restarted at any point
during the start operation, on restart the VM state shall be “fixed” by either
(i) shutting down the VM; or (ii) ensuring the VM is intact and running.
In the case of xapi every Xenopsd
Task id bound one-to-one with a XenAPI task by the function
sync_with_task.
The function update_task
is called when xapi receives a notification that a Xenopsd Task has changed state,
and updates the corresponding XenAPI task.
Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for
background events via the function
events_watch
while each thread performing a XenAPI call waits for its specific Task to complete
via the function
event_wait.
It is the responsibility of the client to call
TASK.destroy
when the Task is no longer needed. Xenopsd won’t destroy the task because it contains
the success/failure result of the operation which is needed by the client.
What happens when a Xenopsd receives a VM.start request?
When Xenopsd receives the request it adds it to the appropriate per-VM queue
via the function
queue_operation.
To understand this and other internal details of Xenopsd, consult the
architecture description.
The queue_operation_int
function looks like this:
let queue_operation_int dbg id op =
let task = Xenops_task.add tasks dbg (fun t -> perform op t; None) in
Redirector.push id (op, task);
task
The “task” is a record containing Task metadata plus a “do it now” function
which will be executed by a thread from the thread pool. The
module Redirector
takes care of:
- pushing operations to the right queue
- ensuring at most one worker thread is working on a VM’s operations
- reducing the queue size by coalescing items together
- providing a diagnostics interface
Once a thread from the worker pool becomes free, it will execute the “do it now”
function. In the example above this is perform op t
where op
is
VM_start vm
and t
is the Task. The function
perform_exn
has fragments like this:
| VM_start (id, force) -> (
debug "VM.start %s (force=%b)" id force ;
let power = (B.VM.get_state (VM_DB.read_exn id)).Vm.power_state in
match power with
| Running ->
info "VM %s is already running" id
| _ ->
perform_atomics (atomics_of_operation op) t ;
VM_DB.signal id "^^^^^^^^^^^^^^^^^^^^--------
)
Each “operation” (e.g. VM_start vm
) is decomposed into “micro-ops” by the
function
atomics_of_operation
where the micro-ops are small building-block actions common to the higher-level
operations. Each operation corresponds to a list of “micro-ops”, where there is
no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM
configuration (for example a PV domain may not need a qemu). In the case of
VM_start vm
the Xenopsd
server starts by calling the functions that
decompose
the VM_hook_script
, VM_create
and VM_build
micro-ops:
dequarantine_ops vgpus
; [
VM_hook_script
(id, Xenops_hooks.VM_pre_start, Xenops_hooks.reason__none)
; VM_create (id, None, None, no_sharept)
; VM_build (id, force)
]
This is the complete sequence of micro-ops:
1. run the “VM_pre_start” scripts
The VM_hook_script
micro-op runs the corresponding “hook” scripts. The
code is all in the
Xenops_hooks
module and looks for scripts in the hardcoded path /etc/xapi.d
.
2. create a Xen domain
The VM_create
micro-op calls the VM.create
function in the backend.
In the classic Xenopsd backend, the
VM.create_exn
function must
- check if we’re creating a domain for a fresh VM or resuming an existing one:
if it’s a resume then the domain configuration stored in the VmExtra database
table must be used
- ask squeezed to create a memory “reservation” big enough to hold the VM
memory. Unfortunately the domain cannot be created until the memory is free
because domain create often fails in low-memory conditions. This means the
“reservation” is associated with our “session” with squeezed; if Xenopsd
crashes and restarts the reservation will be freed automatically.
- create the Domain via the libxc hypercall
Xenctrl.domain_create
- call
generate_create_info()
for storing the platform data (vCPUs, etc) the domain’s Xenstore tree.
xenguest
then uses this in the build
phase (see below) to build the domain. - “transfer” the squeezed reservation to the domain such that squeezed will
free the memory if the domain is destroyed later
- compute and set an initial balloon target depending on the amount of memory
reserved (recall we ask for a range between dynamic_min and dynamic_max)
- apply the “suppress spurious page faults” workaround if requested
- set the “machine address size”
- “hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates
lots of vCPUs and then the guest is asked to only use some of them. Every VM
therefore starts with the “VCPUs_max” setting and co-operative hotplug is
used to reduce the number. Note there is no enforcement mechanism: a VM which
cheats and uses too many vCPUs would have to be caught by looking at the
performance statistics.
3. build the domain
The build
phase waits, if necessary, for the Xen memory scrubber to catch
up reclaiming memory, runs NUMA placement, sets vCPU affinity and invokes
the xenguest
to build the system memory layout of the domain.
See the walk-through of the VM_build μ-op for details.
4. mark each VBD as “active”
VBDs and VIFs are said to be “active” when they are intended to be used by a
particular VM, even if the backend/frontend connection hasn’t been established,
or has been closed. If someone calls VBD.stat
or VIF.stat
then
the result includes both “active” and “plugged”, where “plugged” is true if
the frontend/backend connection is established.
For example xapi will
set VBD.currently_attached
to “active || plugged”. The “active” flag is conceptually very similar to the
traditional “online” flag (which is not documented in the upstream Xen tree
as of Oct/2014 but really should be) except that on unplug, one would set
the “online” key to “0” (false) first before initiating the hotunplug. By
contrast the “active” flag is set to false after the unplug i.e. “set_active”
calls bracket plug/unplug. If the “active” flag was set before the unplug
attempt then as soon as the frontend/backend connection is removed clients
would see the VBD as completely dissociated from the VM – this would be misleading
because Xenopsd will not have had time to use the storage API to release locks
on the disks. By cleaning up before setting “active” to false, clients
can be assured that the disks are now free to be reassigned.
5. handle non-persistent disks
A non-persistent disk is one which is reset to a known-good state on every
VM start. The VBD_epoch_begin
is the signal to perform any necessary reset.
6. plug VBDs
The VBD_plug
micro-op will plug the VBD into the VM. Every VBD is plugged
in a carefully-chosen order.
Generally, plug order is important for all types of devices. For VBDs, we must
work around the deficiency in the storage interface where a VDI, once attached
read/only, cannot be attached read/write. Since it is legal to attach the same
VDI with multiple VBDs, we must plug them in such that the read/write VBDs
come first. From the guest’s point of view the order we plug them doesn’t
matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).
The function
VBD.plug
will
- call
VDI.attach
and VDI.activate
in the storage API to make the
devices ready (start the tapdisk processes etc) - add the Xenstore frontend/backend directories containing the block device
info
- add the extra xenstore keys returned by the
VDI.attach
call that are
needed for SCSIid passthrough which is needed to support VSS - write the VBD information to the Xenopsd database so that future calls to
VBD.stat can be told about the associated disk (this is needed so clients
like xapi can cope with CD insert/eject etc)
- if the qemu is going to be in a different domain to the storage, a frontend
device in the qemu domain is created.
The Xenstore keys are written by the functions
Device.Vbd.add_async
and
Device.Vbd.add_wait.
In a Linux domain (such as dom0) when the backend directory is created, the kernel
creates a “backend device”. Creating any device will cause a kernel UEVENT to fire
which is picked up by udev. The udev rules run a script whose only job is to
stat(2) the device (from the “params” key in the backend) and write the major
and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do
any of this, instead the FreeBSD kernel module simply opens the device in the
“params” key). The script also writes the backend key “hotplug-status=connected”.
We currently wait for this key to be written so that later calls to VBD.stat
will return with “plugged=true”. If the call returns before this key is written
then sometimes we receive an event, call VBD.stat and conclude erroneously
that a spontaneous VBD unplug occurred.
7. mark each VIF as “active”
This is for the same reason as VBDs are marked “active”.
8. plug VIFs
Again, the order matters. Unlike VBDs,
there is no read/write read/only constraint and the devices
have unique indices (0, 1, 2, …) but Linux kernels have often (always?)
ignored the actual index and instead relied on the order of results from the
xenstore-ls
listing. The order that xenstored returns the items happens
to be the order the nodes were created so this means that (i) xenstored must
continue to store directories as ordered lists rather than maps (which would
be more efficient); and (ii) Xenopsd must make sure to plug the vifs in
the same order. Note that relying on ethX device numbering has always been a
bad idea but is still common. I bet if you change this, many tests will
suddenly start to fail!
The function
VIF.plug_exn
will
- compute the port locking configuration required and write this to a well-known
location in the filesystem where it can be read from the udev scripts. This
really should be written to Xenstore instead, since this scheme doesn’t work
with driver domains.
- add the Xenstore frontend/backend directories containing the network device
info
- write the VIF information to the Xenopsd database so that future calls to
VIF.stat can be told about the associated network
- if the qemu is going to be in a different domain to the storage, a frontend
device in the qemu domain is created.
Similarly to the VBD case, the function
Device.Vif.add
will write the Xenstore keys and wait for the “hotplug-status=connected” key.
We do this because we cannot apply the port locking rules until the backend
device has been created, and we cannot know the rules have been applied
until after the udev script has written the key. If we didn’t wait for it then
the VM might execute without all the port locking properly configured.
9. create the device model
The VM_create_device_model
micro-op will create a qemu device model if
- the VM is HVM; or
- the VM uses a PV keyboard or mouse (since only qemu currently has backend
support for these devices).
The function
VM.create_device_model_exn
will
- (if using a qemu stubdom) it will create and build the qemu domain
- compute the necessary qemu arguments and launch it.
Note that qemu (aka the “device model”) is created after the VIFs and VBDs have
been plugged but before the PCI devices have been plugged. Unfortunately qemu
traditional infers the needed emulated hardware by inspecting the Xenstore
VBD and VIF configuration and assuming that we want one emulated device per
PV device, up to the natural limits of the emulated buses (i.e. there can be
at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this
create an ordering dependency that needn’t exist – and which impacts migration
downtime – but it also completely ignores the plain fact that, on a Xen system,
qemu can be in a different domain than the backend disk and network devices.
This hack only works because we currently run everything in the same domain.
There is an option (off by default) to list the emulated devices explicitly
on the qemu command-line. If we switch to this by default then we ought to be
able to start up qemu early, as soon as the domain has been created (qemu will
need to know the domain id so it can map the I/O request ring).
10. plug PCI devices
PCI devices are treated differently to VBDs and VIFs.
If we are attaching the device to an
HVM guest then instead of relying on the traditional Xenstore frontend/backend
state machine we instead send RPCs to qemu requesting they be hotplugged. Note
the domain is paused at this point, but qemu still supports PCI hotplug/unplug.
The reasons why this doesn’t follow the standard Xenstore model are known only
to the people who contributed this support to qemu.
Again the order matters because it determines the position of the virtual device
in the VM.
Note that Xenopsd doesn’t know anything about the PCI devices; concepts such
as “GPU groups” belong to higher layers, such as xapi.
11. mark the domain as alive
A design principle of Xenopsd is that it should tolerate failures such as being
suddenly restarted. It guarantees to always leave the system in a valid state,
in particular there should never be any “half-created VMs”. We achieve this for
VM start by exploiting the mechanism which is necessary for reboot. When a VM
wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a
“reason code” of “reboot”. When Xenopsd sees this event VM_check_state
operation is queued. This operation calls
VM.get_domain_action_request
to ask the question, “what needs to be done to make this VM happy now?”. The
implementation checks the domain state for shutdown codes and also checks a
special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this
key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd
finishes starting the VM it clears this key. This means that if Xenopsd crashes
while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted
and will clean up the current domain and create a fresh one.
12. unpause the domain
A Xenopsd VM.start will always leave the domain paused, so strictly speaking
this is a separate “operation” queued by the client (such as xapi) after the
VM.start has completed. The function
VM.unpause
is reassuringly simple:
if di.Xenctrl.total_memory_pages = 0n then raise (Domain_not_built);
Domain.unpause ~xc di.Xenctrl.domid;
Opt.iter
(fun stubdom_domid ->
Domain.unpause ~xc stubdom_domid
) (get_stubdom ~xs di.Xenctrl.domid)
Subsections of Building a VM
VM_build micro-op
Overview
On Xen, Xenctrl.domain_create
creates an empty domain and
returns the domain ID (domid
) of the new domain to xenopsd
.
In the build
phase, the xenguest
program is called to create
the system memory layout of the domain, set vCPU affinity and a
lot more.
The VM_build
micro-op collects the VM build parameters and calls
VM.build,
which calls
VM.build_domain,
which calls
VM.build_domain_exn
which calls Domain.build:
flowchart
subgraph xenopsd VM_build[xenopsd: VM_build micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end
The function
VM.build_domain_exn
must:
Run pygrub (or eliloader) to extract the kernel and initrd, if necessary
Call
Domain.build to
- optionally run NUMA placement and
- invoke xenguest to set up the domain memory.
See the walk-through of the Domain.build function
for more details on this phase.
Apply the cpuid
configuration
Store the current domain configuration on disk – it’s important to know
the difference between the configuration you started with and the configuration
you would use after a reboot because some properties (such as maximum memory
and vCPUs) as fixed on create.
Domain.build
Overview
flowchart LR
subgraph xenopsd VM_build[
xenopsd thread pool with two VM_build micro#8209;ops:
During parallel VM_start, Many threads run this in parallel!
]
direction LR
build_domain_exn[
VM.build_domain_exn
from thread pool Thread #1
] --> Domain.build
Domain.build --> build_pre
build_pre --> wait_xen_free_mem
build_pre -->|if NUMA/Best_effort| numa_placement
Domain.build --> xenguest[Invoke xenguest]
click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
build_domain_exn2[
VM.build_domain_exn
from thread pool Thread #2] --> Domain.build2[Domain.build]
Domain.build2 --> build_pre2[build_pre]
build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
Domain.build2 --> xenguest2[Invoke xenguest]
click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
end
VM.build_domain_exn
calls
Domain.build
to call:
build_pre
to prepare the build of a VM:- If the
xe
config numa_placement
is set to Best_effort
, invoke the NUMA placement algorithm. - Run
xenguest
xenguest
to invoke the xenguest program to setup the domain’s system memory.
build_pre: Prepare building the VM
Domain.build
calls
build_pre
(which is also used for VM restore) to:
Call
wait_xen_free_mem
to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory.
It
- calls
Xenctrl.physinfo
which returns:hostinfo.free_pages
- the free and already scrubbed pages (available)host.scrub_pages
- the not yet scrubbed pages (not yet available)
- repeats this until a timeout as long as
free_pages
is lower
than the required pages- unless if
scrub_pages
is 0 (no scrubbing left to do)
Note: free_pages
is system-wide memory, not memory specific to a NUMA node.
Because this is not NUMA-aware, in case of temporary node-specific memory shortage,
this check is not sufficient to prevent the VM from being spread over all NUMA nodes.
It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.
Call the hypercall to set the timer mode
Call the hypercall to set the number of vCPUs
Call the numa_placement
function
as described in the NUMA feature description
when the xe
configuration option numa_placement
is set to Best_effort
(except when the VM has a hard CPU affinity).
match !Xenops_server.numa_placement with
| Any ->
()
| Best_effort ->
log_reraise (Printf.sprintf "NUMA placement") (fun () ->
if has_hard_affinity then
D.debug "VM has hard affinity set, skipping NUMA optimization"
else
numa_placement domid ~vcpus
~memory:(Int64.mul memory.xen_max_mib 1048576L)
)
NUMA placement
build_pre
passes the domid
, the number of vCPUs
and xen_max_mib
to the
numa_placement
function to run the algorithm to find the best NUMA placement.
When it returns a NUMA node to use, it calls the Xen hypercalls
to set the vCPU affinity to this NUMA node:
let vm = NUMARequest.make ~memory ~vcpus in
let nodea =
match !numa_resources with
| None ->
Array.of_list nodes
| Some a ->
Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
in
numa_resources := Some nodea ;
Softaffinity.plan ~vm host nodea
By using the default auto_node_affinity
feature of Xen,
setting the vCPU affinity causes the Xen hypervisor to activate
NUMA node affinity for memory allocations to be aligned with
the vCPU affinity of the domain.
Summary: This passes the information to the hypervisor that memory
allocation for this domain should preferably be done from this NUMA node.
Invoke the xenguest program
With the preparation in build_pre
completed, Domain.build
calls
the xenguest
function to invoke the xenguest program to build the domain.
Notes on future design improvements
The Xen domain feature flag
domain->auto_node_affinity
can be disabled by calling
xc_domain_node_setaffinity()
to set a specific NUMA node affinity in special cases:
This can be used, for example, when there might not be enough memory on the preferred
NUMA node, and there are other NUMA nodes (in the same CPU package) to use
(reference).
xenguest
As part of starting a new domain in VM_build, xenopsd
calls xenguest
.
When multiple domain build threads run in parallel,
also multiple instances of xenguest
also run in parallel:
flowchart
subgraph xenopsd VM_build[xenopsd VM_build micro#8209;ops]
direction LR
xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1]
xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2]
xenguest1 --> libxenguest
xenguest2 --> libxenguest2[libxenguest]
click xenopsd1 "../Domain.build/index.html"
click xenopsd2 "../Domain.build/index.html"
click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
libxenguest --> Xen[Xen<br>Hypervisor]
libxenguest2 --> Xen
end
About xenguest
xenguest
is called by the xenopsd Domain.build function
to perform the build phase for new VMs, which is part of the xenopsd
VM.start operation.
xenguest
was created as a separate program due to issues with
libxenguest
:
- It wasn’t threadsafe: fixed, but it still uses a per-call global struct
- It had an incompatible licence, but now licensed under the LGPL.
Those were fixed, but we still shell out to xenguest
, which is currently
carried in the patch queue for the Xen hypervisor packages, but could become
an individual package once planned changes to the Xen hypercalls are stabilised.
Over time, xenguest
has evolved to build more of the initial domain state.
Interface to xenguest
flowchart
subgraph xenopsd VM_build[xenopsd VM_build micro#8209;op]
direction TB
mode
domid
memmax
Xenstore
end
mode[--mode build_hvm] --> xenguest
domid --> xenguest
memmax --> xenguest
Xenstore[Xenstore platform data] --> xenguest
xenopsd
must pass this information to xenguest
to build a VM:
- The domain type to build for (HVM, PHV or PV).
- It is passed using the command line option
--mode hvm_build
.
- The
domid
of the created empty domain, - The amount of system memory of the domain,
- A number of other parameters that are domain-specific.
xenopsd
uses the Xenstore to provide platform data:
- the vCPU affinity
- the vCPU credit2 weight/cap parameters
- whether the NX bit is exposed
- whether the viridian CPUID leaf is exposed
- whether the system has PAE or not
- whether the system has ACPI or not
- whether the system has nested HVM or not
- whether the system has an HPET or not
When called to build a domain, xenguest
reads those and builds the VM accordingly.
Walkthrough of the xenguest build mode
flowchart
subgraph xenguest[xenguest #8209;#8209;mode hvm_build domid]
direction LR
stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
get_flags#40;#41; <#8209; Xenstore platform data
]
stub_xc_hvm_build --> configure_vcpus[
configure_vcpus#40;#41; #8209;> Xen hypercall
]
stub_xc_hvm_build --> setup_mem[
setup_mem#40;#41; #8209;> Xen hypercalls to setup domain memory
]
end
Based on the given domain type, the xenguest
program calls dedicated
functions for the build process of the given domain type.
These are:
stub_xc_hvm_build()
for HVM,stub_xc_pvh_build()
for PVH, andstub_xc_pv_build()
for PV domains.
These domain build functions call these functions:
get_flags()
to get the platform data from the Xenstoreconfigure_vcpus()
which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)- The
setup_mem
function for the given VM type.
The function hvm_build_setup_mem()
For HVM domains, hvm_build_setup_mem()
is responsible for deriving the memory
layout of the new domain, allocating the required memory and populating for the
new domain. It must:
- Derive the
e820
memory layout of the system memory of the domain
including memory holes depending on PCI passthrough and vGPU flags. - Load the BIOS/UEFI firmware images
- Store the final MMIO hole parameters in the Xenstore
- Call the
libxenguest
function xc_dom_boot_mem_init()
(see below) - Call
construct_cpuid_policy()
to apply the CPUID featureset
policy
The function xc_dom_boot_mem_init()
flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end
hvm_build_setup_mem()
calls
xc_dom_boot_mem_init()
to allocate and populate the domain’s system memory.
It calls
meminit_hvm()
to loop over the vmemranges
of the domain for mapping the system RAM
of the guest from the Xen hypervisor heap. Its goals are:
- Attempt to allocate 1GB superpages when possible
- Fall back to 2MB pages when 1GB allocation failed
- Fall back to 4k pages when both failed
It uses the hypercall
XENMEM_populate_physmap
to perform memory allocation and to map the allocated memory
to the system RAM ranges of the domain.
https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071
XENMEM_populate_physmap
:
- Uses
construct_memop_from_reservation
to convert the arguments for allocating a page from
struct xen_memory_reservation
to
struct memop_args
. - Sets flags and calls functions according to the arguments
- Allocates the requested page at the most suitable place
- depending on passed flags, allocate on a specific NUMA node
- else, if the domain has node affinity, on the affine nodes
- also in the most suitable memory zone within the NUMA node
- Falls back to less desirable places if this fails
- or fail for “exact” allocation requests
- When no pages of the requested size are free,
it splits larger superpages into pages of the requested size.
For more details on the VM build step involving xenguest
and Xen side see:
https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest
Walkthrough: Migrating a VM
At the end of this walkthrough, a sequence diagram of the overall process is included.
Invocation
The command to migrate the VM is dispatched
by the autogenerated dispatch_call
function from xapi/server.ml. For
more information about the generated functions you can have a look to
XAPI IDL model.
The command triggers the operation
VM_migrate
that uses many low level atomics operations. These are:
The migrate command has several parameters such as:
- Should it be started asynchronously,
- Should it be forwarded to another host,
- How arguments should be marshalled, and so on.
A new thread is created by xapi/server_helpers.ml
to handle the command asynchronously. The helper thread checks if
the command should be passed to the message forwarding
layer in order to be executed on another host (the destination) or locally (if
it is already at the destination host).
It will finally reach xapi/api_server.ml that will take the action
of posted a command to the message broker message switch.
It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some
XAPI daemons. In the case of the migration this message sends by XAPI will be
consumed by the xenopsd
daemon that will do the job of migrating the VM.
Overview
The migration is an asynchronous task and a thread is created to handle this task.
The task reference is returned to the client, which can then check
its status until completion.
As shown in the introduction, xenopsd
fetches the
VM_migrate
operation from the message broker.
All tasks specific to libxenctrl,
xenguest and Xenstore
are handled by the xenopsd
xc backend.
The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.
During the migration process, the destination domain will be built with the same
UUID as the original VM, except that the last part of the UUID will be
XXXXXXXX-XXXX-XXXX-XXXX-000000000001
. The original domain will be removed using
XXXXXXXX-XXXX-XXXX-XXXX-000000000000
.
Preparing VM migration
At specific places, xenopsd
can execute hooks to run scripts.
In case a pre-migrate script is in place, a command to run this script
is sent to the original domain.
Likewise, a command is sent to Qemu using the Qemu Machine Protocol (QMP)
to check that the domain can be suspended (see xenopsd/xc/device_common.ml).
After checking with Qemu that the VM is can be suspended, the migration can begin.
As for hooks, commands to source domain are sent using stunnel a daemon which
is used as a wrapper to manage SSL encryption communication between two hosts on the same
pool. To import the metadata, an XML RPC command is sent to the original domain.
Once imported, it will give us a reference id and will allow building the new domain
on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001
where XXX...
is the reference id of the original VM.
Memory setup
One of the first steps the setup of the VM’s memory: The backend checks that there
is no ballooning operation in progress. If so, the migration could fail.
Once memory has been checked, the daemon will get the state of the VM (running, halted, …) and
The backend retrieves the domain’s platform data (memory, vCPUs setc) from the Xenstore.
Once this is complete, we can restore VIF and create the domain.
The synchronisation of the memory is the first point of synchronisation and everything
is ready for VM migration.
Destination VM setup
After receiving memory we can set up the destination domain. If we have a vGPU we need to kick
off its migration process. We will need to wait for the acknowledgement that the
GPU entry has been successfully initialized before starting the main VM migration.
The receiver informs the sender using a handshake protocol
that everything is set up and ready for save/restore.
Destination VM restore
VM restore is a low level atomic operation VM.restore.
This operation is represented by a function call to backend.
It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor
and libxc
for sending a migration request to the emu-manager.
After sending the request results coming from emu-manager are collected
by the main thread. It blocks until results are received.
During the live migration, emu-manager helps in ensuring the correct state
transitions for the devices and handling the message passing for the VM as
it’s moved between hosts. This includes making sure that the state of the
VM’s virtual devices, like disks or network interfaces, is correctly moved over.
Destination VM rename
Once all operations are done, xenopsd
renames the target VM from its temporary
name to its real UUID. This operation is a low-level atomic
VM.rename
which takes care of updating the Xenstore on the destination host.
Restoring devices
Restoring devices starts by activating VBD using the low level atomic operation
VBD.set_active. It is an update of Xenstore. VBDs that are read-write must
be plugged before read-only ones. Once activated the low level atomic operation
VBD.plug
is called. VDI are attached and activate.
Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug.
If there are VGPUs we will set them as active now using the atomic VGPU.set_active.
Creating the device model
create_device_model
configures qemu-dm and starts it. This allows to manage PCI devices.
PCI plug
PCI.plug
is executed by the backend. It plugs a PCI device and advertises it to QEMU if this option is set. It is
the case for NVIDIA SR-IOV vGPUs.
Unpause
The libxenctrl call
xc_domain_unpause()
unpauses the domain, and it starts running.
Cleanup
VM_set_domain_action_request
marks the domain as alive: In case xenopsd
restarts, it no longer reboots the VM.
See the chapter on marking domains as alive
for more information.
If a post-migrate script is in place, it is executed by the
Xenops_hooks.VM_post_migrate
hook.
The final step is a handshake to seal the success of the migration
and the old VM can now be cleaned up.
Syncronisation point 4
has been reached, the migration is complete.
Live migration flowchart
This flowchart gives a visual representation of the VM migration workflow:
sequenceDiagram
autonumber
participant tx as sender
participant rx0 as receiver thread 0
participant rx1 as receiver thread 1
participant rx2 as receiver thread 2
activate tx
tx->>rx0: VM.import_metadata
tx->>tx: Squash memory to dynamic-min
tx->>rx1: HTTP /migrate/vm
activate rx1
rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
rx1->>tx: handshake (control channel)<br/>Synchronisation point 1
tx->>rx2: HTTP /migrate/mem
activate rx2
rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem
tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK
rx2->>rx1: memory fd
tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
tx->>tx: VM_rename
rx1->>rx2: exit
deactivate rx2
tx->>rx1: handshake (control channel)<br/>Synchronisation point 3
rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request
rx1->>tx: handshake (control channel)<br/>Synchronisation point 4
deactivate rx1
tx->>tx: VM_shutdown<br/>VM_remove
deactivate tx
References
These pages might help for a better understanding of the XAPI toolstack: