VM_build micro-op
Overview
On Xen, Xenctrl.domain_create
creates an empty domain and
returns the domain ID (domid
) of the new domain to xenopsd
.
In the build
phase, the xenguest
program is called to create
the system memory layout of the domain, set vCPU affinity and a
lot more.
The VM_build
micro-op collects the VM build parameters and calls
VM.build,
which calls
VM.build_domain,
which calls
VM.build_domain_exn
which calls Domain.build:
flowchart
subgraph xenopsd VM_build[xenopsd: VM_build micro#8209;op]
direction LR
VM_build --> VM.build
VM.build --> VM.build_domain
VM.build_domain --> VM.build_domain_exn
VM.build_domain_exn --> Domain.build
click VM_build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
click VM.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
click VM.build_domain "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
click VM.build_domain_exn "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
click Domain.build "
https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
end
The function
VM.build_domain_exn
must:
Run pygrub (or eliloader) to extract the kernel and initrd, if necessary
Call
Domain.build to
- optionally run NUMA placement and
- invoke xenguest to set up the domain memory.
See the walk-through of the Domain.build function
for more details on this phase.
Apply the cpuid
configuration
Store the current domain configuration on disk – it’s important to know
the difference between the configuration you started with and the configuration
you would use after a reboot because some properties (such as maximum memory
and vCPUs) as fixed on create.
Domain.build
Overview
flowchart LR
subgraph xenopsd VM_build[
xenopsd thread pool with two VM_build micro#8209;ops:
During parallel VM_start, Many threads run this in parallel!
]
direction LR
build_domain_exn[
VM.build_domain_exn
from thread pool Thread #1
] --> Domain.build
Domain.build --> build_pre
build_pre --> wait_xen_free_mem
build_pre -->|if NUMA/Best_effort| numa_placement
Domain.build --> xenguest[Invoke xenguest]
click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
build_domain_exn2[
VM.build_domain_exn
from thread pool Thread #2] --> Domain.build2[Domain.build]
Domain.build2 --> build_pre2[build_pre]
build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
Domain.build2 --> xenguest2[Invoke xenguest]
click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
end
VM.build_domain_exn
calls
Domain.build
to call:
build_pre
to prepare the build of a VM:- If the
xe
config numa_placement
is set to Best_effort
, invoke the NUMA placement algorithm. - Run
xenguest
xenguest
to invoke the xenguest program to setup the domain’s system memory.
build_pre: Prepare building the VM
Domain.build
calls
build_pre
(which is also used for VM restore) to:
Call
wait_xen_free_mem
to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory.
It
- calls
Xenctrl.physinfo
which returns:hostinfo.free_pages
- the free and already scrubbed pages (available)host.scrub_pages
- the not yet scrubbed pages (not yet available)
- repeats this until a timeout as long as
free_pages
is lower
than the required pages- unless if
scrub_pages
is 0 (no scrubbing left to do)
Note: free_pages
is system-wide memory, not memory specific to a NUMA node.
Because this is not NUMA-aware, in case of temporary node-specific memory shortage,
this check is not sufficient to prevent the VM from being spread over all NUMA nodes.
It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.
Call the hypercall to set the timer mode
Call the hypercall to set the number of vCPUs
Call the numa_placement
function
as described in the NUMA feature description
when the xe
configuration option numa_placement
is set to Best_effort
(except when the VM has a hard CPU affinity).
match !Xenops_server.numa_placement with
| Any ->
()
| Best_effort ->
log_reraise (Printf.sprintf "NUMA placement") (fun () ->
if has_hard_affinity then
D.debug "VM has hard affinity set, skipping NUMA optimization"
else
numa_placement domid ~vcpus
~memory:(Int64.mul memory.xen_max_mib 1048576L)
)
NUMA placement
build_pre
passes the domid
, the number of vCPUs
and xen_max_mib
to the
numa_placement
function to run the algorithm to find the best NUMA placement.
When it returns a NUMA node to use, it calls the Xen hypercalls
to set the vCPU affinity to this NUMA node:
let vm = NUMARequest.make ~memory ~vcpus in
let nodea =
match !numa_resources with
| None ->
Array.of_list nodes
| Some a ->
Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
in
numa_resources := Some nodea ;
Softaffinity.plan ~vm host nodea
By using the default auto_node_affinity
feature of Xen,
setting the vCPU affinity causes the Xen hypervisor to activate
NUMA node affinity for memory allocations to be aligned with
the vCPU affinity of the domain.
Summary: This passes the information to the hypervisor that memory
allocation for this domain should preferably be done from this NUMA node.
Invoke the xenguest program
With the preparation in build_pre
completed, Domain.build
calls
the xenguest
function to invoke the xenguest program to build the domain.
Notes on future design improvements
The Xen domain feature flag
domain->auto_node_affinity
can be disabled by calling
xc_domain_node_setaffinity()
to set a specific NUMA node affinity in special cases:
This can be used, for example, when there might not be enough memory on the preferred
NUMA node, and there are other NUMA nodes (in the same CPU package) to use
(reference).
xenguest
As part of starting a new domain in VM_build, xenopsd
calls xenguest
.
When multiple domain build threads run in parallel,
also multiple instances of xenguest
also run in parallel:
flowchart
subgraph xenopsd VM_build[xenopsd VM_build micro#8209;ops]
direction LR
xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1]
xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2]
xenguest1 --> libxenguest
xenguest2 --> libxenguest2[libxenguest]
click xenopsd1 "../Domain.build/index.html"
click xenopsd2 "../Domain.build/index.html"
click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
libxenguest --> Xen[Xen<br>Hypervisor]
libxenguest2 --> Xen
end
About xenguest
xenguest
is called by the xenopsd Domain.build function
to perform the build phase for new VMs, which is part of the xenopsd
VM.start operation.
xenguest
was created as a separate program due to issues with
libxenguest
:
- It wasn’t threadsafe: fixed, but it still uses a per-call global struct
- It had an incompatible licence, but now licensed under the LGPL.
Those were fixed, but we still shell out to xenguest
, which is currently
carried in the patch queue for the Xen hypervisor packages, but could become
an individual package once planned changes to the Xen hypercalls are stabilised.
Over time, xenguest
has evolved to build more of the initial domain state.
Interface to xenguest
flowchart
subgraph xenopsd VM_build[xenopsd VM_build micro#8209;op]
direction TB
mode
domid
memmax
Xenstore
end
mode[--mode build_hvm] --> xenguest
domid --> xenguest
memmax --> xenguest
Xenstore[Xenstore platform data] --> xenguest
xenopsd
must pass this information to xenguest
to build a VM:
- The domain type to build for (HVM, PHV or PV).
- It is passed using the command line option
--mode hvm_build
.
- The
domid
of the created empty domain, - The amount of system memory of the domain,
- A number of other parameters that are domain-specific.
xenopsd
uses the Xenstore to provide platform data:
- the vCPU affinity
- the vCPU credit2 weight/cap parameters
- whether the NX bit is exposed
- whether the viridian CPUID leaf is exposed
- whether the system has PAE or not
- whether the system has ACPI or not
- whether the system has nested HVM or not
- whether the system has an HPET or not
When called to build a domain, xenguest
reads those and builds the VM accordingly.
Walkthrough of the xenguest build mode
flowchart
subgraph xenguest[xenguest #8209;#8209;mode hvm_build domid]
direction LR
stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
get_flags#40;#41; <#8209; Xenstore platform data
]
stub_xc_hvm_build --> configure_vcpus[
configure_vcpus#40;#41; #8209;> Xen hypercall
]
stub_xc_hvm_build --> setup_mem[
setup_mem#40;#41; #8209;> Xen hypercalls to setup domain memory
]
end
Based on the given domain type, the xenguest
program calls dedicated
functions for the build process of the given domain type.
These are:
stub_xc_hvm_build()
for HVM,stub_xc_pvh_build()
for PVH, andstub_xc_pv_build()
for PV domains.
These domain build functions call these functions:
get_flags()
to get the platform data from the Xenstoreconfigure_vcpus()
which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)- The
setup_mem
function for the given VM type.
The function hvm_build_setup_mem()
For HVM domains, hvm_build_setup_mem()
is responsible for deriving the memory
layout of the new domain, allocating the required memory and populating for the
new domain. It must:
- Derive the
e820
memory layout of the system memory of the domain
including memory holes depending on PCI passthrough and vGPU flags. - Load the BIOS/UEFI firmware images
- Store the final MMIO hole parameters in the Xenstore
- Call the
libxenguest
function xc_dom_boot_mem_init()
(see below) - Call
construct_cpuid_policy()
to apply the CPUID featureset
policy
The function xc_dom_boot_mem_init()
flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end
hvm_build_setup_mem()
calls
xc_dom_boot_mem_init()
to allocate and populate the domain’s system memory.
It calls
meminit_hvm()
to loop over the vmemranges
of the domain for mapping the system RAM
of the guest from the Xen hypervisor heap. Its goals are:
- Attempt to allocate 1GB superpages when possible
- Fall back to 2MB pages when 1GB allocation failed
- Fall back to 4k pages when both failed
It uses the hypercall
XENMEM_populate_physmap
to perform memory allocation and to map the allocated memory
to the system RAM ranges of the domain.
https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071
XENMEM_populate_physmap
:
- Uses
construct_memop_from_reservation
to convert the arguments for allocating a page from
struct xen_memory_reservation
to
struct memop_args
. - Sets flags and calls functions according to the arguments
- Allocates the requested page at the most suitable place
- depending on passed flags, allocate on a specific NUMA node
- else, if the domain has node affinity, on the affine nodes
- also in the most suitable memory zone within the NUMA node
- Falls back to less desirable places if this fails
- or fail for “exact” allocation requests
- When no pages of the requested size are free,
it splits larger superpages into pages of the requested size.
For more details on the VM build step involving xenguest
and Xen side see:
https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest