Operation Walk-Throughs

Let’s trace through interesting operations to see how the whole system works.

  • Starting a VM

    Complete walkthrough of starting a VM, from receiving the request to unpause.

    • Building a VM

      After VM_create, VM_build builds the core of the domain (vCPUs, memory)

      • VM_build μ-op

        Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).

      • Domain.build

        Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.

      • xenguest

        Perform building VMs: Allocate and populate the domain's system memory.

    • Migrating a VM

      Walkthrough of migrating a VM from one host to another.

      • Live Migration

        Sequence diagram of the process of Live Migration.

        Inspiration for other walk-throughs:

        • Shutting down a VM and waiting for it to happen
        • A VM wants to reboot itself
        • A disk is hotplugged
        • A disk refuses to hotunplug
        • A VM is suspended

        Subsections of Walk-throughs

        Walkthrough: Starting a VM

        A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM configuration to use. A VM configuration is broken down into objects:

        • VM: A device-less Virtual Machine
        • VBD: A virtual block device for a VM
        • VIF: A virtual network interface for a VM
        • PCI: A virtual PCI device for a VM

        Treating devices as first-class objects is convenient because we wish to expose operations on the devices such as hotplug, unplug, eject (for removable media), carrier manipulation (for network interfaces) etc.

        The “add” functions in the Xenopsd interface cause Xenopsd to create the objects:

        In the case of xapi, there are a set of functions which convert between the XenAPI objects and the Xenopsd objects. The two interfaces are slightly different because they have different expected users:

        • the XenAPI has many clients which are updated on long release cycles. The main property needed is backwards compatibility, so that new release of xapi remain compatible with these older clients. Quite often, we will choose to “grandfather in” some poorly designed interface simply because we wish to avoid imposing churn on 3rd parties.
        • the Xenopsd API clients are all open-source and are part of the xapi-project. These clients can be updated as the API is changed. The main property needed is to keep the interface clean, so that it properly hides the complexity of dealing with Xen from other components.

        The Xenopsd “VM.add” function has code like this:

        	let add' x =
        		debug "VM.add %s" (Jsonrpc.to_string (rpc_of_t x));
        		DB.write x.id x;
        		let module B = (val get_backend () : S) in
        		B.VM.add x;
        		x.id

        This function does 2 things:

        • it stores the VM configuration in the “database”
        • it tells the “backend” that the VM exists

        The Xenopsd database is really a set of config files in the filesystem. All objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not stand-alone entities like disks) and are placed into a subdirectory named after the VM e.g.:

        # ls /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701
        config	vbd.xvda  vbd.xvdb
        # cat /run/nonpersistent/xenopsd/xenlight/VM/7b719ce6-0b17-9733-e8ee-dbc1e6e7b701/config
        {"id": "7b719ce6-0b17-9733-e8ee-dbc1e6e7b701", "name": "fedora",
         ...
        }

        Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that all objects are deleted when the host is rebooted. However the objects should be persisted over a simple Xenopsd restart, which is why the objects are stored in the filesystem.

        Aside: it would probably be more appropriate to store the metadata in Xenstore since this has the exact object lifetime we need. This will require a more performant Xenstore to realise.

        Every running Xenopsd process is linked with a single backend. Currently backends exist for:

        • Xen via libxc, libxenguest and xenstore
        • Xen via libxl, libxc and xenstore
        • Xen via libvirt
        • KVM by direct invocation of qemu
        • Simulation for testing

        From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a. “Xenopsd classic”) backend.

        The backend VM.add function checks whether the VM we have to manage already exists – and if it does then it ensures the Xenstore configuration is intact. This Xenstore configuration is important because at any time a client can query the state of a VM with VM.stat and this relies on certain Xenstore keys being present.

        Once the VM metadata has been registered with Xenopsd, the client can call VM.start. Like all potentially-blocking Xenopsd APIs, this function returns a Task id. Please refer to the Task handling design for a general overview of how tasks are handled.

        Clients can poll the state of a task by calling TASK.stat but most clients will prefer to use the event system instead. Please refer to the Event handling design for a general overview of how events are handled.

        The event model is similar to the XenAPI: clients call a blocking UPDATES.get passing in a token which represents the point in time when the last UPDATES.get returned. The call blocks until some objects have changed state, and these object ids are returned (NB in the XenAPI the current object states are returned) The client must then call the relevant “stat” function, in this case TASK.stat

        The client will be able to see the task make progress and use this to – for example – populate a progress bar in a UI. If the client needs to cancel the task then it can call the TASK.cancel; again see the Task handling design to understand how this is implemented.

        When the Task has completed successfully, then calls to *.stat will show:

        • the power state is Paused
        • exactly one valid Xen domain id
        • all VBDs have active = plugged = true
        • all VIFs have active = plugged = true
        • all PCI devices have plugged = true
        • at least one active console
        • a valid start time
        • valid “targets” for memory and vCPU

        Note: before a Task completes, calls to *.stat will show partial updates. E.g. the power state may be paused, but no disk may have been plugged. UI clients must choose whether they are happy displaying this in-between state or whether they wish to hide it and pretend the whole operation has happened transactionally. If a particular, when a client wishes to perform side-effects in response to xenopsd state changes (for example, to clean up an external resource when a VIF becomes unplugged), it must be very careful to avoid responding to these in-between states. Generally, it is safest to passively report these values without driving things directly from them.

        Note: the Xenopsd implementation guarantees that, if it is restarted at any point during the start operation, on restart the VM state shall be “fixed” by either (i) shutting down the VM; or (ii) ensuring the VM is intact and running.

        In the case of xapi every Xenopsd Task id bound one-to-one with a XenAPI task by the function sync_with_task. The function update_task is called when xapi receives a notification that a Xenopsd Task has changed state, and updates the corresponding XenAPI task. Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for background events via the function events_watch while each thread performing a XenAPI call waits for its specific Task to complete via the function event_wait.

        It is the responsibility of the client to call TASK.destroy when the Task is no longer needed. Xenopsd won’t destroy the task because it contains the success/failure result of the operation which is needed by the client.

        What happens when a Xenopsd receives a VM.start request?

        When Xenopsd receives the request it adds it to the appropriate per-VM queue via the function queue_operation. To understand this and other internal details of Xenopsd, consult the architecture description. The queue_operation_int function looks like this:

        let queue_operation_int dbg id op =
        	let task = Xenops_task.add tasks dbg (fun t -> perform op t; None) in
        	Redirector.push id (op, task);
        	task

        The “task” is a record containing Task metadata plus a “do it now” function which will be executed by a thread from the thread pool. The module Redirector takes care of:

        • pushing operations to the right queue
        • ensuring at most one worker thread is working on a VM’s operations
        • reducing the queue size by coalescing items together
        • providing a diagnostics interface

        Once a thread from the worker pool becomes free, it will execute the “do it now” function. In the example above this is perform op t where op is VM_start vm and t is the Task. The function perform_exn has fragments like this:

          | VM_start (id, force) -> (
              debug "VM.start %s (force=%b)" id force ;
              let power = (B.VM.get_state (VM_DB.read_exn id)).Vm.power_state in
              match power with
              | Running ->
                  info "VM %s is already running" id
              | _ ->
                  perform_atomics (atomics_of_operation op) t ;
                  VM_DB.signal id "^^^^^^^^^^^^^^^^^^^^--------
            )

        Each “operation” (e.g. VM_start vm) is decomposed into “micro-ops” by the function atomics_of_operation where the micro-ops are small building-block actions common to the higher-level operations. Each operation corresponds to a list of “micro-ops”, where there is no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM configuration (for example a PV domain may not need a qemu). In the case of VM_start vm the Xenopsd server starts by calling the functions that decompose the VM_hook_script, VM_create and VM_build micro-ops:

                dequarantine_ops vgpus
              ; [
                  VM_hook_script
                    (id, Xenops_hooks.VM_pre_start, Xenops_hooks.reason__none)
                ; VM_create (id, None, None, no_sharept)
                ; VM_build (id, force)
                ]

        This is the complete sequence of micro-ops:

        1. run the “VM_pre_start” scripts

        The VM_hook_script micro-op runs the corresponding “hook” scripts. The code is all in the Xenops_hooks module and looks for scripts in the hardcoded path /etc/xapi.d.

        2. create a Xen domain

        The VM_create micro-op calls the VM.create function in the backend. In the classic Xenopsd backend, the VM.create_exn function must

        1. check if we’re creating a domain for a fresh VM or resuming an existing one: if it’s a resume then the domain configuration stored in the VmExtra database table must be used
        2. ask squeezed to create a memory “reservation” big enough to hold the VM memory. Unfortunately the domain cannot be created until the memory is free because domain create often fails in low-memory conditions. This means the “reservation” is associated with our “session” with squeezed; if Xenopsd crashes and restarts the reservation will be freed automatically.
        3. create the Domain via the libxc hypercall Xenctrl.domain_create
        4. call generate_create_info() for storing the platform data (vCPUs, etc) the domain’s Xenstore tree. xenguest then uses this in the build phase (see below) to build the domain.
        5. “transfer” the squeezed reservation to the domain such that squeezed will free the memory if the domain is destroyed later
        6. compute and set an initial balloon target depending on the amount of memory reserved (recall we ask for a range between dynamic_min and dynamic_max)
        7. apply the “suppress spurious page faults” workaround if requested
        8. set the “machine address size”
        9. “hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates lots of vCPUs and then the guest is asked to only use some of them. Every VM therefore starts with the “VCPUs_max” setting and co-operative hotplug is used to reduce the number. Note there is no enforcement mechanism: a VM which cheats and uses too many vCPUs would have to be caught by looking at the performance statistics.

        3. build the domain

        The build phase waits, if necessary, for the Xen memory scrubber to catch up reclaiming memory, runs NUMA placement, sets vCPU affinity and invokes the xenguest to build the system memory layout of the domain. See the walk-through of the VM_build μ-op for details.

        4. mark each VBD as “active”

        VBDs and VIFs are said to be “active” when they are intended to be used by a particular VM, even if the backend/frontend connection hasn’t been established, or has been closed. If someone calls VBD.stat or VIF.stat then the result includes both “active” and “plugged”, where “plugged” is true if the frontend/backend connection is established. For example xapi will set VBD.currently_attached to “active || plugged”. The “active” flag is conceptually very similar to the traditional “online” flag (which is not documented in the upstream Xen tree as of Oct/2014 but really should be) except that on unplug, one would set the “online” key to “0” (false) first before initiating the hotunplug. By contrast the “active” flag is set to false after the unplug i.e. “set_active” calls bracket plug/unplug. If the “active” flag was set before the unplug attempt then as soon as the frontend/backend connection is removed clients would see the VBD as completely dissociated from the VM – this would be misleading because Xenopsd will not have had time to use the storage API to release locks on the disks. By cleaning up before setting “active” to false, clients can be assured that the disks are now free to be reassigned.

        5. handle non-persistent disks

        A non-persistent disk is one which is reset to a known-good state on every VM start. The VBD_epoch_begin is the signal to perform any necessary reset.

        6. plug VBDs

        The VBD_plug micro-op will plug the VBD into the VM. Every VBD is plugged in a carefully-chosen order. Generally, plug order is important for all types of devices. For VBDs, we must work around the deficiency in the storage interface where a VDI, once attached read/only, cannot be attached read/write. Since it is legal to attach the same VDI with multiple VBDs, we must plug them in such that the read/write VBDs come first. From the guest’s point of view the order we plug them doesn’t matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).

        The function VBD.plug will

        • call VDI.attach and VDI.activate in the storage API to make the devices ready (start the tapdisk processes etc)
        • add the Xenstore frontend/backend directories containing the block device info
        • add the extra xenstore keys returned by the VDI.attach call that are needed for SCSIid passthrough which is needed to support VSS
        • write the VBD information to the Xenopsd database so that future calls to VBD.stat can be told about the associated disk (this is needed so clients like xapi can cope with CD insert/eject etc)
        • if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

        The Xenstore keys are written by the functions Device.Vbd.add_async and Device.Vbd.add_wait. In a Linux domain (such as dom0) when the backend directory is created, the kernel creates a “backend device”. Creating any device will cause a kernel UEVENT to fire which is picked up by udev. The udev rules run a script whose only job is to stat(2) the device (from the “params” key in the backend) and write the major and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do any of this, instead the FreeBSD kernel module simply opens the device in the “params” key). The script also writes the backend key “hotplug-status=connected”. We currently wait for this key to be written so that later calls to VBD.stat will return with “plugged=true”. If the call returns before this key is written then sometimes we receive an event, call VBD.stat and conclude erroneously that a spontaneous VBD unplug occurred.

        7. mark each VIF as “active”

        This is for the same reason as VBDs are marked “active”.

        8. plug VIFs

        Again, the order matters. Unlike VBDs, there is no read/write read/only constraint and the devices have unique indices (0, 1, 2, …) but Linux kernels have often (always?) ignored the actual index and instead relied on the order of results from the xenstore-ls listing. The order that xenstored returns the items happens to be the order the nodes were created so this means that (i) xenstored must continue to store directories as ordered lists rather than maps (which would be more efficient); and (ii) Xenopsd must make sure to plug the vifs in the same order. Note that relying on ethX device numbering has always been a bad idea but is still common. I bet if you change this, many tests will suddenly start to fail!

        The function VIF.plug_exn will

        • compute the port locking configuration required and write this to a well-known location in the filesystem where it can be read from the udev scripts. This really should be written to Xenstore instead, since this scheme doesn’t work with driver domains.
        • add the Xenstore frontend/backend directories containing the network device info
        • write the VIF information to the Xenopsd database so that future calls to VIF.stat can be told about the associated network
        • if the qemu is going to be in a different domain to the storage, a frontend device in the qemu domain is created.

        Similarly to the VBD case, the function Device.Vif.add will write the Xenstore keys and wait for the “hotplug-status=connected” key. We do this because we cannot apply the port locking rules until the backend device has been created, and we cannot know the rules have been applied until after the udev script has written the key. If we didn’t wait for it then the VM might execute without all the port locking properly configured.

        9. create the device model

        The VM_create_device_model micro-op will create a qemu device model if

        • the VM is HVM; or
        • the VM uses a PV keyboard or mouse (since only qemu currently has backend support for these devices).

        The function VM.create_device_model_exn will

        • (if using a qemu stubdom) it will create and build the qemu domain
        • compute the necessary qemu arguments and launch it.

        Note that qemu (aka the “device model”) is created after the VIFs and VBDs have been plugged but before the PCI devices have been plugged. Unfortunately qemu traditional infers the needed emulated hardware by inspecting the Xenstore VBD and VIF configuration and assuming that we want one emulated device per PV device, up to the natural limits of the emulated buses (i.e. there can be at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this create an ordering dependency that needn’t exist – and which impacts migration downtime – but it also completely ignores the plain fact that, on a Xen system, qemu can be in a different domain than the backend disk and network devices. This hack only works because we currently run everything in the same domain. There is an option (off by default) to list the emulated devices explicitly on the qemu command-line. If we switch to this by default then we ought to be able to start up qemu early, as soon as the domain has been created (qemu will need to know the domain id so it can map the I/O request ring).

        10. plug PCI devices

        PCI devices are treated differently to VBDs and VIFs. If we are attaching the device to an HVM guest then instead of relying on the traditional Xenstore frontend/backend state machine we instead send RPCs to qemu requesting they be hotplugged. Note the domain is paused at this point, but qemu still supports PCI hotplug/unplug. The reasons why this doesn’t follow the standard Xenstore model are known only to the people who contributed this support to qemu. Again the order matters because it determines the position of the virtual device in the VM.

        Note that Xenopsd doesn’t know anything about the PCI devices; concepts such as “GPU groups” belong to higher layers, such as xapi.

        11. mark the domain as alive

        A design principle of Xenopsd is that it should tolerate failures such as being suddenly restarted. It guarantees to always leave the system in a valid state, in particular there should never be any “half-created VMs”. We achieve this for VM start by exploiting the mechanism which is necessary for reboot. When a VM wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a “reason code” of “reboot”. When Xenopsd sees this event VM_check_state operation is queued. This operation calls VM.get_domain_action_request to ask the question, “what needs to be done to make this VM happy now?”. The implementation checks the domain state for shutdown codes and also checks a special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd finishes starting the VM it clears this key. This means that if Xenopsd crashes while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted and will clean up the current domain and create a fresh one.

        12. unpause the domain

        A Xenopsd VM.start will always leave the domain paused, so strictly speaking this is a separate “operation” queued by the client (such as xapi) after the VM.start has completed. The function VM.unpause is reassuringly simple:

        		if di.Xenctrl.total_memory_pages = 0n then raise (Domain_not_built);
        		Domain.unpause ~xc di.Xenctrl.domid;
        		Opt.iter
        			(fun stubdom_domid ->
        				Domain.unpause ~xc stubdom_domid
        			) (get_stubdom ~xs di.Xenctrl.domid)

        Building a VM

        flowchart
        subgraph xenopsd VM_build[xenopsd: VM_build micro#8209;op]
        direction LR
        VM_build --> VM.build
        VM.build --> VM.build_domain
        VM.build_domain --> VM.build_domain_exn
        VM.build_domain_exn --> Domain.build
        click VM_build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
        click VM.build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
        click VM.build_domain "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
        click VM.build_domain_exn "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
        click Domain.build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
        end

        Walk-through documents for the VM_build phase:

        • VM_build μ-op

          Overview of the VM_build μ-op (runs after the VM_create μ-op created the domain).

        • Domain.build

          Prepare the build of a VM: Wait for scrubbing, do NUMA placement, run xenguest.

        • xenguest

          Perform building VMs: Allocate and populate the domain's system memory.

        Subsections of Building a VM

        VM_build micro-op

        Overview

        On Xen, Xenctrl.domain_create creates an empty domain and returns the domain ID (domid) of the new domain to xenopsd.

        In the build phase, the xenguest program is called to create the system memory layout of the domain, set vCPU affinity and a lot more.

        The VM_build micro-op collects the VM build parameters and calls VM.build, which calls VM.build_domain, which calls VM.build_domain_exn which calls Domain.build:

        flowchart
        subgraph xenopsd VM_build[xenopsd: VM_build micro#8209;op]
        direction LR
        VM_build --> VM.build
        VM.build --> VM.build_domain
        VM.build_domain --> VM.build_domain_exn
        VM.build_domain_exn --> Domain.build
        click VM_build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/lib/xenops_server.ml#L2255-L2271" _blank
        click VM.build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2290-L2291" _blank
        click VM.build_domain "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2250-L2288" _blank
        click VM.build_domain_exn "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/xenops_server_xen.ml#L2024-L2248" _blank
        click Domain.build "
        https://github.com/xapi-project/xen-api/blob/83555067/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
        end

        The function VM.build_domain_exn must:

        1. Run pygrub (or eliloader) to extract the kernel and initrd, if necessary

        2. Call Domain.build to

          • optionally run NUMA placement and
          • invoke xenguest to set up the domain memory.

          See the walk-through of the Domain.build function for more details on this phase.

        3. Apply the cpuid configuration

        4. Store the current domain configuration on disk – it’s important to know the difference between the configuration you started with and the configuration you would use after a reboot because some properties (such as maximum memory and vCPUs) as fixed on create.

        Domain.build

        Overview

        flowchart LR
        subgraph xenopsd VM_build[
          xenopsd thread pool with two VM_build micro#8209;ops:
          During parallel VM_start, Many threads run this in parallel!
        ]
        direction LR
        build_domain_exn[
          VM.build_domain_exn
          from thread pool Thread #1
        ]  --> Domain.build
        Domain.build --> build_pre
        build_pre --> wait_xen_free_mem
        build_pre -->|if NUMA/Best_effort| numa_placement
        Domain.build --> xenguest[Invoke xenguest]
        click Domain.build "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
        click build_domain_exn "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
        click wait_xen_free_mem "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
        click numa_placement "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
        click build_pre "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
        click xenguest "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
        
        build_domain_exn2[
          VM.build_domain_exn
          from thread pool Thread #2]  --> Domain.build2[Domain.build]
        Domain.build2 --> build_pre2[build_pre]
        build_pre2 --> wait_xen_free_mem2[wait_xen_free_mem]
        build_pre2 -->|if NUMA/Best_effort| numa_placement2[numa_placement]
        Domain.build2 --> xenguest2[Invoke xenguest]
        click Domain.build2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1111-L1210" _blank
        click build_domain_exn2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/xenops_server_xen.ml#L2222-L2225" _blank
        click wait_xen_free_mem2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L236-L272" _blank
        click numa_placement2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L862-L897" _blank
        click build_pre2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L899-L964" _blank
        click xenguest2 "https://github.com/xapi-project/xen-api/blob/master/ocaml/xenopsd/xc/domain.ml#L1139-L1146" _blank
        end

        VM.build_domain_exn calls Domain.build to call:

        • build_pre to prepare the build of a VM:
          • If the xe config numa_placement is set to Best_effort, invoke the NUMA placement algorithm.
          • Run xenguest
        • xenguest to invoke the xenguest program to setup the domain’s system memory.

        build_pre: Prepare building the VM

        Domain.build calls build_pre (which is also used for VM restore) to:

        1. Call wait_xen_free_mem to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory. It

          1. calls Xenctrl.physinfo which returns:
            • hostinfo.free_pages - the free and already scrubbed pages (available)
            • host.scrub_pages - the not yet scrubbed pages (not yet available)
          2. repeats this until a timeout as long as free_pages is lower than the required pages
            • unless if scrub_pages is 0 (no scrubbing left to do)

          Note: free_pages is system-wide memory, not memory specific to a NUMA node. Because this is not NUMA-aware, in case of temporary node-specific memory shortage, this check is not sufficient to prevent the VM from being spread over all NUMA nodes. It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.

        2. Call the hypercall to set the timer mode

        3. Call the hypercall to set the number of vCPUs

        4. Call the numa_placement function as described in the NUMA feature description when the xe configuration option numa_placement is set to Best_effort (except when the VM has a hard CPU affinity).

          match !Xenops_server.numa_placement with
          | Any ->
              ()
          | Best_effort ->
              log_reraise (Printf.sprintf "NUMA placement") (fun () ->
                  if has_hard_affinity then
                    D.debug "VM has hard affinity set, skipping NUMA optimization"
                  else
                    numa_placement domid ~vcpus
                      ~memory:(Int64.mul memory.xen_max_mib 1048576L)
              )

        NUMA placement

        build_pre passes the domid, the number of vCPUs and xen_max_mib to the numa_placement function to run the algorithm to find the best NUMA placement.

        When it returns a NUMA node to use, it calls the Xen hypercalls to set the vCPU affinity to this NUMA node:

          let vm = NUMARequest.make ~memory ~vcpus in
          let nodea =
            match !numa_resources with
            | None ->
                Array.of_list nodes
            | Some a ->
                Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
          in
          numa_resources := Some nodea ;
          Softaffinity.plan ~vm host nodea

        By using the default auto_node_affinity feature of Xen, setting the vCPU affinity causes the Xen hypervisor to activate NUMA node affinity for memory allocations to be aligned with the vCPU affinity of the domain.

        Summary: This passes the information to the hypervisor that memory allocation for this domain should preferably be done from this NUMA node.

        Invoke the xenguest program

        With the preparation in build_pre completed, Domain.build calls the xenguest function to invoke the xenguest program to build the domain.

        Notes on future design improvements

        The Xen domain feature flag domain->auto_node_affinity can be disabled by calling xc_domain_node_setaffinity() to set a specific NUMA node affinity in special cases:

        This can be used, for example, when there might not be enough memory on the preferred NUMA node, and there are other NUMA nodes (in the same CPU package) to use (reference).

        xenguest

        As part of starting a new domain in VM_build, xenopsd calls xenguest. When multiple domain build threads run in parallel, also multiple instances of xenguest also run in parallel:

        flowchart
        subgraph xenopsd VM_build[xenopsd VM_build micro#8209;ops]
        direction LR
        xenopsd1[Domain.build - Thread #1] --> xenguest1[xenguest #1]
        xenopsd2[Domain.build - Thread #2] --> xenguest2[xenguest #2]
        xenguest1 --> libxenguest
        xenguest2 --> libxenguest2[libxenguest]
        click xenopsd1 "../Domain.build/index.html"
        click xenopsd2 "../Domain.build/index.html"
        click xenguest1 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
        click xenguest2 "https://github.com/xenserver/xen.pg/blob/XS-8/patches/xenguest.patch" _blank
        click libxenguest "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
        click libxenguest2 "https://github.com/xen-project/xen/tree/master/tools/libs/guest" _blank
        libxenguest --> Xen[Xen<br>Hypervisor]
        libxenguest2 --> Xen
        end

        About xenguest

        xenguest is called by the xenopsd Domain.build function to perform the build phase for new VMs, which is part of the xenopsd VM.start operation.

        xenguest was created as a separate program due to issues with libxenguest:

        • It wasn’t threadsafe: fixed, but it still uses a per-call global struct
        • It had an incompatible licence, but now licensed under the LGPL.

        Those were fixed, but we still shell out to xenguest, which is currently carried in the patch queue for the Xen hypervisor packages, but could become an individual package once planned changes to the Xen hypercalls are stabilised.

        Over time, xenguest has evolved to build more of the initial domain state.

        Interface to xenguest

        flowchart
        subgraph xenopsd VM_build[xenopsd&nbsp;VM_build&nbsp;micro#8209;op]
        direction TB
        mode
        domid
        memmax
        Xenstore
        end
        mode[--mode build_hvm] --> xenguest
        domid --> xenguest
        memmax --> xenguest
        Xenstore[Xenstore platform data] --> xenguest

        xenopsd must pass this information to xenguest to build a VM:

        • The domain type to build for (HVM, PHV or PV).
          • It is passed using the command line option --mode hvm_build.
        • The domid of the created empty domain,
        • The amount of system memory of the domain,
        • A number of other parameters that are domain-specific.

        xenopsd uses the Xenstore to provide platform data:

        • the vCPU affinity
        • the vCPU credit2 weight/cap parameters
        • whether the NX bit is exposed
        • whether the viridian CPUID leaf is exposed
        • whether the system has PAE or not
        • whether the system has ACPI or not
        • whether the system has nested HVM or not
        • whether the system has an HPET or not

        When called to build a domain, xenguest reads those and builds the VM accordingly.

        Walkthrough of the xenguest build mode

        flowchart
        subgraph xenguest[xenguest&nbsp;#8209;#8209;mode&nbsp;hvm_build&nbsp;domid]
        direction LR
        stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
            get_flags#40;#41;&nbsp;<#8209;&nbsp;Xenstore&nbsp;platform&nbsp;data
        ]
        stub_xc_hvm_build --> configure_vcpus[
            configure_vcpus#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercall
        ]
        stub_xc_hvm_build --> setup_mem[
            setup_mem#40;#41;&nbsp;#8209;>&nbsp;Xen&nbsp;hypercalls&nbsp;to&nbsp;setup&nbsp;domain&nbsp;memory
        ]
        end

        Based on the given domain type, the xenguest program calls dedicated functions for the build process of the given domain type.

        These are:

        • stub_xc_hvm_build() for HVM,
        • stub_xc_pvh_build() for PVH, and
        • stub_xc_pv_build() for PV domains.

        These domain build functions call these functions:

        1. get_flags() to get the platform data from the Xenstore
        2. configure_vcpus() which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)
        3. The setup_mem function for the given VM type.

        The function hvm_build_setup_mem()

        For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory layout of the new domain, allocating the required memory and populating for the new domain. It must:

        1. Derive the e820 memory layout of the system memory of the domain including memory holes depending on PCI passthrough and vGPU flags.
        2. Load the BIOS/UEFI firmware images
        3. Store the final MMIO hole parameters in the Xenstore
        4. Call the libxenguest function xc_dom_boot_mem_init() (see below)
        5. Call construct_cpuid_policy() to apply the CPUID featureset policy

        The function xc_dom_boot_mem_init()

        flowchart LR
        subgraph xenguest
        hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
        end
        subgraph libxenguest
        hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
        xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
        click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
        click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
        end

        hvm_build_setup_mem() calls xc_dom_boot_mem_init() to allocate and populate the domain’s system memory.

        It calls meminit_hvm() to loop over the vmemranges of the domain for mapping the system RAM of the guest from the Xen hypervisor heap. Its goals are:

        • Attempt to allocate 1GB superpages when possible
        • Fall back to 2MB pages when 1GB allocation failed
        • Fall back to 4k pages when both failed

        It uses the hypercall XENMEM_populate_physmap to perform memory allocation and to map the allocated memory to the system RAM ranges of the domain.

        https://github.com/xen-project/xen/blob/39c45c/xen/common/memory.c#L1022-L1071

        XENMEM_populate_physmap:

        1. Uses construct_memop_from_reservation to convert the arguments for allocating a page from struct xen_memory_reservation to struct memop_args.
        2. Sets flags and calls functions according to the arguments
        3. Allocates the requested page at the most suitable place
          • depending on passed flags, allocate on a specific NUMA node
          • else, if the domain has node affinity, on the affine nodes
          • also in the most suitable memory zone within the NUMA node
        4. Falls back to less desirable places if this fails
          • or fail for “exact” allocation requests
        5. When no pages of the requested size are free, it splits larger superpages into pages of the requested size.

        For more details on the VM build step involving xenguest and Xen side see: https://wiki.xenproject.org/wiki/Walkthrough:_VM_build_using_xenguest

        Walkthrough: Migrating a VM

        At the end of this walkthrough, a sequence diagram of the overall process is included.

        Invocation

        The command to migrate the VM is dispatched by the autogenerated dispatch_call function from xapi/server.ml. For more information about the generated functions you can have a look to XAPI IDL model.

        The command triggers the operation VM_migrate that uses many low level atomics operations. These are:

        The migrate command has several parameters such as:

        • Should it be started asynchronously,
        • Should it be forwarded to another host,
        • How arguments should be marshalled, and so on.

        A new thread is created by xapi/server_helpers.ml to handle the command asynchronously. The helper thread checks if the command should be passed to the message forwarding layer in order to be executed on another host (the destination) or locally (if it is already at the destination host).

        It will finally reach xapi/api_server.ml that will take the action of posted a command to the message broker message switch. It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some XAPI daemons. In the case of the migration this message sends by XAPI will be consumed by the xenopsd daemon that will do the job of migrating the VM.

        Overview

        The migration is an asynchronous task and a thread is created to handle this task. The task reference is returned to the client, which can then check its status until completion.

        As shown in the introduction, xenopsd fetches the VM_migrate operation from the message broker.

        All tasks specific to libxenctrl, xenguest and Xenstore are handled by the xenopsd xc backend.

        The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.

        During the migration process, the destination domain will be built with the same UUID as the original VM, except that the last part of the UUID will be XXXXXXXX-XXXX-XXXX-XXXX-000000000001. The original domain will be removed using XXXXXXXX-XXXX-XXXX-XXXX-000000000000.

        Preparing VM migration

        At specific places, xenopsd can execute hooks to run scripts. In case a pre-migrate script is in place, a command to run this script is sent to the original domain.

        Likewise, a command is sent to Qemu using the Qemu Machine Protocol (QMP) to check that the domain can be suspended (see xenopsd/xc/device_common.ml). After checking with Qemu that the VM is can be suspended, the migration can begin.

        Importing metadata

        As for hooks, commands to source domain are sent using stunnel a daemon which is used as a wrapper to manage SSL encryption communication between two hosts on the same pool. To import the metadata, an XML RPC command is sent to the original domain.

        Once imported, it will give us a reference id and will allow building the new domain on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001 where XXX... is the reference id of the original VM.

        Memory setup

        One of the first steps the setup of the VM’s memory: The backend checks that there is no ballooning operation in progress. If so, the migration could fail.

        Once memory has been checked, the daemon will get the state of the VM (running, halted, …) and The backend retrieves the domain’s platform data (memory, vCPUs setc) from the Xenstore.

        Once this is complete, we can restore VIF and create the domain.

        The synchronisation of the memory is the first point of synchronisation and everything is ready for VM migration.

        Destination VM setup

        After receiving memory we can set up the destination domain. If we have a vGPU we need to kick off its migration process. We will need to wait for the acknowledgement that the GPU entry has been successfully initialized before starting the main VM migration.

        The receiver informs the sender using a handshake protocol that everything is set up and ready for save/restore.

        Destination VM restore

        VM restore is a low level atomic operation VM.restore. This operation is represented by a function call to backend. It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor and libxc for sending a migration request to the emu-manager.

        After sending the request results coming from emu-manager are collected by the main thread. It blocks until results are received.

        During the live migration, emu-manager helps in ensuring the correct state transitions for the devices and handling the message passing for the VM as it’s moved between hosts. This includes making sure that the state of the VM’s virtual devices, like disks or network interfaces, is correctly moved over.

        Destination VM rename

        Once all operations are done, xenopsd renames the target VM from its temporary name to its real UUID. This operation is a low-level atomic VM.rename which takes care of updating the Xenstore on the destination host.

        Restoring devices

        Restoring devices starts by activating VBD using the low level atomic operation VBD.set_active. It is an update of Xenstore. VBDs that are read-write must be plugged before read-only ones. Once activated the low level atomic operation VBD.plug is called. VDI are attached and activate.

        Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug. If there are VGPUs we will set them as active now using the atomic VGPU.set_active.

        Creating the device model

        create_device_model configures qemu-dm and starts it. This allows to manage PCI devices.

        PCI plug

        PCI.plug is executed by the backend. It plugs a PCI device and advertises it to QEMU if this option is set. It is the case for NVIDIA SR-IOV vGPUs.

        Unpause

        The libxenctrl call xc_domain_unpause() unpauses the domain, and it starts running.

        Cleanup

        1. VM_set_domain_action_request marks the domain as alive: In case xenopsd restarts, it no longer reboots the VM. See the chapter on marking domains as alive for more information.

        2. If a post-migrate script is in place, it is executed by the Xenops_hooks.VM_post_migrate hook.

        3. The final step is a handshake to seal the success of the migration and the old VM can now be cleaned up.

        Syncronisation point 4 has been reached, the migration is complete.

        Live migration flowchart

        This flowchart gives a visual representation of the VM migration workflow:

        sequenceDiagram
        autonumber
        participant tx as sender
        participant rx0 as receiver thread 0
        participant rx1 as receiver thread 1
        participant rx2 as receiver thread 2
        
        activate tx
        tx->>rx0: VM.import_metadata
        tx->>tx: Squash memory to dynamic-min
        
        tx->>rx1: HTTP /migrate/vm
        activate rx1
        rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
        rx1->>tx: handshake (control channel)<br/>Synchronisation point 1
        
        tx->>rx2: HTTP /migrate/mem
        activate rx2
        rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem
        
        tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK
        
        rx2->>rx1: memory fd
        
        tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
        tx->>tx: VM_rename
        rx1->>rx2: exit
        deactivate rx2
        
        tx->>rx1: handshake (control channel)<br/>Synchronisation point 3
        
        rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request
        
        rx1->>tx: handshake (control channel)<br/>Synchronisation point 4
        
        deactivate rx1
        
        tx->>tx: VM_shutdown<br/>VM_remove
        deactivate tx

        References

        These pages might help for a better understanding of the XAPI toolstack:

        Live Migration Sequence Diagram

        sequenceDiagram
        autonumber
        participant tx as sender
        participant rx0 as receiver thread 0
        participant rx1 as receiver thread 1
        participant rx2 as receiver thread 2
        
        activate tx
        tx->>rx0: VM.import_metadata
        tx->>tx: Squash memory to dynamic-min
        
        tx->>rx1: HTTP /migrate/vm
        activate rx1
        rx1->>rx1: VM_receive_memory<br/>VM_create (00000001)<br/>VM_restore_vifs
        rx1->>tx: handshake (control channel)<br/>Synchronisation point 1
        
        tx->>rx2: HTTP /migrate/mem
        activate rx2
        rx2->>tx: handshake (memory channel)<br/>Synchronisation point 1-mem
        
        tx->>rx1: handshake (control channel)<br/>Synchronisation point 1-mem ACK
        
        rx2->>rx1: memory fd
        
        tx->>rx1: VM_save/VM_restore<br/>Synchronisation point 2
        tx->>tx: VM_rename
        rx1->>rx2: exit
        deactivate rx2
        
        tx->>rx1: handshake (control channel)<br/>Synchronisation point 3
        
        rx1->>rx1: VM_rename<br/>VM_restore_devices<br/>VM_unpause<br/>VM_set_domain_action_request
        
        rx1->>tx: handshake (control channel)<br/>Synchronisation point 4
        
        deactivate rx1
        
        tx->>tx: VM_shutdown<br/>VM_remove
        deactivate tx