Xapi

Xapi is the xapi-project host and cluster manager.

Xapi is responsible for:

providing a stable interface (the XenAPI)
allowing one client to manage multiple hosts
hosting the “xe” CLI
authenticating users and applying role-based access control
locking resources (in particular disks)
allowing storage to be managed through plugins
planning and coping with host failures (“High Availability”)
storing VM and host configuration
generating alerts
managing software patching

Principles

The XenAPI interface must remain backwards compatible, allowing older clients to continue working
Xapi delegates all Xenstore/libxc/libxl access to Xenopsd, so Xapi could be run in an unprivileged helper domain
Xapi delegates the low-level storage manipulation to SM plugins.
Xapi delegates setting up host networking to xcp-networkd.
Xapi delegates monitoring performance counters to xcp-rrdd.

Overview

The following diagram shows the internals of Xapi:

Internals of xapi

The top of the diagram shows the XenAPI clients: XenCenter, XenOrchestra, OpenStack and CloudStack using XenAPI and HTTP GET/PUT over ports 80 and 443 to talk to xapi. These XenAPI (JSON-RPC or XML-RPC over HTTP POST) and HTTP GET/PUT are always authenticated using either PAM (by default using the local passwd and group files) or through Active Directory.

The APIs are classified into categories:

coordinator-only: these are the majority of current APIs. The coordinator should be called and relied upon to forward the call to the right place with the right locks held.
normally-local: these are performance special cases such as disk import/export and console connection which are sent directly to hosts which have the most efficient access to the data.
emergency: these deal with scenarios where the coordinator is offline

If the incoming API call should be resent to the coordinator than a XenAPI HOST_IS_SLAVE error message containing the coordinator’s IP is sent to the client.

Once past the initial checks, API calls enter the “message forwarding” layer which

locks resources (via the current_operations mechanism)
decides which host should execute the request.

If the request should run locally then a direct function call is used; otherwise the message forwarding code makes a synchronous API call to a specific other host. Note: Xapi currently employs a “thread per request” model which causes one full POSIX thread to be created for every request. Even when a request is forwarded the full thread persists, blocking for the result to become available.

If the XenAPI call is a VM lifecycle operation then it is converted into a Xenopsd API call and forwarded over a Unix domain socket. Xapi and Xenopsd have similar notions of cancellable asynchronous “tasks”, so the current Xapi task (all operations run in the context of a task) is bound to the Xenopsd task, so cancellation is passed through and progress updates are received.

If the XenAPI call is a storage operation then the “storage access” layer

verifies that the storage objects are in the correct state (SR attached/detached; VDI attached/activated read-only/read-write)
invokes the relevant operation in the Storage Manager API (SMAPI) v2 interface;
depending on the type of SR:
- uses the SMAPIv2 to SMAPIv1 converter to generate the necessary command-line to talk to the SMAPIv1 plugin (EXT, NFS, LVM etc) and to execute it
- uses the SMAPIv2 to SMAPIv3 converter daemon xapi-storage-script to exectute the necessary SMAPIv3 command (GFS2)
persists the state of the storage objects (including the result of a VDI.attach call) to persistent storage

Internally the SMAPIv1 plugins use privileged access to the Xapi database to directly set fields (e.g. VDI.virtual_size) that would be considered read/only to other clients. The SMAPIv1 plugins also rely on Xapi for

knowledge of all hosts which may access the storage
locking of disks within the resource pool
safely executing code on other hosts via the “Xapi plugin” mechanism

The Xapi database contains Host and VM metadata and is shared pool-wide. The coordinator keeps a copy in memory, and all other nodes remote queries to the coordinator. The database associates each object with a generation count which is used to implement the XenAPI event.next and event.from APIs. The database is routinely asynchronously flushed to disk in XML format. If the “redo-log” is enabled then all database writes are made synchronously as deltas to a shared block device. Without the redo-log, recent updates may be lost if Xapi is killed before a flush.

High-Availability refers to planning for host failure, monitoring host liveness and then following-through on the plans. Xapi defers to an external host liveness monitor called xhad. When xhad confirms that a host has failed – and has been isolated from the storage – then Xapi will restart any VMs which have failed and which have been marked as “protected” by HA. Xapi can also impose admission control to prevent the pool becoming too overloaded to cope with n arbitrary host failures.

The xe CLI is implemented in terms of the XenAPI, but for efficiency the implementation is linked directly into Xapi. The xe program remotes its command-line to Xapi, and Xapi sends back a series of simple commands (prompt for input; print line; fetch file; exit etc).

Guides

Helpful guides for xapi developers.

How to add....

How to add....

Adding a Class to the API

This document describes how to add a new class to the data model that defines the Xen Server API. It complements two other documents that describe how to extend an existing class:

As a running example, we will use the addition of a class that is part of the design for the PVS Direct feature. PVS Direct introduces proxies that serve VMs with disk images. This class was added via commit CP-16939 to Xen API.

Example: PVS_server

In the world of Xen Server, each important concept like a virtual machine, interface, or users is represented by a class in the data model. A class defines methods and instance variables. At runtime, all class instances are held in an in-memory database. For example, part of [PVS Direct] is a class PVS_server, representing a resource that provides block-level data for virtual machines. The design document defines it to have the following important properties:

Fields

(string set) addresses (RO/constructor) IPv4 addresses of the server.
(int) first_port (RO/constructor) First UDP port accepted by the server.
(int) last_port (RO/constructor) Last UDP port accepted by the server.
(PVS_farm ref) farm (RO/constructor) Link to the farm that this server is included in. A PVS_server object must always have a valid farm reference; the PVS_server will be automatically GC’ed by xapi if the associated PVS_farm object is removed.
(string) uuid (R0/runtime) Unique identifier/object reference. Allocated by the server.

Methods (or Functions)

(PVS_server ref) introduce (string set addresses, int first_port, int last_port, PVS_farm ref farm) Introduce a new PVS server into the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.
(void) forget (PVS_server ref self) Remove a PVS server from the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.

Implementation Overview

The implementation of a class is distributed over several files:

ocaml/idl/datamodel.ml – central class definition
ocaml/idl/datamodel_types.ml – definition of releases
ocaml/xapi/cli_frontend.ml – declaration of CLI operations
ocaml/xapi/cli_operations.ml – implementation of CLI operations
ocaml/xapi/records.ml – getters and setters
ocaml/xapi/OMakefile – refers to xapi_pvs_farm.ml
ocaml/xapi/api_server.ml – refers to xapi_pvs_farm.ml
ocaml/xapi/message_forwarding.ml
ocaml/xapi/xapi_pvs_farm.ml – implementation of methods, new file

Data Model

The data model ocaml/idl/datamodel.ml defines the class. To keep the name space tidy, most helper functions are grouped into an internal module:

(* datamodel.ml *)

let schema_minor_vsn = 103 (* line 21 -- increment this *)
let _pvs_farm = "PVS_farm" (* line 153 *)

module PVS_farm = struct (* line 8658 *)
  let lifecycle = [Prototyped, rel_dundee_plus, ""]

  let introduce = call
    ~name:"introduce"
    ~doc:"Introduce new PVS farm"
    ~result:(Ref _pvs_farm, "the new PVS farm")
    ~params:
    [ String,"name","name of the PVS farm"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let forget = call
    ~name:"forget"
    ~doc:"Remove a farm's meta data"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ]
    ~errs:[
      Api_errors.pvs_farm_contains_running_proxies;
      Api_errors.pvs_farm_contains_servers;
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()


  let set_name = call
    ~name:"set_name"
    ~doc:"Update the name of the PVS farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; String, "value", "name to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let add_cache_storage = call
    ~name:"add_cache_storage"
    ~doc:"Add a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let remove_cache_storage = call
    ~name:"remove_cache_storage"
    ~doc:"Remove a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be removed"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let obj =
    let null_str = Some (VString "") in
    let null_set = Some (VSet []) in
    create_obj (* <---- creates class *)
    ~name: _pvs_farm
    ~descr:"machines serving blocks of data for provisioning VMs"
    ~doccomments:[]
    ~gen_constructor_destructor:false
    ~gen_events:true
    ~in_db:true
    ~lifecycle
    ~persist:PersistEverything
    ~in_oss_since:None
    ~messages_default_allowed_roles:_R_POOL_OP
    ~contents:
    [ uid     _pvs_farm ~lifecycle

    ; field   ~qualifier:StaticRO ~lifecycle
              ~ty:String "name" ~default_value:null_str
              "Name of the PVS farm. Must match name configured in PVS"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _sr)) "cache_storage" ~default_value:null_set
              ~ignore_foreign_key:true
              "The SR used by PVS proxy for the cache"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_server)) "servers"
              "The set of PVS servers in the farm"


    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_proxy)) "proxies"
              "The set of proxies associated with the farm"
    ]
    ~messages:
    [ introduce
    ; forget
    ; set_name
    ; add_cache_storage
    ; remove_cache_storage
    ]
    ()
end
let pvs_farm = PVS_farm.obj

The class is defined by a call to create_obj and it defines the fields and messages (methods) belonging to the class. Each field has a name, a type, and some meta information. Likewise, each message (or method) is created by call that describes its parameters.

The PVS_farm has additional getter and setter methods for accessing its fields. These are not declared here as part of the messages but are automatically generated.

To make sure the new class is actually used, it is important to enter it into two lists:

(* datamodel.ml *)
let all_system = (* line 8917 *)
  [
    ...
    vgpu_type;
    pvs_farm;
    ...
  ]

let expose_get_all_messages_for = [ (* line 9097 *)
  ...
  _pvs_farm;
  _pvs_server;
  _pvs_proxy;

When a field refers to another object that itself refers back to it, these two need to be entered into the all_relations list. For example, _pvs_server refers to a _pvs_farm value via "farm", which, in turn, refers to the _pvs_server value via its "servers" field.

let all_relations =
  [
    (* ... *)
    (_sr, "introduced_by"), (_dr_task, "introduced_SRs");
    (_pvs_server, "farm"), (_pvs_farm, "servers");
    (_pvs_proxy,  "farm"), (_pvs_farm, "proxies");
  ]

CLI Conventions

The CLI provides access to objects from the command line. The following conventions exist for naming fields:

A field in the data model uses an underscore (_) but a hyphen (-) in the CLI: what is cache_storage in the data model becomes cache-storage in the CLI.
When a field contains a reference or multiple, like proxies, it becomes proxy-uuids in the CLI because references are always referred to by their UUID.

CLI Getters and Setters

All fields can be read from the CLI and some fields can also be set via the CLI. These getters and setters are mostly generated automatically and need to be connected to the CLI through a function in ocaml/xapi/records.ml. Note that field names here use the naming convention for the CLI:

(* ocaml/xapi/records.ml *)
let pvs_farm_record rpc session_id pvs_farm =
  let _ref = ref pvs_farm in
  let empty_record =
    ToGet (fun () -> Client.PVS_farm.get_record rpc session_id !_ref) in
  let record = ref empty_record in
  let x () = lzy_get record in
    { setref    = (fun r -> _ref := r ; record := empty_record)
    ; setrefrec = (fun (a,b) -> _ref := a; record := Got b)
    ; record    = x
    ; getref    = (fun () -> !_ref)
    ; fields=
      [ make_field ~name:"uuid"
        ~get:(fun () -> (x ()).API.pVS_farm_uuid) ()
      ; make_field ~name:"name"
        ~get:(fun () -> (x ()).API.pVS_farm_name)
        ~set:(fun name ->
          Client.PVS_farm.set_name rpc session_id !_ref name) ()
      ; make_field ~name:"cache-storage"
        ~get:(fun () -> (x ()).API.pVS_farm_cache_storage
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~add_to_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.add_cache_storage rpc session_id !_ref sr)
        ~remove_from_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.remove_cache_storage rpc session_id !_ref sr)
        ()
      ; make_field ~name:"server-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref)
        ()
      ; make_field ~name:"proxy-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref)
        ()
      ]
    }

CLI Interface to Methods

Methods accessible from the CLI are declared in ocaml/xapi/cli_frontend.ml. Each declaration refers to the real implementation of the method, like Cli_operations.PVS_far.introduce:

(* cli_frontend.ml *)
let rec cmdtable_data : (string*cmd_spec) list =
  (* ... *)
  "pvs-farm-introduce",
  {
    reqd=["name"];
    optn=[];
    help="Introduce new PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.introduce;
    flags=[];
  };
  "pvs-farm-forget",
  {
    reqd=["uuid"];
    optn=[];
    help="Forget a PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.forget;
    flags=[];
  };

CLI Implementation of Methods

Each CLI operation that is not a getter or setter has an implementation in cli_operations.ml which is implemented in terms of the real implementation:

(* cli_operations.ml *)
module PVS_farm = struct
  let introduce printer rpc session_id params =
    let name  = List.assoc "name" params in
    let ref   = Client.PVS_farm.introduce ~rpc ~session_id ~name in
    let uuid  = Client.PVS_farm.get_uuid rpc session_id ref in
    printer (Cli_printer.PList [uuid])

  let forget printer rpc session_id params =
    let uuid  = List.assoc "uuid" params in
    let ref   = Client.PVS_farm.get_by_uuid ~rpc ~session_id ~uuid in
    Client.PVS_farm.forget rpc session_id ref
end

Fields that should show up in the CLI interface by default are declared in the gen_cmds value:

(* cli_operations.ml *)
let gen_cmds rpc session_id =
  let mk = make_param_funs in
  List.concat
  [ (*...*)
  ; Client.Pool.(mk get_all get_all_records_where
    get_by_uuid pool_record "pool" []
    ["uuid";"name-label";"name-description";"master"
    ;"default-SR"] rpc session_id)
  ; Client.PVS_farm.(mk get_all get_all_records_where
    get_by_uuid pvs_farm_record "pvs-farm" []
    ["uuid";"name";"cache-storage";"server-uuids"] rpc session_id)

Error messages

Error messages used by an implementation are introduced in two files:

(* ocaml/xapi-consts/api_errors.ml *)
let pvs_farm_contains_running_proxies = "PVS_FARM_CONTAINS_RUNNING_PROXIES"
let pvs_farm_contains_servers = "PVS_FARM_CONTAINS_SERVERS"
let pvs_farm_sr_already_added = "PVS_FARM_SR_ALREADY_ADDED"
let pvs_farm_sr_is_in_use = "PVS_FARM_SR_IS_IN_USE"
let sr_not_in_pvs_farm = "SR_NOT_IN_PVS_FARM"
let pvs_farm_cant_set_name = "PVS_FARM_CANT_SET_NAME"

(* ocaml/idl/datamodel.ml *)
  (* PVS errors *)
  error Api_errors.pvs_farm_contains_running_proxies ["proxies"]
    ~doc:"The PVS farm contains running proxies and cannot be forgotten." ();

  error Api_errors.pvs_farm_contains_servers ["servers"]
    ~doc:"The PVS farm contains servers and cannot be forgotten."
    ();

  error Api_errors.pvs_farm_sr_already_added ["farm"; "SR"]
    ~doc:"Trying to add a cache SR that is already associated with the farm"
    ();

  error Api_errors.sr_not_in_pvs_farm ["farm"; "SR"]
    ~doc:"The SR is not associated with the farm."
    ();

  error Api_errors.pvs_farm_sr_is_in_use ["farm"; "SR"]
    ~doc:"The SR is in use by the farm and cannot be removed."
    ();

  error Api_errors.pvs_farm_cant_set_name ["farm"]
    ~doc:"The name of the farm can't be set while proxies are active."
    ()

Method Implementation

The implementation of methods lives in a module in ocaml/xapi:

(* ocaml/xapi/api_server.ml *)
  module PVS_farm = Xapi_pvs_farm

The file below is typically a new file and needs to be added to ocaml/xapi/OMakefile.

(* ocaml/xapi/xapi_pvs_farm.ml *)
module D = Debug.Make(struct let name = "xapi_pvs_farm" end)
module E = Api_errors

let api_error msg xs = raise (E.Server_error (msg, xs))

let introduce ~__context ~name =
  let pvs_farm = Ref.make () in
  let uuid = Uuid.to_string (Uuid.make_uuid ()) in
  Db.PVS_farm.create ~__context
    ~ref:pvs_farm ~uuid ~name ~cache_storage:[];
  pvs_farm

(* ... *)

Messages received on a slave host may or may not be executed there. In the simple case, each methods executes locally:

(* ocaml/xapi/message_forwarding.ml *)
module PVS_farm = struct
  let introduce ~__context ~name =
    info "PVS_farm.introduce %s" name;
    Local.PVS_farm.introduce ~__context ~name

  let forget ~__context ~self =
    info "PVS_farm.forget";
    Local.PVS_farm.forget ~__context ~self

  let set_name ~__context ~self ~value =
    info "PVS_farm.set_name %s" value;
    Local.PVS_farm.set_name ~__context ~self ~value

  let add_cache_storage ~__context ~self ~value =
    info "PVS_farm.add_cache_storage";
    Local.PVS_farm.add_cache_storage ~__context ~self ~value

  let remove_cache_storage ~__context ~self ~value =
    info "PVS_farm.remove_cache_storage";
    Local.PVS_farm.remove_cache_storage ~__context ~self ~value
end

Adding a field to the API

This page describes how to add a field to XenAPI. A field is a parameter of a class that can be used in functions and read from the API.

Bumping the database schema version

Whenever a field is added to or removed from the API, its schema version needs to be increased. XAPI needs this fundamental procedure in order to be able to detect that an automatic database upgrade is necessary or to find out that the new schema is incompatible with the existing database. If the schema version is not bumped, XAPI will start failing in unpredictable ways. Note that bumping the version is not necessary when adding functions, only when adding fields.

The current version number is kept at the top of the file ocaml/idl/datamodel_common.ml in the variables schema_major_vsn and schema_minor_vsn, of which only the latter should be incremented (the major version only exists for historical reasons). When moving to a new XenServer release, also update the variable last_release_schema_minor_vsn to the schema version of the last release. To keep track of the schema versions of recent XenServer releases, the file contains variables for these, such as miami_release_schema_minor_vsn. After starting a new version of Xapi on an existing server, the database is automatically upgraded if the schema version of the existing database matches the value of last_release_schema_*_vsn in the new Xapi.

As an example, the patch below shows how the schema version was bumped when the new API fields used for ActiveDirectory integration were added:

--- a/ocaml/idl/datamodel.ml  Tue Nov 11 16:17:48 2008 +0000
+++ b/ocaml/idl/datamodel.ml  Tue Nov 11 15:53:29 2008 +0000
@@ -15,17 +15,20 @@ open Datamodel_types
  open Datamodel_types

  (* IMPORTANT: Please bump schema vsn if you change/add/remove a _field_.
     You do not have to dump vsn if you change/add/remove a message *)

  let schema_major_vsn = 5
 -let schema_minor_vsn = 55
 +let schema_minor_vsn = 56

  (* Historical schema versions just in case this is useful later *)
  let rio_schema_major_vsn = 5
  let rio_schema_minor_vsn = 19

 +let miami_release_schema_major_vsn = 5
 +let miami_release_schema_minor_vsn = 35
 +
  (* the schema vsn of the last release: used to determine whether we can
     upgrade or not.. *)
  let last_release_schema_major_vsn = 5
 -let last_release_schema_minor_vsn = 35
 +let last_release_schema_minor_vsn = 55

Setting the schema hash

In the ocaml/idl/schematest.ml there is the last_known_schema_hash This needs to be updated to be the next hash after the schema version was bumped. Get the new hash by running make test and you will receive the correct hash in the error message.

Adding the new field to some existing class

ocaml/idl/datamodel.ml

Add a new “field” line to the class in the file ocaml/idl/datamodel.ml or ocaml/idl/datamodel_[class].ml. The new field might require a suitable default value. This default value is used in case the user does not provide a value for the field.

A field has a number of parameters:

The lifecycle parameter, which shows how the field has evolved over time.
The qualifier parameter, which controls access to the field. The following values are possible:

Value	Meaning
StaticRO	Field is set statically at install-time.
DynamicRO	Field is computed dynamically at run time.
RW	Field is read/write.

The ty parameter for the type of the field.
The default_value parameter.
The name of the field.
A documentation string.

Example of a field in the pool class:

field ~lifecycle:[Published, rel_orlando, "Controls whether HA is enabled"]
      ~qualifier:DynamicRO ~ty:Bool
      ~default_value:(Some (VBool false)) "ha_enabled" "true if HA is enabled on the pool, false otherwise";

See datamodel_types.ml for information about other parameters.

Changing Constructors

Adding a field would change the constructors for the class – functions Db.*.create – and therefore, any references to these in the code need to be updated. In the example, the argument ~ha_enabled:false should be added to any call to Db.Pool.create.

Examples of where these calls can be found is in ocaml/tests/common/test_common.ml and ocaml/xapi/xapi_[class].ml.

CLI Records

If you want this field to show up in the CLI (which you probably do), you will also need to modify the Records module, in the file ocaml/xapi-cli-server/records.ml. Find the record function for the class which you have modified, add a new entry to the fields list using make_field. This type can be found in the same file.

The only required parameters are name and get (and unit, of course ). If your field is a map or set, then you will need to pass in get_{map,set}, and optionally set_{map,set}, if it is a RW field. The hidden parameter is useful if you don’t want this field to show up in a *_params_list call. As an example, here is a field that we’ve just added to the SM class:

make_field ~name:"versioned-capabilities"
           ~get:(fun () -> get_from_map (x ()).API.sM_versioned_capabilities)
           ~get_map:(fun () -> (x ()).API.sM_versioned_capabilities)
           ~hidden:true ();

Testing

The new fields can be tested by copying the newly compiled xapi binary to a test box. After the new xapi service is started, the file /var/log/xensource.log in the test box should contain a few lines reporting the successful upgrade of the metadata schema in the test box:

[...|xapi] Db has schema major_vsn=5, minor_vsn=57 (current is 5 58) (last is 5 57)
[...|xapi] Database schema version is that of last release: attempting upgrade
[...|sql] attempting to restore database from /var/xapi/state.db
[...|sql] finished parsing xml
[...|sql] writing db as xml to file '/var/xapi/state.db'.
[...|xapi] Database upgrade complete, restarting to use new db

Making this field accessible as a CLI attribute

XenAPI functions to get and set the value of the new field are generated automatically. It requires some extra work, however, to enable such operations in the CLI.

The CLI has commands such as host-param-list and host-param-get. To make a new field accessible by these commands, the file xapi-cli-server/records.ml needs to be edited. For the pool.ha-enabled field, the pool_record function in this file contains the following (note the convention to replace underscores by hyphens in the CLI):

let pool_record rpc session_id pool =
  ...
[
  ...
  make_field ~name:"ha-enabled" ~get:(fun () -> string_of_bool (x ()).API.pool_ha_enabled) ();
  ...
]}

NB: the ~get parameter must return a string so include a relevant function to convert the type of the field into a string i.e. string_of_bool

See xapi-cli-server/records.ml for examples of handling field types other than Bool.

Adding a function to the API

This page describes how to add a function to XenAPI.

Add message to API

The file idl/datamodel.ml is a description of the API, from which the marshalling and handler code is generated.

In this file, the create_obj function is used to define a class which may contain fields and support operations (known as “messages”). For example, the identifier host is defined using create_obj to encapsulate the operations which can be performed on a host.

In order to add a function to the API, we need to add a message to an existing class. This entails adding a function in idl/datamodel.ml or one of the other datamodel files to describe the new message and adding it to the class’s list of messages. In this example, we are adding to idl/datamodel_host.ml.

The function to describe the new message will look something like the following:

let host_price_of = call ~flags:[`Session]
    ~name:"price_of"
    ~in_oss_since:None
    ~lifecycle:[]
    ~params:[(Ref _host, "host", "The host containing the price information");
             (String, "item", "The item whose price is queried")]
    ~result:(Float, "The price of the item")
    ~doc:"Returns the price of a named item."
    ~allowed_roles:_R_POOL_OP
    ()

By convention, the name of the function is formed from the name of the class and the name of the message: host and price_of, in the example. An entry for host_price_of is added to the messages of the host class:

let host =
    create_obj ...
        ~messages: [...
                    host_price_of;
                   ]
...

The parameters passed to call are all optional (except ~name and ~lifecycle).

The ~flags parameter is used to set conditions for the use of the message. For example, `Session is used to indicate that the call must be made in the presence of an existing session.
The value of the ~lifecycle parameter should be [] in new code, with dune automatically generating appropriate values (datamodel_lifecycle.ml)
The ~params parameter describes a list of the formal parameters of the message. Each parameter is described by a triple. The first component of the triple is the type (from type ty in idl/datamodel_types.ml); the second is the name of the parameter, and the third is a human-readable description of the parameter. The first triple in the list is conventionally the instance of the class on which the message will operate. In the example, this is a reference to the host.
Similarly, the ~result describes the message’s return type, although this is permitted to merely be a single value rather than a list of values. If no ~result is specified, the default is unit.
The ~doc parameter describes what the message is doing.
The bool ~hide_from_docs parameter prevents the message from being included in the documentation when generated.
The bool ~pool_internal parameter is used to indicate if the message should be callable by external systems or only internal hosts.
The ~errs parameter is a list of possible exceptions that the message can raise.
The parameter ~lifecycle takes in an array of (Status, version, doc) to indicate the lifecycle of the message type. This takes over from ~in_oss_since which indicated the release that the message type was introduced. NOTE: Leave this parameter empty, it will be populated on build.
The ~allowed_roles parameter is used for access control (see below).

Compiling xen-api.(hg|git) will cause the code corresponding to this message to be generated and output in ocaml/xapi/server.ml. In the example above, a section handling an incoming call host.price_of appeared in ocaml/xapi/server.ml. However, after this was generated, the rest of the build failed because this call expects a price_of function in the Host object.

Update expose_get_all_messages_for list

If you are adding a new class, do not forget to add your new class _name to the expose_get_all_messages_for list, at the bottom of datamodel.ml, in order to have automatically generated get_all and get_all_records functions attached to it.

Update the RBAC field containing the roles expected to use the new API call

After the RBAC integration, Xapi provides by default a set of static roles associated to the most common subject tasks.

The api calls associated with each role are defined by a new ~allowed_roles parameter in each api call, which specifies the list of static roles that should be able to execute the call. The possible roles for this list is one of the following names, defined in datamodel.ml:

role_pool_admin
role_pool_operator
role_vm_power_admin
role_vm_admin
role_vm_operator
role_read_only

So, for instance,

~allowed_roles:[role_pool_admin,role_pool_operator] (* this is not the recommended usage, see example below *)

would be a valid list (though it is not the recommended way of using allowed_roles, see below), meaning that subjects belonging to either role_pool_admin or role_pool_operator can execute the api call.

The RBAC requirements define a policy where the roles in the list above are supposed to be totally-ordered by the set of api-calls associated with each of them. That means that any api-call allowed to role_pool_operator should also be in role_pool_admin; any api-call allowed to role_vm_power_admin should also be in role_pool_operator and also in role_pool_admin; and so on. Datamodel.ml provides shortcuts for expressing these totally-ordered set of roles policy associated with each api-call:

_R_POOL_ADMIN, equivalent to [role_pool_admin]
_R_POOL_OP, equivalent to [role_pool_admin,role_pool_operator]
_R_VM_POWER_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin]
_R_VM_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin]
_R_VM_OP, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op]
_R_READ_ONLY, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op,role_read_only]

The ~allowed_roles parameter should use one of the shortcuts in the list above, instead of directly using a list of roles, because the shortcuts above make sure that the roles in the list are in a total order regarding the api-calls permission sets. Creating an api-call with e.g. allowed_roles:[role_pool_admin,role_vm_admin] would be wrong, because that would mean that a pool_operator cannot execute the api-call that a vm_admin can, breaking the total-order policy expected in the RBAC 1.0 implementation. In the future, this requirement might be relaxed.

So, the example above should instead be used as:

~allowed_roles:_R_POOL_OP  (* recommended usage via pre-defined totally-ordered role lists *)

and so on.

How to determine the correct role of a new api-call:

if only xapi should execute the api-call, ie. it is an internal call: _R_POOL_ADMIN
if it is related to subject, role, external-authentication: _R_POOL_ADMIN
if it is related to accessing Dom0 (via console, ssh, whatever): _R_POOL_ADMIN
if it is related to the pool object: R_POOL_OP
if it is related to the host object, licenses, backups, physical devices: _R_POOL_OP
if it is related to managing VM memory, snapshot/checkpoint, migration: _R_VM_POWER_ADMIN
if it is related to creating, destroying, cloning, importing/exporting VMs: _R_VM_ADMIN
if it is related to starting, stopping, pausing etc VMs or otherwise accessing/manipulating VMs: _R_VM_OP
if it is related to being able to login, manipulate own tasks and read values only: _R_READ_ONLY

Update message forwarding

The “message forwarding” layer describes the policy of whether an incoming API call should be forwarded to another host (such as another member of the pool) or processed on the host which receives the call. This policy may be non-trivial to describe and so cannot be auto-generated from the data model.

In xapi/message_forwarding.ml, add a function to the relevant module to describe this policy. In the running example, we add the following function to the Host module:

let price_of ~__context ~host ~item =
    info "Host.price_of for item %s" item;
    let local_fn = Local.Host.price_of ~host ~item in
    do_op_on ~local_fn ~__context ~host
      (fun session_id rpc -> Client.Host.price_of ~rpc ~session_id ~host ~item)

After the ~__context parameter, the parameters of this new function should match the parameters we specified for the message. In this case, that is the host and the item to query the price of.

The do_op_on function takes a function to execute locally and a function to execute remotely and performs one of these operations depending on whether the given host is the local host.

The local function references Local.Host.price_of, which is a function we will write in the next step.

Implement the function

Now we write the function to perform the logic behind the new API call. For a host-based call, this will reside in xapi/xapi_host.ml. For other classes, other files with similar names are used.

We add the following function to xapi/xapi_host.ml:

let price_of ~__context ~host ~item =
    if item = "fish" then 3.14 else 0.00

We also need to add the function to the interface xapi/xapi_host.mli:

val price_of :
    __context:Context.t -> host:API.ref_host -> item:string -> float

Congratulations, you’ve added a function to the API!

Add the operation to the CLI

Edit xapi-cli-server/cli_frontend.ml. Add a block to the definition of cmdtable_data as in the following example:

"host-price-of",
{
  reqd=["host-uuid"; "item"];
  optn=[];
  help="Find out the price of an item on a certain host.";
  implementation= No_fd Cli_operations.host_price_of;
  flags=[];
};

Include here the following:

The names of required (reqd) and optional (optn) parameters.
A description to be displayed when calling xe help <cmd> in the help field.
The implementation should use With_fd if any communication with the client is necessary (for example, showing the user a warning, sending the contents of a file, etc.) Otherwise, No_fd can be used as above.
The flags field can be used to set special options:
- Vm_selectors: adds a “vm” parameter for the name of a VM (rather than a UUID)
- Host_selectors: adds a “host” parameter for the name of a host (rather than a UUID)
- Standard: includes the command in the list of common commands displayed by xe help
- Neverforward:
- Hidden:
- Deprecated of string list:

Now we must implement Cli_operations.host_price_of. This is done in xapi-cli-server/cli_operations.ml. This function typically extracts the parameters and forwards them to the internal implementation of the function. Other arbitrary code is permitted. For example:

let host_price_of printer rpc session_id params =
  let host = Client.Host.get_by_uuid rpc session_id (List.assoc "host-uuid" params) in
  let item = List.assoc "item" params in
  let price = string_of_float (Client.Host.price_of ~rpc ~session_id ~host ~item) in
  printer (Cli_printer.PList [price])

Tab Completion in the CLI

The CLI features tab completion for many of its commands’ parameters. Tab completion is implemented in the file ocaml/xe-cli/bash-completion, which is installed on the host as /etc/bash_completion.d/cli, and is done on a parameter-name rather than on a command-name basis. The main portion of the bash-completion file is a case statement that contains a section for each of the parameters that benefit from completion. There is also an entry that catches all parameter names ending at -uuid, and performs an automatic lookup of suitable UUIDs. The host-uuid parameter of our new host-price-of command therefore automatically gains completion capabilities.

Executing the CLI operation

Recompile xapi with the changes described above and install it on a test machine.

Execute the following command to see if the function exists:

xe help host-price-of

Invoke the function itself with the following command:

xe host-price-of host-uuid=<tab> item=fish

and you should find out the price of fish.

Adding a XenAPI extension

A XenAPI extension is a new RPC which is implemented as a separate executable (i.e. it is not part of xapi) but which still benefits from xapi parameter type-checking, multi-language stub generation, documentation generation, authentication etc. An extension can be backported to previous versions by simply adding the implementation, without having to recompile xapi itself.

A XenAPI extension is in two parts:

a declaration in the xapi datamodel. This must use the ~forward_to:(Extension "filename") parameter. The filename must be unique, and should be the same as the XenAPI call name.
an implementation executable in the dom0 filesystem with path /etc/xapi.d/extensions/filename

To define an extension

First write the declaration in the datamodel. The act of specifying the types and writing the documentation will help clarify the intended meaning of the call.

Second create a prototype of your implementation and put an executable file in /etc/xapi.d/extensions/filename. The calling convention is:

the file must be executable
xapi will parse the XMLRPC call arguments received over the network and check the session_id is valid
xapi will execute the named executable
the XMLRPC call arguments will be sent to the executable on stdin and stdin will be closed afterwards
the executable will run and print an XMLRPC response on stdout
xapi will read the response and return it to the client.

See the basic example.

Second make a pull request containing only the datamodel definitions (it is not necessary to include the prototype too). This will attract review comments which will help you improve your API further. Once the pull request is merged, then the API call name and extension are officially yours and you may use them on any xapi version which supports the extension mechanism.

Packaging your extension

Your extension /etc/xapi.d/extensions/filename (and dependencies) should be packaged for your target distribution (for XenServer dom0 this would be a CentOS RPM). Once the package is unpacked on the target machine, the extension should be immediately callable via the XenAPI, provided the xapi version supports the extension mechanism. Note the xapi version does not need to know about the specific extension in advance: it will always look in /etc/xapi.d/extensions/ for all RPC calls whose name it does not recognise.

Limitations

On type-checking

if the xapi version is new enough to know about your specific extension: xapi will type-check the call arguments for you
if the xapi version is too old to know about your specific extension: the extension will still be callable but the arguments will not be type-checked.

On access control

if the xapi version is new enough to know about your specific extension: you can declare that a user must have a particular role (e.g. ‘VM admin’)
if the xapi version is too old to know about your specific extension: the extension will still be callable but the client must have the ‘Pool admin’ role.

Since a xapi which knows about your specific extension is stricter than an older xapi, it’s a good idea to develop against the new xapi and then test older xapi versions later.

XE CLI architecture

Info

The links in this page point to the source files of xapi v1.132.0, not to the latest source code. Meanwhile, the CLI server code in xapi has been moved to a library separate from the main xapi binary, and has its own subdirectory ocaml/xapi-cli-server.

Architecture

The actual CLI is a very lightweight binary in ocaml/xe-cli
- It is just a dumb client, that does everything that xapi tells it to do
- This is a security issue
  - We must trust the xenserver that we connect to, because it can tell xe to read local files, download files, …
- When it is first called, it takes the few command-line arguments it needs, and then passes the rest to xapi in a HTTP PUT request
  - Each argument is in a separate line
- Then it loops doing what xapi tells it to do, in a loop, until xapi tells it to exit or an exception happens
The protocol description is in ocaml/xapi-cli-protocol/cli_protocol.ml
- The CLI has such a protocol that one binary can talk to multiple versions of xapi as long as their CLI protocol versions are compatible
- and the CLI can be changed without updating the xe binary
- and also for performance reasons, it is more efficient this way than by having a CLI that makes XenAPI calls
Xapi
- The HTTP POST request is sent to the /cli URL
- In Xapi.server_init, xapi registers the appropriate function to handle these requests, defined in common_http_handlers in the same file: Xapi_cli.handler
- The relevant code is in ocaml/xapi/records.ml, ocaml/xapi/cli_*.ml
  - CLI object definitions are in records.ml, command definitions in cli_frontend.ml (in cmdtable_data), implementations of commands in cli_operations.ml
- When a command is received, it is parsed into a command name and a parameter list of key-value pairs
  - and the command table is populated lazily from the commands defined in cmdtable_data in cli_frontend.ml, and automatically generated low-level parameter commands (the ones defined in section A.3.2 of the XenServer Administrator’s Guide) are also added for a list of standard classes
  - the command table maps command names to records that contain the implementation of the command, among other things
- Then the command name is looked up in the command table, and the corresponding operation is executed with the parsed key-value parameter list passed to it

Walk-through: CLI handler in xapi (external calls)

Definitions for the HTTP handler

Constants.cli_uri = "/cli"

Datamodel.http_actions = [...;
  ("post_cli", (Post, Constants.cli_uri, false, [], _R_READ_ONLY, []));
...]

(* these public http actions will NOT be checked by RBAC *)
(* they are meant to be used in exceptional cases where RBAC is already *)
(* checked inside them, such as in the XMLRPC (API) calls *)
Datamodel.public_http_actions_with_no_rbac_check` = [...
  "post_cli";  (* CLI commands -> calls XMLRPC *)
...]

Xapi.common_http_handlers = [...;
  ("post_cli", (Http_svr.BufIO Xapi_cli.handler));
...]

Xapi.server_init () =
  ...
  "Registering http handlers", [], (fun () -> List.iter Xapi_http.add_handler common_http_handlers);
  ...

Due to there definitions, Xapi_http.add_handler does not perform RBAC checks for post_cli. This means that the CLI handler does not use Xapi_http.assert_credentials_ok when a request comes in, as most other handlers do. The reason is that RBAC checking is delegated to the actual XenAPI calls that are being done by the commands in Cli_operations.

This means that the Xapi_http.add_handler call so resolves to simply:

Http_svr.Server.add_handler server Http.Post "/cli" (Http_svr.BufIO Xapi_cli.handler))

…which means that the function Xapi_cli.handler is called directly when an HTTP POST request with path /cli comes in.

High-level request processing

Xapi_cli.handler:

Reads the body of the HTTP request, limitted to Xapi_globs.http_limit_max_cli_size = 200 * 1024 characters.
Sends a protocol version string to the client: "XenSource thin CLI protocol" plus binary encoded major (0) and (2) minor numbers.
Reads the protocol version from the client and exits with an error if it does not match the above.
Calls Xapi_cli.parse_session_and_args with the request’s body to extract the session ref, if there.
Calls Cli_frontend.parse_commandline to parse the rest of the command line from the body.
Calls Xapi_cli.exec_command to execute the command.
On error, calls exception_handler.

Xapi_cli.parse_session_and_args:

Is passed the request body and reads it line by line. Each line is considered an argument.
Removes any CR chars from the end of each argument.
If the first arg starts with session_id=, the the bit after this prefix is considered to be a session reference.
Returns the session ref (if there) and (remaining) list of args.

Cli_frontend.parse_commandline:

Returns the command name and assoc list of param names and values. It handles --name and -flag arguments by turning them into key/value string pairs.

Xapi_cli.exec_command:

Finds username/password params.
Get the rpc function: this is the so-called “fake_rpc callback”, which does not use the network or HTTP at all, but goes straight to Api_server.callback1 (the XenAPI RPC entry point). This function is used by the CLI handler to do loopback XenAPI calls.
Logs the parsed xe command, omitting sensitive data.
Continues as Xapi_cli.do_rpcs
Looks up the command name in the command table from Cli_frontend (raises an error if not found).
Checks if all required params have been supplied (raises an error if not).
Checks that the host is a pool master (raises an error if not).
Depending on the command, a session.login_with_password or session.slave_local_login_with_password XenAPI call is made with the supplied username and password. If the authentication passes, then a session reference is returned for the RBAC role that belongs to the user. This session is used to do further XenAPI calls.
Next, the implementation of the command in Cli_operations is executed.

Command implementations

The various commands are implemented in cli_operations.ml. These functions are only called after user authentication has passed (see above). However, RBAC restrictions are only enforced inside any XenAPI calls that are made, and not on any of the other code in cli_operations.ml.

The type of each command implementation function is as follows (see cli_cmdtable.ml):

type op =
  Cli_printer.print_fn ->
  (Rpc.call -> Rpc.response) ->
  API.ref_session -> ((string*string) list) -> unit

So each function receives a printer for sending text output to the xe client, and rpc function and session reference for doing XenAPI calls, and a key/value pair param list. Here is a typical example:

let bond_create printer rpc session_id params =
  let network = List.assoc "network-uuid" params in
  let mac = List.assoc_default "mac" params "" in
  let network = Client.Network.get_by_uuid rpc session_id network in
  let pifs = List.assoc "pif-uuids" params in
  let uuids = String.split ',' pifs in
  let pifs = List.map (fun uuid -> Client.PIF.get_by_uuid rpc session_id uuid) uuids in
  let mode = Record_util.bond_mode_of_string (List.assoc_default "mode" params "") in
  let properties = read_map_params "properties" params in
  let bond = Client.Bond.create rpc session_id network pifs mac mode properties in
  let uuid = Client.Bond.get_uuid rpc session_id bond in
  printer (Cli_printer.PList [ uuid])

The necessary parameters are looked up in params using List.assoc or similar.
UUIDs are translated into reference by get_by_uuid XenAPI calls (note that the Client module is the XenAPI client, and functions in there require the rpc function and session reference).
Then the main API call is made (Client.Bond.create in this case).
Further API calls may be made to output data for the client, and passed to the printer.

This is the common case for CLI operations: they do API calls based on the parameters that were passed in.

However, other commands are more complicated, for example vm_import/export and vm_migrate. These contain a lot more logic in the CLI commands, and also send commands to the client to instruct it to read or write files and/or do HTTP calls.

Yet other commands do not actually do any XenAPI calls, but instead get “helpful” information from other places. Example: diagnostic_gc_stats, which displays statistics from xapi’s OCaml GC.

Tutorials

The following tutorials show how to extend the CLI (and XenAPI):

Database

Metadata-on-LUN

In the present version of XenServer, metadata changes resulting in writes to the database are not persisted in non-volatile storage. Hence, in case of failure, up to five minutes’ worth of metadata changes could be lost. The Metadata-on-LUN feature addresses the issue by ensuring that all database writes are retained. This will be used to improve recovery from failure by storing incremental deltas which can be re-applied to an old version of the database to bring it more up-to-date. An implication of this is that clients will no longer be required to perform a ‘pool-sync-database’ to protect critical writes, because all writes will be implicitly protected.

This is implemented by saving descriptions of all persistent database writes to a LUN when HA is active. Upon xapi restart after failure, such as on master fail-over, these descriptions are read and parsed to restore the latest version of the database.

Layout on block device

It is useful to store the database on the block device as well as the deltas, so that it is unambiguous on recovery which version of the database the deltas apply to.

The content of the block device will be structured as shown in the table below. It consists of a header; the rest of the device is split into two halves.

	Length (bytes)	Description
Header	16	Magic identifier
	1	ASCII NUL
	1	Validity byte
First half database	36	UUID as ASCII string
	16	Length of database as decimal ASCII
	(as specified)	Database (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
First half deltas	16	Length of database delta as decimal ASCII
	(as specified)	Database delta (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
Second half database	36	UUID as ASCII string
	16	Length of database as decimal ASCII
	(as specified)	Database (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string
Second half deltas	16	Length of database delta as decimal ASCII
	(as specified)	Database delta (binary data)
	16	Generation count as decimal ASCII
	36	UUID as ASCII string

After the header, one or both halves may be devoid of content. In a half which contains a database, there may be zero or more deltas (repetitions of the last three entries in each half).

The structure of the device is split into two halves to provide double-buffering. In case of failure during write to one half, the other half remains intact.

The magic identifier at the start of the file protect against attempting to treat a different device as a redo log.

The validity byte is a single `ascii character indicating the state of the two halves. It can take the following values:

Byte	Description
`0`	Neither half is valid
`1`	First half is valid
`2`	Second half is valid

The use of lengths preceding data sections permit convenient reading. The constant repetitions of the UUIDs act as nonces to protect against reading in invalid data in the case of an incomplete or corrupt write.

Architecture

The I/O to and from the block device may involve long delays. For example, if there is a network problem, or the iSCSI device disappears, the I/O calls may block indefinitely. It is important to isolate this from xapi. Hence, I/O with the block device will occur in a separate process.

Xapi will communicate with the I/O process via a UNIX domain socket using a simple text-based protocol described below. The I/O process will use to ensure that it can always accept xapi’s requests with a guaranteed upper limit on the delay. Xapi can therefore communicate with the process using blocking I/O.

Xapi will interact with the I/O process in a best-effort fashion. If it cannot communicate with the process, or the process indicates that it has not carried out the requested command, xapi will continue execution regardless. Redo-log entries are idempotent (modulo the raising of exceptions in some cases) so it is of little consequence if a particular entry cannot be written but others can. If xapi notices that the process has died, it will attempt to restart it.

The I/O process keeps track of a pointer for each half indicating the position at which the next delta will be written in that half.

Protocol

Upon connection to the control socket, the I/O process will attempt to connect to the block device. Depending on whether this is successful or unsuccessful, one of two responses will be sent to the client.

connect|ack_ if it is successful; or
connect|nack|<length>|<message> if it is unsuccessful, perhaps because the block device does not exist or cannot be read from. The <message> is a description of the error; the <length> of the message is expressed using 16 digits of decimal ascii.

The former message indicates that the I/O process is ready to receive commands. The latter message indicates that commands can not be sent to the I/O process.

There are three commands which xapi can send to the I/O process. These are described below, with a high level description of the operational semantics of the I/O process’ actions, and the corresponding responses. For ease of parsing, each command is ten bytes in length.

Write database

Xapi requests that a new database is written to the block device, and sends its content using the data socket.

Command:

: writedb___|<uuid>|<generation-count>|<length>: The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.

Semantics:

Read the validity byte.
If one half is valid, we will use the other half. If no halves are valid, we will use the first half.
Read the data from the data socket and write it into the chosen half.
Set the pointer for the chosen half to point to the position after the data.
Set the validity byte to indicate the chosen half is valid.

Response:

: writedb|ack_ in case of successful write; or: writedb|nack|<length>|<message> otherwise.; For error messages, the length of the message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Write database delta

Xapi sends a description of a database delta to append to the block device.

Command:

: writedelta|<uuid>|<generation-count>|<length>|<data>: The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.

Semantics:

Read the validity byte to establish which half is valid. If neither half is valid, return with a nack.
If the half’s pointer is set, seek to that position. Otherwise, scan through the half and stop at the position after the last write.
Write the entry.
Update the half’s pointer to point to the position after the entry.

Response:

: writedelta|ack_ in case of successful append; or: writedelta|nack|<length>|<message> otherwise.; For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Read log

Xapi requests the contents of the log.

Command:

: read______

Semantics:

Read the validity byte to establish which half is valid. If neither half is valid, return with an end.
Attempt to read the database from the current half.
If this is successful, continue in that half reading entries up to the position of the half’s pointer. If the pointer is not set, read until a record of length zero is found or the end of the half is reached. Otherwise—if the attempt to the read the database was not successful—switch to using the other half and try again from step 2.
Finally output an end.

Response:

: read|nack_|<length>|<message> in case of error; or: read|db___|<generation-count>|<length>|<data> for a database record, then a sequence of zero or more; read|delta|<generation-count>|<length>|<data> for each delta record, then; read|end__; For each record, and for error messages, the length of the data or message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Re-initialise log

Xapi requests that the block device is re-initialised with a fresh redo-log.

Command:

: empty_____\

Semantics:

: 1. Set the validity byte to indicate that neither half is valid.

Response:

: empty|ack_ in case of successful re-initialisation; or
empty|nack|<length>|<message> otherwise.: For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Impact on xapi performance

The implementation of the feature causes a slow-down in xapi of around 6% in the general case. However, if the LUN becomes inaccessible this can cause a slow-down of up to 25% in the worst case.

The figure below shows the result of testing four configurations, counting the number of database writes effected through a command-line ‘xe pool-param-set’ call.

The first and second configurations are xapi without the Metadata-on-LUN feature, with HA disabled and enabled respectively.
The third configuration shows xapi with the Metadata-on-LUN feature using a healthy LUN to which all database writes can be successfully flushed.
The fourth configuration shows xapi with the Metadata-on-LUN feature using an inaccessible LUN for which all database writes fail.

$Impact of feature on xapi database-writing performance. (Green points\nrepresent individual samples; red bars are the arithmetic means of\nsamples.)$

Testing strategy

The section above shows how xapi performance is affected by this feature. The sections below describe the dev-testing which has already been undertaken, and propose how this feature will impact on regression testing.

Dev-testing performed

A variety of informal tests have been performed as part of the development process:

Enable HA.: Confirm LUN starts being used to persist database writes.
Enable HA, disable HA.: Confirm LUN stops being used.
Enable HA, kill xapi on master, restart xapi on master.: Confirm that last database write before kill is successfully restored on restart.
Repeatedly enable and disable HA.: Confirm that no file descriptors are leaked (verified by counting the number of descriptors in /proc/pid/fd/).
Enable HA, reboot the master.: Due to HA, a slave becomes the master (or this can be forced using ‘xe pool-emergency-transition-to-master’). Confirm that the new master starts is able to restore the database from the LUN from the point the old master left off, and begins to write new changes to the LUN.
Enable HA, disable the iSCSI volume.: Confirm that xapi continues to make progress, although database writes are not persisted.
Enable HA, disable and enable the iSCSI volume.: Confirm that xapi begins to use the LUN when the iSCSI volume is re-enabled and subsequent writes are persisted.

These tests have been undertaken using an iSCSI target VM and a real iSCSI volume on lannik. In these scenarios, disabling the iSCSI volume consists of stopping the VM and unmapping the LUN, respectively.

Proposed new regression test

A new regression test is proposed to confirm that all database writes are persisted across failure.

There are three types of database modification to test: row creation, field-write and row deletion. Although these three kinds of write could be tested in separate tests, the means of setting up the pre-conditions for a field-write and a row deletion require a row creation, so it is convenient to test them all in a single test.

Start a pool containing three hosts.
Issue a CLI command on the master to create a row in the database, e.g.
xe network-create name-label=a.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the row creation persisted:
xe network-list name-label=a,
confirming that the returned string is non-empty.
Issue a CLI command on the master to modify a field in the new row in the database:
xe network-param-set uuid=<uuid> name-description=abcd,
where <uuid> is the UUID returned from step 2.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the field-write persisted:
xe network-param-get uuid=<uuid> param-name=name-description,
where <uuid> is the UUID returned from step 2. The returned string should contain
abcd.
Issue a CLI command on the master to delete the row from the database:
xe network-destroy uuid=<uuid>,
where <uuid> is the UUID returned from step 2.
Forcefully power-cycle the master.
On fail-over, issue a CLI command on the new master to check that the row does not exist:
xe network-list name-label=a,
confirming that the returned string is empty.

Impact on existing regression tests

The Metadata-on-LUN feature should mean that there is no need to perform an ‘xe pool-sync-database’ operation in existing HA regression tests to ensure that database state persists on xapi failure.

XAPI's Internals

The articles provided under this sub-section intend to act as a developer resource for toolstack engineers.

Certificates and PEM Files

Xapi uses certificates for secure communication within a pool and with external clients. These certificates are using the PEM file format and reside in the Dom0 file system. This documents explains the purpose of these files.

Design Documents

Paths

Below are paths used by Xapi for certificates; additional certficates may be installed but they are not fundamental for Xapi’s operation.

/etc/xensource/xapi-ssl.pem
/etc/xensource/xapi-pool-tls.pem
/etc/stunnel/certs-pool/1c111a1f-412e-47c0-9003-60789b839bc3.pem
/etc/stunnel/certs-pool/960abfff-6017-4d97-bd56-0a8f1a43e51a.pem
/etc/stunnel/xapi-stunnel-ca-bundle.pem
/etc/stunnel/certs/
/etc/stunnel/xapi-pool-ca-bundle.pem

Fundamental Certificates

Certificates that identify a host. These certificates are comprised of both a private and a public key. The public key may be distributed to other hosts.

xapi-ssl.pem

This certificate identifies a host for extra-pool clients.

This is the certificate used by the API HTTPS server that clients like XenCenter or CVAD connect to. On installation of XenServer it is auto generated but can be updated by a user using the API. This is the most important certificate for a user to establish an HTTPS connection to a pool or host to be used as an API.

/etc/xensource/xapi-ssl.pem
contains private and public key for this host
Host.get_server_certificate API call
referenced by /etc/stunnel/xapi.conf
xe host-server-certificate-install XE command to replace the certificate.
See below for xapi-stunnel-ca-bundle for additional certificates that can be added to a pool in support of a user-supplied host certificate.
xe reset-server-certificate creates a new self-signed certificate.

`xapi-pool-tls.pem`

This certificate identifies a host inside a pool. It is auto generated and used for all intra-pool HTTPS connections. It needs to be distributed inside a pool to establish trust. The distribution of the public part of the certificate is performed by the API and must not be done manually.

/etc/xensource/xapi-pool-tls.pem
contains private and public key for this host
referenced by /etc/stunnel/xapi.conf
This certificate can be re-generated using the API or XE
Host.refresh_server_certificate
xe host-refresh-server-certificate

Certificate Bundles

Certifiacte bundles are used by stunnel. They are a collection of public keys from hosts and certificates provided by a user. Knowing a host’s public key facilitates stunnel connecting to the host.

Bundles by themselves are a technicality as they organise a set of certificates in a single file but don’t add new certificates.

`xapi-pool-ca-bundle.pem` and `certs-pool/*.pem`

Collection of public keys from xapi-pool-tls.pem across the pool. The public keys are collected in the certs-pool directory: each is named after the UUID of its host and the bundle is constructed from them.

bundle of public keys from hosts’ xapi-pool-tls.pem
constructed from PEM files in certs-pool/
/opt/xensource/bin/update-ca-bundle.sh generates the bundle from PEM files

`xapi-stunnel-ca-bundle.pem` and `certs/*.pem`

User-supplied certificates; they are not essential for the operation of a pool from Xapi’s perspective. They make stunnel aware of certificates used by clients when using HTTPS for API calls.

in a plain pool installation, these are empty; PEMs supplied by a user are stored here and bundled into the xapi-stunnerl-ca-bundle.pem.
bundle of public keys supploed by a user
constructed from PEM files in certs/
/opt/xensource/bin/update-ca-bundle.sh generates the bundle from PEM files
Updated by a user using xe pool-install-ca-certificate
Pool.install_ca_certificate
Pool.uninstall_ca_certificate
xe pool-certificate-sync explicitly distribute these certificates in the pool.
User-provided certificates can be used to let xapi connect to WLB.

Generated Parts of Xapi

Introduction

Many parts of xapi are auto-generated during the build process. This article aims to document some of these modules and how they relate to each other. The intention of this article is to serve as a developer resource and, as such, its contents are prone to change (or become inaccurate) as the codebase evolves. The ultimate source of truth remains the codebase itself.

Interface Description

All of XenAPI’s data model is described within the ocaml/idl subdirectory. The data model itself describes the classes that make up the API. The classes themselves comprise fields and messages (methods), relating functionality together.

API Types (`aPI.ml`)

The internal representation of each object is a record whose type is specified within the generated API module, as part of building xapi-types. For example, the task object’s structure is defined by the type task_t. Similarly, API includes the internal type representation used to represent fields. For example, a type such as (string -> vdi_operations) map, used by the data model, is defined as (string * vdi_operations) list within API (where vdi_operations itself is a polymorphic variant also defined by API).

Note that the all the type definitions within API are annotated with [@@deriving rpc]. This ensures that the final module, after preprocessing, also contains functions to marshal each type to/from Rpc.t values (which is important for Server - described later - which receives that format as input).

Database Actions (`db_actions.ml`)

The majority of XenAPI consists of methods used to read and modify the fields of objects in the database. These methods are automatically generated and their implementations are placed within the generated Db_actions module.

The Db_actions module consists of various, related, parts: type definitions for a subset of objects’ fields, marshallers for converting API types to/from strings, and database action handlers. Briefly, the role of each of these is described below:

Type definitions for XenAPI objects are redefined in Db_actions in order to exclude internal fields. If a field is marked as “internal only” within the data model, then it should only exist within internal representations (those defined by API, described above). For example, task_t (describing a task object) - as described by Db_actions - notably omits the field task_session (of type session ref) so as to not leak sensitive information to clients (in this case, the reference to the session that created the task object).
Fields of the database are internally stored as strings and must be marshalled to typed values for use in OCaml code using DB actions. To this end, submodules String_to_DM and DM_to_String are generated to include code for doing these conversions. These submodules consist of the inverse operations of the other. For example, for the data model type Observer ref set, the function ref_Observer_set exists in both String_to_DM (as string -> [`Observer] API.Ref.t list) and in DM_to_String (inversely, as [`Observer] API.Ref.t list -> string).
Handlers for actions that read/write fields of database objects are implemented by Db_actions. Each handler uses the relevant marshallers to marshal inputs and outputs. Note that Db_actions generates two variants of get_record for each class: a normal get_record which returns the public class representation as described by the types defined in Db_actions, and a get_record_internal, which returns the full class representation (including internal fields) as described by the API module.

Registering for Snapshots

The Db_actions module also generates modules that register callbacks for Xapi’s event mechanisms. These modules are named in the format Class_init and consist only of top-level code, evaluated for its effect.

As each event must provide a snapshot of the related object, the event mechanism must be able to read records from the database. To do this, Eventgen exposes an API that Db_actions uses to register callbacks (one for each type of object in the database).

For example, within Db_actions, there is a module VM_init, consisting of:

module VM_init = struct
  let _  =
    Hashtbl.add Eventgen.get_record_table "VM"
      (fun ~__context ~self -> (fun () -> API.rpc_of_vM_t (VM.get_record ~__context ~self:(Ref.of_string self))))
end

As snapshots are served to external clients, the functions use the public get_record functions - returning types defined by API - which omits internal fields.

Notice that the type of values being mapped to is __context:Context.t -> self:string -> unit -> Rpc.t.

The presence of unit in the type is to permit partial application (of __context and self) to create thunks. The type unit -> Rpc.t is lossy, it says nothing about the context or object reference being used to fetch the snapshot; these details are captured by the closure arising from partial application. This means that code can arbitrarily delay the fetching of a snapshot and then “force” it (on demand) later. In practice, these snapshots are not delayed for long, see Eventgen for more information.

Custom Actions (`custom_actions.ml`)

The API operations that require a custom implementation (i.e. are not automatically generated) are grouped together into a signature called CUSTOM_ACTIONS within custom_actions.ml.

open API

module type CUSTOM_ACTIONS = sig

  module Session : sig
    val login_with_password : __context:Context.t -> uname:string -> pwd:string -> version:string -> originator:string -> ref_session
    val logout : __context:Context.t -> unit
(* ... *)

The isolation of these methods into their own signature is important for ensuring implementations exist for them.

The Actions submodule of Api_server_common can be ascribed this signature. The purpose of the Actions sub-module is to group custom implementations together whilst renaming them using module aliases. For example, the custom implementations of messages associate with the task class exist within xapi_task.ml - the module arising from this file is aliased, within Actions, to rename it and satisfy the CUSTOM_ACTIONS signature (e.g. module Task = Xapi_task).

The signature is not explicitly ascribed to Actions at its definition site, but is used by functors which are parameterised by modules satisfying CUSTOM_ACTIONS. The important modules of note are Actions itself (which comprise the concrete implementations of custom messages in Xapi) and the Forwarder sub-module (of Api_server_common) which uses the Message_forwarding.Forward functor to potentially override (via shadowing) the implementations of custom actions, in order to define policies around forwarding (e.g. whether the coordinator should handle a custom action by appealing to a subordinate host, usually via the Client module - described later).

Server (`server.ml`)

The Server modules contains the logic to handle incoming calls from Xapi’s HTTP server. At the top level, it contains a functor (parameterised by Local and Forward - both satisfying the CUSTOM_ACTIONS signature), that contains a large pattern match on the name of the supplied message (e.g. Host.get_record). Then, dependent on the semantics of the message itself, the body of each handler differs in slight ways when doing dispatch.

The top-level `dispatch_call` function

The top-level dispatch function, dispatch_call, has the following header:

let dispatch_call (http_req: Http.Request.t) (fd: Unix.file_descr) (call: Rpc.call) =

The incoming HTTP request (http_req) and - related - socket’s file descriptor (Unix.file_descr) are forwarded - within handler code - to Server_helpers.do_dispatch. The HTTP request is important because task and tracing-related metadata can be propagated using fields within the request header. The file descriptor can be used to determine the origin of the request (whether it’s local or not) but also can permit flexibility in upgrading protocols (as is done in other parts of Xapi, such as the /cli handler, where the connection starts off as HTTP but continues as something else).

The Anatomy of a Handler

A typical handler, within dispatch_call, looks like the following:

    | "task.get_name_label" | "task_get_name_label" ->
        begin match __params with
        | [session_id_rpc; self_rpc] ->
            (* has no side-effect; should be handled by DB action *)
            (* has no asynchronous mode *)
            let session_id = ref_session_of_rpc session_id_rpc in
            let self = ref_task_of_rpc self_rpc in
            Session_check.check ~intra_pool_only:false ~session_id ~action:"task.get_name_label";
            let arg_names_values = [("session_id", session_id_rpc); ("self", self_rpc)] in
            let key_names = [] in
            let rbac __context fn = Rbac.check session_id __call ~args:arg_names_values ~keys:key_names ~__context ~fn in
            let marshaller = (fun x -> rpc_of_string x) in
            let local_op = fun ~__context ->(rbac __context (fun()->(Db_actions.DB_Action.Task.get_name_label ~__context:(Context.check_for_foreign_database ~__context)  ~self))) in
            let supports_async = false in
            let generate_task_for = false in
            ApiLogRead.debug "task.get_name_label";
            let resp = Server_helpers.do_dispatch ~session_id  supports_async __call local_op marshaller fd http_req __label __sync_ty generate_task_for in
            resp
        | _ ->
            Server_helpers.parameter_count_mismatch_failure __call "1" (string_of_int ((List.length __params) - 1))
        end

The start of each handler contains calls to unmarshal arguments from their Rpc.t representation to that defined by the API module. These functions are automatically generated during preprocessing of the aPI.ml file (API modules from xapi-types, described above). The conversion from the incoming XML-RPC (or JSON-RPC) to the Rpc.t encoding is handled by Api_server before it calls dispatch_call.

In the example above, the “local” operation (local_op) uses handlers generated within Db_actions (described above). This is typical of handlers for DB-related actions (the most common type of action): they have no forwarding logic (thus, no entry in the CUSTOM_ACTIONS signature) as they can only be carried out on the coordinator host (which maintains the database). If a subordinate host wishes to change the database, it must use a custom endpoint and protocol (not described here).

To see more about how the CUSTOM_ACTIONS signature is used in practice, you can look at the “local” and “forward” operations for a message with custom handling. For example, in Pool_patch.apply:

(* ... *)
let local_op = fun ~__context ->(rbac __context (fun()->(Custom.Pool_patch.apply ~__context:(Context.check_for_foreign_database ~__context)  ~self ~host))) in
(* ...  *)
let forward_op = fun ~local_fn ~__context -> (rbac __context (fun()-> (Forward.Pool_patch.apply ~__context:(Context.check_for_foreign_database ~__context)  ~self ~host) )) in
(* ...  *)

As mentioned above, the Custom and Forward modules are both inputs to Server’s Make functor. The difference lies in how they are instantiated: Api_server ensures that Custom is referring to local implementations (such as that arising from modules defined by files named xapi_*.ml) and Forward is referring to the module derived by Message_forwarding (but shadowed with implementations that may apply different handling to the call).

RBAC Checking and Auditing

In order to implement RBAC (Role Based Access Control) checking for individual messages, each handler contains logic that wraps an action (as a callback) within code that calls into the Rbac module (specifically, the Rbac.check function).

In the typical case, Rbac.check compares the name of a call against the list of RBAC permissions granted to the role associated with the originator’s session/context. There is more involved logic for key-related RBAC checks (explained later).

For an accessible listing of each (static) RBAC permission, Xapi auto-generates a CSV file containing this information in a tabular format (within rbac_static.csv). The information in that file is consistent with the auto-generated Rbac_static module described in this document.

Along with providing authorisation checking, the Rbac.check function also appends to an audit log which contains a (sanitised) list of actions (alongside their RBAC check outcome).

RBAC Checking of Keys

In auto-generated handlers for add_to and remove_from messages (e.g. pool.add_to_other_config), the RBAC check may cite a list of key descriptors. For example:

(* pool.add_to_other_config ...  *)
let arg_names_values = [("session_id", session_id_rpc); ("self", self_rpc); ("key", key_rpc); ("value", value_rpc)] in
let key_names = ["folder"; "XenCenter.CustomFields.*"; "EMPTY_FOLDERS"] in
let rbac __context fn = Rbac.check session_id __call ~args:arg_names_values ~keys:key_names ~__context ~fn in
(* ... *)

These keys are specified within the data model as being tied to specific roles, in order to apply role-based exclusions to specific keys. The usual situation is that the setter for such a (string -> string) map field (e.g. pool.set_other_config) requires a more privileged role than the roles specified for individual keys.

The mechanism that enforces this check is somewhat brittle at present: the Rbac.check function is provided the list of key descriptors and the (association) list of (unmarshalled) arguments. If the key descriptor list is non-empty, it will consult the argument listing for the cited key (i.e. the key name mapped to by “key” in the argument listing) and then attempt to match that against a descriptor. If there is a match, it will check the current session against the list of RBAC permissions. The key-related RBAC permissions are encoded in the format action/key:key (all lowercase) - for example, pool.add_to_other_config/key:xencenter.customfields.*.

Alternative Wire Names

In order to support languages that have keywords that collide with message names within Xapi, an alternative wire format is also cased upon within dispatch_call.

| "task.get_name_label" | "task_get_name_label" ->
(* ... *)

For example, Python uses the keyword from to handle imports and, so, an API call (using xmlrpc.client) - rendered as event.from - is a syntactic error. To get around this, the API permits an underscore to be substituted in place of the period (.) that separates the class name from the message name (e.g. event_from).

This apparent duplication of cases does not amount to a concrete duplication of matching code within the compiled module (due to how OCaml special cases the compilation of pattern matching over constant strings). However, in future, we could avoid casing on both of them by normalising the name of the incoming call (i.e. transform event.from to event_from prior to matching).

Client (`client.ml`)

The Client module serves as the main module of the xapi-client library. The primary consumer of this library is Xapi itself, for use when a host may call into another host (or itself).

For example, when defining a message forwarding policy, the implementation of a handler may use the Client module to invoke a function on another host. For instance, the message forwarding of Pool_patch.apply (from xapi/message_forwarding.ml):

let apply ~__context ~self ~host =
  info "Pool_patch.apply: pool patch = '%s'; host = '%s'"
    (pool_patch_uuid ~__context self)
    (host_uuid ~__context host) ;
  let local_fn = Local.Pool_patch.apply ~self ~host in
  do_op_on ~local_fn ~__context ~host (fun session_id rpc ->
    Client.Pool_patch.apply ~rpc ~session_id ~self ~host
  )

The do_op_on machinery provides the rpc transport (Rpc.call -> Rpc.response) to the callback which passes it to Client’s implementation (which just performs the relevant marshalling). The RPC transport itself is XML-RPC over HTTP (as implemented by the internal http-lib library - ocaml/libs/http-lib).

Client Internals

Internally, the Client module contains a few functors. The top-level functor, ClientF is parameterised by a signature describing an arbitrary monad. The intention is to permit users to instantiate clients defined in terms of an RPC transport that may be asynchronous (for example, within the context of a program using Lwt or Async for its networking).

There is also a sub-functor AsyncF (within ClientF) that is parameterised by a module that provides a qualifier string to be prepended to calls’ method names. A few messages in Xapi can be qualified with async qualifiers (in particular, Async and InternalAsync). The AsyncF functor provides handling of those calls and is used to define the sub-modules (within ClientF) Async and InternalAsync. (the former prepending Async and the latter further prepending Internal). Code at the top-level of Server’s dispatch_call function is used to parse (and remove) this async qualifier from the provided message name.

module ClientF = functor(X : IO) ->struct
  (* ... *)
  module AsyncF = functor(AQ: AsyncQualifier) ->struct
    (* handling of messages with asynchronous modes *)
    module Session = struct
      let create_from_db_file ~rpc ~session_id ~filename =
        let session_id = rpc_of_ref_session session_id in
        let filename = rpc_of_string filename in
        rpc_wrapper rpc (Printf.sprintf "%sAsync.session.create_from_db_file" AQ.async_qualifier) [ session_id; filename ] >>= fun x -> return (ref_task_of_rpc  x)
        (* ... *)
    end
    (* ...  *)
  end
  (* handling of messages with synchronous modes; similar to above, but without prefixing of "Async" *)

  module Async = AsyncF(struct let async_qualifier = "" end)
  module InternalAsync = AsyncF(struct let async_qualifier = "Internal" end)
end
(* instantiate Client with the identity monad *)
module Client = ClientF(Id)

The usual Client module used by users of xapi-client is the Client sub-module, defined in terms of the identity monad (which simply applies the given continuation as its sequencing logic and performs no wrapping):

module Id = struct
  type 'a t = 'a
  let bind x f = f x 
  let return x = x
end

module Client = ClientF(Id)

This results in synchronous semantics, whereby any code within Xapi that uses it would block as it waits for a response via the RPC transport. This is not an issue in practice, as each call is given its own thread during the dispatch logic.

Note that the RPC transport itself is defined in terms of the provided monad. In the identity case, it’s a simple alias, and so the type of rpc is rendered Rpc.call -> Rpc.response. However, if you were to provide a monad defined, for example, in terms of Lwt.t (i.e. type 'a t = 'a Lwt.t), the expected type of the transport would reflect that: Rpc.call -> Rpc.response Lwt.t.

Rbac_static

The data model assigns specific roles to messages and fields. In order to permit RBAC (Role Based Access Control) checking for the related actions, Xapi must be able to determine the required role(s) for a given action. To this end, Rbac_static is generated to contain entries that encode this information.

The format of the entries in Rbac_static is rather peculiar. For example, for the action Pool_patch.apply, we find permission_pool_patch_apply defined at the top-level:

let permission_pool_patch_apply = 
  { (* 311/2196 *)
  role_uuid = "d4385002-b920-5412-4c57-b010f451fa81";
  role_name_label = "pool_patch.apply";
  role_name_description = permission_description;
  role_subroles = []; (* permission cannot have any subroles *)
  role_is_internal = true;
  }

This record is of type role_t (as defined by Db_actions). This record is later incorporated into role-specific lists of permissions (for each statically known role).

The reason that Rbac_static defines permissions in a format defined by Db_actions is because, to avoid flooding the database with thousands of entries, Rbac_static acts as its own database. In Xapi_role, functions are defined that mirror the functionality of functions within Db_actions (e.g. get_by_uuid).

The get_by_uuid function (within Xapi_role) illustrates the bypassing of the database clearly:

let get_by_uuid ~__context ~uuid =
  match find_role_by_uuid uuid with
  | Some static_record ->
      ref_of_role ~role:static_record
  | None ->
      (* pass-through to Db *)
      Db.Role.get_by_uuid ~__context ~uuid

If a role can be found (by UUID) statically (within Rbac_static), then that is used. Otherwise, the database is queried. Using the database as a fallback is important because there is still a dynamic component to the RBAC checking in Xapi: users can define their own roles that incorporate other roles as sub-roles - it’s just that the statically-known roles won’t be stored in the database. Precluding static roles from the database helps to avoid making the database larger and prevents users from deleting static roles from the database.

Host memory accounting

Memory is used for many things:

the hypervisor code: this is the Xen executable itself
the hypervisor heap: this is needed for per-domain structures and per-vCPU structures
the crash kernel: this is needed to collect information after a host crash
domain RAM: this is the memory the VM believes it has
shadow memory: for HVM guests running on hosts without hardware assisted paging (HAP) Xen uses shadow to optimise page table updates. For all guests shadow is used during live migration for tracking the memory transfer.
video RAM for the virtual graphics card

Some of these are constants (e.g. hypervisor code) while some depend on the VM configuration (e.g. domain RAM). Xapi calls the constants “host overhead” and the variables due to VM configuration as “VM overhead”. There is no low-level API to query this information, therefore xapi will sample the host overheads at system boot time and model the per-VM overheads.

Host overhead

The host overhead is not managed by xapi, instead it is sampled. After the host boots and before any VMs start, xapi asks Xen how much memory the host has in total, and how much memory is currently free. Xapi subtracts the free from the total and stores this as the host overhead.

VM overhead

The inputs to the model are

VM.memory_static_max: the maximum amount of RAM the domain will be able to use
VM.HVM_shadow_multiplier: allows the shadow memory to be increased
VM.VCPUs_max: the maximum number of vCPUs the domain will be able to use

First the shadow memory is calculated, in MiB

Shadow memory in MiB

Second the VM overhead is calculated, in MiB

Memory overhead in MiB

Memory required to start a VM

If ballooning is disabled, the memory required to start a VM is the same as the VM overhead above.

If ballooning is enabled then the memory calculation above is modified to use the VM.memory_dynamic_max rather than the VM.memory_static_max.

Memory required to migrate a VM

If ballooning is disabled, the memory required to receive a migrating VM is the same as the VM overhead above.

If ballooning is enabled, then the VM will first be ballooned down to VM.memory_dynamic_min and then it will be migrated across. If the VM fails to balloon all the way down, then correspondingly more memory will be required on the receiving side.

XAPI's Storage Layers

Info

The links in this page point to the source files of xapi v25.11.0.

Xapi directly communicates only with the SMAPIv2 layer. There are no plugins directly implementing the SMAPIv2 interface, but the plugins in other layers are accessed through it:

graph TD
A[xapi] --> B[SMAPIv2 interface]
B --> C[SMAPIv2 <-> SMAPIv1 state machine: storage_smapiv1_wrapper.ml]
C --> G[SMAPIv2 <-> SMAPIv1 translation: storage_smapiv1.ml]
B --> D[SMAPIv2 <-> SMAPIv3 translation: xapi-storage-script]
G --> E[SMAPIv1 plugins]
D --> F[SMAPIv3 plugins]

SMAPIv1

These are the files related to SMAPIv1 in xen-api/ocaml/xapi/:

sm.ml: OCaml “bindings” for the SMAPIv1 Python “drivers” (SM)
sm_exec.ml: support for implementing the above “bindings”. The parameters are converted to XML-RPC, passed to the relevant python script (“driver”), and then the standard output of the program is parsed as an XML-RPC response (we use xen-api-libs-transitional/http-svr/xMLRPC.ml for parsing XML-RPC). When adding new functionality, we can modify type call to add parameters, but when we don’t add any common ones, we should just pass the new parameters in the args record.
smint.ml: Contains types, exceptions, … for the SMAPIv1 OCaml interface
storage_smapiv1_wrapper.ml: A state machine for SMAPIv1 operations. It computes the required actions to reach the desired state from the current state.
storage_smapiv1.ml: Contains the actual translation of SMAPIv2 calls to SMAPIv1 calls, by calling the bindings provided in sm.ml.

SMAPIv2

These are the files related to SMAPIv2, which need to be modified to implement new calls:

xcp-idl/storage/storage_interface.ml: Contains the SMAPIv2 interface
xcp-idl/storage/storage_skeleton.ml: A stub SMAPIv2 storage server implementation that matches the SMAPIv2 storage server interface (this is verified by storage_skeleton_test.ml), each of its function just raise a Storage_interface.Unimplemented error. This skeleton is used to automatically fill the unimplemented methods of the below storage servers to satisfy the interface.
xen-api/ocaml/xapi/storage_smapiv1.ml: a SMAPIv2 server that does SMAPIv2 -> SMAPIv1 translation. It passes the XML-RPC requests as the first command-line argument to the corresponding Python script, which returns an XML-RPC response on standard output.
xen-api/ocaml/xapi/storage_smapiv1_wrapper.ml: The Wrapper module wraps a SMAPIv2 server (Server_impl) and takes care of locking and datapaths (in case of multiple connections (=datapaths) from VMs to the same VDI, it will use the superstate computed by the Vdi_automaton in xcp-idl). It also implements some functionality, like the DP module, that is not implemented in lower layers.
xen-api/ocaml/xapi/storage_mux.ml: A SMAPIv2 server, which multiplexes between other servers. A different SMAPIv2 server can be registered for each SR. Then it forwards the calls for each SR to the “storage plugin” registered for that SR.

How SMAPIv2 works:

We use message-switch under the hood for RPC communication between xapi-idl components. The main Storage_mux.Server (basically Storage_impl.Wrapper(Mux)) is registered to listen on the “org.xen.xapi.storage” queue during xapi’s startup, and this is the main entry point for incoming SMAPIv2 function calls. Storage_mux does not really multiplex between different plugins right now: earlier during xapi’s startup, the same SMAPIv1 storage server module is registered on the various “org.xen.xapi.storage.<sr type>” queues for each supported SR type. (This will change with SMAPIv3, which is accessed via a SMAPIv2 plugin outside of xapi that translates between SMAPIv2 and SMAPIv3.) Then, in Storage_access.create_sr, which is called during SR.create, and also during PBD.plug, the relevant “org.xen.xapi.storage.<sr type>” queue needed for that PBD is registered with Storage_mux in Storage_access.bind for the SR of that PBD.
So basically what happens is that xapi registers itself as a SMAPIv2 server, and forwards incoming function calls to itself through message-switch, using its Storage_mux module. These calls are forwarded to xapi’s SMAPIv1 module doing SMAPIv2 -> SMAPIv1 translation.

Registration of the various storage servers

sequenceDiagram
participant q as message-switch
participant v1 as Storage_smapiv1.SMAPIv1
participant svr as Storage_mux.Server

Note over q, svr: xapi startup, "Starting SMAPIv1 proxies"
q ->> v1:org.xen.xapi.storage.sr_type_1
q ->> v1:org.xen.xapi.storage.sr_type_2
q ->> v1:org.xen.xapi.storage.sr_type_3

Note over q, svr: xapi startup, "Starting SM service"
q ->> svr:org.xen.xapi.storage 

Note over q, svr: SR.create, PBD.plug
svr ->> q:org.xapi.storage.sr_type_2

What happens when a SMAPIv2 “function” is called

graph TD

call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage

subgraph message-switch
org.xen.xapi.storage
org.xen.xapi.storage.SR_type_x
end

org.xen.xapi.storage --VDI.attach2--> Storage_smapiv1_wrapper.Wrapper

subgraph xapi
subgraph Storage_mux.server
Storage_smapiv1_wrapper.Wrapper --> Storage_mux.mux
end
Storage_smapiv1.SMAPIv1
end

Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x
org.xen.xapi.storage.SR_type_x --VDI.attach2--> Storage_smapiv1.SMAPIv1

subgraph SMAPIv1
driver_x[SMAPIv1 driver for SR_type_x]
end

Storage_smapiv1.SMAPIv1 --vdi_attach--> driver_x

Interface Changes, Backward Compatibility, & SXM

During SXM, xapi calls SMAPIv2 functions on a remote xapi. Therefore it is important to keep all those SMAPIv2 functions backward-compatible that we call remotely (e.g. Remote.VDI.attach), otherwise SXM from an older to a newer xapi will break.

Functionality implemented in SMAPIv2 layers

The layer between SMAPIv2 and SMAPIv1 is much fatter than the one between SMAPIv2 and SMAPIv3. The latter does not do much, apart from simple translation. However, the former has large portions of code in its intermediate layers, in addition to the basic SMAPIv2 <-> SMAPIv1 translation in storage_access.ml.

These are the two files in xapi that implement the SMAPIv2 storage interface, from higher to lower level:

Functionality implemented by higher layers is not implemented by the layers below it.

Extra functionality in `storage_task.ml`

storage_smapiv1_wrapper.ml also implements the UPDATES and TASK SMAPIv2 APIs. These are backed by the Updates, Task_server, and Scheduler modules from xcp-idl, instantiated in xapi’s Storage_task module. Migration code in Storage_mux will interact with these to update task progress. There is also an event loop in xapi that keeps calling UPDATES.get to keep the tasks in xapi’s database in sync with the storage manager’s tasks.

Storage_smapiv1_wrapper.ml also implements the legacy VDI.attach call by simply calling the newer VDI.attach2 call in the same module. In general, this is a good place to implement a compatibility layer for deprecated functionality removed from other layers, because this is the first module that intercepts a SMAPIv2 call.

Extra functionality in `storage_mux.ml`

Storage_mux redirects all storage motion (SXM) code to storage_migrate.ml, and the multiplexed will be managed by storage_migrate.ml. The main implementation resides in the DATA and DATA.MIRROR modules. Migration code will use the Storage_task module to run the operations and update the task’s progress.

It also implements the Policy module from the SMAPIv2 interface.

SMAPIv3

SMAPIv3 has a slightly different interface from SMAPIv2.The xapi-storage-script daemon is a SMAPIv2 plugin separate from xapi that is doing the SMAPIv2 ↔ SMAPIv3 translation. It keeps the plugins registered with xapi-idl (their message-switch queues) up to date as their files appear or disappear from the relevant directory.

SMAPIv3 Interface

The SMAPIv3 interface is defined using an OCaml-based IDL from the ocaml-rpc library, and is in this repo: https://github.com/xapi-project/xapi-storage

From this interface we generate

OCaml RPC client bindings used in xapi-storage-script
The SMAPIv3 API reference
Python bindings, used by the SM scripts that implement the SMAPIv3 interface.
- These bindings are built by running “make” in the root xapi-storage, and appear in the _build/default/python/xapi/storage/api/v5 directory.
- On a XenServer host, they are stored in the /usr/lib/python3.6/site-packages/xapi/storage/api/v5/ directory

SMAPIv3 Plugins

For SMAPIv3 we have volume plugins to manipulate SRs and volumes (=VDIs) in them, and datapath plugins for connecting to the volumes. Volume plugins tell us which datapath plugins we can use with each volume, and what to pass to the plugin. Both volume and datapath plugins implement some common functionality: the SMAPIv3 plugin interface.

How SMAPIv3 works:

The xapi-storage-script daemon detects volume and datapath plugins stored in subdirectories of the /usr/libexec/xapi-storage-script/volume/ and /usr/libexec/xapi-storage-script/datapath/ directories, respectively. When it finds a new datapath plugin, it adds the plugin to a lookup table and uses it the next time that datapath is required. When it finds a new volume plugin, it binds a new message-switch queue named after the plugin’s subdirectory to a new server instance that uses these volume scripts.

To invoke a SMAPIv3 method, it executes a program named <Interface name>.<function name> in the plugin’s directory, for example /usr/libexec/xapi-storage-script/volume/org.xen.xapi.storage.gfs2/SR.ls. The inputs to each script can be passed as command-line arguments and are type-checked using the generated Python bindings, and so are the outputs. The URIs of the SRs that xapi-storage-script knows about are stored in the /var/run/nonpersistent/xapi-storage-script/state.db file, these URIs can be used on the command line when an sr argument is expected.

Registration of the various SMAPIv3 plugins

sequenceDiagram
participant q as message-switch
participant v1 as (Storage_access.SMAPIv1)
participant svr as Storage_mux.Server
participant vol_dir as /../volume/
participant dp_dir as /../datapath/
participant script as xapi-storage-script

Note over script, vol_dir: xapi-storage-script startup
script ->> vol_dir: new subdir org.xen.xapi.storage.sr_type_4
q ->> script: org.xen.xapi.storage.sr_type_4
script ->> dp_dir: new subdir sr_type_4_dp

Note over q, svr: xapi startup, "Starting SMAPIv1 proxies"
q -->> v1:org.xen.xapi.storage.sr_type_1
q -->> v1:org.xen.xapi.storage.sr_type_2
q -->> v1:org.xen.xapi.storage.sr_type_3

Note over q, svr: xapi startup, "Starting SM service"
q ->> svr:org.xen.xapi.storage 

Note over q, svr: SR.create, PBD.plug
svr ->> q:org.xapi.storage.sr_type_4

What happens when a SMAPIv3 “function” is called

graph TD

call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage

subgraph message-switch
org.xen.xapi.storage
org.xen.xapi.storage.SR_type_x
end

org.xen.xapi.storage --VDI.attach2--> Storage_impl.Wrapper

subgraph xapi
subgraph Storage_mux.server
Storage_impl.Wrapper --> Storage_mux.mux
end
Storage_access.SMAPIv1
end

Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x

org.xen.xapi.storage.SR_type_x -."VDI.attach2".-> Storage_access.SMAPIv1

subgraph SMAPIv1
driver_x[SMAPIv1 driver for SR_type_x]
end

Storage_access.SMAPIv1 -.vdi_attach.-> driver_x

subgraph SMAPIv3
xapi-storage-script --Datapath.attach--> v3_dp_plugin_x
subgraph SMAPIv3 plugins
v3_vol_plugin_x[volume plugin for SR_type_x]
v3_dp_plugin_x[datapath plugin for SR_type_x]
end
end

org.xen.xapi.storage.SR_type_x --VDI.attach2-->xapi-storage-script

Error reporting

In our SMAPIv1 OCaml “bindings” in xapi (xen-api/ocaml/xapi/sm_exec.ml), when we inspect the error codes returned from a call to SM, we translate some of the SMAPIv1/SM error codes to XenAPI errors, and for others, we just construct an error code of the form SR_BACKEND_FAILURE_<SM error number>.

The file xcp-idl/storage/storage_interface.ml defines a number of SMAPIv2 errors, ultimately all errors from the various SMAPIv2 storage servers in xapi will be returned as one of these. Most of the errors aren’t converted into a specific exception in Storage_interface, but are simply wrapped with Storage_interface.Backend_error.

The Storage_utils.transform_storage_exn function is used by the client code in xapi to translate the SMAPIv2 errors into XenAPI errors again, this unwraps the errors wrapped with Storage_interface.Backend_error.

Message Forwarding

In the message forwarding layer, first we check the validity of VDI operations using mark_vdi and mark_sr. These first check that the operation is valid operations, using Xapi_vdi.check_operation_error, for mark_vdi, which also inspects the current operations of the VDI, and then, if the operation is valid, it is added to the VDI’s current operations, and update_allowed_operations is called. Then we forward the VDI operation to a suitable host that has a PBD plugged for the VDI’s SR.

Checking that the SR is attached

For the VDI operations, we check at two different places whether the SR is attached: first, at the Xapi level, in Xapi_vdi.check_operation_error, for the resize operation, and then, at the SMAPIv1 level, in Sm.assert_pbd_is_plugged. Sm.assert_pbd_is_plugged performs the same checks, plus it checks that the PBD is attached to the localhost, unlike Xapi_vdi.check_operation_error. This behaviour is correct, because Xapi_vdi.check_operation_error is called from the message forwarding layer, which forwards the call to a host that has the SR attached.

VDI Identifiers and Storage Motion

VDI “location”: this is the VDI identifier used by the SM backend. It is usually the UUID of the VDI, but for ISO SRs it is the name of the ISO.
VDI “content_id”: this is used for storage motion, to reduce the amount of data copied. When we copy over a VDI, the content_id will initially be the same. However, when we attach a VDI as read-write, and then detach it, then we will blank its content_id (set it to a random UUID), because we may have written to it, so the content could be different. .

Storage migration

Overview

The core idea of storage migration is surprisingly simple: We have VDIs attached to a VM, and we wish to migrate these VDIs from one SR to another. This necessarily requires us to copy the data stored in these VDIs over to the new SR, which can be a long-running process if there are gigabytes or even terabytes of them. We wish to minimise the down time of this process to allow the VM to keep running as much as possible.

At a very high level, the SXM process generally only consists of two stages: preparation and mirroring. The preparation is about getting the receiving host ready for the mirroring operation, while the mirroring itself can be further divided into two more operations: 1. sending new writes to both sides; 2.copying existing data from source to destination. The exact detail of how to set up a mirror differs significantly between SMAPIv1 and SMAPIv3, but both of them will have to perform the above two operations. Once the mirroring is established, it is a matter of checking the status of the mirroring and carry on with the follwoing VM migration.

The reality is more complex than what we had hoped for. For example, in SMAPIv1, the mirror establishment is quite an involved process and is itself divided into several stages, which will be discussed in more detail later on.

SXM Multiplexing

This section is about the design idea behind the additional layer of mutiplexing specifically for Storage Xen Motion (SXM) from SRs using SMAPIv3. It is recommended that you have read the introduction doc for the storage layer first to understand how storage multiplexing is done between SMAPIv2 and SMAPI{v1, v3} before reading this.

Motivation

The existing SXM code was designed to work only with SMAPIv1 SRs, and therefore does not take into account the dramatic difference in the ways SXM is done between SMAPIv1 and SMAPIv3. The exact difference will be covered later on in this doc, for this section it is sufficient to assume that they have two ways of doing migration. Therefore, we need different code paths for migration from SMAPIv1 and SMAPIv3.

But we have storage_mux.ml

Indeed, storage_mux.ml is responsible for multiplexing and forwarding requests to the correct storage backend, based on the SR type that the caller specifies. And in fact, for inbound SXM to SMAPIv3 (i.e. migrating into a SMAPIv3 SR, GFS2 for example), storage_mux is doing the heavy lifting of multiplexing between different storage backends. Every time a Remote. call is invoked, this will go through the SMAPIv2 layer to the remote host and get multiplexed on the destination host, based on whether we are migrating into a SMAPIv1 or SMAPIv3 SR (see the diagram below). And the inbound SXM is implemented by implementing the existing SMAPIv2 -> SMAPIv3 calls (see import_activate for example) which may not have been implemented before.

mux for inbound

While this works fine for inbound SXM, it does not work for outbound SXM. A typical SXM consists of four combinations, the source sr type (v1/v3) and the destiantion sr type (v1/v3), any of the four combinations is possible. We have already covered the destination multiplexing (v1/v3) by utilising storage_mux, and at this point we have run out of multiplexer for multiplexing on the source. In other words, we can only mutiplex once for each SMAPIv2 call, and we can either use that chance for either the source or the destination, and we have already used it for the latter.

Thought experiments on an alternative design

To make it even more concrete, let us consider an example: the mirroring logic in SXM is different based on the source SR type of the SXM call. You might imagine defining a function like MIRROR.start v3_sr v1_sr that will be multiplexed by the storage_mux based on the source SR type, and forwarded to storage_smapiv3_migrate, or even just xapi-storage-script, which is indeed quite possible. Now at this point we have already done the multiplexing, but we still wish to multiplex operations on destination SRs, for example, we might want to attach a VDI belonging to a SMAPIv1 SR on the remote host. But as we have already done the multiplexing and is now inside xapi-storage-script, we have lost any chance of doing any further multiplexing :(

Design

The idea of this new design is to introduce an additional multiplexing layer that is specific for multiplexing calls based on the source SR type. For example, in the diagram below the send_start src_sr dest_sr will take both the src SR and the destination SR as parameters, and suppose the mirroring logic is different for different types of source SRs (i.e. SMAPIv1 or SMAPIv3), the storage migration code will necessarily choose the right code path based on the source SR type. And this is exactly what is done in this additional multiplexing layer. The respective logic for doing {v1,v3}-specifi mirroring, for example, will stay in storage_smapi{v1,v3}_migrate.ml

mux for outbound

Note that later on storage_smapi{v1,v3}_migrate.ml will still have the flexibility to call remote SMAPIv2 functions, such as Remote.VDI.attach dest_sr vdi, and it will be handled just as before.

SMAPIv1 migration

This section is about migration from SMAPIv1 SRs to SMAPIv1 or SMAPIv3 SRs, since the migration is driven by the source host, it is usally the source host that determines most of the logic during a storage migration.

First we take a look at an overview diagram of what happens during SMAPIv1 SXM: the diagram is labelled with S1, S2 … which indicates different stages of the migration. We will talk about each stage in more detail below.

overview-v1

Preparation

Before we can start our migration process, there are a number of preparations needed to prepare for the following mirror. For SMAPIv1 this involves:

Create a new VDI (called leaf) that will be used as the receiving VDI for all the new writes
Create a dummy snapshot of the VDI above to make sure it is a differencing disk and can be composed later on
Create a VDI (called parent) that will be used to receive the existing content of the disk (the snapshot)

Note that the leaf VDI needs to be attached and activated on the destination host (to a non-exsiting mirror_vm) since it will later on accept writes to mirror what is written on the source host.

The parent VDI may be created in two different ways: 1. If there is a “similar VDI”, clone it on the destination host and use it as the parent VDI; 2. If there is no such VDI, create a new blank VDI. The similarity here is defined by the distances between different VDIs in the VHD tree, which is exploiting the internal representation of the storage layer, hence we will not go into too much detail about this here.

Once these preparations are done, a mirror_receive_result data structure is then passed back to the source host that will contain all the necessary information about these new VDIs, etc.

Establishing mirror

At a high level, mirror establishment for SMAPIv1 works as follows:

Take a snapshot of a VDI that is attached to VM1. This gives us an immutable copy of the current state of the VDI, with all the data up until the point we took the snapshot. This is illustrated in the diagram as a VDI and its snapshot connecting to a shared parent, which stores the shared content for the snapshot and the writable VDI from which we took the snapshot (snapshot)
Mirror the writable VDI to the server hosts: this means that all writes that goes to the client VDI will also be written to the mirrored VDI on the remote host (mirror)
Copy the immutable snapshot from our local host to the remote (copy)
Compose the mirror and the snapshot to form a single VDI
Destroy the snapshot on the local host (cleanup)

Mirror

The mirroring process for SMAPIv1 is rather unconventional, so it is worth documenting how this works. Instead of a conventional client server architecture, where the source client connects to the destination server directly through the NBD protocol in tapdisk, the connection is established in xapi and then passed onto tapdisk. It was done in this rather unusual way mainly due to authentication issues. Because it is xapi that is creating the connection, tapdisk does not need to be concerned about authentication of the connection, thus simplifying the storage component. This is reasonable as the storage component should focus on handling storage requests rather than worrying about network security.

The diagram below illustrates this prcess. First, xapi on the source host will initiate an https request to the remote xapi. This request contains the necessary information about the VDI to be mirrored, and the SR that contains it, etc. This information is then passed onto the https handler on the destination host (called nbd_handler) which then processes this information. Now the unusual step is that both the source and the destination xapi will pass this connection onto tapdisk, by sending the fd representing the socket connection to the tapdisk process. On the source this would be nbd client process of tapdisk, and on the destination this would be the nbd server process of the tapdisk. After this step, we can consider a client-server connection is established between two tapdisks on the client and server, as if the tapdisk on the source host makes a request to the tapdisk on the destination host and initiates the connection. On the diagram, this is indicated by the dashed lines between the tapdisk processes. Logically, we can view this as xapi creates the connection, and then passes this connection down into tapdisk.

mirror

Snapshot

The next step would be create a snapshot of the VDI. This is easily done as a VDI.snapshot operation. If the VDI was in VHD format, then internally this would create two children for, one for the snapshot, which only contains the metadata information and tends to be small, the other for the writable VDI where all the new writes will go to. The shared base copy contains the shared blocks.

snapshot

Copy and compose

Once the snapshot is created, we can then copy the snapshot from the source to the destination. This step is done by sparse_dd using the nbd protocol. This is also the step that takes the most time to complete.

sparse_dd is a process forked by xapi that does the copying of the disk blocks. sparse_dd can supports a number of protocols, including nbd. In this case, sparse_dd will initiate an https put request to the destination host, with a url of the form <address>/services/SM/nbdproxy/<sr>/<vdi>. This https request then gets handled by the https handler on the destination host B, which will then spawn a handler thread. This handler will find the “generic” nbd server¹ of either tapdisk or qemu-dp, depending on the destination SR type, and then start proxying data between the https connection socket and the socket connected to the nbd server.

sxm new copy

Once copying is done, the snapshot and mirrored VDI can be then composed into a single VDI.

Finish

At this point the VDI is synchronised to the new host! Mirror is still working at this point though because that will not be destroyed until the VM itself has been migrated as well. Some cleanups are done at this point, such as deleting the snapshot that is taken on the source, destroying the mirror datapath, etc.

The end results look like the following. Note that VM2 is in dashed line as it is not yet created yet. The next steps would be to migrate the VM1 itself to the destination as well, but this is part of the VM migration process and will not be covered here.

final

SMAPIv3 migration

This section covers the mechanism of migrations from SRs using SMAPIv3 (to SMAPIv1 or SMAPIv3). Although the core ideas are the same, SMAPIv3 has a rather different mechanism for mirroring: 1. it does not require xapi to take snapshot of the VDI anymore, since the mirror itself will take care of replicating the existing data to the destination; 2. there is no fd passing for connection establishment anymore, and instead proxies are used for connection setup.

Preparation

The preparation work for SMAPIv3 is greatly simplified by the fact that the mirror at the storage layer will copy the existing data in the VDI to the destination. This means that snapshot of the source VDI is not required anymore. So we are left with only one thing:

Create a VDI used for mirroring the data of the source VDI

For this reason, the implementation logic for SMAPIv3 preparation is also shorter, as the complexity is now handled by the storage layer, which is where it is supposed to be handled.

Establishing mirror

The other significant difference is that the storage backend for SMAPIv3 qemu-dp SRs no longer accepts fds, so xapi needs to proxy the data between two nbd client and nbd server.

SMAPIv3 provides the Data.mirror uri domain remote which needs three parameters: uri for accessing the local disk, doamin for the domain slice on which mirroring should happen, and most importantly for this design, a remote url which represents the remote nbd server to which the blocks of data can be sent to.

This function itself, when called by xapi and forwarded to the storage layer’s qemu-dp nbd client, will initiate a nbd connection to the nbd server pointed to by remote. This works fine when the storage migration happens entirely within a local host, where qemu-dp’s nbd client and nbd server can communicate over unix domain sockets. However, it does not work for inter-host migrations as qemu-dp’s nbd server is not exposed publicly over the network (just as tapdisk’s nbd server). Therefore a proxying service on the source host is needed for forwarding the nbd connection from the source host to the destination host. And it would be the responsiblity of xapi to manage this proxy service.

The following diagram illustrates the mirroring process of a single VDI:

sxm mirror

The first step for xapi is then to set up a nbd proxy thread that will be listening on a local unix domain socket with path /var/run/nbdproxy/export/<domain> where domain is the domain parameter mentioned above in Data.mirror. The nbd proxy thread will accept nbd connections (or rather any connections, it does not speak/care about nbd protocol at all) and sends an https put request to the remote xapi. The proxy itself will then forward the data exactly as it is to the remote side through the https connection.

Once the proxy is set up, xapi will call Data.mirror, which will be forwarded to the xapi-storage-script and is further forwarded to the qemu-dp. This call contains, among other parameters, the destination NBD server url (remote) to be connected. In this case the destination nbd server is exactly the domain socket to which the proxy thread is listening. Therefore the remote parameter will be of the form nbd+unix:///<export>?socket=<socket> where the export is provided by the destination nbd server that represents the VDI prepared on the destination host, and the socket will be the path of the unix domain socket where the proxy thread (which we just created) is listening at.

When this connection is set up, the proxy process will talk to the remote xapi via https requests, and on the remote side, an https handler will proxy this request to the appropriate nbd server of either tapdisk or qemu-dp, using exactly the same import proxy as mentioned before.

Note that this proxying service is tightly integrated with outbound SXM of SMAPIv3 SRs. This is to make it simple to focus on the migration itself.

Although there is no need to explicitly copy the VDI anymore, we still need to transfer the data and wait for it finish. For this we use Data.stat call provided by the storage backend to query the status of the mirror, and wait for it to finish as needed.

Limitations

This way of establishing the connection simplifies the implementation of the migration for SMAPIv3, but it also has limitations:

One proxy per live VDI migration is needed, which can potentially consume lots of resources in dom0, and we should measure the impact of this before we switch to using more resource-efficient ways such as wire guard that allows establishing a single connection between multiple hosts.

Finish

As there is no need to copy a VDI, there is also no need to compose or delete the snapshot. The cleanup procedure would therefore just involve destroy the datapath that was used for receiving writes for the mirrored VDI.

Error Handling

Storage migration is a long-running process, and is prone to failures in each step. Hence it is important specifying what errors could be raised at each step and their significance. This is beneficial both for the user and for triaging.

There are two general cleanup functions in SXM: MIRROR.receive_cancel and MIRROR.stop. The former is for cleaning up whatever has been created by MIRROR.receive_start on the destination host (such as VDIs for receiving mirrored data). The latter is a more comprehensive function that attempts to “undo” all the side effects that was done during the SXM, and also calls receive_cancel as part of its operations.

Currently error handling was done by building up a list of cleanup functions in the on_fail list ref as the function executes. For example, if the receive_start has been completed successfully, add receive_cancel to the list of cleanup functions. And whenever an exception is encountered, just execute whatever has been added to the on_fail list ref. This is convenient, but does entangle all the error handling logic with the core SXM logic itself, making the code rather than hard to understand and maintain.

The idea to fix this is to introduce explicit “stages” during the SXM and define explicitly what error handling should be done if it fails at a certain stage. This helps separate the error handling logic into the with part of a try with block, which is where they are supposed to be. Since we need to accommodate the existing SMAPIv1 migration (which has more stages than SMAPIv3), the following stages are introduced: preparation (v1,v3), snapshot(v1), mirror(v1, v3), copy(v1). Note that each stage also roughly corresponds to a helper function that is called within Storage_migrate.start, which is the wrapper function that initiates storage migration. And each helper functions themselves would also have error handling logic within themselves as needed (e.g. see Storage_smapiv1_migrate.receive_start) to deal with exceptions that happen within each helper functions.

Preparation (SMAPIv1 and SMAPIv3)

The preparation stage generally corresponds to what is done in receive_start, and this function itself will handle exceptions when there are partial failures within the function itself, such as an exception after the receiving VDI is created. It will use the old-style on_fail function but only with a limited scope.

There is nothing to be done at a higher level (i.e within MIRROR.start which calls receive_start) if preparation has failed.

Snapshot and mirror failure (SMAPIv1)

For SMAPIv1, the mirror is done in a bit cumbersome way. The end goal is to establish connections between two tapdisk processes on the source and destination hosts. To achieve this goal, xapi will do two main jobs: 1. create a connection between two hosts and pass the connection to tapdisk; 2. create a snapshot as a starting point of the mirroring process.

Therefore handling of failures at these two stages are similar: clean up what was done in the preparation stage by calling receive_cancel, and that is almost it. Again, we will leave whatever is needed for partial failure handling within those functions themselves and only clean up at a stage-level in storage_migrate.ml

Note that receive_cancel is a multiplexed function for SMAPIv1 and SMAPIv3, which means different clean up logic will be executed depending on what type of SR we are migrating from.

Mirror failure (SMAPIv3)

The Data.stat call in SMAPIv3 returns a data structure that includes the current progress of the mirror job, whether it has completed syncing the existing data and whether the mirorr has failed. Similar to how it is done in SMAPIv1, we wait for the sync to complete once we issue the Data.mirror call, by repeatedly polling the status of the mirror using the Data.stat call. During this process, the status of the mirror is also checked and if a failure is detected, a Migration_mirror_failure will be raised and then gets handled by the code in storage_migrate.ml by calling Storage_smapiv3_migrate.receive_cancel2, which will clean up the mirror datapath and destroy the mirror VDI, similar to what is done in SMAPIv1.

Copy failure (SMAPIv1)

The final step of storage migration for SMAPIv1 is to copy the snapshot from the source to the destination. At this stage, most of the side effectful work has been done, so we do need to call MIRROR.stop to clean things up if we experience an failure during copying.

SMAPIv1 Migration implementation detail

Info

The following doc refers to the xapi a version of xapi that is before 24.37 after which point this code structure has undergone many changes as part of adding support for SMAPIv3 SXM. Therefore the following tutorial might be less relevant in terms of the implementation detail. Although the general principle should remain the same.

sequenceDiagram
participant local_tapdisk as local tapdisk
participant local_smapiv2 as local SMAPIv2
participant xapi
participant remote_xapi as remote xapi
participant remote_smapiv2 as remote SMAPIv2 (might redirect)
participant remote_tapdisk as remote tapdisk

Note over xapi: Sort VDIs increasingly by size and then age

loop VM's & snapshots' VDIs & suspend images
  xapi->>remote_xapi: plug dest SR to dest host and pool master
  alt VDI is not mirrored
    Note over xapi: We don't mirror RO VDIs & VDIs of snapshots
    xapi->>local_smapiv2: DATA.copy remote_sm_url

    activate local_smapiv2
    local_smapiv2-->>local_smapiv2: SR.scan
    local_smapiv2-->>local_smapiv2: VDI.similar_content
    local_smapiv2-->>remote_smapiv2: SR.scan
    Note over local_smapiv2: Find nearest smaller remote VDI remote_base, if any
    alt remote_base
      local_smapiv2-->>remote_smapiv2: VDI.clone
      local_smapiv2-->>remote_smapiv2: VDI.resize
    else no remote_base
      local_smapiv2-->>remote_smapiv2: VDI.create
    end

    Note over local_smapiv2: call copy'
    activate local_smapiv2
    local_smapiv2-->>remote_smapiv2: SR.list
    local_smapiv2-->>remote_smapiv2: SR.scan
    Note over local_smapiv2: create new datapaths remote_dp, base_dp, leaf_dp
    Note over local_smapiv2: find local base_vdi with same content_id as dest, if any
    local_smapiv2-->>remote_smapiv2: VDI.attach2 remote_dp dest
    local_smapiv2-->>remote_smapiv2: VDI.activate remote_dp dest
    opt base_vdi
      local_smapiv2-->>local_smapiv2: VDI.attach2 base_dp base_vdi
      local_smapiv2-->>local_smapiv2: VDI.activate base_dp base_vdi
    end
    local_smapiv2-->>local_smapiv2: VDI.attach2 leaf_dp vdi
    local_smapiv2-->>local_smapiv2: VDI.activate leaf_dp vdi
    local_smapiv2-->>remote_xapi: sparse_dd base_vdi vdi dest [NBD URI for dest & remote_dp]
    Note over remote_xapi: HTTP handler verifies credentials
    remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server
    local_smapiv2-->>local_smapiv2: VDI.deactivate leaf_dp vdi
    local_smapiv2-->>local_smapiv2: VDI.detach leaf_dp vdi
    opt base_vdi
      local_smapiv2-->>local_smapiv2: VDI.deactivate base_dp base_vdi
      local_smapiv2-->>local_smapiv2: VDI.detach base_dp base_vdi
    end
    local_smapiv2-->>remote_smapiv2: DP.destroy remote_dp
    deactivate local_smapiv2

    local_smapiv2-->>remote_smapiv2: VDI.snapshot remote_copy
    local_smapiv2-->>remote_smapiv2: VDI.destroy remote_copy
    local_smapiv2->>xapi: task(snapshot)
    deactivate local_smapiv2

  else VDI is mirrored
    Note over xapi: We mirror RW VDIs of the VM
    Note over xapi: create new datapath dp
    xapi->>local_smapiv2: VDI.attach2 dp
    xapi->>local_smapiv2: VDI.activate dp
    xapi->>local_smapiv2: DATA.MIRROR.start dp remote_sm_url

    activate local_smapiv2
    Note over local_smapiv2: copy disk data & mirror local writes
    local_smapiv2-->>local_smapiv2: SR.scan
    local_smapiv2-->>local_smapiv2: VDI.similar_content
    local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_start similars
    activate remote_smapiv2
    remote_smapiv2-->>local_smapiv2: mirror_vdi,mirror_dp,copy_diffs_from,copy_diffs_to,dummy_vdi
    deactivate remote_smapiv2
    local_smapiv2-->>local_smapiv2: DP.attach_info dp
    local_smapiv2-->>remote_xapi: connect to [NBD URI for mirror_vdi & mirror_dp]
    Note over remote_xapi: HTTP handler verifies credentials
    remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server
    local_smapiv2-->>local_tapdisk: pass socket & dp to tapdisk of dp
    local_smapiv2-->>local_smapiv2: VDI.snapshot local_vdi [mirror:dp]
    local_smapiv2-->>local_tapdisk: [Python] unpause disk, pass dp
    local_tapdisk-->>remote_tapdisk: mirror new writes via NBD to socket
    Note over local_smapiv2: call copy' snapshot copy_diffs_to
    local_smapiv2-->>remote_smapiv2: VDI.compose copy_diffs_to mirror_vdi
    local_smapiv2-->>remote_smapiv2: VDI.remove_from_sm_config mirror_vdi base_mirror
    local_smapiv2-->>remote_smapiv2: VDI.destroy dummy_vdi
    local_smapiv2-->>local_smapiv2: VDI.destroy snapshot
    local_smapiv2->>xapi: task(mirror ID)
    deactivate local_smapiv2

    xapi->>local_smapiv2: DATA.MIRROR.stat
    activate local_smapiv2
    local_smapiv2->>xapi: dest_vdi
    deactivate local_smapiv2
  end

  loop until task finished
    xapi->>local_smapiv2: UPDATES.get
    xapi->>local_smapiv2: TASK.stat
  end
  xapi->>local_smapiv2: TASK.stat
  xapi->>local_smapiv2: TASK.destroy
end
opt for snapshot VDIs
  xapi->>local_smapiv2: SR.update_snapshot_info_src remote_sm_url
  activate local_smapiv2
  local_smapiv2-->>remote_smapiv2: SR.update_snapshot_info_dest
  deactivate local_smapiv2
end
Note over xapi: ...
Note over xapi: reserve resources for the new VM in dest host
loop all VDIs
  opt VDI is mirrored
    xapi->>local_smapiv2: DP.destroy dp
  end
end
opt post_detach_hook
  opt active local mirror
    local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_finalize [mirror ID]
    Note over remote_smapiv2: destroy mirror dp
  end
end
Note over xapi: memory image migration by xenopsd
Note over xapi: destroy the VM record

Receiving SXM

These are the remote calls in the above diagram sent from the remote host to the receiving end of storage motion:

Remote SMAPIv2 -> local SMAPIv2 RPC calls:
- SR.list
- SR.scan
- SR.update_snapshot_info_dest
- VDI.attach2
- VDI.activate
- VDI.snapshot
- VDI.destroy
- For copying:
  - For copying from base:
    - VDI.clone
    - VDI.resize
  - For copying without base:
    - VDI.create
- For mirroring:
  - DATA.MIRROR.receive_start
  - VDI.compose
  - VDI.remove_from_sm_config
  - DATA.MIRROR.receive_finalize
HTTP requests to xapi:
- Connecting to NBD URI via xapi’s HTTP handler

This is how xapi coordinates storage migration. We’ll do it as a code walkthrough through the two layers: xapi and storage-in-xapi (SMAPIv2).

Xapi code

The entry point is in xapi_vm_migration.ml

The function takes several arguments:

a vm reference (vm)
a dictionary of (string * string) key-value pairs about the destination (dest). This is the result of a previous call to the destination pool, Host.migrate_receive
live, a boolean of whether we should live-migrate or suspend-resume,
vdi_map, a mapping of VDI references to destination SR references,
vif_map, a mapping of VIF references to destination network references,
vgpu_map, similar for VGPUs
options, another dictionary of options

let migrate_send'  ~__context ~vm ~dest ~live ~vdi_map ~vif_map ~vgpu_map ~options =
  SMPERF.debug "vm.migrate_send called vm:%s" (Db.VM.get_uuid ~__context ~self:vm);

  let open Xapi_xenops in

  let localhost = Helpers.get_localhost ~__context in
  let remote = remote_of_dest dest in

  (* Copy mode means we don't destroy the VM on the source host. We also don't
     	   copy over the RRDs/messages *)
  let copy = try bool_of_string (List.assoc "copy" options) with _ -> false in

It begins by getting the local host reference, deciding whether we’re copying or moving, and converting the input dest parameter from an untyped string association list to a typed record, remote, which is declared further up the file:

type remote = {
  rpc : Rpc.call -> Rpc.response;
  session : API.ref_session;
  sm_url : string;
  xenops_url : string;
  master_url : string;
  remote_ip : string; (* IP address *)
  remote_master_ip : string; (* IP address *)
  dest_host : API.ref_host;
}

this contains:

A function, rpc, for calling XenAPI RPCs on the destination
A session valid on the destination
A sm_url on which SMAPIv2 APIs can be called on the destination
A master_url on which XenAPI commands can be called (not currently used)
The IP address, remote_ip, of the destination host
The IP address, remote_master_ip, of the master of the destination pool

Next, we determine which VDIs to copy:

  (* The first thing to do is to create mirrors of all the disks on the remote.
     We look through the VM's VBDs and all of those of the snapshots. We then
     compile a list of all of the associated VDIs, whether we mirror them or not
     (mirroring means we believe the VDI to be active and new writes should be
     mirrored to the destination - otherwise we just copy it)
     We look at the VDIs of the VM, the VDIs of all of the snapshots, and any
     suspend-image VDIs. *)

  let vm_uuid = Db.VM.get_uuid ~__context ~self:vm in
  let vbds = Db.VM.get_VBDs ~__context ~self:vm in
  let vifs = Db.VM.get_VIFs ~__context ~self:vm in
  let snapshots = Db.VM.get_snapshots ~__context ~self:vm in
  let vm_and_snapshots = vm :: snapshots in
  let snapshots_vbds = List.concat_map (fun self -> Db.VM.get_VBDs ~__context ~self) snapshots in
  let snapshot_vifs = List.concat_map (fun self -> Db.VM.get_VIFs ~__context ~self) snapshots in

we now decide whether we’re intra-pool or not, and if we’re intra-pool whether we’re migrating onto the same host (localhost migrate). Intra-pool is decided by trying to do a lookup of our current host uuid on the destination pool.

  let is_intra_pool = try ignore(Db.Host.get_uuid ~__context ~self:remote.dest_host); true with _ -> false in
  let is_same_host = is_intra_pool && remote.dest_host == localhost in

  if copy && is_intra_pool then raise (Api_errors.Server_error(Api_errors.operation_not_allowed, [ "Copy mode is disallowed on intra pool storage migration, try efficient alternatives e.g. VM.copy/clone."]));

Having got all of the VBDs of the VM, we now need to find the associated VDIs, filtering out empty CDs, and decide whether we’re going to copy them or mirror them - read-only VDIs can be copied but RW VDIs must be mirrored.

  let vms_vdis = List.filter_map (vdi_filter __context true) vbds in

where vdi_filter is defined earler:

(* We ignore empty or CD VBDs - nothing to do there. Possible redundancy here:
   I don't think any VBDs other than CD VBDs can be 'empty' *)
let vdi_filter __context allow_mirror vbd =
  if Db.VBD.get_empty ~__context ~self:vbd || Db.VBD.get_type ~__context ~self:vbd = `CD
  then None
  else
    let do_mirror = allow_mirror && (Db.VBD.get_mode ~__context ~self:vbd = `RW) in
    let vm = Db.VBD.get_VM ~__context ~self:vbd in
    let vdi = Db.VBD.get_VDI ~__context ~self:vbd in
    Some (get_vdi_mirror __context vm vdi do_mirror)

This in turn calls get_vdi_mirror which gathers together some important info:

let get_vdi_mirror __context vm vdi do_mirror =
  let snapshot_of = Db.VDI.get_snapshot_of ~__context ~self:vdi in
  let size = Db.VDI.get_virtual_size ~__context ~self:vdi in
  let xenops_locator = Xapi_xenops.xenops_vdi_locator ~__context ~self:vdi in
  let location = Db.VDI.get_location ~__context ~self:vdi in
  let dp = Storage_access.presentative_datapath_of_vbd ~__context ~vm ~vdi in
  let sr = Db.SR.get_uuid ~__context ~self:(Db.VDI.get_SR ~__context ~self:vdi) in
  {vdi; dp; location; sr; xenops_locator; size; snapshot_of; do_mirror}

The record is helpfully commented above:

type vdi_mirror = {
  vdi : [ `VDI ] API.Ref.t;           (* The API reference of the local VDI *)
  dp : string;                        (* The datapath the VDI will be using if the VM is running *)
  location : string;                  (* The location of the VDI in the current SR *)
  sr : string;                        (* The VDI's current SR uuid *)
  xenops_locator : string;            (* The 'locator' xenops uses to refer to the VDI on the current host *)
  size : Int64.t;                     (* Size of the VDI *)
  snapshot_of : [ `VDI ] API.Ref.t;   (* API's snapshot_of reference *)
  do_mirror : bool;                   (* Whether we should mirror or just copy the VDI *)
}

xenops_locator is <sr uuid>/<vdi uuid>, and dp is vbd/<domid>/<device> if the VM is running and vbd/<vm_uuid>/<vdi_uuid> if not.

So now we have a list of these records for all VDIs attached to the VM. For these we check explicitly that they’re all defined in the vdi_map, the mapping of VDI references to their destination SR references.

  check_vdi_map ~__context vms_vdis vdi_map;

We then figure out the VIF map:

 let vif_map =
    if is_intra_pool then vif_map
    else infer_vif_map ~__context (vifs @ snapshot_vifs) vif_map
  in

More sanity checks: We can’t do a storage migration if any of the VDIs is a reset-on-boot one - since the state will be lost on the destination when it’s attached:

(* Block SXM when VM has a VDI with on_boot=reset *)
  List.(iter (fun vconf ->
      let vdi = vconf.vdi in
      if (Db.VDI.get_on_boot ~__context ~self:vdi ==`reset) then
        raise (Api_errors.Server_error(Api_errors.vdi_on_boot_mode_incompatible_with_operation, [Ref.string_of vdi]))) vms_vdis) ;

We now consider all of the VDIs associated with the snapshots. As for the VM’s VBDs above, we end up with a vdi_mirror list. Note we pass false to the allow_mirror parameter of the get_vdi_mirror function as none of these snapshot VDIs will ever require mirrorring.

let snapshots_vdis = List.filter_map (vdi_filter __context false)

Finally we get all of the suspend-image VDIs from all snapshots as well as the actual VM, since it might be suspended itself:

snapshots_vbds in
  let suspends_vdis =
    List.fold_left
      (fun acc vm ->
         if Db.VM.get_power_state ~__context ~self:vm = `Suspended
         then
           let vdi = Db.VM.get_suspend_VDI ~__context ~self:vm in
           let sr = Db.VDI.get_SR ~__context ~self:vdi in
           if is_intra_pool && Helpers.host_has_pbd_for_sr ~__context ~host:remote.dest_host ~sr
           then acc
           else (get_vdi_mirror __context vm vdi false):: acc
         else acc)
      [] vm_and_snapshots in

Sanity check that we can see all of the suspend-image VDIs on this host:

 (* Double check that all of the suspend VDIs are all visible on the source *)
  List.iter (fun vdi_mirror ->
      let sr = Db.VDI.get_SR ~__context ~self:vdi_mirror.vdi in
      if not (Helpers.host_has_pbd_for_sr ~__context ~host:localhost ~sr)
      then raise (Api_errors.Server_error (Api_errors.suspend_image_not_accessible, [ Ref.string_of vdi_mirror.vdi ]))) suspends_vdis;

Next is a fairly complex piece that determines the destination SR for all of these VDIs. We don’t require API uses to decide destinations for all of the VDIs on snapshots and hence we have to make some decisions here:

  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let default_sr_ref =
    XenAPI.Pool.get_default_SR remote.rpc remote.session dest_pool in
  let suspend_sr_ref =
    let pool_suspend_SR = XenAPI.Pool.get_suspend_image_SR remote.rpc remote.session dest_pool
    and host_suspend_SR = XenAPI.Host.get_suspend_image_sr remote.rpc remote.session remote.dest_host in
    if pool_suspend_SR <> Ref.null then pool_suspend_SR else host_suspend_SR in

  (* Resolve placement of unspecified VDIs here - unspecified VDIs that
            are 'snapshot_of' a specified VDI go to the same place. suspend VDIs
            that are unspecified go to the suspend_sr_ref defined above *)

  let extra_vdis = suspends_vdis @ snapshots_vdis in

  let extra_vdi_map =
    List.map
      (fun vconf ->
         let dest_sr_ref =
           let is_mapped = List.mem_assoc vconf.vdi vdi_map
           and snapshot_of_is_mapped = List.mem_assoc vconf.snapshot_of vdi_map
           and is_suspend_vdi = List.mem vconf suspends_vdis
           and remote_has_suspend_sr = suspend_sr_ref <> Ref.null
           and remote_has_default_sr = default_sr_ref <> Ref.null in
           let log_prefix =
             Printf.sprintf "Resolving VDI->SR map for VDI %s:" (Db.VDI.get_uuid ~__context ~self:vconf.vdi) in
           if is_mapped then begin
             debug "%s VDI has been specified in the map" log_prefix;
             List.assoc vconf.vdi vdi_map
           end else if snapshot_of_is_mapped then begin
             debug "%s Snapshot VDI has entry in map for it's snapshot_of link" log_prefix;
             List.assoc vconf.snapshot_of vdi_map
           end else if is_suspend_vdi && remote_has_suspend_sr then begin
             debug "%s Mapping suspend VDI to remote suspend SR" log_prefix;
             suspend_sr_ref
           end else if is_suspend_vdi && remote_has_default_sr then begin
             debug "%s Remote suspend SR not set, mapping suspend VDI to remote default SR" log_prefix;
             default_sr_ref
           end else if remote_has_default_sr then begin
             debug "%s Mapping unspecified VDI to remote default SR" log_prefix;
             default_sr_ref
           end else begin
             error "%s VDI not in VDI->SR map and no remote default SR is set" log_prefix;
             raise (Api_errors.Server_error(Api_errors.vdi_not_in_map, [ Ref.string_of vconf.vdi ]))
           end in
         (vconf.vdi, dest_sr_ref))
      extra_vdis in

At the end of this we’ve got all of the VDIs that need to be copied and destinations for all of them:

  let vdi_map = vdi_map @ extra_vdi_map in
  let all_vdis = vms_vdis @ extra_vdis in

  (* The vdi_map should be complete at this point - it should include all the
     VDIs in the all_vdis list. *)

Now we gather some final information together:

  assert_no_cbt_enabled_vdi_migrated ~__context ~vdi_map;

  let dbg = Context.string_of_task __context in
  let open Xapi_xenops_queue in
  let queue_name = queue_of_vm ~__context ~self:vm in
  let module XenopsAPI = (val make_client queue_name : XENOPS) in

  let remote_vdis = ref [] in

  let ha_always_run_reset = not is_intra_pool && Db.VM.get_ha_always_run ~__context ~self:vm in

  let cd_vbds = find_cds_to_eject __context vdi_map vbds in
  eject_cds __context cd_vbds;

check there’s no CBT (we can’t currently migrate the CBT metadata), make our client to talk to Xenopsd, make a mutable list of remote VDIs (which I think is redundant right now), decide whether we need to do anything for HA (we disable HA protection for this VM on the destination until it’s fully migrated) and eject any CDs from the VM.

Up until now this has mostly been gathering info (aside from the ejecting CDs bit), but now we’ll start to do some actions, so we begin a try-catch block:

try

but we’ve still got a bit of thinking to do: we sort the VDIs to copy based on age/size:

    (* Sort VDIs by size in principle and then age secondly. This gives better
       chances that similar but smaller VDIs would arrive comparatively
       earlier, which can serve as base for incremental copying the larger
       ones. *)
    let compare_fun v1 v2 =
      let r = Int64.compare v1.size v2.size in
      if r = 0 then
        let t1 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v1.vdi) in
        let t2 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v2.vdi) in
        compare t1 t2
      else r in
    let all_vdis = all_vdis |> List.sort compare_fun in

    let total_size = List.fold_left (fun acc vconf -> Int64.add acc vconf.size) 0L all_vdis in
    let so_far = ref 0L in

OK, let’s copy/mirror:

    with_many (vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy) all_vdis @@ fun all_map ->

The copy functions are written such that they take continuations. This it to make the error handling simpler - each individual component function can perform its setup and execute the continuation. In the event of an exception coming from the continuation it can then unroll its bit of state and rethrow the exception for the next layer to handle.

with_many is a simple helper function for nesting invocations of functions that take continuations. It has the delightful type:

('a -> ('b -> 'c) -> 'c) -> 'a list -> ('b list -> 'c) -> 'c

(* Helper function to apply a 'with_x' function to a list *)
let rec with_many withfn many fn =
  let rec inner l acc =
    match l with
    | [] -> fn acc
    | x::xs -> withfn x (fun y -> inner xs (y::acc))
  in inner many []

As an example of its operation, imagine our withfn is as follows:

let withfn x c =
  Printf.printf "Starting withfn: x=%d\n" x;
  try
    c (string_of_int x)
  with e ->
    Printf.printf "Handling exception for x=%d\n" x;
    raise e;;

applying this gives the output:

utop # with_many withfn [1;2;3;4] (String.concat ",");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
- : string = "4,3,2,1"

whereas raising an exception in the continutation results in the following:

utop # with_many with_fn [1;2;3;4] (fun _ -> failwith "error");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
Handling exception for x=4
Handling exception for x=3
Handling exception for x=2
Handling exception for x=1
Exception: Failure "error".

All the real action is in vdi_copy_fun, which copies or mirrors a single VDI:

let vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy vconf continuation =
  TaskHelper.exn_if_cancelling ~__context;
  let open Storage_access in
  let dest_sr_ref = List.assoc vconf.vdi vdi_map in
  let dest_sr_uuid = XenAPI.SR.get_uuid remote.rpc remote.session dest_sr_ref in

  (* Plug the destination shared SR into destination host and pool master if unplugged.
     Plug the local SR into destination host only if unplugged *)
  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let master_host = XenAPI.Pool.get_master remote.rpc remote.session dest_pool in
  let pbds = XenAPI.SR.get_PBDs remote.rpc remote.session dest_sr_ref in
  let pbd_host_pair = List.map (fun pbd -> (pbd, XenAPI.PBD.get_host remote.rpc remote.session pbd)) pbds in
  let hosts_to_be_attached = [master_host; remote.dest_host] in
  let pbds_to_be_plugged = List.filter (fun (_, host) ->
      (List.mem host hosts_to_be_attached) && (XenAPI.Host.get_enabled remote.rpc remote.session host)) pbd_host_pair in
  List.iter (fun (pbd, _) ->
      if not (XenAPI.PBD.get_currently_attached remote.rpc remote.session pbd) then
        XenAPI.PBD.plug remote.rpc remote.session pbd) pbds_to_be_plugged;

It begins by attempting to ensure the SRs we require are definitely attached on the destination host and on the destination pool master.

There’s now a little logic to support the case where we have cross-pool SRs and the VDI is already visible to the destination pool. Since this is outside our normal support envelope there is a key in xapi_globs that has to be set (via xapi.conf) to enable this:

  let rec dest_vdi_exists_on_sr vdi_uuid sr_ref retry =
    try
      let dest_vdi_ref = XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid in
      let dest_vdi_sr_ref = XenAPI.VDI.get_SR remote.rpc remote.session dest_vdi_ref in
      if dest_vdi_sr_ref = sr_ref then
        true
      else
        false
    with _ ->
      if retry then
        begin
          XenAPI.SR.scan remote.rpc remote.session sr_ref;
          dest_vdi_exists_on_sr vdi_uuid sr_ref false
        end
      else
        false
  in

  (* CP-4498 added an unsupported mode to use cross-pool shared SRs - the initial
     use case is for a shared raw iSCSI SR (same uuid, same VDI uuid) *)
  let vdi_uuid = Db.VDI.get_uuid ~__context ~self:vconf.vdi in
  let mirror = if !Xapi_globs.relax_xsm_sr_check then
      if (dest_sr_uuid = vconf.sr) then
        begin
          (* Check if the VDI uuid already exists in the target SR *)
          if (dest_vdi_exists_on_sr vdi_uuid dest_sr_ref true) then
            false
          else
            failwith ("SR UUID matches on destination but VDI does not exist")
        end
      else
        true
    else
      (not is_intra_pool) || (dest_sr_uuid <> vconf.sr)
  in

The check also covers the case where we’re doing an intra-pool migration and not copying all of the disks, in which case we don’t need to do anything for that disk.

We now have a wrapper function that creates a new datapath and passes it to a continuation function. On error it handles the destruction of the datapath:

let with_new_dp cont =
    let dp = Printf.sprintf (if vconf.do_mirror then "mirror_%s" else "copy_%s") vconf.dp in
    try cont dp
    with e ->
      (try SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false with _ -> info "Failed to cleanup datapath: %s" dp);
      raise e in

and now a helper that, given a remote VDI uuid, looks up the reference on the remote host and gives it to a continuation function. On failure of the continuation it will destroy the remote VDI:

  let with_remote_vdi remote_vdi cont =
    debug "Executing remote scan to ensure VDI is known to xapi";
    XenAPI.SR.scan remote.rpc remote.session dest_sr_ref;
    let query = Printf.sprintf "(field \"location\"=\"%s\") and (field \"SR\"=\"%s\")" remote_vdi (Ref.string_of dest_sr_ref) in
    let vdis = XenAPI.VDI.get_all_records_where remote.rpc remote.session query in
    let remote_vdi_ref = match vdis with
      | [] -> raise (Api_errors.Server_error(Api_errors.vdi_location_missing, [Ref.string_of dest_sr_ref; remote_vdi]))
      | h :: [] -> debug "Found remote vdi reference: %s" (Ref.string_of (fst h)); fst h
      | _ -> raise (Api_errors.Server_error(Api_errors.location_not_unique, [Ref.string_of dest_sr_ref; remote_vdi])) in
    try cont remote_vdi_ref
    with e ->
      (try XenAPI.VDI.destroy remote.rpc remote.session remote_vdi_ref with _ -> error "Failed to destroy remote VDI");
      raise e in

another helper to gather together info about a mirrored VDI:

let get_mirror_record ?new_dp remote_vdi remote_vdi_reference =
    { mr_dp = new_dp;
      mr_mirrored = mirror;
      mr_local_sr = vconf.sr;
      mr_local_vdi = vconf.location;
      mr_remote_sr = dest_sr_uuid;
      mr_remote_vdi = remote_vdi;
      mr_local_xenops_locator = vconf.xenops_locator;
      mr_remote_xenops_locator = Xapi_xenops.xenops_vdi_locator_of_strings dest_sr_uuid remote_vdi;
      mr_local_vdi_reference = vconf.vdi;
      mr_remote_vdi_reference = remote_vdi_reference } in

and finally the really important function:

let mirror_to_remote new_dp =
    let task =
      if not vconf.do_mirror then
        SMAPI.DATA.copy ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      else begin
        (* Though we have no intention of "write", here we use the same mode as the
           associated VBD on a mirrored VDIs (i.e. always RW). This avoids problem
           when we need to start/stop the VM along the migration. *)
        let read_write = true in
        (* DP set up is only essential for MIRROR.start/stop due to their open ended pattern.
           It's not necessary for copy which will take care of that itself. *)
        ignore(SMAPI.VDI.attach ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location ~read_write);
        SMAPI.VDI.activate ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location;
        ignore(Storage_access.register_mirror __context vconf.location);
        SMAPI.DATA.MIRROR.start ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      end in

    let mapfn x =
      let total = Int64.to_float total_size in
      let done_ = Int64.to_float !so_far /. total in
      let remaining = Int64.to_float vconf.size /. total in
      done_ +. x *. remaining in

    let open Storage_access in

    let task_result =
      task |> register_task __context
      |> add_to_progress_map mapfn
      |> wait_for_task dbg
      |> remove_from_progress_map
      |> unregister_task __context
      |> success_task dbg in

    let mirror_id, remote_vdi =
      if not vconf.do_mirror then
        let vdi = task_result |> vdi_of_task dbg in
        remote_vdis := vdi.vdi :: !remote_vdis;
        None, vdi.vdi
      else
        let mirrorid = task_result |> mirror_of_task dbg in
        let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mirrorid in
        Some mirrorid, m.Mirror.dest_vdi in

    so_far := Int64.add !so_far vconf.size;
    debug "Local VDI %s %s to %s" vconf.location (if vconf.do_mirror then "mirrored" else "copied") remote_vdi;
    mirror_id, remote_vdi in

This is the bit that actually starts the mirroring or copying. Before the call to mirror we call VDI.attach and VDI.activate locally to ensure that if the VM is shutdown then the detach/deactivate there doesn’t kill the mirroring process.

Note the parameters to the SMAPI call are sr and vdi, locating the local VDI and SM backend, new_dp, the datapath we’re using for the mirroring, url, which is the remote url on which SMAPI calls work, and dest, the destination SR uuid. These are also the arguments to copy above too.

There’s a little function to calculate the overall progress of the task, and the function waits until the completion of the task before it continues. The function success_task will raise an exception if the task failed. For DATA.mirror, completion implies both that the disk data has been copied to the destination and that all local writes are being mirrored to the destination. Hence more cleanup must be done on cancellation. In contrast, if the DATA.copy path had been taken then the operation at this point has completely finished.

The result of this function is an optional mirror id and the remote VDI uuid.

Next, there is a post_mirror function:

  let post_mirror mirror_id mirror_record =
    try
      let result = continuation mirror_record in
      (match mirror_id with
       | Some mid -> ignore(Storage_access.unregister_mirror mid);
       | None -> ());
      if mirror && not (Xapi_fist.storage_motion_keep_vdi () || copy) then
        Helpers.call_api_functions ~__context (fun rpc session_id ->
            XenAPI.VDI.destroy rpc session_id vconf.vdi);
      result
    with e ->
      let mirror_failed =
        match mirror_id with
        | Some mid ->
          ignore(Storage_access.unregister_mirror mid);
          let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mid in
          (try SMAPI.DATA.MIRROR.stop ~dbg ~id:mid with _ -> ());
          m.Mirror.failed
        | None -> false in
      if mirror_failed then raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
      else raise e in

This is poorly named - it is post mirror and copy. The aim of this function is to destroy the source VDIs on successful completion of the continuation function, which will have migrated the VM to the destination. In its exception handler it will stop the mirroring, but before doing so it will check to see if the mirroring process it was looking after has itself failed, and raise mirror_failed if so. This is because a failed mirror can result in a range of actual errors, and we decide here that the failed mirror was probably the root cause.

These functions are assembled together at the end of the vdi_copy_fun function:

   if mirror then
    with_new_dp (fun new_dp ->
        let mirror_id, remote_vdi = mirror_to_remote new_dp in
        with_remote_vdi remote_vdi (fun remote_vdi_ref ->
            let mirror_record = get_mirror_record ~new_dp remote_vdi remote_vdi_ref in
            post_mirror mirror_id mirror_record))
  else
    let mirror_record = get_mirror_record vconf.location (XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid) in
    continuation mirror_record

again, mirror here is poorly named, and means mirror or copy.

Once all of the disks have been mirrored or copied, we jump back to the body of migrate_send. We split apart the mirror records according to the source of the VDI:

      let was_from vmap = List.exists (fun vconf -> vconf.vdi = vmap.mr_local_vdi_reference) in

      let suspends_map, snapshots_map, vdi_map = List.fold_left (fun (suspends, snapshots, vdis) vmap ->
          if was_from vmap suspends_vdis then  vmap :: suspends, snapshots, vdis
          else if was_from vmap snapshots_vdis then suspends, vmap :: snapshots, vdis
          else suspends, snapshots, vmap :: vdis
        ) ([],[],[]) all_map in

then we reassemble all_map from this, for some reason:

    let all_map = List.concat [suspends_map; snapshots_map; vdi_map] in

Now we need to update the snapshot-of links:

     (* All the disks and snapshots have been created in the remote SR(s),
       * so update the snapshot links if there are any snapshots. *)
      if snapshots_map <> [] then
        update_snapshot_info ~__context ~dbg ~url:remote.sm_url ~vdi_map ~snapshots_map;

I’m not entirely sure why this is done in this layer as opposed to in the storage layer.

A little housekeeping:

     let xenops_vdi_map = List.map (fun mirror_record -> (mirror_record.mr_local_xenops_locator, mirror_record.mr_remote_xenops_locator)) all_map in

      (* Wait for delay fist to disappear *)
      wait_for_fist __context Xapi_fist.pause_storage_migrate "pause_storage_migrate";

      TaskHelper.exn_if_cancelling ~__context;

the fist thing here simply allows tests to put in a delay at this specific point.

We also check the task to see if we’ve been cancelled and raise an exception if so.

The VM metadata is now imported into the remote pool, with all the XenAPI level objects remapped:

let new_vm =
        if is_intra_pool
        then vm
        else
          (* Make sure HA replaning cycle won't occur right during the import process or immediately after *)
          let () = if ha_always_run_reset then XenAPI.Pool.ha_prevent_restarts_for ~rpc:remote.rpc ~session_id:remote.session ~seconds:(Int64.of_float !Xapi_globs.ha_monitor_interval) in
          (* Move the xapi VM metadata to the remote pool. *)
          let vms =
            let vdi_map =
              List.map (fun mirror_record -> {
                    local_vdi_reference = mirror_record.mr_local_vdi_reference;
                    remote_vdi_reference = Some mirror_record.mr_remote_vdi_reference;
                  })
                all_map in
            let vif_map =
              List.map (fun (vif, network) -> {
                    local_vif_reference = vif;
                    remote_network_reference = network;
                  })
                vif_map in
            let vgpu_map =
              List.map (fun (vgpu, gpu_group) -> {
                    local_vgpu_reference = vgpu;
                    remote_gpu_group_reference = gpu_group;
                  })
                vgpu_map
            in
            inter_pool_metadata_transfer ~__context ~remote ~vm ~vdi_map
              ~vif_map ~vgpu_map ~dry_run:false ~live:true ~copy
          in
          let vm = List.hd vms in
          let () = if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:vm ~value:false in
          (* Reserve resources for the new VM on the destination pool's host *)
          let () = XenAPI.Host.allocate_resources_for_vm remote.rpc remote.session remote.dest_host vm true in
          vm in

More waiting for fist points:

     wait_for_fist __context Xapi_fist.pause_storage_migrate2 "pause_storage_migrate2";

      (* Attach networks on remote *)
      XenAPI.Network.attach_for_vm ~rpc:remote.rpc ~session_id:remote.session ~host:remote.dest_host ~vm:new_vm;

also make sure all the networks are plugged for the VM on the destination. Next we create the xenopsd-level vif map, equivalent to the vdi_map above:

  (* Create the vif-map for xenops, linking VIF devices to bridge names on the remote *)
      let xenops_vif_map =
        let vifs = XenAPI.VM.get_VIFs ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm in
        List.map (fun vif ->
            let vifr = XenAPI.VIF.get_record ~rpc:remote.rpc ~session_id:remote.session ~self:vif in
            let bridge = Xenops_interface.Network.Local
                (XenAPI.Network.get_bridge ~rpc:remote.rpc ~session_id:remote.session ~self:vifr.API.vIF_network) in
            vifr.API.vIF_device, bridge
          ) vifs
      in

Now we destroy any extra mirror datapaths we set up previously:

     (* Destroy the local datapaths - this allows the VDIs to properly detach, invoking the migrate_finalize calls *)
      List.iter (fun mirror_record ->
          if mirror_record.mr_mirrored
          then match mirror_record.mr_dp with | Some dp ->  SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false | None -> ()) all_map;

More housekeeping:

    SMPERF.debug "vm.migrate_send: migration initiated vm:%s" vm_uuid;

      (* In case when we do SXM on the same host (mostly likely a VDI
         migration), the VM's metadata in xenopsd will be in-place updated
         as soon as the domain migration starts. For these case, there
         will be no (clean) way back from this point. So we disable task
         cancellation for them here.
       *)
      if is_same_host then (TaskHelper.exn_if_cancelling ~__context; TaskHelper.set_not_cancellable ~__context);

Finally we get to the memory-image part of the migration:

      (* It's acceptable for the VM not to exist at this point; shutdown commutes with storage migrate *)
      begin
        try
          Xapi_xenops.Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid
            (fun () ->
               let xenops_vgpu_map = (* can raise VGPU_mapping *)
                 infer_vgpu_map ~__context ~remote new_vm in
               migrate_with_retry
                 ~__context queue_name dbg vm_uuid xenops_vdi_map
                 xenops_vif_map xenops_vgpu_map remote.xenops_url;
               Xapi_xenops.Xenopsd_metadata.delete ~__context vm_uuid)
        with
        | Xenops_interface.Does_not_exist ("VM",_)
        | Xenops_interface.Does_not_exist ("extra",_) ->
          info "%s: VM %s stopped being live during migration"
            "vm_migrate_send" vm_uuid
        | VGPU_mapping(msg) ->
          info "%s: VM %s - can't infer vGPU map: %s"
            "vm_migrate_send" vm_uuid msg;
          raise Api_errors.
                  (Server_error
                     (vm_migrate_failed,
                      ([ vm_uuid
                       ; Helpers.get_localhost_uuid ()
                       ; Db.Host.get_uuid ~__context ~self:remote.dest_host
                       ; "The VM changed its power state during migration"
                       ])))
      end;

      debug "Migration complete";
      SMPERF.debug "vm.migrate_send: migration complete vm:%s" vm_uuid;

Now we tidy up after ourselves:

      (* So far the main body of migration is completed, and the rests are
         updates, config or cleanup on the source and destination. There will
         be no (clean) way back from this point, due to these destructive
         changes, so we don't want user intervention e.g. task cancellation.
       *)
      TaskHelper.exn_if_cancelling ~__context;
      TaskHelper.set_not_cancellable ~__context;
      XenAPI.VM.pool_migrate_complete remote.rpc remote.session new_vm remote.dest_host;

      detach_local_network_for_vm ~__context ~vm ~destination:remote.dest_host;
      Xapi_xenops.refresh_vm ~__context ~self:vm;

the function pool_migrate_complete is called on the destination host, and consists of a few things that ordinarily would be set up during VM.start or the like:

let pool_migrate_complete ~__context ~vm ~host =
  let id = Db.VM.get_uuid ~__context ~self:vm in
  debug "VM.pool_migrate_complete %s" id;
  let dbg = Context.string_of_task __context in
  let queue_name = Xapi_xenops_queue.queue_of_vm ~__context ~self:vm in
  if Xapi_xenops.vm_exists_in_xenopsd queue_name dbg id then begin
    Cpuid_helpers.update_cpu_flags ~__context ~vm ~host;
    Xapi_xenops.set_resident_on ~__context ~self:vm;
    Xapi_xenops.add_caches id;
    Xapi_xenops.refresh_vm ~__context ~self:vm;
    Monitor_dbcalls_cache.clear_cache_for_vm ~vm_uuid:id
  end

More tidying up, remapping some remaining VBDs and clearing state on the sender:

      (* Those disks that were attached at the point the migration happened will have been
         remapped by the Events_from_xenopsd logic. We need to remap any other disks at
         this point here *)

      if is_intra_pool
      then
        List.iter
          (fun vm' ->
             intra_pool_vdi_remap ~__context vm' all_map;
             intra_pool_fix_suspend_sr ~__context remote.dest_host vm')
          vm_and_snapshots;

      (* If it's an inter-pool migrate, the VBDs will still be 'currently-attached=true'
         because we supressed the events coming from xenopsd. Destroy them, so that the
         VDIs can be destroyed *)
      if not is_intra_pool && not copy
      then List.iter (fun vbd -> Db.VBD.destroy ~__context ~self:vbd) (vbds @ snapshots_vbds);

      new_vm
    in

The remark about the Events_from_xenopsd is that we have a thread watching for events that are emitted by xenopsd, and we resynchronise xapi’s state according to xenopsd’s state for several fields for which xenopsd is considered the canonical source of truth. One of these is the exact VDI the VBD is associated with.

The suspend_SR field of the VM is set to the source’s value, so we reset that.

Now we move the RRDs:

  if not copy then begin
      Rrdd_proxy.migrate_rrd ~__context ~remote_address:remote.remote_ip ~session_id:(Ref.string_of remote.session)
        ~vm_uuid:vm_uuid ~host_uuid:(Ref.string_of remote.dest_host) ()
    end;

This can be done for intra- and inter- pool migrates in the same way, simplifying the logic.

However, for messages and blobs we have to only migrate them for inter-pool migrations:

   if not is_intra_pool && not copy then begin
      (* Replicate HA runtime flag if necessary *)
      if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm ~value:true;
      (* Send non-database metadata *)
      Xapi_message.send_messages ~__context ~cls:`VM ~obj_uuid:vm_uuid
        ~session_id:remote.session ~remote_address:remote.remote_master_ip;
      Xapi_blob.migrate_push ~__context ~rpc:remote.rpc
        ~remote_address:remote.remote_master_ip ~session_id:remote.session ~old_vm:vm ~new_vm ;
      (* Signal the remote pool that we're done *)
    end;

Lastly, we destroy the VM record on the source:

    Helpers.call_api_functions ~__context (fun rpc session_id ->
        if not is_intra_pool && not copy then begin
          info "Destroying VM ref=%s uuid=%s" (Ref.string_of vm) vm_uuid;
          Xapi_vm_lifecycle.force_state_reset ~__context ~self:vm ~value:`Halted;
          List.iter (fun self -> Db.VM.destroy ~__context ~self) vm_and_snapshots
        end);
    SMPERF.debug "vm.migrate_send exiting vm:%s" vm_uuid;
    new_vm

The exception handler still has to clean some state, but mostly things are handled in the CPS functions declared above:

with e ->
    error "Caught %s: cleaning up" (Printexc.to_string e);

    (* We do our best to tidy up the state left behind *)
    Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid (fun () ->
        try
          let _, state = XenopsAPI.VM.stat dbg vm_uuid in
          if Xenops_interface.(state.Vm.power_state = Suspended) then begin
            debug "xenops: %s: shutting down suspended VM" vm_uuid;
            Xapi_xenops.shutdown ~__context ~self:vm None;
          end;
        with _ -> ());

    if not is_intra_pool && Db.is_valid_ref __context vm then begin
      List.map (fun self -> Db.VM.get_uuid ~__context ~self) vm_and_snapshots
      |> List.iter (fun self ->
          try
            let vm_ref = XenAPI.VM.get_by_uuid remote.rpc remote.session self in
            info "Destroying stale VM uuid=%s on destination host" self;
            XenAPI.VM.destroy remote.rpc remote.session vm_ref
          with e -> error "Caught %s while destroying VM uuid=%s on destination host" (Printexc.to_string e) self)
    end;

    let task = Context.get_task_id __context in
    let oc = Db.Task.get_other_config ~__context ~self:task in
    if List.mem_assoc "mirror_failed" oc then begin
      let failed_vdi = List.assoc "mirror_failed" oc in
      let vconf = List.find (fun vconf -> vconf.location=failed_vdi) vms_vdis in
      debug "Mirror failed for VDI: %s" failed_vdi;
      raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
    end;
    TaskHelper.exn_if_cancelling ~__context;
    begin match e with
      | Storage_interface.Backend_error(code, params) -> raise (Api_errors.Server_error(code, params))
      | Storage_interface.Unimplemented(code) -> raise (Api_errors.Server_error(Api_errors.unimplemented_in_sm_backend, [code]))
      | Xenops_interface.Cancelled _ -> TaskHelper.raise_cancelled ~__context
      | _ -> raise e
    end

Failures during the migration can result in the VM being in a suspended state. There’s no point leaving it like this since there’s nothing that can be done to resume it, so we force shut it down.

We also try to remove the VM record from the destination if we managed to send it there.

Finally we check for mirror failure in the task - this is set by the events thread watching for events from the storage layer, in storage_access.ml

Storage code

The part of the code that is conceptually in the storage layer, but physically in xapi, is located in storage_migrate.ml. There are logically a few separate parts to this file:

A stateful module for persisting state across xapi restarts.
Some general helper functions
Some quite specific helper functions related to actions to be taken on deactivate/detach
An NBD handler
The implementations of the SMAPIv2 mirroring APIs

Let’s start by considering the way the storage APIs are intended to be used.

Copying a VDI

DATA.copy takes several parameters:

dbg - a debug string
sr - the source SR (a uuid)
vdi - the source VDI (a uuid)
dp - unused
url - a URL on which SMAPIv2 API calls can be made
sr - the destination SR in which the VDI should be copied

and returns a parameter of type Task.id. The API call is intended to be called in an asynchronous fashion - ie., the caller makes the call, receives the task ID back and polls or uses the event mechanism to wait until the task has completed. The task may be cancelled via the Task.cancel API call. The result of the operation is obtained by calling TASK.stat, which returns a record:

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
		backtrace: string;
	}

Where the state field contains the result once the task has completed:

type async_result_t =
	| Vdi_info of vdi_info
	| Mirror_id of Mirror.id

type completion_t = {
	duration : float;
	result : async_result_t option
}

type state =
	| Pending of float
	| Completed of completion_t
	| Failed of Rpc.t

Once the result has been obtained from the task, the task should be destroyed via the TASK.destroy API call.

The implementation uses the url parameter to make SMAPIv2 calls to the destination SR. This is used, for example, to invoke a VDI.create call if necessary. The URL contains an authentication token within it (valid for the duration of the XenAPI call that caused this DATA.copy API call).

The implementation tries to minimize the amount of data copied by looking for related VDIs on the destination SR. See below for more details.

Mirroring a VDI

DATA.MIRROR.start takes a similar set of parameters to that of copy:

dbg - a debug string
sr - the source SR (a uuid)
vdi - the source VDI (a uuid)
dp - the datapath on which the VDI has been attached
url - a URL on which SMAPIv2 API calls can be made
sr - the destination SR in which the VDI should be copied

Similar to copy above, this returns a task id. The task ‘completes’ once the mirror has been set up - that is, at any point afterwards we can detach the disk and the destination disk will be identical to the source. Unlike for copy the operation is ongoing after the API call completes, since new writes need to be mirrored to the destination. Therefore the completion type of the mirror operation is Mirror_id which contains a handle on which further API calls related to the mirror call can be made. For example MIRROR.stat whose signature is:

MIRROR.stat: dbg:debug_info -> id:Mirror.id -> Mirror.t

The return type of this call is a record containing information about the mirror:

type state =
	| Receiving
	| Sending
	| Copying

type t = {
	source_vdi : vdi;
	dest_vdi : vdi;
	state : state list;
	failed : bool;
}

Note that state is a list since the initial phase of the operation requires both copying and mirroring.

Additionally the mirror can be cancelled using the MIRROR.stop API call.

Code walkthrough

let’s go through the implementation of copy:

DATA.copy

let copy ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "copy sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

Here we are constructing a module Remote on which we can do SMAPIv2 calls directly on the destination.

try

Wrap the whole function in an exception handler.

    (* Find the local VDI *)
    let vdis = Local.SR.scan ~dbg ~sr in
    let local_vdi =
      try List.find (fun x -> x.vdi = vdi) vdis
      with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

We first find the metadata for our source VDI by doing a local SMAPIv2 call SR.scan. This returns a list of VDI metadata, out of which we extract the VDI we’re interested in.

try

Another exception handler. This looks redundant to me right now.

      let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
      let similars = List.map (fun vdi -> vdi.content_id) similar_vdis in
      debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

Here we look for related VDIs locally using the VDI.similar_content SMAPIv2 API call. This searches for related VDIs and returns an ordered list where the most similar is first in the list. It returns both clones and snapshots, and hence is more general than simply following snapshot_of links.

      let remote_vdis = Remote.SR.scan ~dbg ~sr:dest in
      (** We drop cbt_metadata VDIs that do not have any actual data *)
      let remote_vdis = List.filter (fun vdi -> vdi.ty <> "cbt_metadata") remote_vdis in

      let nearest = List.fold_left
          (fun acc content_id -> match acc with
             | Some x -> acc
             | None ->
               try Some (List.find (fun vdi -> vdi.content_id = content_id && vdi.virtual_size <= local_vdi.virtual_size) remote_vdis)
               with Not_found -> None) None similars in

      debug "Nearest VDI: content_id=%s vdi=%s"
        (Opt.default "None" (Opt.map (fun x -> x.content_id) nearest))
        (Opt.default "None" (Opt.map (fun x -> x.vdi) nearest));

Here we look for VDIs on the destination with the same content_id as one of the locally similar VDIs. We will use this as a base image and only copy deltas to the destination. This is done by cloning the VDI on the destination and then using sparse_dd to find the deltas from our local disk to our local copy of the content_id disk and streaming these to the destination. Note that we need to ensure the VDI is smaller than the one we want to copy since we can’t resize disks downwards in size.

      let remote_base = match nearest with
        | Some vdi ->
          debug "Cloning VDI %s" vdi.vdi;
          let vdi_clone = Remote.VDI.clone ~dbg ~sr:dest ~vdi_info:vdi in
          if vdi_clone.virtual_size <> local_vdi.virtual_size then begin
            let new_size = Remote.VDI.resize ~dbg ~sr:dest ~vdi:vdi_clone.vdi ~new_size:local_vdi.virtual_size in
            debug "Resize remote VDI %s to %Ld: result %Ld" vdi_clone.vdi local_vdi.virtual_size new_size;
          end;
          vdi_clone
        | None ->
          debug "Creating a blank remote VDI";
          Remote.VDI.create ~dbg ~sr:dest ~vdi_info:{ local_vdi with sm_config = [] }  in

If we’ve found a base VDI we clone it and resize it immediately. If there’s nothing on the destination already we can use, we just create a new VDI. Note that the calls to create and clone may well fail if the destination host is not the SRmaster. This is handled purely in the rpc function:

let rec rpc ~srcstr ~dststr url call =
  let result = XMLRPC_protocol.rpc ~transport:(transport_of_url url)
      ~srcstr ~dststr ~http:(xmlrpc ~version:"1.0" ?auth:(Http.Url.auth_of url) ~query:(Http.Url.get_query_params url) (Http.Url.get_uri url)) call
  in
  if not result.Rpc.success then begin
    debug "Got failure: checking for redirect";
    debug "Call was: %s" (Rpc.string_of_call call);
    debug "result.contents: %s" (Jsonrpc.to_string result.Rpc.contents);
    match Storage_interface.Exception.exnty_of_rpc result.Rpc.contents with
    | Storage_interface.Exception.Redirect (Some ip) ->
      let open Http.Url in
      let newurl =
        match url with
        | (Http h, d) ->
          (Http {h with host=ip}, d)
        | _ ->
          remote_url ip in
      debug "Redirecting to ip: %s" ip;
      let r = rpc ~srcstr ~dststr newurl call in
      debug "Successfully redirected. Returning";
      r
    | _ ->
      debug "Not a redirect";
      result
  end
  else result

Back to the copy function:

      let remote_copy = copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi:remote_base.vdi |> vdi_info in

This calls the actual data copy part. See below for more on that.

      let snapshot = Remote.VDI.snapshot ~dbg ~sr:dest ~vdi_info:remote_copy in
      Remote.VDI.destroy ~dbg ~sr:dest ~vdi:remote_copy.vdi;
      Some (Vdi_info snapshot)

Finally we snapshot the remote VDI to ensure we’ve got a VDI of type ‘snapshot’ on the destination, and we delete the non-snapshot VDI.

    with e ->
      error "Caught %s: copying snapshots vdi" (Printexc.to_string e);
      raise (Internal_error (Printexc.to_string e))
  with
  | Backend_error(code, params)
  | Api_errors.Server_error(code, params) ->
    raise (Backend_error(code, params))
  | e ->
    raise (Internal_error(Printexc.to_string e))

The exception handler does nothing - so we leak remote VDIs if the exception happens after we’ve done our cloning :-(

DATA.copy_into

Let’s now look at the data-copying part. This is common code shared between VDI.copy, VDI.copy_into and MIRROR.start and hence has some duplication of the calls made above.

let copy_into ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi

copy_into is a stub and just calls copy'

let copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
  debug "copy local=%s/%s url=%s remote=%s/%s" sr vdi url dest dest_vdi;

This call takes roughly the same parameters as the ``DATA.copy` call above, except it specifies the destination VDI. Once again we construct a module to do remote SMAPIv2 calls

  (* Check the remote SR exists *)
  let srs = Remote.SR.list ~dbg in
  if not(List.mem dest srs)
  then failwith (Printf.sprintf "Remote SR %s not found" dest);

Sanity check.

  let vdis = Remote.SR.scan ~dbg ~sr:dest in
  let remote_vdi =
    try List.find (fun x -> x.vdi = dest_vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Remote VDI %s not found" dest_vdi)
  in

Find the metadata of the destination VDI

  let dest_content_id = remote_vdi.content_id in

If we’ve got a local VDI with the same content_id as the destination, we only need copy the deltas, so we make a note of the destination content ID here.

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

  debug "copy local=%s/%s content_id=%s" sr vdi local_vdi.content_id;
  debug "copy remote=%s/%s content_id=%s" dest dest_vdi remote_vdi.content_id;

Find the source VDI metadata.

  if local_vdi.virtual_size > remote_vdi.virtual_size then begin
    (* This should never happen provided the higher-level logic is working properly *)
    error "copy local=%s/%s virtual_size=%Ld > remote=%s/%s virtual_size = %Ld" sr vdi local_vdi.virtual_size dest dest_vdi remote_vdi.virtual_size;
    failwith "local VDI is larger than the remote VDI";
  end;

Sanity check - the remote VDI can’t be smaller than the source.

  let on_fail : (unit -> unit) list ref = ref [] in

We do some ugly error handling here by keeping a mutable list of operations to perform in the event of a failure.

  let base_vdi =
    try
      let x = (List.find (fun x -> x.content_id = dest_content_id) vdis).vdi in
      debug "local VDI %s has content_id = %s; we will perform an incremental copy" x dest_content_id;
      Some x
    with _ ->
      debug "no local VDI has content_id = %s; we will perform a full copy" dest_content_id;
      None
  in

See if we can identify a local VDI with the same content_id as the destination. If not, no problem.

  try
    let remote_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let base_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let leaf_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in

Construct some datapaths - named reasons why the VDI is attached - that we will pass to VDI.attach/activate.

    let dest_vdi_url = Http.Url.set_uri remote_url (Printf.sprintf "%s/nbd/%s/%s/%s" (Http.Url.get_uri remote_url) dest dest_vdi remote_dp) |> Http.Url.to_string in

    debug "copy remote=%s/%s NBD URL = %s" dest dest_vdi dest_vdi_url;

Here we are constructing a URI that we use to connect to the destination xapi. The handler for this particular path will verify the credentials and then pass the connection on to tapdisk which will behave as a NBD server. The VDI has to be attached and activated for this to work, unlike the new NBD handler in xapi-nbd that is smarter. The handler for this URI is declared in this file

    let id=State.copy_id_of (sr,vdi) in
    debug "Persisting state for copy (id=%s)" id;
    State.add id State.(Copy_op Copy_state.({
        base_dp; leaf_dp; remote_dp; dest_sr=dest; copy_vdi=remote_vdi.vdi; remote_url=url}));

Since we’re about to perform a long-running operation that is stateful, we persist the state here so that if xapi is restarted we can cancel the operation and not leak VDI attaches. Normally in xapi code we would be doing VBD.plug operations to persist the state in the xapi db, but this is storage code so we have to use a different mechanism.

    SMPERF.debug "mirror.copy: copy initiated local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    Pervasiveext.finally (fun () ->
        debug "activating RW datapath %s on remote=%s/%s" remote_dp dest dest_vdi;
        ignore(Remote.VDI.attach ~dbg ~sr:dest ~vdi:dest_vdi ~dp:remote_dp ~read_write:true);
        Remote.VDI.activate ~dbg ~dp:remote_dp ~sr:dest ~vdi:dest_vdi;

        with_activated_disk ~dbg ~sr ~vdi:base_vdi ~dp:base_dp
          (fun base_path ->
             with_activated_disk ~dbg ~sr ~vdi:(Some vdi) ~dp:leaf_dp
               (fun src ->
                  let dd = Sparse_dd_wrapper.start ~progress_cb:(progress_callback 0.05 0.9 task) ?base:base_path true (Opt.unbox src)
                      dest_vdi_url remote_vdi.virtual_size in
                  Storage_task.with_cancel task
                    (fun () -> Sparse_dd_wrapper.cancel dd)
                    (fun () ->
                       try Sparse_dd_wrapper.wait dd
                       with Sparse_dd_wrapper.Cancelled -> Storage_task.raise_cancelled task)
               )
          );
      )
      (fun () ->
         Remote.DP.destroy ~dbg ~dp:remote_dp ~allow_leak:false;
         State.remove_copy id
      );

In this chunk of code we attach and activate the disk on the remote SR via the SMAPI, then locally attach and activate both the VDI we’re copying and the base image we’re copying deltas from (if we’ve got one). We then call sparse_dd to copy the data to the remote NBD URL. There is some logic to update progress indicators and to cancel the operation if the SMAPIv2 call TASK.cancel is called.

Once the operation has terminated (either on success, error or cancellation), we remove the local attach and activations in the with_activated_disk function and the remote attach and activation by destroying the datapath on the remote SR. We then remove the persistent state relating to the copy.

    SMPERF.debug "mirror.copy: copy complete local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    debug "setting remote=%s/%s content_id <- %s" dest dest_vdi local_vdi.content_id;
    Remote.VDI.set_content_id ~dbg ~sr:dest ~vdi:dest_vdi ~content_id:local_vdi.content_id;
    (* PR-1255: XXX: this is useful because we don't have content_ids by default *)
    debug "setting local=%s/%s content_id <- %s" sr local_vdi.vdi local_vdi.content_id;
    Local.VDI.set_content_id ~dbg ~sr ~vdi:local_vdi.vdi ~content_id:local_vdi.content_id;
    Some (Vdi_info remote_vdi)

The last thing we do is to set the local and remote content_id. The local set_content_id is there because the content_id of the VDI is constructed from the location if it is unset in the storage_access.ml module of xapi (still part of the storage layer)

  with e ->
    error "Caught %s: performing cleanup actions" (Printexc.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

Here we perform the list of cleanup operations. Theoretically. It seems we don’t ever actually set this to anything, so this is dead code.

DATA.MIRROR.start

let start' ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "Mirror.start sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  SMPERF.debug "mirror.start called sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

As with the previous calls, we make a remote module for SMAPIv2 calls on the destination, and we find local VDI metadata via SR.scan

  let id = State.mirror_id_of (sr,local_vdi.vdi) in

Mirror ids are deterministically constructed.

  (* A list of cleanup actions to perform if the operation should fail. *)
  let on_fail : (unit -> unit) list ref = ref [] in

This on_fail list is actually used.

  try
    let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
    let similars = List.filter (fun x -> x <> "") (List.map (fun vdi -> vdi.content_id) similar_vdis) in
    debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

As with copy we look locally for similar VDIs. However, rather than use that here we actually pass this information on to the destination SR via the receive_start internal SMAPIv2 call:

    let result_ty = Remote.DATA.MIRROR.receive_start ~dbg ~sr:dest ~vdi_info:local_vdi ~id ~similar:similars in
    let result = match result_ty with
        Mirror.Vhd_mirror x -> x
    in

This gives the destination SR a chance to say what sort of migration it can support. We only support Vhd_mirror style migrations which require the destination to support the compose SMAPIv2 operation. The type of x is a record:

type mirror_receive_result_vhd_t = {
	mirror_vdi : vdi_info;
	mirror_datapath : dp;
	copy_diffs_from : content_id option;
	copy_diffs_to : vdi;
	dummy_vdi : vdi;
}

Field descriptions:

mirror_vdi is the VDI to which new writes should be mirrored.
mirror_datapath is the remote datapath on which the VDI has been attached and activated. This is required to construct the remote NBD url
copy_diffs_from represents the source base VDI to be used for the non-mirrored data copy.
copy_diffs_to is the remote VDI to copy those diffs to
dummy_vdi exists to prevent leaf-coalesce on the mirror_vdi

    (* Enable mirroring on the local machine *)
    let mirror_dp = result.Mirror.mirror_datapath in

    let uri = (Printf.sprintf "/services/SM/nbd/%s/%s/%s" dest result.Mirror.mirror_vdi.vdi mirror_dp) in
    let dest_url = Http.Url.set_uri remote_url uri in
    let request = Http.Request.make ~query:(Http.Url.get_query_params dest_url) ~version:"1.0" ~user_agent:"smapiv2" Http.Put uri in
    let transport = Xmlrpc_client.transport_of_url dest_url in

This is where we connect to the NBD server on the destination.

    debug "Searching for data path: %s" dp;
    let attach_info = Local.DP.attach_info ~dbg:"nbd" ~sr ~vdi ~dp in
    debug "Got it!";

we need the local attach_info to find the local tapdisk so we can send it the connected NBD socket.

    on_fail := (fun () -> Remote.DATA.MIRROR.receive_cancel ~dbg ~id) :: !on_fail;

This should probably be set directly after the call to receive_start

    let tapdev = match tapdisk_of_attach_info attach_info with
      | Some tapdev ->
        debug "Got tapdev";
        let pid = Tapctl.get_tapdisk_pid tapdev in
        let path = Printf.sprintf "/var/run/blktap-control/nbdclient%d" pid in
        with_transport transport (with_http request (fun (response, s) ->
            debug "Here inside the with_transport";
            let control_fd = Unix.socket Unix.PF_UNIX Unix.SOCK_STREAM 0 in
            finally
              (fun () ->
                 debug "Connecting to path: %s" path;
                 Unix.connect control_fd (Unix.ADDR_UNIX path);
                 let msg = dp in
                 let len = String.length msg in
                 let written = Unixext.send_fd control_fd msg 0 len [] s in
                 debug "Sent fd";
                 if written <> len then begin
                   error "Failed to transfer fd to %s" path;
                   failwith "foo"
                 end)
              (fun () ->
                 Unix.close control_fd)));
        tapdev
      | None ->
        failwith "Not attached"
    in

Here we connect to the remote NBD server, then pass that connected fd to the local tapdisk that is using the disk. This fd is passed with a name that is later used to tell tapdisk to start using it - we use the datapath name for this.

    debug "Adding to active local mirrors: id=%s" id;
    let alm = State.Send_state.({
        url;
        dest_sr=dest;
        remote_dp=mirror_dp;
        local_dp=dp;
        mirror_vdi=result.Mirror.mirror_vdi.vdi;
        remote_url=url;
        tapdev;
        failed=false;
        watchdog=None}) in
    State.add id (State.Send_op alm);
    debug "Added";

As for copy we persist some state to disk to say that we’re doing a mirror so we can undo any state changes after a toolstack restart.

    debug "About to snapshot VDI = %s" (string_of_vdi_info local_vdi);
    let local_vdi = add_to_sm_config local_vdi "mirror" ("nbd:" ^ dp) in
    let local_vdi = add_to_sm_config local_vdi "base_mirror" id in
    let snapshot =
    try
      Local.VDI.snapshot ~dbg ~sr ~vdi_info:local_vdi
    with
    | Storage_interface.Backend_error(code, _) when code = "SR_BACKEND_FAILURE_44" ->
      raise (Api_errors.Server_error(Api_errors.sr_source_space_insufficient, [ sr ]))
    | e ->
      raise e
    in
    debug "Done!";

    SMPERF.debug "mirror.start: snapshot created, mirror initiated vdi:%s snapshot_of:%s"
      snapshot.vdi local_vdi.vdi ;

    on_fail := (fun () -> Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi) :: !on_fail;

This bit inserts into sm_config the name of the fd we passed earlier to do mirroring. This is interpreted by the python SM backends and passed on the tap-ctl invocation to unpause the disk. This causes all new writes to be mirrored via NBD to the file descriptor passed earlier.

    begin
      let rec inner () =
        debug "tapdisk watchdog";
        let alm_opt = State.find_active_local_mirror id in
        match alm_opt with
        | Some alm ->
          let stats = Tapctl.stats (Tapctl.create ()) tapdev in
          if stats.Tapctl.Stats.nbd_mirror_failed = 1 then
            Updates.add (Dynamic.Mirror id) updates;
          alm.State.Send_state.watchdog <- Some (Scheduler.one_shot scheduler (Scheduler.Delta 5) "tapdisk_watchdog" inner)
        | None -> ()
      in inner ()
    end;

This is the watchdog that runs tap-ctl stats every 5 seconds watching mirror_failed for evidence of a failure in the mirroring code. If it detects one the only thing it does is to notify that the state of the mirroring has changed. This will be picked up by the thread in xapi that is monitoring the state of the mirror. It will then issue a MIRROR.stat call which will return the state of the mirror including the information that it has failed.

    on_fail := (fun () -> stop ~dbg ~id) :: !on_fail;
    (* Copy the snapshot to the remote *)
    let new_parent = Storage_task.with_subtask task "copy" (fun () ->
        copy' ~task ~dbg ~sr ~vdi:snapshot.vdi ~url ~dest ~dest_vdi:result.Mirror.copy_diffs_to) |> vdi_info in
    debug "Local VDI %s == remote VDI %s" snapshot.vdi new_parent.vdi;

This is where we copy the VDI returned by the snapshot invocation to the remote VDI called copy_diffs_to. We only copy deltas, but we rely on copy' to figure out which disk the deltas should be taken from, which it does via the content_id field.

    Remote.VDI.compose ~dbg ~sr:dest ~vdi1:result.Mirror.copy_diffs_to ~vdi2:result.Mirror.mirror_vdi.vdi;
    Remote.VDI.remove_from_sm_config ~dbg ~sr:dest ~vdi:result.Mirror.mirror_vdi.vdi ~key:"base_mirror";
    debug "Local VDI %s now mirrored to remote VDI: %s" local_vdi.vdi result.Mirror.mirror_vdi.vdi;

Once the copy has finished we invoke the compose SMAPIv2 call that composes the diffs from the mirror with the base image copied from the snapshot.

    debug "Destroying dummy VDI %s on remote" result.Mirror.dummy_vdi;
    Remote.VDI.destroy ~dbg ~sr:dest ~vdi:result.Mirror.dummy_vdi;
    debug "Destroying snapshot %s on src" snapshot.vdi;
    Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi;

    Some (Mirror_id id)

we can now destroy the dummy vdi on the remote (which will cause a leaf-coalesce in due course), and we destroy the local snapshot here (which will also cause a leaf-coalesce in due course, providing we don’t destroy it first). The return value from the function is the mirror_id that we can use to monitor the state or cancel the mirror.

  with
  | Sr_not_attached(sr_uuid) ->
    error " Caught exception %s:%s. Performing cleanup." Api_errors.sr_not_attached sr_uuid;
    perform_cleanup_actions !on_fail;
    raise (Api_errors.Server_error(Api_errors.sr_not_attached,[sr_uuid]))
  | e ->
    error "Caught %s: performing cleanup actions" (Api_errors.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

The exception handler just cleans up afterwards.

This is not the end of the story, since we need to detach the remote datapath being used for mirroring when we detach this end. The hook function is in storage_migrate.ml:

let post_detach_hook ~sr ~vdi ~dp =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  State.find_active_local_mirror id |>
  Opt.iter (fun r ->
      let remote_url = Http.Url.of_string r.url in
      let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
      let t = Thread.create (fun () ->
          debug "Calling receive_finalize";
          log_and_ignore_exn
            (fun () -> Remote.DATA.MIRROR.receive_finalize ~dbg:"Mirror-cleanup" ~id);
          debug "Finished calling receive_finalize";
          State.remove_local_mirror id;
          debug "Removed active local mirror: %s" id
        ) () in
      Opt.iter (fun id -> Scheduler.cancel scheduler id) r.watchdog;
      debug "Created thread %d to call receive finalize and dp destroy" (Thread.id t))

This removes the persistent state and calls receive_finalize on the destination. The body of that functions is:

let receive_finalize ~dbg ~id =
  let recv_state = State.find_active_receive_mirror id in
  let open State.Receive_state in Opt.iter (fun r -> Local.DP.destroy ~dbg ~dp:r.leaf_dp ~allow_leak:false) recv_state;
  State.remove_receive_mirror id

which removes the persistent state on the destination and destroys the datapath associated with the mirror.

Additionally, there is also a pre-deactivate hook. The rationale for this is that we want to detect any failures to write that occur right at the end of the SXM process. So if there is a mirror operation going on, before we deactivate we wait for tapdisk to flush its queue of outstanding requests, then we query whether there has been a mirror failure. The code is just above the detach hook in storage_migrate.ml:

let pre_deactivate_hook ~dbg ~dp ~sr ~vdi =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  let start = Mtime_clock.counter () in
  let get_delta () = Mtime_clock.count start |> Mtime.Span.to_s in
  State.find_active_local_mirror id |>
  Opt.iter (fun s ->
      try
        (* We used to pause here and then check the nbd_mirror_failed key. Now, we poll
				   until the number of outstanding requests has gone to zero, then check the
				   status. This avoids confusing the backend (CA-128460) *)
        let open Tapctl in
        let ctx = create () in
        let rec wait () =
          if get_delta () > reqs_outstanding_timeout then raise Timeout;
          let st = stats ctx s.tapdev in
          if st.Stats.reqs_outstanding > 0
          then (Thread.delay 1.0; wait ())
          else st
        in
        let st = wait () in
        debug "Got final stats after waiting %f seconds" (get_delta ());
        if st.Stats.nbd_mirror_failed = 1
        then begin
          error "tapdisk reports mirroring failed";
          s.failed <- true
        end;
      with
      | Timeout ->
        error "Timeout out after %f seconds waiting for tapdisk to complete all outstanding requests" (get_delta ());
        s.failed <- true
      | e ->
        error "Caught exception while finally checking mirror state: %s"
          (Printexc.to_string e);
        s.failed <- true
    )

The server is generic because it does not accept fd passing, and I call those “special” nbd server/fd receiver. ↩︎

XAPI requests walk-throughs

Let’s detail the handling process of an XML request within XAPI. The first document uses the migration as an example of such request.

How the migration request goes through Xen API?

From RPC migration request to xapi internals

Overview

In this document we will use the VM.pool_migrate request to illustrate the interaction between various components within the XAPI toolstack during migration. However this schema can be applied to other requests as well.

Not all parts of the Xapi toolstack are shown here as not all are involved in the migration process. For instance you won’t see the squeezed nor mpathalert two daemons that belong to the toolstack but don’t participate in the migration of a VM.

Anatomy of a VM migration

Migration is initiated by a Xapi client that sends VM.pool_migrate, an RPC XML request.
The Xen API server handles this request and dispatches it to the server.
The server is generated using XAPI IDL and requests are wrapped whithin a context, either to be forwarded to a host or executed locally. Broadly, the context follows RBAC rules. The executed function is related to the message of the request (refer to XenAPI Reference).
In the case of the migration you can refer to ocaml/idl/datamodel_vm.ml.
The server will dispatch the operation to server helpers, executing the operation synchronously or asynchronously and returning the RPC answer.
Message forwarding decides if operation must be executed by another host of the pool and then forward the call or if is executed locally.
When executed locally the high-level migration operation is send to the Xenopsd daemon by posting a message on a known queue on the message switch.
Xenopsd will get the command and will split it into several atomic operations that will be run by the xenopsd backend.
Xenopsd with its backend can then access xenstore or execute hypercall to interact with xen a server the micro operation.

A diagram is worth a thousand words

flowchart TD

    %% First we are starting by a XAPI client that is sending an XML-RPC request
    client((Xapi client)) -. sends RPC XML request .->
        xapi_server{"`Dispatch RPC
                    **api_server.ml**`"}
    style client stroke:#CAFEEE,stroke-width:4px

    %% XAPI Toolstack internals
    subgraph "Xapi Toolstack (master of the pool)"
        style server stroke:#BAFA00,stroke-width:4px,stroke-dasharray: 5 5

            xapi_server --dispatch call (ie VM.pool_migrate)--> server("`Auto generated using *IDL*
                    **server.ml**`")

            server --do_dispatch (ie VM.pool_migrate)--> server_helpers["`server helpers
            **server_helpers.ml**`"]

            server_helpers -- call management (ie xapi_vm_migrate.ml)--> message_forwarding["`check where to run the call **message_forwarding.ml**`"]

            message_forwarding -- execute locally --> vm_management["`VM Mgmt
            like **xapi_vm_migrate.ml**`"]

            vm_management -- Call --> xapi_xenops["`Transform xenops
            see (**xapi_xenops.ml**)`"]
                xapi_xenops <-- Post following IDL model (see xenops_interface.ml) --> msg_switch


        subgraph "Message Switch Daemon"
            msg_switch[["Queues"]]
        end

        subgraph "Xenopsd Daemon"
            msg_switch <-- Push/Pop on org.xen.xapi.xenopsd.classic --> xenopsd_server

            xenopsd_server["`Xenposd *frontend*
            get & split high level opertion into atomics`"]  o-- linked at compile time --o xenopsd_backend
        end
    end

    %% Xenopsd backend is accessing xen and xenstore
    xenopsd_backend["`Xenopsd *backend*
    Backend XC (libxenctrl)`"] -. access to .-> xen_hypervisor["Xen hypervisor & xenstore"]
    style xen_hypervisor stroke:#BEEF00,stroke-width:2px

    %% Can send request to the host where call must be executed
    message_forwarding -.forward call to .-> elected_host["Host where call must be executed"]
    style elected_host stroke:#B0A,stroke-width:4px

Xapi

Principles

Overview

Subsections of Xapi

Guides

Subsections of Guides

How to add....

Subsections of How to add....

Adding a Class to the API

Example: PVS_server

Fields

Methods (or Functions)

Implementation Overview

Data Model

CLI Conventions

CLI Getters and Setters

CLI Interface to Methods

CLI Implementation of Methods

Error messages

Method Implementation

Adding a field to the API

Bumping the database schema version

Setting the schema hash

Adding the new field to some existing class

ocaml/idl/datamodel.ml

Changing Constructors

CLI Records

Testing

Making this field accessible as a CLI attribute

Adding a function to the API

Add message to API

Update expose_get_all_messages_for list

Update the RBAC field containing the roles expected to use the new API call

How to determine the correct role of a new api-call:

Update message forwarding

Implement the function

Add the operation to the CLI

Tab Completion in the CLI

Executing the CLI operation

Adding a XenAPI extension

To define an extension

Packaging your extension

Limitations

XE CLI architecture

Architecture

Walk-through: CLI handler in xapi (external calls)

Definitions for the HTTP handler

High-level request processing

Command implementations

Tutorials

Database

Subsections of Database

Metadata-on-LUN

Layout on block device

Architecture

Protocol

Write database

Command:

Semantics:

Response:

Write database delta

Command:

Semantics:

Response:

Read log

Command:

Semantics:

Response:

Re-initialise log

Command:

Semantics:

Response:

Impact on xapi performance

Testing strategy

Dev-testing performed

Proposed new regression test

Impact on existing regression tests

XAPI's Internals

Subsections of Internals

Certificates and PEM Files

`xapi-pool-tls.pem`

`xapi-pool-ca-bundle.pem` and `certs-pool/*.pem`

`xapi-stunnel-ca-bundle.pem` and `certs/*.pem`

API Types (`aPI.ml`)

Database Actions (`db_actions.ml`)

Custom Actions (`custom_actions.ml`)

Server (`server.ml`)

The top-level `dispatch_call` function

Client (`client.ml`)

Extra functionality in `storage_task.ml`

Extra functionality in `storage_mux.ml`