Xapi

Xapi is the xapi-project host and cluster manager.

Xapi is responsible for:

  • providing a stable interface (the XenAPI)
  • allowing one client to manage multiple hosts
  • hosting the “xe” CLI
  • authenticating users and applying role-based access control
  • locking resources (in particular disks)
  • allowing storage to be managed through plugins
  • planning and coping with host failures (“High Availability”)
  • storing VM and host configuration
  • generating alerts
  • managing software patching

Principles

  1. The XenAPI interface must remain backwards compatible, allowing older clients to continue working
  2. Xapi delegates all Xenstore/libxc/libxl access to Xenopsd, so Xapi could be run in an unprivileged helper domain
  3. Xapi delegates the low-level storage manipulation to SM plugins.
  4. Xapi delegates setting up host networking to xcp-networkd.
  5. Xapi delegates monitoring performance counters to xcp-rrdd.

Overview

The following diagram shows the internals of Xapi:

Internals of xapi Internals of xapi

The top of the diagram shows the XenAPI clients: XenCenter, XenOrchestra, OpenStack and CloudStack using XenAPI and HTTP GET/PUT over ports 80 and 443 to talk to xapi. These XenAPI (JSON-RPC or XML-RPC over HTTP POST) and HTTP GET/PUT are always authenticated using either PAM (by default using the local passwd and group files) or through Active Directory.

The APIs are classified into categories:

  • coordinator-only: these are the majority of current APIs. The coordinator should be called and relied upon to forward the call to the right place with the right locks held.
  • normally-local: these are performance special cases such as disk import/export and console connection which are sent directly to hosts which have the most efficient access to the data.
  • emergency: these deal with scenarios where the coordinator is offline

If the incoming API call should be resent to the coordinator than a XenAPI HOST_IS_SLAVE error message containing the coordinator’s IP is sent to the client.

Once past the initial checks, API calls enter the “message forwarding” layer which

  • locks resources (via the current_operations mechanism)
  • decides which host should execute the request.

If the request should run locally then a direct function call is used; otherwise the message forwarding code makes a synchronous API call to a specific other host. Note: Xapi currently employs a “thread per request” model which causes one full POSIX thread to be created for every request. Even when a request is forwarded the full thread persists, blocking for the result to become available.

If the XenAPI call is a VM lifecycle operation then it is converted into a Xenopsd API call and forwarded over a Unix domain socket. Xapi and Xenopsd have similar notions of cancellable asynchronous “tasks”, so the current Xapi task (all operations run in the context of a task) is bound to the Xenopsd task, so cancellation is passed through and progress updates are received.

If the XenAPI call is a storage operation then the “storage access” layer

  • verifies that the storage objects are in the correct state (SR attached/detached; VDI attached/activated read-only/read-write)
  • invokes the relevant operation in the Storage Manager API (SMAPI) v2 interface;
  • depending on the type of SR:
    • uses the SMAPIv2 to SMAPIv1 converter to generate the necessary command-line to talk to the SMAPIv1 plugin (EXT, NFS, LVM etc) and to execute it
    • uses the SMAPIv2 to SMAPIv3 converter daemon xapi-storage-script to exectute the necessary SMAPIv3 command (GFS2)
  • persists the state of the storage objects (including the result of a VDI.attach call) to persistent storage

Internally the SMAPIv1 plugins use privileged access to the Xapi database to directly set fields (e.g. VDI.virtual_size) that would be considered read/only to other clients. The SMAPIv1 plugins also rely on Xapi for

  • knowledge of all hosts which may access the storage
  • locking of disks within the resource pool
  • safely executing code on other hosts via the “Xapi plugin” mechanism

The Xapi database contains Host and VM metadata and is shared pool-wide. The coordinator keeps a copy in memory, and all other nodes remote queries to the coordinator. The database associates each object with a generation count which is used to implement the XenAPI event.next and event.from APIs. The database is routinely asynchronously flushed to disk in XML format. If the “redo-log” is enabled then all database writes are made synchronously as deltas to a shared block device. Without the redo-log, recent updates may be lost if Xapi is killed before a flush.

High-Availability refers to planning for host failure, monitoring host liveness and then following-through on the plans. Xapi defers to an external host liveness monitor called xhad. When xhad confirms that a host has failed – and has been isolated from the storage – then Xapi will restart any VMs which have failed and which have been marked as “protected” by HA. Xapi can also impose admission control to prevent the pool becoming too overloaded to cope with n arbitrary host failures.

The xe CLI is implemented in terms of the XenAPI, but for efficiency the implementation is linked directly into Xapi. The xe program remotes its command-line to Xapi, and Xapi sends back a series of simple commands (prompt for input; print line; fetch file; exit etc).

Subsections of Xapi

Subsections of Guides

How to add....

Subsections of How to add....

Adding a Class to the API

This document describes how to add a new class to the data model that defines the Xen Server API. It complements two other documents that describe how to extend an existing class:

As a running example, we will use the addition of a class that is part of the design for the PVS Direct feature. PVS Direct introduces proxies that serve VMs with disk images. This class was added via commit CP-16939 to Xen API.

Example: PVS_server

In the world of Xen Server, each important concept like a virtual machine, interface, or users is represented by a class in the data model. A class defines methods and instance variables. At runtime, all class instances are held in an in-memory database. For example, part of [PVS Direct] is a class PVS_server, representing a resource that provides block-level data for virtual machines. The design document defines it to have the following important properties:

Fields

  • (string set) addresses (RO/constructor) IPv4 addresses of the server.

  • (int) first_port (RO/constructor) First UDP port accepted by the server.

  • (int) last_port (RO/constructor) Last UDP port accepted by the server.

  • (PVS_farm ref) farm (RO/constructor) Link to the farm that this server is included in. A PVS_server object must always have a valid farm reference; the PVS_server will be automatically GC’ed by xapi if the associated PVS_farm object is removed.

  • (string) uuid (R0/runtime) Unique identifier/object reference. Allocated by the server.

Methods (or Functions)

  • (PVS_server ref) introduce (string set addresses, int first_port, int last_port, PVS_farm ref farm) Introduce a new PVS server into the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.

  • (void) forget (PVS_server ref self) Remove a PVS server from the farm. Allowed at any time, even when proxies are in use. The proxies will be updated automatically.

Implementation Overview

The implementation of a class is distributed over several files:

  • ocaml/idl/datamodel.ml – central class definition
  • ocaml/idl/datamodel_types.ml – definition of releases
  • ocaml/xapi/cli_frontend.ml – declaration of CLI operations
  • ocaml/xapi/cli_operations.ml – implementation of CLI operations
  • ocaml/xapi/records.ml – getters and setters
  • ocaml/xapi/OMakefile – refers to xapi_pvs_farm.ml
  • ocaml/xapi/api_server.ml – refers to xapi_pvs_farm.ml
  • ocaml/xapi/message_forwarding.ml
  • ocaml/xapi/xapi_pvs_farm.ml – implementation of methods, new file

Data Model

The data model ocaml/idl/datamodel.ml defines the class. To keep the name space tidy, most helper functions are grouped into an internal module:

(* datamodel.ml *)

let schema_minor_vsn = 103 (* line 21 -- increment this *)
let _pvs_farm = "PVS_farm" (* line 153 *)

module PVS_farm = struct (* line 8658 *)
  let lifecycle = [Prototyped, rel_dundee_plus, ""]

  let introduce = call
    ~name:"introduce"
    ~doc:"Introduce new PVS farm"
    ~result:(Ref _pvs_farm, "the new PVS farm")
    ~params:
    [ String,"name","name of the PVS farm"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let forget = call
    ~name:"forget"
    ~doc:"Remove a farm's meta data"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ]
    ~errs:[
      Api_errors.pvs_farm_contains_running_proxies;
      Api_errors.pvs_farm_contains_servers;
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()


  let set_name = call
    ~name:"set_name"
    ~doc:"Update the name of the PVS farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; String, "value", "name to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let add_cache_storage = call
    ~name:"add_cache_storage"
    ~doc:"Add a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be used"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let remove_cache_storage = call
    ~name:"remove_cache_storage"
    ~doc:"Remove a cache SR for the proxies on the farm"
    ~params:
    [ Ref _pvs_farm, "self", "this PVS farm"
    ; Ref _sr, "value", "SR to be removed"
    ]
    ~lifecycle
    ~allowed_roles:_R_POOL_OP
    ()

  let obj =
    let null_str = Some (VString "") in
    let null_set = Some (VSet []) in
    create_obj (* <---- creates class *)
    ~name: _pvs_farm
    ~descr:"machines serving blocks of data for provisioning VMs"
    ~doccomments:[]
    ~gen_constructor_destructor:false
    ~gen_events:true
    ~in_db:true
    ~lifecycle
    ~persist:PersistEverything
    ~in_oss_since:None
    ~messages_default_allowed_roles:_R_POOL_OP
    ~contents:
    [ uid     _pvs_farm ~lifecycle

    ; field   ~qualifier:StaticRO ~lifecycle
              ~ty:String "name" ~default_value:null_str
              "Name of the PVS farm. Must match name configured in PVS"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _sr)) "cache_storage" ~default_value:null_set
              ~ignore_foreign_key:true
              "The SR used by PVS proxy for the cache"

    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_server)) "servers"
              "The set of PVS servers in the farm"


    ; field   ~qualifier:DynamicRO ~lifecycle
              ~ty:(Set (Ref _pvs_proxy)) "proxies"
              "The set of proxies associated with the farm"
    ]
    ~messages:
    [ introduce
    ; forget
    ; set_name
    ; add_cache_storage
    ; remove_cache_storage
    ]
    ()
end
let pvs_farm = PVS_farm.obj

The class is defined by a call to create_obj and it defines the fields and messages (methods) belonging to the class. Each field has a name, a type, and some meta information. Likewise, each message (or method) is created by call that describes its parameters.

The PVS_farm has additional getter and setter methods for accessing its fields. These are not declared here as part of the messages but are automatically generated.

To make sure the new class is actually used, it is important to enter it into two lists:

(* datamodel.ml *)
let all_system = (* line 8917 *)
  [
    ...
    vgpu_type;
    pvs_farm;
    ...
  ]

let expose_get_all_messages_for = [ (* line 9097 *)
  ...
  _pvs_farm;
  _pvs_server;
  _pvs_proxy;

When a field refers to another object that itself refers back to it, these two need to be entered into the all_relations list. For example, _pvs_server refers to a _pvs_farm value via "farm", which, in turn, refers to the _pvs_server value via its "servers" field.

let all_relations =
  [
    (* ... *)
    (_sr, "introduced_by"), (_dr_task, "introduced_SRs");
    (_pvs_server, "farm"), (_pvs_farm, "servers");
    (_pvs_proxy,  "farm"), (_pvs_farm, "proxies");
  ]

CLI Conventions

The CLI provides access to objects from the command line. The following conventions exist for naming fields:

  • A field in the data model uses an underscore (_) but a hyphen (-) in the CLI: what is cache_storage in the data model becomes cache-storage in the CLI.

  • When a field contains a reference or multiple, like proxies, it becomes proxy-uuids in the CLI because references are always referred to by their UUID.

CLI Getters and Setters

All fields can be read from the CLI and some fields can also be set via the CLI. These getters and setters are mostly generated automatically and need to be connected to the CLI through a function in ocaml/xapi/records.ml. Note that field names here use the naming convention for the CLI:

(* ocaml/xapi/records.ml *)
let pvs_farm_record rpc session_id pvs_farm =
  let _ref = ref pvs_farm in
  let empty_record =
    ToGet (fun () -> Client.PVS_farm.get_record rpc session_id !_ref) in
  let record = ref empty_record in
  let x () = lzy_get record in
    { setref    = (fun r -> _ref := r ; record := empty_record)
    ; setrefrec = (fun (a,b) -> _ref := a; record := Got b)
    ; record    = x
    ; getref    = (fun () -> !_ref)
    ; fields=
      [ make_field ~name:"uuid"
        ~get:(fun () -> (x ()).API.pVS_farm_uuid) ()
      ; make_field ~name:"name"
        ~get:(fun () -> (x ()).API.pVS_farm_name)
        ~set:(fun name ->
          Client.PVS_farm.set_name rpc session_id !_ref name) ()
      ; make_field ~name:"cache-storage"
        ~get:(fun () -> (x ()).API.pVS_farm_cache_storage
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~add_to_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.add_cache_storage rpc session_id !_ref sr)
        ~remove_from_set:(fun sr_uuid ->
          let sr = Client.SR.get_by_uuid rpc session_id sr_uuid in
          Client.PVS_farm.remove_cache_storage rpc session_id !_ref sr)
        ()
      ; make_field ~name:"server-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_servers
          |> List.map get_uuid_from_ref)
        ()
      ; make_field ~name:"proxy-uuids"
        ~get:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref |> String.concat "; ")
        ~get_set:(fun () -> (x ()).API.pVS_farm_proxies
          |> List.map get_uuid_from_ref)
        ()
      ]
    }

CLI Interface to Methods

Methods accessible from the CLI are declared in ocaml/xapi/cli_frontend.ml. Each declaration refers to the real implementation of the method, like Cli_operations.PVS_far.introduce:

(* cli_frontend.ml *)
let rec cmdtable_data : (string*cmd_spec) list =
  (* ... *)
  "pvs-farm-introduce",
  {
    reqd=["name"];
    optn=[];
    help="Introduce new PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.introduce;
    flags=[];
  };
  "pvs-farm-forget",
  {
    reqd=["uuid"];
    optn=[];
    help="Forget a PVS farm";
    implementation=No_fd Cli_operations.PVS_farm.forget;
    flags=[];
  };

CLI Implementation of Methods

Each CLI operation that is not a getter or setter has an implementation in cli_operations.ml which is implemented in terms of the real implementation:

(* cli_operations.ml *)
module PVS_farm = struct
  let introduce printer rpc session_id params =
    let name  = List.assoc "name" params in
    let ref   = Client.PVS_farm.introduce ~rpc ~session_id ~name in
    let uuid  = Client.PVS_farm.get_uuid rpc session_id ref in
    printer (Cli_printer.PList [uuid])

  let forget printer rpc session_id params =
    let uuid  = List.assoc "uuid" params in
    let ref   = Client.PVS_farm.get_by_uuid ~rpc ~session_id ~uuid in
    Client.PVS_farm.forget rpc session_id ref
end

Fields that should show up in the CLI interface by default are declared in the gen_cmds value:

(* cli_operations.ml *)
let gen_cmds rpc session_id =
  let mk = make_param_funs in
  List.concat
  [ (*...*)
  ; Client.Pool.(mk get_all get_all_records_where
    get_by_uuid pool_record "pool" []
    ["uuid";"name-label";"name-description";"master"
    ;"default-SR"] rpc session_id)
  ; Client.PVS_farm.(mk get_all get_all_records_where
    get_by_uuid pvs_farm_record "pvs-farm" []
    ["uuid";"name";"cache-storage";"server-uuids"] rpc session_id)

Error messages

Error messages used by an implementation are introduced in two files:

(* ocaml/xapi-consts/api_errors.ml *)
let pvs_farm_contains_running_proxies = "PVS_FARM_CONTAINS_RUNNING_PROXIES"
let pvs_farm_contains_servers = "PVS_FARM_CONTAINS_SERVERS"
let pvs_farm_sr_already_added = "PVS_FARM_SR_ALREADY_ADDED"
let pvs_farm_sr_is_in_use = "PVS_FARM_SR_IS_IN_USE"
let sr_not_in_pvs_farm = "SR_NOT_IN_PVS_FARM"
let pvs_farm_cant_set_name = "PVS_FARM_CANT_SET_NAME"

(* ocaml/idl/datamodel.ml *)
  (* PVS errors *)
  error Api_errors.pvs_farm_contains_running_proxies ["proxies"]
    ~doc:"The PVS farm contains running proxies and cannot be forgotten." ();

  error Api_errors.pvs_farm_contains_servers ["servers"]
    ~doc:"The PVS farm contains servers and cannot be forgotten."
    ();

  error Api_errors.pvs_farm_sr_already_added ["farm"; "SR"]
    ~doc:"Trying to add a cache SR that is already associated with the farm"
    ();

  error Api_errors.sr_not_in_pvs_farm ["farm"; "SR"]
    ~doc:"The SR is not associated with the farm."
    ();

  error Api_errors.pvs_farm_sr_is_in_use ["farm"; "SR"]
    ~doc:"The SR is in use by the farm and cannot be removed."
    ();

  error Api_errors.pvs_farm_cant_set_name ["farm"]
    ~doc:"The name of the farm can't be set while proxies are active."
    ()

Method Implementation

The implementation of methods lives in a module in ocaml/xapi:

(* ocaml/xapi/api_server.ml *)
  module PVS_farm = Xapi_pvs_farm

The file below is typically a new file and needs to be added to ocaml/xapi/OMakefile.

(* ocaml/xapi/xapi_pvs_farm.ml *)
module D = Debug.Make(struct let name = "xapi_pvs_farm" end)
module E = Api_errors

let api_error msg xs = raise (E.Server_error (msg, xs))

let introduce ~__context ~name =
  let pvs_farm = Ref.make () in
  let uuid = Uuid.to_string (Uuid.make_uuid ()) in
  Db.PVS_farm.create ~__context
    ~ref:pvs_farm ~uuid ~name ~cache_storage:[];
  pvs_farm

(* ... *)

Messages received on a slave host may or may not be executed there. In the simple case, each methods executes locally:

(* ocaml/xapi/message_forwarding.ml *)
module PVS_farm = struct
  let introduce ~__context ~name =
    info "PVS_farm.introduce %s" name;
    Local.PVS_farm.introduce ~__context ~name

  let forget ~__context ~self =
    info "PVS_farm.forget";
    Local.PVS_farm.forget ~__context ~self

  let set_name ~__context ~self ~value =
    info "PVS_farm.set_name %s" value;
    Local.PVS_farm.set_name ~__context ~self ~value

  let add_cache_storage ~__context ~self ~value =
    info "PVS_farm.add_cache_storage";
    Local.PVS_farm.add_cache_storage ~__context ~self ~value

  let remove_cache_storage ~__context ~self ~value =
    info "PVS_farm.remove_cache_storage";
    Local.PVS_farm.remove_cache_storage ~__context ~self ~value
end

Adding a field to the API

This page describes how to add a field to XenAPI. A field is a parameter of a class that can be used in functions and read from the API.

Bumping the database schema version

Whenever a field is added to or removed from the API, its schema version needs to be increased. XAPI needs this fundamental procedure in order to be able to detect that an automatic database upgrade is necessary or to find out that the new schema is incompatible with the existing database. If the schema version is not bumped, XAPI will start failing in unpredictable ways. Note that bumping the version is not necessary when adding functions, only when adding fields.

The current version number is kept at the top of the file ocaml/idl/datamodel_common.ml in the variables schema_major_vsn and schema_minor_vsn, of which only the latter should be incremented (the major version only exists for historical reasons). When moving to a new XenServer release, also update the variable last_release_schema_minor_vsn to the schema version of the last release. To keep track of the schema versions of recent XenServer releases, the file contains variables for these, such as miami_release_schema_minor_vsn. After starting a new version of Xapi on an existing server, the database is automatically upgraded if the schema version of the existing database matches the value of last_release_schema_*_vsn in the new Xapi.

As an example, the patch below shows how the schema version was bumped when the new API fields used for ActiveDirectory integration were added:

--- a/ocaml/idl/datamodel.ml  Tue Nov 11 16:17:48 2008 +0000
+++ b/ocaml/idl/datamodel.ml  Tue Nov 11 15:53:29 2008 +0000
@@ -15,17 +15,20 @@ open Datamodel_types
  open Datamodel_types

  (* IMPORTANT: Please bump schema vsn if you change/add/remove a _field_.
     You do not have to dump vsn if you change/add/remove a message *)

  let schema_major_vsn = 5
 -let schema_minor_vsn = 55
 +let schema_minor_vsn = 56

  (* Historical schema versions just in case this is useful later *)
  let rio_schema_major_vsn = 5
  let rio_schema_minor_vsn = 19

 +let miami_release_schema_major_vsn = 5
 +let miami_release_schema_minor_vsn = 35
 +
  (* the schema vsn of the last release: used to determine whether we can
     upgrade or not.. *)
  let last_release_schema_major_vsn = 5
 -let last_release_schema_minor_vsn = 35
 +let last_release_schema_minor_vsn = 55

Setting the schema hash

In the ocaml/idl/schematest.ml there is the last_known_schema_hash This needs to be updated to be the next hash after the schema version was bumped. Get the new hash by running make test and you will receive the correct hash in the error message.

Adding the new field to some existing class

ocaml/idl/datamodel.ml

Add a new “field” line to the class in the file ocaml/idl/datamodel.ml or ocaml/idl/datamodel_[class].ml. The new field might require a suitable default value. This default value is used in case the user does not provide a value for the field.

A field has a number of parameters:

  • The lifecycle parameter, which shows how the field has evolved over time.
  • The qualifier parameter, which controls access to the field. The following values are possible:
ValueMeaning
StaticROField is set statically at install-time.
DynamicROField is computed dynamically at run time.
RWField is read/write.
  • The ty parameter for the type of the field.
  • The default_value parameter.
  • The name of the field.
  • A documentation string.

Example of a field in the pool class:

field ~lifecycle:[Published, rel_orlando, "Controls whether HA is enabled"]
      ~qualifier:DynamicRO ~ty:Bool
      ~default_value:(Some (VBool false)) "ha_enabled" "true if HA is enabled on the pool, false otherwise";

See datamodel_types.ml for information about other parameters.

Changing Constructors

Adding a field would change the constructors for the class – functions Db.*.create – and therefore, any references to these in the code need to be updated. In the example, the argument ~ha_enabled:false should be added to any call to Db.Pool.create.

Examples of where these calls can be found is in ocaml/tests/common/test_common.ml and ocaml/xapi/xapi_[class].ml.

CLI Records

If you want this field to show up in the CLI (which you probably do), you will also need to modify the Records module, in the file ocaml/xapi-cli-server/records.ml. Find the record function for the class which you have modified, add a new entry to the fields list using make_field. This type can be found in the same file.

The only required parameters are name and get (and unit, of course ). If your field is a map or set, then you will need to pass in get_{map,set}, and optionally set_{map,set}, if it is a RW field. The hidden parameter is useful if you don’t want this field to show up in a *_params_list call. As an example, here is a field that we’ve just added to the SM class:

make_field ~name:"versioned-capabilities"
           ~get:(fun () -> get_from_map (x ()).API.sM_versioned_capabilities)
           ~get_map:(fun () -> (x ()).API.sM_versioned_capabilities)
           ~hidden:true ();

Testing

The new fields can be tested by copying the newly compiled xapi binary to a test box. After the new xapi service is started, the file /var/log/xensource.log in the test box should contain a few lines reporting the successful upgrade of the metadata schema in the test box:

[...|xapi] Db has schema major_vsn=5, minor_vsn=57 (current is 5 58) (last is 5 57)
[...|xapi] Database schema version is that of last release: attempting upgrade
[...|sql] attempting to restore database from /var/xapi/state.db
[...|sql] finished parsing xml
[...|sql] writing db as xml to file '/var/xapi/state.db'.
[...|xapi] Database upgrade complete, restarting to use new db

Making this field accessible as a CLI attribute

XenAPI functions to get and set the value of the new field are generated automatically. It requires some extra work, however, to enable such operations in the CLI.

The CLI has commands such as host-param-list and host-param-get. To make a new field accessible by these commands, the file xapi-cli-server/records.ml needs to be edited. For the pool.ha-enabled field, the pool_record function in this file contains the following (note the convention to replace underscores by hyphens in the CLI):

let pool_record rpc session_id pool =
  ...
[
  ...
  make_field ~name:"ha-enabled" ~get:(fun () -> string_of_bool (x ()).API.pool_ha_enabled) ();
  ...
]}

NB: the ~get parameter must return a string so include a relevant function to convert the type of the field into a string i.e. string_of_bool

See xapi-cli-server/records.ml for examples of handling field types other than Bool.

Adding a function to the API

This page describes how to add a function to XenAPI.

Add message to API

The file idl/datamodel.ml is a description of the API, from which the marshalling and handler code is generated.

In this file, the create_obj function is used to define a class which may contain fields and support operations (known as “messages”). For example, the identifier host is defined using create_obj to encapsulate the operations which can be performed on a host.

In order to add a function to the API, we need to add a message to an existing class. This entails adding a function in idl/datamodel.ml or one of the other datamodel files to describe the new message and adding it to the class’s list of messages. In this example, we are adding to idl/datamodel_host.ml.

The function to describe the new message will look something like the following:

let host_price_of = call ~flags:[`Session]
    ~name:"price_of"
    ~in_oss_since:None
    ~in_product_since:rel_orlando
    ~params:[(Ref _host, "host", "The host containing the price information");
             (String, "item", "The item whose price is queried")]
    ~result:(Float, "The price of the item")
    ~doc:"Returns the price of a named item."
    ~allowed_roles:_R_POOL_OP
    ()

By convention, the name of the function is formed from the name of the class and the name of the message: host and price_of, in the example. An entry for host_price_of is added to the messages of the host class:

let host =
    create_obj ...
        ~messages: [...
                    host_price_of;
                   ]
...

The parameters passed to call are all optional (except ~name and ~in_product_since).

  • The ~flags parameter is used to set conditions for the use of the message. For example, `Session is used to indicate that the call must be made in the presence of an existing session.

  • The value of the ~in_product_since parameter is a string taken from idl/datamodel_types.ml indicates the XenServer release in which this message was first introduced.

  • The ~params parameter describes a list of the formal parameters of the message. Each parameter is described by a triple. The first component of the triple is the type (from type ty in idl/datamodel_types.ml); the second is the name of the parameter, and the third is a human-readable description of the parameter. The first triple in the list is conventionally the instance of the class on which the message will operate. In the example, this is a reference to the host.

  • Similarly, the ~result describes the message’s return type, although this is permitted to merely be a single value rather than a list of values. If no ~result is specified, the default is unit.

  • The ~doc parameter describes what the message is doing.

  • The bool ~hide_from_docs parameter prevents the message from being included in the documentation when generated.

  • The bool ~pool_internal parameter is used to indicate if the message should be callable by external systems or only internal hosts.

  • The ~errs parameter is a list of possible exceptions that the message can raise.

  • The parameter ~lifecycle takes in an array of (Status, version, doc) to indicate the lifecycle of the message type. This takes over from ~in_oss_since which indicated the release that the message type was introduced. NOTE: Leave this parameter empty, it will be populated on build.

  • The ~allowed_roles parameter is used for access control (see below).

Compiling xen-api.(hg|git) will cause the code corresponding to this message to be generated and output in ocaml/xapi/server.ml. In the example above, a section handling an incoming call host.price_of appeared in ocaml/xapi/server.ml. However, after this was generated, the rest of the build failed because this call expects a price_of function in the Host object.

Expected values in parameter ~in_product_since

In the example above, the value of the parameter ~in_product_since informs that the message host_price_of was added during the rel_orlando release cycle. If a new release cycle is required, then it needs to be added in the file idl/datamodel_types.ml. The patch below shows how the new rel_george release identifier was added. Any class, message, etc. added during the rel_george release cycle should contain ~in_product_since:rel_george entries. (obs: the release and upgrade infrastructure can handle only one new rel_* identifier – in this case, rel_george – in each release)

--- a/ocaml/idl/datamodel_types.ml Tue Nov 11 15:17:48 2008 +0000
+++ b/ocaml/idl/datamodel_types.ml Tue Nov 11 15:53:29 2008 +0000
@@ -27,14 +27,13 @@
 (* useful constants for product vsn tracking *)
 let oss_since_303 = Some "3.0.3"
+let rel_george = "george"
 let rel_orlando = "orlando"
 let rel_orlando_update_1 = "orlando-update-1"
 let rel_symc = "symc"
 let rel_miami = "miami"
 let rel_rio = "rio"
-let release_order = [engp:rel_rio; rel_miami; rel_symc; rel_orlando; rel_orlando_update_1]
+let release_order = [engp:rel_rio; rel_miami; rel_symc; rel_orlando; rel_orlando_update_1; rel_george] 

Update expose_get_all_messages_for list

If you are adding a new class, do not forget to add your new class _name to the expose_get_all_messages_for list, at the bottom of datamodel.ml, in order to have automatically generated get_all and get_all_records functions attached to it.

Update the RBAC field containing the roles expected to use the new API call

After the RBAC integration, Xapi provides by default a set of static roles associated to the most common subject tasks.

The api calls associated with each role are defined by a new ~allowed_roles parameter in each api call, which specifies the list of static roles that should be able to execute the call. The possible roles for this list is one of the following names, defined in datamodel.ml:

  • role_pool_admin
  • role_pool_operator
  • role_vm_power_admin
  • role_vm_admin
  • role_vm_operator
  • role_read_only

So, for instance,

~allowed_roles:[role_pool_admin,role_pool_operator] (* this is not the recommended usage, see example below *)

would be a valid list (though it is not the recommended way of using allowed_roles, see below), meaning that subjects belonging to either role_pool_admin or role_pool_operator can execute the api call.

The RBAC requirements define a policy where the roles in the list above are supposed to be totally-ordered by the set of api-calls associated with each of them. That means that any api-call allowed to role_pool_operator should also be in role_pool_admin; any api-call allowed to role_vm_power_admin should also be in role_pool_operator and also in role_pool_admin; and so on. Datamodel.ml provides shortcuts for expressing these totally-ordered set of roles policy associated with each api-call:

  • _R_POOL_ADMIN, equivalent to [role_pool_admin]
  • _R_POOL_OP, equivalent to [role_pool_admin,role_pool_operator]
  • _R_VM_POWER_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin]
  • _R_VM_ADMIN, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin]
  • _R_VM_OP, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op]
  • _R_READ_ONLY, equivalent to [role_pool_admin,role_pool_operator,role_vm_power_admin,role_vm_admin,role_vm_op,role_read_only]

The ~allowed_roles parameter should use one of the shortcuts in the list above, instead of directly using a list of roles, because the shortcuts above make sure that the roles in the list are in a total order regarding the api-calls permission sets. Creating an api-call with e.g. allowed_roles:[role_pool_admin,role_vm_admin] would be wrong, because that would mean that a pool_operator cannot execute the api-call that a vm_admin can, breaking the total-order policy expected in the RBAC 1.0 implementation. In the future, this requirement might be relaxed.

So, the example above should instead be used as:

~allowed_roles:_R_POOL_OP  (* recommended usage via pre-defined totally-ordered role lists *)

and so on.

How to determine the correct role of a new api-call:

  • if only xapi should execute the api-call, ie. it is an internal call: _R_POOL_ADMIN
  • if it is related to subject, role, external-authentication: _R_POOL_ADMIN
  • if it is related to accessing Dom0 (via console, ssh, whatever): _R_POOL_ADMIN
  • if it is related to the pool object: R_POOL_OP
  • if it is related to the host object, licenses, backups, physical devices: _R_POOL_OP
  • if it is related to managing VM memory, snapshot/checkpoint, migration: _R_VM_POWER_ADMIN
  • if it is related to creating, destroying, cloning, importing/exporting VMs: _R_VM_ADMIN
  • if it is related to starting, stopping, pausing etc VMs or otherwise accessing/manipulating VMs: _R_VM_OP
  • if it is related to being able to login, manipulate own tasks and read values only: _R_READ_ONLY

Update message forwarding

The “message forwarding” layer describes the policy of whether an incoming API call should be forwarded to another host (such as another member of the pool) or processed on the host which receives the call. This policy may be non-trivial to describe and so cannot be auto-generated from the data model.

In xapi/message_forwarding.ml, add a function to the relevant module to describe this policy. In the running example, we add the following function to the Host module:

let price_of ~__context ~host ~item =
    info "Host.price_of for item %s" item;
    let local_fn = Local.Host.price_of ~host ~item in
    do_op_on ~local_fn ~__context ~host
      (fun session_id rpc -> Client.Host.price_of ~rpc ~session_id ~host ~item)

After the ~__context parameter, the parameters of this new function should match the parameters we specified for the message. In this case, that is the host and the item to query the price of.

The do_op_on function takes a function to execute locally and a function to execute remotely and performs one of these operations depending on whether the given host is the local host.

The local function references Local.Host.price_of, which is a function we will write in the next step.

Implement the function

Now we write the function to perform the logic behind the new API call. For a host-based call, this will reside in xapi/xapi_host.ml. For other classes, other files with similar names are used.

We add the following function to xapi/xapi_host.ml:

let price_of ~__context ~host ~item =
    if item = "fish" then 3.14 else 0.00

We also need to add the function to the interface xapi/xapi_host.mli:

val price_of :
    __context:Context.t -> host:API.ref_host -> item:string -> float

Congratulations, you’ve added a function to the API!

Add the operation to the CLI

Edit xapi-cli-server/cli_frontend.ml. Add a block to the definition of cmdtable_data as in the following example:

"host-price-of",
{
  reqd=["host-uuid"; "item"];
  optn=[];
  help="Find out the price of an item on a certain host.";
  implementation= No_fd Cli_operations.host_price_of;
  flags=[];
};

Include here the following:

  • The names of required (reqd) and optional (optn) parameters.

  • A description to be displayed when calling xe help <cmd> in the help field.

  • The implementation should use With_fd if any communication with the client is necessary (for example, showing the user a warning, sending the contents of a file, etc.) Otherwise, No_fd can be used as above.

  • The flags field can be used to set special options:

    • Vm_selectors: adds a “vm” parameter for the name of a VM (rather than a UUID)
    • Host_selectors: adds a “host” parameter for the name of a host (rather than a UUID)
    • Standard: includes the command in the list of common commands displayed by xe help
    • Neverforward:
    • Hidden:
    • Deprecated of string list:

Now we must implement Cli_operations.host_price_of. This is done in xapi-cli-server/cli_operations.ml. This function typically extracts the parameters and forwards them to the internal implementation of the function. Other arbitrary code is permitted. For example:

let host_price_of printer rpc session_id params =
  let host = Client.Host.get_by_uuid rpc session_id (List.assoc "host-uuid" params) in
  let item = List.assoc "item" params in
  let price = string_of_float (Client.Host.price_of ~rpc ~session_id ~host ~item) in
  printer (Cli_printer.PList [price])

Tab Completion in the CLI

The CLI features tab completion for many of its commands’ parameters. Tab completion is implemented in the file ocaml/xe-cli/bash-completion, which is installed on the host as /etc/bash_completion.d/cli, and is done on a parameter-name rather than on a command-name basis. The main portion of the bash-completion file is a case statement that contains a section for each of the parameters that benefit from completion. There is also an entry that catches all parameter names ending at -uuid, and performs an automatic lookup of suitable UUIDs. The host-uuid parameter of our new host-price-of command therefore automatically gains completion capabilities.

Executing the CLI operation

Recompile xapi with the changes described above and install it on a test machine.

Execute the following command to see if the function exists:

xe help host-price-of

Invoke the function itself with the following command:

xe host-price-of host-uuid=<tab> item=fish

and you should find out the price of fish.

Adding a XenAPI extension

A XenAPI extension is a new RPC which is implemented as a separate executable (i.e. it is not part of xapi) but which still benefits from xapi parameter type-checking, multi-language stub generation, documentation generation, authentication etc. An extension can be backported to previous versions by simply adding the implementation, without having to recompile xapi itself.

A XenAPI extension is in two parts:

  1. a declaration in the xapi datamodel. This must use the ~forward_to:(Extension "filename") parameter. The filename must be unique, and should be the same as the XenAPI call name.
  2. an implementation executable in the dom0 filesystem with path /etc/xapi.d/extensions/filename

To define an extension

First write the declaration in the datamodel. The act of specifying the types and writing the documentation will help clarify the intended meaning of the call.

Second create a prototype of your implementation and put an executable file in /etc/xapi.d/extensions/filename. The calling convention is:

  • the file must be executable
  • xapi will parse the XMLRPC call arguments received over the network and check the session_id is valid
  • xapi will execute the named executable
  • the XMLRPC call arguments will be sent to the executable on stdin and stdin will be closed afterwards
  • the executable will run and print an XMLRPC response on stdout
  • xapi will read the response and return it to the client.

See the basic example.

Second make a pull request containing only the datamodel definitions (it is not necessary to include the prototype too). This will attract review comments which will help you improve your API further. Once the pull request is merged, then the API call name and extension are officially yours and you may use them on any xapi version which supports the extension mechanism.

Packaging your extension

Your extension /etc/xapi.d/extensions/filename (and dependencies) should be packaged for your target distribution (for XenServer dom0 this would be a CentOS RPM). Once the package is unpacked on the target machine, the extension should be immediately callable via the XenAPI, provided the xapi version supports the extension mechanism. Note the xapi version does not need to know about the specific extension in advance: it will always look in /etc/xapi.d/extensions/ for all RPC calls whose name it does not recognise.

Limitations

On type-checking

  • if the xapi version is new enough to know about your specific extension: xapi will type-check the call arguments for you
  • if the xapi version is too old to know about your specific extension: the extension will still be callable but the arguments will not be type-checked.

On access control

  • if the xapi version is new enough to know about your specific extension: you can declare that a user must have a particular role (e.g. ‘VM admin’)
  • if the xapi version is too old to know about your specific extension: the extension will still be callable but the client must have the ‘Pool admin’ role.

Since a xapi which knows about your specific extension is stricter than an older xapi, it’s a good idea to develop against the new xapi and then test older xapi versions later.

Database

Subsections of Database

Metadata-on-LUN

In the present version of XenServer, metadata changes resulting in writes to the database are not persisted in non-volatile storage. Hence, in case of failure, up to five minutes’ worth of metadata changes could be lost. The Metadata-on-LUN feature addresses the issue by ensuring that all database writes are retained. This will be used to improve recovery from failure by storing incremental deltas which can be re-applied to an old version of the database to bring it more up-to-date. An implication of this is that clients will no longer be required to perform a ‘pool-sync-database’ to protect critical writes, because all writes will be implicitly protected.

This is implemented by saving descriptions of all persistent database writes to a LUN when HA is active. Upon xapi restart after failure, such as on master fail-over, these descriptions are read and parsed to restore the latest version of the database.

Layout on block device

It is useful to store the database on the block device as well as the deltas, so that it is unambiguous on recovery which version of the database the deltas apply to.

The content of the block device will be structured as shown in the table below. It consists of a header; the rest of the device is split into two halves.

Length (bytes)Description
Header16Magic identifier
1ASCII NUL
1Validity byte
First half database36UUID as ASCII string
16Length of database as decimal ASCII
(as specified)Database (binary data)
16Generation count as decimal ASCII
36UUID as ASCII string
First half deltas16Length of database delta as decimal ASCII
(as specified)Database delta (binary data)
16Generation count as decimal ASCII
36UUID as ASCII string
Second half database36UUID as ASCII string
16Length of database as decimal ASCII
(as specified)Database (binary data)
16Generation count as decimal ASCII
36UUID as ASCII string
Second half deltas16Length of database delta as decimal ASCII
(as specified)Database delta (binary data)
16Generation count as decimal ASCII
36UUID as ASCII string

After the header, one or both halves may be devoid of content. In a half which contains a database, there may be zero or more deltas (repetitions of the last three entries in each half).

The structure of the device is split into two halves to provide double-buffering. In case of failure during write to one half, the other half remains intact.

The magic identifier at the start of the file protect against attempting to treat a different device as a redo log.

The validity byte is a single `ascii character indicating the state of the two halves. It can take the following values:

ByteDescription
0Neither half is valid
1First half is valid
2Second half is valid

The use of lengths preceding data sections permit convenient reading. The constant repetitions of the UUIDs act as nonces to protect against reading in invalid data in the case of an incomplete or corrupt write.

Architecture

The I/O to and from the block device may involve long delays. For example, if there is a network problem, or the iSCSI device disappears, the I/O calls may block indefinitely. It is important to isolate this from xapi. Hence, I/O with the block device will occur in a separate process.

Xapi will communicate with the I/O process via a UNIX domain socket using a simple text-based protocol described below. The I/O process will use to ensure that it can always accept xapi’s requests with a guaranteed upper limit on the delay. Xapi can therefore communicate with the process using blocking I/O.

Xapi will interact with the I/O process in a best-effort fashion. If it cannot communicate with the process, or the process indicates that it has not carried out the requested command, xapi will continue execution regardless. Redo-log entries are idempotent (modulo the raising of exceptions in some cases) so it is of little consequence if a particular entry cannot be written but others can. If xapi notices that the process has died, it will attempt to restart it.

The I/O process keeps track of a pointer for each half indicating the position at which the next delta will be written in that half.

Protocol

Upon connection to the control socket, the I/O process will attempt to connect to the block device. Depending on whether this is successful or unsuccessful, one of two responses will be sent to the client.

  • connect|ack_ if it is successful; or

  • connect|nack|<length>|<message> if it is unsuccessful, perhaps because the block device does not exist or cannot be read from. The <message> is a description of the error; the <length> of the message is expressed using 16 digits of decimal ascii.

The former message indicates that the I/O process is ready to receive commands. The latter message indicates that commands can not be sent to the I/O process.

There are three commands which xapi can send to the I/O process. These are described below, with a high level description of the operational semantics of the I/O process’ actions, and the corresponding responses. For ease of parsing, each command is ten bytes in length.

Write database

Xapi requests that a new database is written to the block device, and sends its content using the data socket.

Command:
: writedb___|<uuid>|<generation-count>|<length>
The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.
Semantics:
  1. Read the validity byte.
  2. If one half is valid, we will use the other half. If no halves are valid, we will use the first half.
  3. Read the data from the data socket and write it into the chosen half.
  4. Set the pointer for the chosen half to point to the position after the data.
  5. Set the validity byte to indicate the chosen half is valid.
Response:
: writedb|ack_ in case of successful write; or
writedb|nack|<length>|<message> otherwise.
For error messages, the length of the message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Write database delta

Xapi sends a description of a database delta to append to the block device.

Command:
: writedelta|<uuid>|<generation-count>|<length>|<data>
The UUID is expressed as 36 ASCII characters. The length of the data and the generation-count are expressed using 16 digits of decimal ASCII.
Semantics:
  1. Read the validity byte to establish which half is valid. If neither half is valid, return with a nack.
  2. If the half’s pointer is set, seek to that position. Otherwise, scan through the half and stop at the position after the last write.
  3. Write the entry.
  4. Update the half’s pointer to point to the position after the entry.
Response:
: writedelta|ack_ in case of successful append; or
writedelta|nack|<length>|<message> otherwise.
For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Read log

Xapi requests the contents of the log.

Command:

: read______

Semantics:
  1. Read the validity byte to establish which half is valid. If neither half is valid, return with an end.
  2. Attempt to read the database from the current half.
  3. If this is successful, continue in that half reading entries up to the position of the half’s pointer. If the pointer is not set, read until a record of length zero is found or the end of the half is reached. Otherwise—if the attempt to the read the database was not successful—switch to using the other half and try again from step 2.
  4. Finally output an end.
Response:
: read|nack_|<length>|<message> in case of error; or
read|db___|<generation-count>|<length>|<data> for a database record, then a sequence of zero or more
read|delta|<generation-count>|<length>|<data> for each delta record, then
read|end__
For each record, and for error messages, the length of the data or message is expressed using 16 digits of decimal ascii. In particular, the error message for timeouts is the string Timeout.

Re-initialise log

Xapi requests that the block device is re-initialised with a fresh redo-log.

Command:

: empty_____\

Semantics:

: 1. Set the validity byte to indicate that neither half is valid.

Response:
: empty|ack_ in case of successful re-initialisation; or
empty|nack|<length>|<message> otherwise.
For error messages, the length of the message is expressed using 16 digits of decimal ASCII. In particular, the error message for timeouts is the string Timeout.

Impact on xapi performance

The implementation of the feature causes a slow-down in xapi of around 6% in the general case. However, if the LUN becomes inaccessible this can cause a slow-down of up to 25% in the worst case.

The figure below shows the result of testing four configurations, counting the number of database writes effected through a command-line ‘xe pool-param-set’ call.

  • The first and second configurations are xapi without the Metadata-on-LUN feature, with HA disabled and enabled respectively.

  • The third configuration shows xapi with the Metadata-on-LUN feature using a healthy LUN to which all database writes can be successfully flushed.

  • The fourth configuration shows xapi with the Metadata-on-LUN feature using an inaccessible LUN for which all database writes fail.

Impact of feature on xapi database-writing performance. (Green points
represent individual samples; red bars are the arithmetic means of
samples.) Impact of feature on xapi database-writing performance. (Green points
represent individual samples; red bars are the arithmetic means of
samples.)

Testing strategy

The section above shows how xapi performance is affected by this feature. The sections below describe the dev-testing which has already been undertaken, and propose how this feature will impact on regression testing.

Dev-testing performed

A variety of informal tests have been performed as part of the development process:

Enable HA.

Confirm LUN starts being used to persist database writes.

Enable HA, disable HA.

Confirm LUN stops being used.

Enable HA, kill xapi on master, restart xapi on master.

Confirm that last database write before kill is successfully restored on restart.

Repeatedly enable and disable HA.

Confirm that no file descriptors are leaked (verified by counting the number of descriptors in /proc/pid/fd/).

Enable HA, reboot the master.

Due to HA, a slave becomes the master (or this can be forced using ‘xe pool-emergency-transition-to-master’). Confirm that the new master starts is able to restore the database from the LUN from the point the old master left off, and begins to write new changes to the LUN.

Enable HA, disable the iSCSI volume.

Confirm that xapi continues to make progress, although database writes are not persisted.

Enable HA, disable and enable the iSCSI volume.

Confirm that xapi begins to use the LUN when the iSCSI volume is re-enabled and subsequent writes are persisted.

These tests have been undertaken using an iSCSI target VM and a real iSCSI volume on lannik. In these scenarios, disabling the iSCSI volume consists of stopping the VM and unmapping the LUN, respectively.

Proposed new regression test

A new regression test is proposed to confirm that all database writes are persisted across failure.

There are three types of database modification to test: row creation, field-write and row deletion. Although these three kinds of write could be tested in separate tests, the means of setting up the pre-conditions for a field-write and a row deletion require a row creation, so it is convenient to test them all in a single test.

  1. Start a pool containing three hosts.

  2. Issue a CLI command on the master to create a row in the database, e.g.

    xe network-create name-label=a.

  3. Forcefully power-cycle the master.

  4. On fail-over, issue a CLI command on the new master to check that the row creation persisted:

    xe network-list name-label=a,

    confirming that the returned string is non-empty.

  5. Issue a CLI command on the master to modify a field in the new row in the database:

    xe network-param-set uuid=<uuid> name-description=abcd,

    where <uuid> is the UUID returned from step 2.

  6. Forcefully power-cycle the master.

  7. On fail-over, issue a CLI command on the new master to check that the field-write persisted:

    xe network-param-get uuid=<uuid> param-name=name-description,

    where <uuid> is the UUID returned from step 2. The returned string should contain

    abcd.

  8. Issue a CLI command on the master to delete the row from the database:

    xe network-destroy uuid=<uuid>,

    where <uuid> is the UUID returned from step 2.

  9. Forcefully power-cycle the master.

  10. On fail-over, issue a CLI command on the new master to check that the row does not exist:

    xe network-list name-label=a,

    confirming that the returned string is empty.

Impact on existing regression tests

The Metadata-on-LUN feature should mean that there is no need to perform an ‘xe pool-sync-database’ operation in existing HA regression tests to ensure that database state persists on xapi failure.

Host memory accounting

Memory is used for many things:

  • the hypervisor code: this is the Xen executable itself
  • the hypervisor heap: this is needed for per-domain structures and per-vCPU structures
  • the crash kernel: this is needed to collect information after a host crash
  • domain RAM: this is the memory the VM believes it has
  • shadow memory: for HVM guests running on hosts without hardware assisted paging (HAP) Xen uses shadow to optimise page table updates. For all guests shadow is used during live migration for tracking the memory transfer.
  • video RAM for the virtual graphics card

Some of these are constants (e.g. hypervisor code) while some depend on the VM configuration (e.g. domain RAM). Xapi calls the constants “host overhead” and the variables due to VM configuration as “VM overhead”. There is no low-level API to query this information, therefore xapi will sample the host overheads at system boot time and model the per-VM overheads.

Host overhead

The host overhead is not managed by xapi, instead it is sampled. After the host boots and before any VMs start, xapi asks Xen how much memory the host has in total, and how much memory is currently free. Xapi subtracts the free from the total and stores this as the host overhead.

VM overhead

The inputs to the model are

  • VM.memory_static_max: the maximum amount of RAM the domain will be able to use
  • VM.HVM_shadow_multiplier: allows the shadow memory to be increased
  • VM.VCPUs_max: the maximum number of vCPUs the domain will be able to use

First the shadow memory is calculated, in MiB

Shadow memory in MiB Shadow memory in MiB

Second the VM overhead is calculated, in MiB

Memory overhead in MiB Memory overhead in MiB

Memory required to start a VM

If ballooning is disabled, the memory required to start a VM is the same as the VM overhead above.

If ballooning is enabled then the memory calculation above is modified to use the VM.memory_dynamic_max rather than the VM.memory_static_max.

Memory required to migrate a VM

If ballooning is disabled, the memory required to receive a migrating VM is the same as the VM overhead above.

If ballooning is enabled, then the VM will first be ballooned down to VM.memory_dynamic_min and then it will be migrated across. If the VM fails to balloon all the way down, then correspondingly more memory will be required on the receiving side.

Subsections of XAPI requests walk-throughs

From RPC migration request to xapi internals

Overview

In this document we will use the VM.pool_migrate request to illustrate the interaction between various components within the XAPI toolstack during migration. However this schema can be applied to other requests as well.

Not all parts of the Xapi toolstack are shown here as not all are involved in the migration process. For instance you won’t see the squeezed nor mpathalert two daemons that belong to the toolstack but don’t participate in the migration of a VM.

Anatomy of a VM migration

  • Migration is initiated by a Xapi client that sends VM.pool_migrate, an RPC XML request.
  • The Xen API server handles this request and dispatches it to the server.
  • The server is generated using XAPI IDL and requests are wrapped whithin a context, either to be forwarded to a host or executed locally. Broadly, the context follows RBAC rules. The executed function is related to the message of the request (refer to XenAPI Reference).
  • In the case of the migration you can refer to ocaml/idl/datamodel_vm.ml.
  • The server will dispatch the operation to server helpers, executing the operation synchronously or asynchronously and returning the RPC answer.
  • Message forwarding decides if operation must be executed by another host of the pool and then forward the call or if is executed locally.
  • When executed locally the high-level migration operation is send to the Xenopsd daemon by posting a message on a known queue on the message switch.
  • Xenopsd will get the command and will split it into several atomic operations that will be run by the xenopsd backend.
  • Xenopsd with its backend can then access xenstore or execute hypercall to interact with xen a server the micro operation.

A diagram is worth a thousand words

flowchart TD %% First we are starting by a XAPI client that is sending an XML-RPC request client((Xapi client)) -. sends RPC XML request .-> xapi_server{"`Dispatch RPC **api_server.ml**`"} style client stroke:#CAFEEE,stroke-width:4px %% XAPI Toolstack internals subgraph "Xapi Toolstack (master of the pool)" style server stroke:#BAFA00,stroke-width:4px,stroke-dasharray: 5 5 xapi_server --dispatch call (ie VM.pool_migrate)--> server("`Auto generated using *IDL* **server.ml**`") server --do_dispatch (ie VM.pool_migrate)--> server_helpers["`server helpers **server_helpers.ml**`"] server_helpers -- call management (ie xapi_vm_migrate.ml)--> message_forwarding["`check where to run the call **message_forwarding.ml**`"] message_forwarding -- execute locally --> vm_management["`VM Mgmt like **xapi_vm_migrate.ml**`"] vm_management -- Call --> xapi_xenops["`Transform xenops see (**xapi_xenops.ml**)`"] xapi_xenops <-- Post following IDL model (see xenops_interface.ml) --> msg_switch subgraph "Message Switch Daemon" msg_switch[["Queues"]] end subgraph "Xenopsd Daemon" msg_switch <-- Push/Pop on org.xen.xapi.xenopsd.classic --> xenopsd_server xenopsd_server["`Xenposd *frontend* get & split high level opertion into atomics`"] o-- linked at compile time --o xenopsd_backend end end %% Xenopsd backend is accessing xen and xenstore xenopsd_backend["`Xenopsd *backend* Backend XC (libxenctrl)`"] -. access to .-> xen_hypervisor["Xen hypervisor & xenstore"] style xen_hypervisor stroke:#BEEF00,stroke-width:2px %% Can send request to the host where call must be executed message_forwarding -.forward call to .-> elected_host["Host where call must be executed"] style elected_host stroke:#B0A,stroke-width:4px

XAPI's Storage Layers

Info

The links in this page point to the source files of xapi v1.127.0, and xcp-idl v1.62.0, not to the latest source code.

In the beginning of 2023, significant changes have been made in the layering. In particular, the wrapper code from storage_impl.ml has been pushed down the stack, below the mux, such that it only covers the SMAPIv1 backend and not SMAPIv3. Also, all of the code (from xcp-idl etc) is now present in this repo (xen-api).

Xapi directly communicates only with the SMAPIv2 layer. There are no plugins directly implementing the SMAPIv2 interface, but the plugins in other layers are accessed through it:

graph TD A[xapi] --> B[SMAPIv2 interface] B --> C[SMAPIv2 <-> SMAPIv1 translation: storage_access.ml] B --> D[SMAPIv2 <-> SMAPIv3 translation: xapi-storage-script] C --> E[SMAPIv1 plugins] D --> F[SMAPIv3 plugins]

SMAPIv1

These are the files related to SMAPIv1 in xen-api/ocaml/xapi/:

  • sm.ml: OCaml “bindings” for the SMAPIv1 Python “drivers” (SM)
  • sm_exec.ml: support for implementing the above “bindings”. The parameters are converted to XML-RPC, passed to the relevant python script (“driver”), and then the standard output of the program is parsed as an XML-RPC response (we use xen-api-libs-transitional/http-svr/xMLRPC.ml for parsing XML-RPC). When adding new functionality, we can modify type call to add parameters, but when we don’t add any common ones, we should just pass the new parameters in the args record.
  • smint.ml: Contains types, exceptions, … for the SMAPIv1 OCaml interface

SMAPIv2

These are the files related to SMAPIv2, which need to be modified to implement new calls:

  • xcp-idl/storage/storage_interface.ml: Contains the SMAPIv2 interface
  • xcp-idl/storage/storage_skeleton.ml: A stub SMAPIv2 storage server implementation that matches the SMAPIv2 storage server interface (this is verified by storage_skeleton_test.ml), each of its function just raise a Storage_interface.Unimplemented error. This skeleton is used to automatically fill the unimplemented methods of the below storage servers to satisfy the interface.
  • xen-api/ocaml/xapi/storage_access.ml: module SMAPIv1: a SMAPIv2 server that does SMAPIv2 -> SMAPIv1 translation. It passes the XML-RPC requests as the first command-line argument to the corresponding Python script, which returns an XML-RPC response on standard output.
  • xen-api/ocaml/xapi/storage_impl.ml: The Wrapper module wraps a SMAPIv2 server (Server_impl) and takes care of locking and datapaths (in case of multiple connections (=datapaths) from VMs to the same VDI, it will use the superstate computed by the Vdi_automaton in xcp-idl). It also implements some functionality, like the DP module, that is not implemented in lower layers.
  • xen-api/ocaml/xapi/storage_mux.ml: A SMAPIv2 server, which multiplexes between other servers. A different SMAPIv2 server can be registered for each SR. Then it forwards the calls for each SR to the “storage plugin” registered for that SR.

How SMAPIv2 works:

We use message-switch under the hood for RPC communication between xcp-idl components. The main Storage_mux.Server (basically Storage_impl.Wrapper(Mux)) is registered to listen on the “org.xen.xapi.storage” queue during xapi’s startup, and this is the main entry point for incoming SMAPIv2 function calls. Storage_mux does not really multiplex between different plugins right now: earlier during xapi’s startup, the same SMAPIv1 storage server module is registered on the various “org.xen.xapi.storage.<sr type>” queues for each supported SR type. (This will change with SMAPIv3, which is accessed via a SMAPIv2 plugin outside of xapi that translates between SMAPIv2 and SMAPIv3.) Then, in Storage_access.create_sr, which is called during SR.create, and also during PBD.plug, the relevant “org.xen.xapi.storage.<sr type>” queue needed for that PBD is registered with Storage_mux in Storage_access.bind for the SR of that PBD.
So basically what happens is that xapi registers itself as a SMAPIv2 server, and forwards incoming function calls to itself through message-switch, using its Storage_mux module. These calls are forwarded to xapi’s SMAPIv1 module doing SMAPIv2 -> SMAPIv1 translation.

Registration of the various storage servers

sequenceDiagram participant q as message-switch participant v1 as Storage_access.SMAPIv1 participant svr as Storage_mux.Server Note over q, svr: xapi startup, "Starting SMAPIv1 proxies" q ->> v1:org.xen.xapi.storage.sr_type_1 q ->> v1:org.xen.xapi.storage.sr_type_2 q ->> v1:org.xen.xapi.storage.sr_type_3 Note over q, svr: xapi startup, "Starting SM service" q ->> svr:org.xen.xapi.storage Note over q, svr: SR.create, PBD.plug svr ->> q:org.xapi.storage.sr_type_2

What happens when a SMAPIv2 “function” is called

graph TD call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage subgraph message-switch org.xen.xapi.storage org.xen.xapi.storage.SR_type_x end org.xen.xapi.storage --VDI.attach2--> Storage_impl.Wrapper subgraph xapi subgraph Storage_mux.server Storage_impl.Wrapper --> Storage_mux.mux end Storage_access.SMAPIv1 end Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x org.xen.xapi.storage.SR_type_x --VDI.attach2--> Storage_access.SMAPIv1 subgraph SMAPIv1 driver_x[SMAPIv1 driver for SR_type_x] end Storage_access.SMAPIv1 --vdi_attach--> driver_x

Interface Changes, Backward Compatibility, & SXM

During SXM, xapi calls SMAPIv2 functions on a remote xapi. Therefore it is important to keep all those SMAPIv2 functions backward-compatible that we call remotely (e.g. Remote.VDI.attach), otherwise SXM from an older to a newer xapi will break.

Functionality implemented in SMAPIv2 layers

The layer between SMAPIv2 and SMAPIv1 is much fatter than the one between SMAPIv2 and SMAPIv3. The latter does not do much, apart from simple translation. However, the former has large portions of code in its intermediate layers, in addition to the basic SMAPIv2 <-> SMAPIv1 translation in storage_access.ml.

These are the three files in xapi that implement the SMAPIv2 storage interface, from higher to lower level:

Functionality implemented by higher layers is not implemented by the layers below it.

Extra functionality in storage_impl.ml

In addition to its usual functions, Storage_impl.Wrapper also implements the UPDATES and TASK SMAPIv2 APIs, without calling the wrapped module.

These are backed by the Updates, Task_server, and Scheduler modules from xcp-idl, instantiated in xapi’s Storage_task module. Migration code in Storage_mux will interact with these to update task progress. There is also an event loop in xapi that keeps calling UPDATES.get to keep the tasks in xapi’s database in sync with the storage manager’s tasks.

Storage_impl.Wrapper also implements the legacy VDI.attach call by simply calling the newer VDI.attach2 call in the same module. In general, this is a good place to implement a compatibility layer for deprecated functionality removed from other layers, because this is the first module that intercepts a SMAPIv2 call.

Extra functionality in storage_mux.ml

Storage_mux implements storage motion (SXM): it implements the DATA and DATA.MIRROR modules. Migration code will use the Storage_task module to run the operations and update the task’s progress.

It also implements the Policy module from the SMAPIv2 interface.

SMAPIv3

SMAPIv3 has a slightly different interface from SMAPIv2.The xapi-storage-script daemon is a SMAPIv2 plugin separate from xapi that is doing the SMAPIv2 ↔ SMAPIv3 translation. It keeps the plugins registered with xcp-idl (their message-switch queues) up to date as their files appear or disappear from the relevant directory.

SMAPIv3 Interface

The SMAPIv3 interface is defined using an OCaml-based IDL from the ocaml-rpc library, and is in this repo: https://github.com/xapi-project/xapi-storage

From this interface we generate

  • OCaml RPC client bindings used in xapi-storage-script
  • The SMAPIv3 API reference
  • Python bindings, used by the SM scripts that implement the SMAPIv3 interface.
    • These bindings are built by running “make” in the root xapi-storage, and appear in the _build/default/python/xapi/storage/api/v5 directory.
    • On a XenServer host, they are stored in the /usr/lib/python3.6/site-packages/xapi/storage/api/v5/ directory

SMAPIv3 Plugins

For SMAPIv3 we have volume plugins to manipulate SRs and volumes (=VDIs) in them, and datapath plugins for connecting to the volumes. Volume plugins tell us which datapath plugins we can use with each volume, and what to pass to the plugin. Both volume and datapath plugins implement some common functionality: the SMAPIv3 plugin interface.

How SMAPIv3 works:

The xapi-storage-script daemon detects volume and datapath plugins stored in subdirectories of the /usr/libexec/xapi-storage-script/volume/ and /usr/libexec/xapi-storage-script/datapath/ directories, respectively. When it finds a new datapath plugin, it adds the plugin to a lookup table and uses it the next time that datapath is required. When it finds a new volume plugin, it binds a new message-switch queue named after the plugin’s subdirectory to a new server instance that uses these volume scripts.

To invoke a SMAPIv3 method, it executes a program named <Interface name>.<function name> in the plugin’s directory, for example /usr/libexec/xapi-storage-script/volume/org.xen.xapi.storage.gfs2/SR.ls. The inputs to each script can be passed as command-line arguments and are type-checked using the generated Python bindings, and so are the outputs. The URIs of the SRs that xapi-storage-script knows about are stored in the /var/run/nonpersistent/xapi-storage-script/state.db file, these URIs can be used on the command line when an sr argument is expected.

Registration of the various SMAPIv3 plugins

sequenceDiagram participant q as message-switch participant v1 as (Storage_access.SMAPIv1) participant svr as Storage_mux.Server participant vol_dir as /../volume/ participant dp_dir as /../datapath/ participant script as xapi-storage-script Note over script, vol_dir: xapi-storage-script startup script ->> vol_dir: new subdir org.xen.xapi.storage.sr_type_4 q ->> script: org.xen.xapi.storage.sr_type_4 script ->> dp_dir: new subdir sr_type_4_dp Note over q, svr: xapi startup, "Starting SMAPIv1 proxies" q -->> v1:org.xen.xapi.storage.sr_type_1 q -->> v1:org.xen.xapi.storage.sr_type_2 q -->> v1:org.xen.xapi.storage.sr_type_3 Note over q, svr: xapi startup, "Starting SM service" q ->> svr:org.xen.xapi.storage Note over q, svr: SR.create, PBD.plug svr ->> q:org.xapi.storage.sr_type_4

What happens when a SMAPIv3 “function” is called

graph TD call[SMAPIv2 call] --VDI.attach2--> org.xen.xapi.storage subgraph message-switch org.xen.xapi.storage org.xen.xapi.storage.SR_type_x end org.xen.xapi.storage --VDI.attach2--> Storage_impl.Wrapper subgraph xapi subgraph Storage_mux.server Storage_impl.Wrapper --> Storage_mux.mux end Storage_access.SMAPIv1 end Storage_mux.mux --VDI.attach2--> org.xen.xapi.storage.SR_type_x org.xen.xapi.storage.SR_type_x -."VDI.attach2".-> Storage_access.SMAPIv1 subgraph SMAPIv1 driver_x[SMAPIv1 driver for SR_type_x] end Storage_access.SMAPIv1 -.vdi_attach.-> driver_x subgraph SMAPIv3 xapi-storage-script --Datapath.attach--> v3_dp_plugin_x subgraph SMAPIv3 plugins v3_vol_plugin_x[volume plugin for SR_type_x] v3_dp_plugin_x[datapath plugin for SR_type_x] end end org.xen.xapi.storage.SR_type_x --VDI.attach2-->xapi-storage-script

Error reporting

In our SMAPIv1 OCaml “bindings” in xapi (xen-api/ocaml/xapi/sm_exec.ml), when we inspect the error codes returned from a call to SM, we translate some of the SMAPIv1/SM error codes to XenAPI errors, and for others, we just construct an error code of the form SR_BACKEND_FAILURE_<SM error number>.

The file xcp-idl/storage/storage_interface.ml defines a number of SMAPIv2 errors, ultimately all errors from the various SMAPIv2 storage servers in xapi will be returned as one of these. Most of the errors aren’t converted into a specific exception in Storage_interface, but are simply wrapped with Storage_interface.Backend_error.

The Storage_access.transform_storage_exn function is used by the client code in xapi to translate the SMAPIv2 errors into XenAPI errors again, this unwraps the errors wrapped with Storage_interface.Backend_error.

Message Forwarding

In the message forwarding layer, first we check the validity of VDI operations using mark_vdi and mark_sr. These first check that the operation is valid operations, using Xapi_vdi.check_operation_error, for mark_vdi, which also inspects the current operations of the VDI, and then, if the operation is valid, it is added to the VDI’s current operations, and update_allowed_operations is called. Then we forward the VDI operation to a suitable host that has a PBD plugged for the VDI’s SR.

Checking that the SR is attached

For the VDI operations, we check at two different places whether the SR is attached: first, at the Xapi level, in Xapi_vdi.check_operation_error, for the resize operation, and then, at the SMAPIv1 level, in Sm.assert_pbd_is_plugged. Sm.assert_pbd_is_plugged performs the same checks, plus it checks that the PBD is attached to the localhost, unlike Xapi_vdi.check_operation_error. This behaviour is correct, because Xapi_vdi.check_operation_error is called from the message forwarding layer, which forwards the call to a host that has the SR attached.

VDI Identifiers and Storage Motion

  • VDI “location”: this is the VDI identifier used by the SM backend. It is usually the UUID of the VDI, but for ISO SRs it is the name of the ISO.
  • VDI “content_id”: this is used for storage motion, to reduce the amount of data copied. When we copy over a VDI, the content_id will initially be the same. However, when we attach a VDI as read-write, and then detach it, then we will blank its content_id (set it to a random UUID), because we may have written to it, so the content could be different. .

Subsections of XAPI's Storage Layers

Storage migration

Overview

sequenceDiagram participant local_tapdisk as local tapdisk participant local_smapiv2 as local SMAPIv2 participant xapi participant remote_xapi as remote xapi participant remote_smapiv2 as remote SMAPIv2 (might redirect) participant remote_tapdisk as remote tapdisk Note over xapi: Sort VDIs increasingly by size and then age loop VM's & snapshots' VDIs & suspend images xapi->>remote_xapi: plug dest SR to dest host and pool master alt VDI is not mirrored Note over xapi: We don't mirror RO VDIs & VDIs of snapshots xapi->>local_smapiv2: DATA.copy remote_sm_url activate local_smapiv2 local_smapiv2-->>local_smapiv2: SR.scan local_smapiv2-->>local_smapiv2: VDI.similar_content local_smapiv2-->>remote_smapiv2: SR.scan Note over local_smapiv2: Find nearest smaller remote VDI remote_base, if any alt remote_base local_smapiv2-->>remote_smapiv2: VDI.clone local_smapiv2-->>remote_smapiv2: VDI.resize else no remote_base local_smapiv2-->>remote_smapiv2: VDI.create end Note over local_smapiv2: call copy' activate local_smapiv2 local_smapiv2-->>remote_smapiv2: SR.list local_smapiv2-->>remote_smapiv2: SR.scan Note over local_smapiv2: create new datapaths remote_dp, base_dp, leaf_dp Note over local_smapiv2: find local base_vdi with same content_id as dest, if any local_smapiv2-->>remote_smapiv2: VDI.attach2 remote_dp dest local_smapiv2-->>remote_smapiv2: VDI.activate remote_dp dest opt base_vdi local_smapiv2-->>local_smapiv2: VDI.attach2 base_dp base_vdi local_smapiv2-->>local_smapiv2: VDI.activate base_dp base_vdi end local_smapiv2-->>local_smapiv2: VDI.attach2 leaf_dp vdi local_smapiv2-->>local_smapiv2: VDI.activate leaf_dp vdi local_smapiv2-->>remote_xapi: sparse_dd base_vdi vdi dest [NBD URI for dest & remote_dp] Note over remote_xapi: HTTP handler verifies credentials remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server local_smapiv2-->>local_smapiv2: VDI.deactivate leaf_dp vdi local_smapiv2-->>local_smapiv2: VDI.detach leaf_dp vdi opt base_vdi local_smapiv2-->>local_smapiv2: VDI.deactivate base_dp base_vdi local_smapiv2-->>local_smapiv2: VDI.detach base_dp base_vdi end local_smapiv2-->>remote_smapiv2: DP.destroy remote_dp deactivate local_smapiv2 local_smapiv2-->>remote_smapiv2: VDI.snapshot remote_copy local_smapiv2-->>remote_smapiv2: VDI.destroy remote_copy local_smapiv2->>xapi: task(snapshot) deactivate local_smapiv2 else VDI is mirrored Note over xapi: We mirror RW VDIs of the VM Note over xapi: create new datapath dp xapi->>local_smapiv2: VDI.attach2 dp xapi->>local_smapiv2: VDI.activate dp xapi->>local_smapiv2: DATA.MIRROR.start dp remote_sm_url activate local_smapiv2 Note over local_smapiv2: copy disk data & mirror local writes local_smapiv2-->>local_smapiv2: SR.scan local_smapiv2-->>local_smapiv2: VDI.similar_content local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_start similars activate remote_smapiv2 remote_smapiv2-->>local_smapiv2: mirror_vdi,mirror_dp,copy_diffs_from,copy_diffs_to,dummy_vdi deactivate remote_smapiv2 local_smapiv2-->>local_smapiv2: DP.attach_info dp local_smapiv2-->>remote_xapi: connect to [NBD URI for mirror_vdi & mirror_dp] Note over remote_xapi: HTTP handler verifies credentials remote_xapi-->>remote_tapdisk: then passes connection to tapdisk's NBD server local_smapiv2-->>local_tapdisk: pass socket & dp to tapdisk of dp local_smapiv2-->>local_smapiv2: VDI.snapshot local_vdi [mirror:dp] local_smapiv2-->>local_tapdisk: [Python] unpause disk, pass dp local_tapdisk-->>remote_tapdisk: mirror new writes via NBD to socket Note over local_smapiv2: call copy' snapshot copy_diffs_to local_smapiv2-->>remote_smapiv2: VDI.compose copy_diffs_to mirror_vdi local_smapiv2-->>remote_smapiv2: VDI.remove_from_sm_config mirror_vdi base_mirror local_smapiv2-->>remote_smapiv2: VDI.destroy dummy_vdi local_smapiv2-->>local_smapiv2: VDI.destroy snapshot local_smapiv2->>xapi: task(mirror ID) deactivate local_smapiv2 xapi->>local_smapiv2: DATA.MIRROR.stat activate local_smapiv2 local_smapiv2->>xapi: dest_vdi deactivate local_smapiv2 end loop until task finished xapi->>local_smapiv2: UPDATES.get xapi->>local_smapiv2: TASK.stat end xapi->>local_smapiv2: TASK.stat xapi->>local_smapiv2: TASK.destroy end opt for snapshot VDIs xapi->>local_smapiv2: SR.update_snapshot_info_src remote_sm_url activate local_smapiv2 local_smapiv2-->>remote_smapiv2: SR.update_snapshot_info_dest deactivate local_smapiv2 end Note over xapi: ... Note over xapi: reserve resources for the new VM in dest host loop all VDIs opt VDI is mirrored xapi->>local_smapiv2: DP.destroy dp end end opt post_detach_hook opt active local mirror local_smapiv2-->>remote_smapiv2: DATA.MIRROR.receive_finalize [mirror ID] Note over remote_smapiv2: destroy mirror dp end end Note over xapi: memory image migration by xenopsd Note over xapi: destroy the VM record

Receiving SXM

These are the remote calls in the above diagram sent from the remote host to the receiving end of storage motion:

  • Remote SMAPIv2 -> local SMAPIv2 RPC calls:
    • SR.list
    • SR.scan
    • SR.update_snapshot_info_dest
    • VDI.attach2
    • VDI.activate
    • VDI.snapshot
    • VDI.destroy
    • For copying:
      • For copying from base:
        • VDI.clone
        • VDI.resize
      • For copying without base:
        • VDI.create
    • For mirroring:
      • DATA.MIRROR.receive_start
      • VDI.compose
      • VDI.remove_from_sm_config
      • DATA.MIRROR.receive_finalize
  • HTTP requests to xapi:
    • Connecting to NBD URI via xapi’s HTTP handler

This is how xapi coordinates storage migration. We’ll do it as a code walkthrough through the two layers: xapi and storage-in-xapi (SMAPIv2).

Xapi code

The entry point is in xapi_vm_migration.ml

The function takes several arguments:

  • a vm reference (vm)
  • a dictionary of (string * string) key-value pairs about the destination (dest). This is the result of a previous call to the destination pool, Host.migrate_receive
  • live, a boolean of whether we should live-migrate or suspend-resume,
  • vdi_map, a mapping of VDI references to destination SR references,
  • vif_map, a mapping of VIF references to destination network references,
  • vgpu_map, similar for VGPUs
  • options, another dictionary of options
let migrate_send'  ~__context ~vm ~dest ~live ~vdi_map ~vif_map ~vgpu_map ~options =
  SMPERF.debug "vm.migrate_send called vm:%s" (Db.VM.get_uuid ~__context ~self:vm);

  let open Xapi_xenops in

  let localhost = Helpers.get_localhost ~__context in
  let remote = remote_of_dest dest in

  (* Copy mode means we don't destroy the VM on the source host. We also don't
     	   copy over the RRDs/messages *)
  let copy = try bool_of_string (List.assoc "copy" options) with _ -> false in

It begins by getting the local host reference, deciding whether we’re copying or moving, and converting the input dest parameter from an untyped string association list to a typed record, remote, which is declared further up the file:

type remote = {
  rpc : Rpc.call -> Rpc.response;
  session : API.ref_session;
  sm_url : string;
  xenops_url : string;
  master_url : string;
  remote_ip : string; (* IP address *)
  remote_master_ip : string; (* IP address *)
  dest_host : API.ref_host;
}

this contains:

  • A function, rpc, for calling XenAPI RPCs on the destination
  • A session valid on the destination
  • A sm_url on which SMAPIv2 APIs can be called on the destination
  • A master_url on which XenAPI commands can be called (not currently used)
  • The IP address, remote_ip, of the destination host
  • The IP address, remote_master_ip, of the master of the destination pool

Next, we determine which VDIs to copy:

  (* The first thing to do is to create mirrors of all the disks on the remote.
     We look through the VM's VBDs and all of those of the snapshots. We then
     compile a list of all of the associated VDIs, whether we mirror them or not
     (mirroring means we believe the VDI to be active and new writes should be
     mirrored to the destination - otherwise we just copy it)
     We look at the VDIs of the VM, the VDIs of all of the snapshots, and any
     suspend-image VDIs. *)

  let vm_uuid = Db.VM.get_uuid ~__context ~self:vm in
  let vbds = Db.VM.get_VBDs ~__context ~self:vm in
  let vifs = Db.VM.get_VIFs ~__context ~self:vm in
  let snapshots = Db.VM.get_snapshots ~__context ~self:vm in
  let vm_and_snapshots = vm :: snapshots in
  let snapshots_vbds = List.flatten (List.map (fun self -> Db.VM.get_VBDs ~__context ~self) snapshots) in
  let snapshot_vifs = List.flatten (List.map (fun self -> Db.VM.get_VIFs ~__context ~self) snapshots) in

we now decide whether we’re intra-pool or not, and if we’re intra-pool whether we’re migrating onto the same host (localhost migrate). Intra-pool is decided by trying to do a lookup of our current host uuid on the destination pool.

  let is_intra_pool = try ignore(Db.Host.get_uuid ~__context ~self:remote.dest_host); true with _ -> false in
  let is_same_host = is_intra_pool && remote.dest_host == localhost in

  if copy && is_intra_pool then raise (Api_errors.Server_error(Api_errors.operation_not_allowed, [ "Copy mode is disallowed on intra pool storage migration, try efficient alternatives e.g. VM.copy/clone."]));

Having got all of the VBDs of the VM, we now need to find the associated VDIs, filtering out empty CDs, and decide whether we’re going to copy them or mirror them - read-only VDIs can be copied but RW VDIs must be mirrored.

  let vms_vdis = List.filter_map (vdi_filter __context true) vbds in

where vdi_filter is defined earler:

(* We ignore empty or CD VBDs - nothing to do there. Possible redundancy here:
   I don't think any VBDs other than CD VBDs can be 'empty' *)
let vdi_filter __context allow_mirror vbd =
  if Db.VBD.get_empty ~__context ~self:vbd || Db.VBD.get_type ~__context ~self:vbd = `CD
  then None
  else
    let do_mirror = allow_mirror && (Db.VBD.get_mode ~__context ~self:vbd = `RW) in
    let vm = Db.VBD.get_VM ~__context ~self:vbd in
    let vdi = Db.VBD.get_VDI ~__context ~self:vbd in
    Some (get_vdi_mirror __context vm vdi do_mirror)

This in turn calls get_vdi_mirror which gathers together some important info:

let get_vdi_mirror __context vm vdi do_mirror =
  let snapshot_of = Db.VDI.get_snapshot_of ~__context ~self:vdi in
  let size = Db.VDI.get_virtual_size ~__context ~self:vdi in
  let xenops_locator = Xapi_xenops.xenops_vdi_locator ~__context ~self:vdi in
  let location = Db.VDI.get_location ~__context ~self:vdi in
  let dp = Storage_access.presentative_datapath_of_vbd ~__context ~vm ~vdi in
  let sr = Db.SR.get_uuid ~__context ~self:(Db.VDI.get_SR ~__context ~self:vdi) in
  {vdi; dp; location; sr; xenops_locator; size; snapshot_of; do_mirror}

The record is helpfully commented above:

type vdi_mirror = {
  vdi : [ `VDI ] API.Ref.t;           (* The API reference of the local VDI *)
  dp : string;                        (* The datapath the VDI will be using if the VM is running *)
  location : string;                  (* The location of the VDI in the current SR *)
  sr : string;                        (* The VDI's current SR uuid *)
  xenops_locator : string;            (* The 'locator' xenops uses to refer to the VDI on the current host *)
  size : Int64.t;                     (* Size of the VDI *)
  snapshot_of : [ `VDI ] API.Ref.t;   (* API's snapshot_of reference *)
  do_mirror : bool;                   (* Whether we should mirror or just copy the VDI *)
}

xenops_locator is <sr uuid>/<vdi uuid>, and dp is vbd/<domid>/<device> if the VM is running and vbd/<vm_uuid>/<vdi_uuid> if not.

So now we have a list of these records for all VDIs attached to the VM. For these we check explicitly that they’re all defined in the vdi_map, the mapping of VDI references to their destination SR references.

  check_vdi_map ~__context vms_vdis vdi_map;

We then figure out the VIF map:

 let vif_map =
    if is_intra_pool then vif_map
    else infer_vif_map ~__context (vifs @ snapshot_vifs) vif_map
  in

More sanity checks: We can’t do a storage migration if any of the VDIs is a reset-on-boot one - since the state will be lost on the destination when it’s attached:

(* Block SXM when VM has a VDI with on_boot=reset *)
  List.(iter (fun vconf ->
      let vdi = vconf.vdi in
      if (Db.VDI.get_on_boot ~__context ~self:vdi ==`reset) then
        raise (Api_errors.Server_error(Api_errors.vdi_on_boot_mode_incompatible_with_operation, [Ref.string_of vdi]))) vms_vdis) ;

We now consider all of the VDIs associated with the snapshots. As for the VM’s VBDs above, we end up with a vdi_mirror list. Note we pass false to the allow_mirror parameter of the get_vdi_mirror function as none of these snapshot VDIs will ever require mirrorring.

let snapshots_vdis = List.filter_map (vdi_filter __context false)

Finally we get all of the suspend-image VDIs from all snapshots as well as the actual VM, since it might be suspended itself:

snapshots_vbds in
  let suspends_vdis =
    List.fold_left
      (fun acc vm ->
         if Db.VM.get_power_state ~__context ~self:vm = `Suspended
         then
           let vdi = Db.VM.get_suspend_VDI ~__context ~self:vm in
           let sr = Db.VDI.get_SR ~__context ~self:vdi in
           if is_intra_pool && Helpers.host_has_pbd_for_sr ~__context ~host:remote.dest_host ~sr
           then acc
           else (get_vdi_mirror __context vm vdi false):: acc
         else acc)
      [] vm_and_snapshots in

Sanity check that we can see all of the suspend-image VDIs on this host:

 (* Double check that all of the suspend VDIs are all visible on the source *)
  List.iter (fun vdi_mirror ->
      let sr = Db.VDI.get_SR ~__context ~self:vdi_mirror.vdi in
      if not (Helpers.host_has_pbd_for_sr ~__context ~host:localhost ~sr)
      then raise (Api_errors.Server_error (Api_errors.suspend_image_not_accessible, [ Ref.string_of vdi_mirror.vdi ]))) suspends_vdis;

Next is a fairly complex piece that determines the destination SR for all of these VDIs. We don’t require API uses to decide destinations for all of the VDIs on snapshots and hence we have to make some decisions here:

  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let default_sr_ref =
    XenAPI.Pool.get_default_SR remote.rpc remote.session dest_pool in
  let suspend_sr_ref =
    let pool_suspend_SR = XenAPI.Pool.get_suspend_image_SR remote.rpc remote.session dest_pool
    and host_suspend_SR = XenAPI.Host.get_suspend_image_sr remote.rpc remote.session remote.dest_host in
    if pool_suspend_SR <> Ref.null then pool_suspend_SR else host_suspend_SR in

  (* Resolve placement of unspecified VDIs here - unspecified VDIs that
            are 'snapshot_of' a specified VDI go to the same place. suspend VDIs
            that are unspecified go to the suspend_sr_ref defined above *)

  let extra_vdis = suspends_vdis @ snapshots_vdis in

  let extra_vdi_map =
    List.map
      (fun vconf ->
         let dest_sr_ref =
           let is_mapped = List.mem_assoc vconf.vdi vdi_map
           and snapshot_of_is_mapped = List.mem_assoc vconf.snapshot_of vdi_map
           and is_suspend_vdi = List.mem vconf suspends_vdis
           and remote_has_suspend_sr = suspend_sr_ref <> Ref.null
           and remote_has_default_sr = default_sr_ref <> Ref.null in
           let log_prefix =
             Printf.sprintf "Resolving VDI->SR map for VDI %s:" (Db.VDI.get_uuid ~__context ~self:vconf.vdi) in
           if is_mapped then begin
             debug "%s VDI has been specified in the map" log_prefix;
             List.assoc vconf.vdi vdi_map
           end else if snapshot_of_is_mapped then begin
             debug "%s Snapshot VDI has entry in map for it's snapshot_of link" log_prefix;
             List.assoc vconf.snapshot_of vdi_map
           end else if is_suspend_vdi && remote_has_suspend_sr then begin
             debug "%s Mapping suspend VDI to remote suspend SR" log_prefix;
             suspend_sr_ref
           end else if is_suspend_vdi && remote_has_default_sr then begin
             debug "%s Remote suspend SR not set, mapping suspend VDI to remote default SR" log_prefix;
             default_sr_ref
           end else if remote_has_default_sr then begin
             debug "%s Mapping unspecified VDI to remote default SR" log_prefix;
             default_sr_ref
           end else begin
             error "%s VDI not in VDI->SR map and no remote default SR is set" log_prefix;
             raise (Api_errors.Server_error(Api_errors.vdi_not_in_map, [ Ref.string_of vconf.vdi ]))
           end in
         (vconf.vdi, dest_sr_ref))
      extra_vdis in

At the end of this we’ve got all of the VDIs that need to be copied and destinations for all of them:

  let vdi_map = vdi_map @ extra_vdi_map in
  let all_vdis = vms_vdis @ extra_vdis in

  (* The vdi_map should be complete at this point - it should include all the
     VDIs in the all_vdis list. *)

Now we gather some final information together:

  assert_no_cbt_enabled_vdi_migrated ~__context ~vdi_map;

  let dbg = Context.string_of_task __context in
  let open Xapi_xenops_queue in
  let queue_name = queue_of_vm ~__context ~self:vm in
  let module XenopsAPI = (val make_client queue_name : XENOPS) in

  let remote_vdis = ref [] in

  let ha_always_run_reset = not is_intra_pool && Db.VM.get_ha_always_run ~__context ~self:vm in

  let cd_vbds = find_cds_to_eject __context vdi_map vbds in
  eject_cds __context cd_vbds;

check there’s no CBT (we can’t currently migrate the CBT metadata), make our client to talk to Xenopsd, make a mutable list of remote VDIs (which I think is redundant right now), decide whether we need to do anything for HA (we disable HA protection for this VM on the destination until it’s fully migrated) and eject any CDs from the VM.

Up until now this has mostly been gathering info (aside from the ejecting CDs bit), but now we’ll start to do some actions, so we begin a try-catch block:

try

but we’ve still got a bit of thinking to do: we sort the VDIs to copy based on age/size:

    (* Sort VDIs by size in principle and then age secondly. This gives better
       chances that similar but smaller VDIs would arrive comparatively
       earlier, which can serve as base for incremental copying the larger
       ones. *)
    let compare_fun v1 v2 =
      let r = Int64.compare v1.size v2.size in
      if r = 0 then
        let t1 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v1.vdi) in
        let t2 = Date.to_unix_time (Db.VDI.get_snapshot_time ~__context ~self:v2.vdi) in
        compare t1 t2
      else r in
    let all_vdis = all_vdis |> List.sort compare_fun in

    let total_size = List.fold_left (fun acc vconf -> Int64.add acc vconf.size) 0L all_vdis in
    let so_far = ref 0L in

OK, let’s copy/mirror:

    with_many (vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy) all_vdis @@ fun all_map ->

The copy functions are written such that they take continuations. This it to make the error handling simpler - each individual component function can perform its setup and execute the continuation. In the event of an exception coming from the continuation it can then unroll its bit of state and rethrow the exception for the next layer to handle.

with_many is a simple helper function for nesting invocations of functions that take continuations. It has the delightful type:

('a -> ('b -> 'c) -> 'c) -> 'a list -> ('b list -> 'c) -> 'c
(* Helper function to apply a 'with_x' function to a list *)
let rec with_many withfn many fn =
  let rec inner l acc =
    match l with
    | [] -> fn acc
    | x::xs -> withfn x (fun y -> inner xs (y::acc))
  in inner many []

As an example of its operation, imagine our withfn is as follows:

let withfn x c =
  Printf.printf "Starting withfn: x=%d\n" x;
  try
    c (string_of_int x)
  with e ->
    Printf.printf "Handling exception for x=%d\n" x;
    raise e;;

applying this gives the output:

utop # with_many withfn [1;2;3;4] (String.concat ",");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
- : string = "4,3,2,1"

whereas raising an exception in the continutation results in the following:

utop # with_many with_fn [1;2;3;4] (fun _ -> failwith "error");;
Starting with fn: x=1
Starting with fn: x=2
Starting with fn: x=3
Starting with fn: x=4
Handling exception for x=4
Handling exception for x=3
Handling exception for x=2
Handling exception for x=1
Exception: Failure "error".

All the real action is in vdi_copy_fun, which copies or mirrors a single VDI:

let vdi_copy_fun __context dbg vdi_map remote is_intra_pool remote_vdis so_far total_size copy vconf continuation =
  TaskHelper.exn_if_cancelling ~__context;
  let open Storage_access in
  let dest_sr_ref = List.assoc vconf.vdi vdi_map in
  let dest_sr_uuid = XenAPI.SR.get_uuid remote.rpc remote.session dest_sr_ref in

  (* Plug the destination shared SR into destination host and pool master if unplugged.
     Plug the local SR into destination host only if unplugged *)
  let dest_pool = List.hd (XenAPI.Pool.get_all remote.rpc remote.session) in
  let master_host = XenAPI.Pool.get_master remote.rpc remote.session dest_pool in
  let pbds = XenAPI.SR.get_PBDs remote.rpc remote.session dest_sr_ref in
  let pbd_host_pair = List.map (fun pbd -> (pbd, XenAPI.PBD.get_host remote.rpc remote.session pbd)) pbds in
  let hosts_to_be_attached = [master_host; remote.dest_host] in
  let pbds_to_be_plugged = List.filter (fun (_, host) ->
      (List.mem host hosts_to_be_attached) && (XenAPI.Host.get_enabled remote.rpc remote.session host)) pbd_host_pair in
  List.iter (fun (pbd, _) ->
      if not (XenAPI.PBD.get_currently_attached remote.rpc remote.session pbd) then
        XenAPI.PBD.plug remote.rpc remote.session pbd) pbds_to_be_plugged;

It begins by attempting to ensure the SRs we require are definitely attached on the destination host and on the destination pool master.

There’s now a little logic to support the case where we have cross-pool SRs and the VDI is already visible to the destination pool. Since this is outside our normal support envelope there is a key in xapi_globs that has to be set (via xapi.conf) to enable this:

  let rec dest_vdi_exists_on_sr vdi_uuid sr_ref retry =
    try
      let dest_vdi_ref = XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid in
      let dest_vdi_sr_ref = XenAPI.VDI.get_SR remote.rpc remote.session dest_vdi_ref in
      if dest_vdi_sr_ref = sr_ref then
        true
      else
        false
    with _ ->
      if retry then
        begin
          XenAPI.SR.scan remote.rpc remote.session sr_ref;
          dest_vdi_exists_on_sr vdi_uuid sr_ref false
        end
      else
        false
  in

  (* CP-4498 added an unsupported mode to use cross-pool shared SRs - the initial
     use case is for a shared raw iSCSI SR (same uuid, same VDI uuid) *)
  let vdi_uuid = Db.VDI.get_uuid ~__context ~self:vconf.vdi in
  let mirror = if !Xapi_globs.relax_xsm_sr_check then
      if (dest_sr_uuid = vconf.sr) then
        begin
          (* Check if the VDI uuid already exists in the target SR *)
          if (dest_vdi_exists_on_sr vdi_uuid dest_sr_ref true) then
            false
          else
            failwith ("SR UUID matches on destination but VDI does not exist")
        end
      else
        true
    else
      (not is_intra_pool) || (dest_sr_uuid <> vconf.sr)
  in

The check also covers the case where we’re doing an intra-pool migration and not copying all of the disks, in which case we don’t need to do anything for that disk.

We now have a wrapper function that creates a new datapath and passes it to a continuation function. On error it handles the destruction of the datapath:

let with_new_dp cont =
    let dp = Printf.sprintf (if vconf.do_mirror then "mirror_%s" else "copy_%s") vconf.dp in
    try cont dp
    with e ->
      (try SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false with _ -> info "Failed to cleanup datapath: %s" dp);
      raise e in

and now a helper that, given a remote VDI uuid, looks up the reference on the remote host and gives it to a continuation function. On failure of the continuation it will destroy the remote VDI:

  let with_remote_vdi remote_vdi cont =
    debug "Executing remote scan to ensure VDI is known to xapi";
    XenAPI.SR.scan remote.rpc remote.session dest_sr_ref;
    let query = Printf.sprintf "(field \"location\"=\"%s\") and (field \"SR\"=\"%s\")" remote_vdi (Ref.string_of dest_sr_ref) in
    let vdis = XenAPI.VDI.get_all_records_where remote.rpc remote.session query in
    let remote_vdi_ref = match vdis with
      | [] -> raise (Api_errors.Server_error(Api_errors.vdi_location_missing, [Ref.string_of dest_sr_ref; remote_vdi]))
      | h :: [] -> debug "Found remote vdi reference: %s" (Ref.string_of (fst h)); fst h
      | _ -> raise (Api_errors.Server_error(Api_errors.location_not_unique, [Ref.string_of dest_sr_ref; remote_vdi])) in
    try cont remote_vdi_ref
    with e ->
      (try XenAPI.VDI.destroy remote.rpc remote.session remote_vdi_ref with _ -> error "Failed to destroy remote VDI");
      raise e in

another helper to gather together info about a mirrored VDI:

let get_mirror_record ?new_dp remote_vdi remote_vdi_reference =
    { mr_dp = new_dp;
      mr_mirrored = mirror;
      mr_local_sr = vconf.sr;
      mr_local_vdi = vconf.location;
      mr_remote_sr = dest_sr_uuid;
      mr_remote_vdi = remote_vdi;
      mr_local_xenops_locator = vconf.xenops_locator;
      mr_remote_xenops_locator = Xapi_xenops.xenops_vdi_locator_of_strings dest_sr_uuid remote_vdi;
      mr_local_vdi_reference = vconf.vdi;
      mr_remote_vdi_reference = remote_vdi_reference } in

and finally the really important function:

let mirror_to_remote new_dp =
    let task =
      if not vconf.do_mirror then
        SMAPI.DATA.copy ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      else begin
        (* Though we have no intention of "write", here we use the same mode as the
           associated VBD on a mirrored VDIs (i.e. always RW). This avoids problem
           when we need to start/stop the VM along the migration. *)
        let read_write = true in
        (* DP set up is only essential for MIRROR.start/stop due to their open ended pattern.
           It's not necessary for copy which will take care of that itself. *)
        ignore(SMAPI.VDI.attach ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location ~read_write);
        SMAPI.VDI.activate ~dbg ~dp:new_dp ~sr:vconf.sr ~vdi:vconf.location;
        ignore(Storage_access.register_mirror __context vconf.location);
        SMAPI.DATA.MIRROR.start ~dbg ~sr:vconf.sr ~vdi:vconf.location ~dp:new_dp ~url:remote.sm_url ~dest:dest_sr_uuid
      end in

    let mapfn x =
      let total = Int64.to_float total_size in
      let done_ = Int64.to_float !so_far /. total in
      let remaining = Int64.to_float vconf.size /. total in
      done_ +. x *. remaining in

    let open Storage_access in

    let task_result =
      task |> register_task __context
      |> add_to_progress_map mapfn
      |> wait_for_task dbg
      |> remove_from_progress_map
      |> unregister_task __context
      |> success_task dbg in

    let mirror_id, remote_vdi =
      if not vconf.do_mirror then
        let vdi = task_result |> vdi_of_task dbg in
        remote_vdis := vdi.vdi :: !remote_vdis;
        None, vdi.vdi
      else
        let mirrorid = task_result |> mirror_of_task dbg in
        let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mirrorid in
        Some mirrorid, m.Mirror.dest_vdi in

    so_far := Int64.add !so_far vconf.size;
    debug "Local VDI %s %s to %s" vconf.location (if vconf.do_mirror then "mirrored" else "copied") remote_vdi;
    mirror_id, remote_vdi in

This is the bit that actually starts the mirroring or copying. Before the call to mirror we call VDI.attach and VDI.activate locally to ensure that if the VM is shutdown then the detach/deactivate there doesn’t kill the mirroring process.

Note the parameters to the SMAPI call are sr and vdi, locating the local VDI and SM backend, new_dp, the datapath we’re using for the mirroring, url, which is the remote url on which SMAPI calls work, and dest, the destination SR uuid. These are also the arguments to copy above too.

There’s a little function to calculate the overall progress of the task, and the function waits until the completion of the task before it continues. The function success_task will raise an exception if the task failed. For DATA.mirror, completion implies both that the disk data has been copied to the destination and that all local writes are being mirrored to the destination. Hence more cleanup must be done on cancellation. In contrast, if the DATA.copy path had been taken then the operation at this point has completely finished.

The result of this function is an optional mirror id and the remote VDI uuid.

Next, there is a post_mirror function:

  let post_mirror mirror_id mirror_record =
    try
      let result = continuation mirror_record in
      (match mirror_id with
       | Some mid -> ignore(Storage_access.unregister_mirror mid);
       | None -> ());
      if mirror && not (Xapi_fist.storage_motion_keep_vdi () || copy) then
        Helpers.call_api_functions ~__context (fun rpc session_id ->
            XenAPI.VDI.destroy rpc session_id vconf.vdi);
      result
    with e ->
      let mirror_failed =
        match mirror_id with
        | Some mid ->
          ignore(Storage_access.unregister_mirror mid);
          let m = SMAPI.DATA.MIRROR.stat ~dbg ~id:mid in
          (try SMAPI.DATA.MIRROR.stop ~dbg ~id:mid with _ -> ());
          m.Mirror.failed
        | None -> false in
      if mirror_failed then raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
      else raise e in

This is poorly named - it is post mirror and copy. The aim of this function is to destroy the source VDIs on successful completion of the continuation function, which will have migrated the VM to the destination. In its exception handler it will stop the mirroring, but before doing so it will check to see if the mirroring process it was looking after has itself failed, and raise mirror_failed if so. This is because a failed mirror can result in a range of actual errors, and we decide here that the failed mirror was probably the root cause.

These functions are assembled together at the end of the vdi_copy_fun function:

   if mirror then
    with_new_dp (fun new_dp ->
        let mirror_id, remote_vdi = mirror_to_remote new_dp in
        with_remote_vdi remote_vdi (fun remote_vdi_ref ->
            let mirror_record = get_mirror_record ~new_dp remote_vdi remote_vdi_ref in
            post_mirror mirror_id mirror_record))
  else
    let mirror_record = get_mirror_record vconf.location (XenAPI.VDI.get_by_uuid remote.rpc remote.session vdi_uuid) in
    continuation mirror_record

again, mirror here is poorly named, and means mirror or copy.

Once all of the disks have been mirrored or copied, we jump back to the body of migrate_send. We split apart the mirror records according to the source of the VDI:

      let was_from vmap = List.exists (fun vconf -> vconf.vdi = vmap.mr_local_vdi_reference) in

      let suspends_map, snapshots_map, vdi_map = List.fold_left (fun (suspends, snapshots, vdis) vmap ->
          if was_from vmap suspends_vdis then  vmap :: suspends, snapshots, vdis
          else if was_from vmap snapshots_vdis then suspends, vmap :: snapshots, vdis
          else suspends, snapshots, vmap :: vdis
        ) ([],[],[]) all_map in

then we reassemble all_map from this, for some reason:

    let all_map = List.concat [suspends_map; snapshots_map; vdi_map] in

Now we need to update the snapshot-of links:

     (* All the disks and snapshots have been created in the remote SR(s),
       * so update the snapshot links if there are any snapshots. *)
      if snapshots_map <> [] then
        update_snapshot_info ~__context ~dbg ~url:remote.sm_url ~vdi_map ~snapshots_map;

I’m not entirely sure why this is done in this layer as opposed to in the storage layer.

A little housekeeping:

     let xenops_vdi_map = List.map (fun mirror_record -> (mirror_record.mr_local_xenops_locator, mirror_record.mr_remote_xenops_locator)) all_map in

      (* Wait for delay fist to disappear *)
      wait_for_fist __context Xapi_fist.pause_storage_migrate "pause_storage_migrate";

      TaskHelper.exn_if_cancelling ~__context;

the fist thing here simply allows tests to put in a delay at this specific point.

We also check the task to see if we’ve been cancelled and raise an exception if so.

The VM metadata is now imported into the remote pool, with all the XenAPI level objects remapped:

let new_vm =
        if is_intra_pool
        then vm
        else
          (* Make sure HA replaning cycle won't occur right during the import process or immediately after *)
          let () = if ha_always_run_reset then XenAPI.Pool.ha_prevent_restarts_for ~rpc:remote.rpc ~session_id:remote.session ~seconds:(Int64.of_float !Xapi_globs.ha_monitor_interval) in
          (* Move the xapi VM metadata to the remote pool. *)
          let vms =
            let vdi_map =
              List.map (fun mirror_record -> {
                    local_vdi_reference = mirror_record.mr_local_vdi_reference;
                    remote_vdi_reference = Some mirror_record.mr_remote_vdi_reference;
                  })
                all_map in
            let vif_map =
              List.map (fun (vif, network) -> {
                    local_vif_reference = vif;
                    remote_network_reference = network;
                  })
                vif_map in
            let vgpu_map =
              List.map (fun (vgpu, gpu_group) -> {
                    local_vgpu_reference = vgpu;
                    remote_gpu_group_reference = gpu_group;
                  })
                vgpu_map
            in
            inter_pool_metadata_transfer ~__context ~remote ~vm ~vdi_map
              ~vif_map ~vgpu_map ~dry_run:false ~live:true ~copy
          in
          let vm = List.hd vms in
          let () = if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:vm ~value:false in
          (* Reserve resources for the new VM on the destination pool's host *)
          let () = XenAPI.Host.allocate_resources_for_vm remote.rpc remote.session remote.dest_host vm true in
          vm in

More waiting for fist points:

     wait_for_fist __context Xapi_fist.pause_storage_migrate2 "pause_storage_migrate2";

      (* Attach networks on remote *)
      XenAPI.Network.attach_for_vm ~rpc:remote.rpc ~session_id:remote.session ~host:remote.dest_host ~vm:new_vm;

also make sure all the networks are plugged for the VM on the destination. Next we create the xenopsd-level vif map, equivalent to the vdi_map above:

  (* Create the vif-map for xenops, linking VIF devices to bridge names on the remote *)
      let xenops_vif_map =
        let vifs = XenAPI.VM.get_VIFs ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm in
        List.map (fun vif ->
            let vifr = XenAPI.VIF.get_record ~rpc:remote.rpc ~session_id:remote.session ~self:vif in
            let bridge = Xenops_interface.Network.Local
                (XenAPI.Network.get_bridge ~rpc:remote.rpc ~session_id:remote.session ~self:vifr.API.vIF_network) in
            vifr.API.vIF_device, bridge
          ) vifs
      in

Now we destroy any extra mirror datapaths we set up previously:

     (* Destroy the local datapaths - this allows the VDIs to properly detach, invoking the migrate_finalize calls *)
      List.iter (fun mirror_record ->
          if mirror_record.mr_mirrored
          then match mirror_record.mr_dp with | Some dp ->  SMAPI.DP.destroy ~dbg ~dp ~allow_leak:false | None -> ()) all_map;

More housekeeping:

    SMPERF.debug "vm.migrate_send: migration initiated vm:%s" vm_uuid;

      (* In case when we do SXM on the same host (mostly likely a VDI
         migration), the VM's metadata in xenopsd will be in-place updated
         as soon as the domain migration starts. For these case, there
         will be no (clean) way back from this point. So we disable task
         cancellation for them here.
       *)
      if is_same_host then (TaskHelper.exn_if_cancelling ~__context; TaskHelper.set_not_cancellable ~__context);

Finally we get to the memory-image part of the migration:

      (* It's acceptable for the VM not to exist at this point; shutdown commutes with storage migrate *)
      begin
        try
          Xapi_xenops.Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid
            (fun () ->
               let xenops_vgpu_map = (* can raise VGPU_mapping *)
                 infer_vgpu_map ~__context ~remote new_vm in
               migrate_with_retry
                 ~__context queue_name dbg vm_uuid xenops_vdi_map
                 xenops_vif_map xenops_vgpu_map remote.xenops_url;
               Xapi_xenops.Xenopsd_metadata.delete ~__context vm_uuid)
        with
        | Xenops_interface.Does_not_exist ("VM",_)
        | Xenops_interface.Does_not_exist ("extra",_) ->
          info "%s: VM %s stopped being live during migration"
            "vm_migrate_send" vm_uuid
        | VGPU_mapping(msg) ->
          info "%s: VM %s - can't infer vGPU map: %s"
            "vm_migrate_send" vm_uuid msg;
          raise Api_errors.
                  (Server_error
                     (vm_migrate_failed,
                      ([ vm_uuid
                       ; Helpers.get_localhost_uuid ()
                       ; Db.Host.get_uuid ~__context ~self:remote.dest_host
                       ; "The VM changed its power state during migration"
                       ])))
      end;

      debug "Migration complete";
      SMPERF.debug "vm.migrate_send: migration complete vm:%s" vm_uuid;

Now we tidy up after ourselves:

      (* So far the main body of migration is completed, and the rests are
         updates, config or cleanup on the source and destination. There will
         be no (clean) way back from this point, due to these destructive
         changes, so we don't want user intervention e.g. task cancellation.
       *)
      TaskHelper.exn_if_cancelling ~__context;
      TaskHelper.set_not_cancellable ~__context;
      XenAPI.VM.pool_migrate_complete remote.rpc remote.session new_vm remote.dest_host;

      detach_local_network_for_vm ~__context ~vm ~destination:remote.dest_host;
      Xapi_xenops.refresh_vm ~__context ~self:vm;

the function pool_migrate_complete is called on the destination host, and consists of a few things that ordinarily would be set up during VM.start or the like:

let pool_migrate_complete ~__context ~vm ~host =
  let id = Db.VM.get_uuid ~__context ~self:vm in
  debug "VM.pool_migrate_complete %s" id;
  let dbg = Context.string_of_task __context in
  let queue_name = Xapi_xenops_queue.queue_of_vm ~__context ~self:vm in
  if Xapi_xenops.vm_exists_in_xenopsd queue_name dbg id then begin
    Cpuid_helpers.update_cpu_flags ~__context ~vm ~host;
    Xapi_xenops.set_resident_on ~__context ~self:vm;
    Xapi_xenops.add_caches id;
    Xapi_xenops.refresh_vm ~__context ~self:vm;
    Monitor_dbcalls_cache.clear_cache_for_vm ~vm_uuid:id
  end

More tidying up, remapping some remaining VBDs and clearing state on the sender:

      (* Those disks that were attached at the point the migration happened will have been
         remapped by the Events_from_xenopsd logic. We need to remap any other disks at
         this point here *)

      if is_intra_pool
      then
        List.iter
          (fun vm' ->
             intra_pool_vdi_remap ~__context vm' all_map;
             intra_pool_fix_suspend_sr ~__context remote.dest_host vm')
          vm_and_snapshots;

      (* If it's an inter-pool migrate, the VBDs will still be 'currently-attached=true'
         because we supressed the events coming from xenopsd. Destroy them, so that the
         VDIs can be destroyed *)
      if not is_intra_pool && not copy
      then List.iter (fun vbd -> Db.VBD.destroy ~__context ~self:vbd) (vbds @ snapshots_vbds);

      new_vm
    in

The remark about the Events_from_xenopsd is that we have a thread watching for events that are emitted by xenopsd, and we resynchronise xapi’s state according to xenopsd’s state for several fields for which xenopsd is considered the canonical source of truth. One of these is the exact VDI the VBD is associated with.

The suspend_SR field of the VM is set to the source’s value, so we reset that.

Now we move the RRDs:

  if not copy then begin
      Rrdd_proxy.migrate_rrd ~__context ~remote_address:remote.remote_ip ~session_id:(Ref.string_of remote.session)
        ~vm_uuid:vm_uuid ~host_uuid:(Ref.string_of remote.dest_host) ()
    end;

This can be done for intra- and inter- pool migrates in the same way, simplifying the logic.

However, for messages and blobs we have to only migrate them for inter-pool migrations:

   if not is_intra_pool && not copy then begin
      (* Replicate HA runtime flag if necessary *)
      if ha_always_run_reset then XenAPI.VM.set_ha_always_run ~rpc:remote.rpc ~session_id:remote.session ~self:new_vm ~value:true;
      (* Send non-database metadata *)
      Xapi_message.send_messages ~__context ~cls:`VM ~obj_uuid:vm_uuid
        ~session_id:remote.session ~remote_address:remote.remote_master_ip;
      Xapi_blob.migrate_push ~__context ~rpc:remote.rpc
        ~remote_address:remote.remote_master_ip ~session_id:remote.session ~old_vm:vm ~new_vm ;
      (* Signal the remote pool that we're done *)
    end;

Lastly, we destroy the VM record on the source:

    Helpers.call_api_functions ~__context (fun rpc session_id ->
        if not is_intra_pool && not copy then begin
          info "Destroying VM ref=%s uuid=%s" (Ref.string_of vm) vm_uuid;
          Xapi_vm_lifecycle.force_state_reset ~__context ~self:vm ~value:`Halted;
          List.iter (fun self -> Db.VM.destroy ~__context ~self) vm_and_snapshots
        end);
    SMPERF.debug "vm.migrate_send exiting vm:%s" vm_uuid;
    new_vm

The exception handler still has to clean some state, but mostly things are handled in the CPS functions declared above:

with e ->
    error "Caught %s: cleaning up" (Printexc.to_string e);

    (* We do our best to tidy up the state left behind *)
    Events_from_xenopsd.with_suppressed queue_name dbg vm_uuid (fun () ->
        try
          let _, state = XenopsAPI.VM.stat dbg vm_uuid in
          if Xenops_interface.(state.Vm.power_state = Suspended) then begin
            debug "xenops: %s: shutting down suspended VM" vm_uuid;
            Xapi_xenops.shutdown ~__context ~self:vm None;
          end;
        with _ -> ());

    if not is_intra_pool && Db.is_valid_ref __context vm then begin
      List.map (fun self -> Db.VM.get_uuid ~__context ~self) vm_and_snapshots
      |> List.iter (fun self ->
          try
            let vm_ref = XenAPI.VM.get_by_uuid remote.rpc remote.session self in
            info "Destroying stale VM uuid=%s on destination host" self;
            XenAPI.VM.destroy remote.rpc remote.session vm_ref
          with e -> error "Caught %s while destroying VM uuid=%s on destination host" (Printexc.to_string e) self)
    end;

    let task = Context.get_task_id __context in
    let oc = Db.Task.get_other_config ~__context ~self:task in
    if List.mem_assoc "mirror_failed" oc then begin
      let failed_vdi = List.assoc "mirror_failed" oc in
      let vconf = List.find (fun vconf -> vconf.location=failed_vdi) vms_vdis in
      debug "Mirror failed for VDI: %s" failed_vdi;
      raise (Api_errors.Server_error(Api_errors.mirror_failed,[Ref.string_of vconf.vdi]))
    end;
    TaskHelper.exn_if_cancelling ~__context;
    begin match e with
      | Storage_interface.Backend_error(code, params) -> raise (Api_errors.Server_error(code, params))
      | Storage_interface.Unimplemented(code) -> raise (Api_errors.Server_error(Api_errors.unimplemented_in_sm_backend, [code]))
      | Xenops_interface.Cancelled _ -> TaskHelper.raise_cancelled ~__context
      | _ -> raise e
    end

Failures during the migration can result in the VM being in a suspended state. There’s no point leaving it like this since there’s nothing that can be done to resume it, so we force shut it down.

We also try to remove the VM record from the destination if we managed to send it there.

Finally we check for mirror failure in the task - this is set by the events thread watching for events from the storage layer, in storage_access.ml

Storage code

The part of the code that is conceptually in the storage layer, but physically in xapi, is located in storage_migrate.ml. There are logically a few separate parts to this file:

Let’s start by considering the way the storage APIs are intended to be used.

Copying a VDI

DATA.copy takes several parameters:

  • dbg - a debug string
  • sr - the source SR (a uuid)
  • vdi - the source VDI (a uuid)
  • dp - unused
  • url - a URL on which SMAPIv2 API calls can be made
  • sr - the destination SR in which the VDI should be copied

and returns a parameter of type Task.id. The API call is intended to be called in an asynchronous fashion - ie., the caller makes the call, receives the task ID back and polls or uses the event mechanism to wait until the task has completed. The task may be cancelled via the Task.cancel API call. The result of the operation is obtained by calling TASK.stat, which returns a record:

	type t = {
		id: id;
		dbg: string;
		ctime: float;
		state: state;
		subtasks: (string * state) list;
		debug_info: (string * string) list;
		backtrace: string;
	}

Where the state field contains the result once the task has completed:

type async_result_t =
	| Vdi_info of vdi_info
	| Mirror_id of Mirror.id

type completion_t = {
	duration : float;
	result : async_result_t option
}

type state =
	| Pending of float
	| Completed of completion_t
	| Failed of Rpc.t

Once the result has been obtained from the task, the task should be destroyed via the TASK.destroy API call.

The implementation uses the url parameter to make SMAPIv2 calls to the destination SR. This is used, for example, to invoke a VDI.create call if necessary. The URL contains an authentication token within it (valid for the duration of the XenAPI call that caused this DATA.copy API call).

The implementation tries to minimize the amount of data copied by looking for related VDIs on the destination SR. See below for more details.

Mirroring a VDI

DATA.MIRROR.start takes a similar set of parameters to that of copy:

  • dbg - a debug string
  • sr - the source SR (a uuid)
  • vdi - the source VDI (a uuid)
  • dp - the datapath on which the VDI has been attached
  • url - a URL on which SMAPIv2 API calls can be made
  • sr - the destination SR in which the VDI should be copied

Similar to copy above, this returns a task id. The task ‘completes’ once the mirror has been set up - that is, at any point afterwards we can detach the disk and the destination disk will be identical to the source. Unlike for copy the operation is ongoing after the API call completes, since new writes need to be mirrored to the destination. Therefore the completion type of the mirror operation is Mirror_id which contains a handle on which further API calls related to the mirror call can be made. For example MIRROR.stat whose signature is:

MIRROR.stat: dbg:debug_info -> id:Mirror.id -> Mirror.t

The return type of this call is a record containing information about the mirror:

type state =
	| Receiving
	| Sending
	| Copying

type t = {
	source_vdi : vdi;
	dest_vdi : vdi;
	state : state list;
	failed : bool;
}

Note that state is a list since the initial phase of the operation requires both copying and mirroring.

Additionally the mirror can be cancelled using the MIRROR.stop API call.

Code walkthrough

let’s go through the implementation of copy:

DATA.copy

let copy ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "copy sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

Here we are constructing a module Remote on which we can do SMAPIv2 calls directly on the destination.

  try

Wrap the whole function in an exception handler.

    (* Find the local VDI *)
    let vdis = Local.SR.scan ~dbg ~sr in
    let local_vdi =
      try List.find (fun x -> x.vdi = vdi) vdis
      with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

We first find the metadata for our source VDI by doing a local SMAPIv2 call SR.scan. This returns a list of VDI metadata, out of which we extract the VDI we’re interested in.

    try

Another exception handler. This looks redundant to me right now.

      let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
      let similars = List.map (fun vdi -> vdi.content_id) similar_vdis in
      debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

Here we look for related VDIs locally using the VDI.similar_content SMAPIv2 API call. This searches for related VDIs and returns an ordered list where the most similar is first in the list. It returns both clones and snapshots, and hence is more general than simply following snapshot_of links.

      let remote_vdis = Remote.SR.scan ~dbg ~sr:dest in
      (** We drop cbt_metadata VDIs that do not have any actual data *)
      let remote_vdis = List.filter (fun vdi -> vdi.ty <> "cbt_metadata") remote_vdis in

      let nearest = List.fold_left
          (fun acc content_id -> match acc with
             | Some x -> acc
             | None ->
               try Some (List.find (fun vdi -> vdi.content_id = content_id && vdi.virtual_size <= local_vdi.virtual_size) remote_vdis)
               with Not_found -> None) None similars in

      debug "Nearest VDI: content_id=%s vdi=%s"
        (Opt.default "None" (Opt.map (fun x -> x.content_id) nearest))
        (Opt.default "None" (Opt.map (fun x -> x.vdi) nearest));

Here we look for VDIs on the destination with the same content_id as one of the locally similar VDIs. We will use this as a base image and only copy deltas to the destination. This is done by cloning the VDI on the destination and then using sparse_dd to find the deltas from our local disk to our local copy of the content_id disk and streaming these to the destination. Note that we need to ensure the VDI is smaller than the one we want to copy since we can’t resize disks downwards in size.

      let remote_base = match nearest with
        | Some vdi ->
          debug "Cloning VDI %s" vdi.vdi;
          let vdi_clone = Remote.VDI.clone ~dbg ~sr:dest ~vdi_info:vdi in
          if vdi_clone.virtual_size <> local_vdi.virtual_size then begin
            let new_size = Remote.VDI.resize ~dbg ~sr:dest ~vdi:vdi_clone.vdi ~new_size:local_vdi.virtual_size in
            debug "Resize remote VDI %s to %Ld: result %Ld" vdi_clone.vdi local_vdi.virtual_size new_size;
          end;
          vdi_clone
        | None ->
          debug "Creating a blank remote VDI";
          Remote.VDI.create ~dbg ~sr:dest ~vdi_info:{ local_vdi with sm_config = [] }  in

If we’ve found a base VDI we clone it and resize it immediately. If there’s nothing on the destination already we can use, we just create a new VDI. Note that the calls to create and clone may well fail if the destination host is not the SRmaster. This is handled purely in the rpc function:

let rec rpc ~srcstr ~dststr url call =
  let result = XMLRPC_protocol.rpc ~transport:(transport_of_url url)
      ~srcstr ~dststr ~http:(xmlrpc ~version:"1.0" ?auth:(Http.Url.auth_of url) ~query:(Http.Url.get_query_params url) (Http.Url.get_uri url)) call
  in
  if not result.Rpc.success then begin
    debug "Got failure: checking for redirect";
    debug "Call was: %s" (Rpc.string_of_call call);
    debug "result.contents: %s" (Jsonrpc.to_string result.Rpc.contents);
    match Storage_interface.Exception.exnty_of_rpc result.Rpc.contents with
    | Storage_interface.Exception.Redirect (Some ip) ->
      let open Http.Url in
      let newurl =
        match url with
        | (Http h, d) ->
          (Http {h with host=ip}, d)
        | _ ->
          remote_url ip in
      debug "Redirecting to ip: %s" ip;
      let r = rpc ~srcstr ~dststr newurl call in
      debug "Successfully redirected. Returning";
      r
    | _ ->
      debug "Not a redirect";
      result
  end
  else result

Back to the copy function:

      let remote_copy = copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi:remote_base.vdi |> vdi_info in

This calls the actual data copy part. See below for more on that.

      let snapshot = Remote.VDI.snapshot ~dbg ~sr:dest ~vdi_info:remote_copy in
      Remote.VDI.destroy ~dbg ~sr:dest ~vdi:remote_copy.vdi;
      Some (Vdi_info snapshot)

Finally we snapshot the remote VDI to ensure we’ve got a VDI of type ‘snapshot’ on the destination, and we delete the non-snapshot VDI.

    with e ->
      error "Caught %s: copying snapshots vdi" (Printexc.to_string e);
      raise (Internal_error (Printexc.to_string e))
  with
  | Backend_error(code, params)
  | Api_errors.Server_error(code, params) ->
    raise (Backend_error(code, params))
  | e ->
    raise (Internal_error(Printexc.to_string e))

The exception handler does nothing - so we leak remote VDIs if the exception happens after we’ve done our cloning :-(

DATA.copy_into

Let’s now look at the data-copying part. This is common code shared between VDI.copy, VDI.copy_into and MIRROR.start and hence has some duplication of the calls made above.

let copy_into ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi

copy_into is a stub and just calls copy'

let copy' ~task ~dbg ~sr ~vdi ~url ~dest ~dest_vdi =
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
  debug "copy local=%s/%s url=%s remote=%s/%s" sr vdi url dest dest_vdi;

This call takes roughly the same parameters as the ``DATA.copy` call above, except it specifies the destination VDI. Once again we construct a module to do remote SMAPIv2 calls

  (* Check the remote SR exists *)
  let srs = Remote.SR.list ~dbg in
  if not(List.mem dest srs)
  then failwith (Printf.sprintf "Remote SR %s not found" dest);

Sanity check.

  let vdis = Remote.SR.scan ~dbg ~sr:dest in
  let remote_vdi =
    try List.find (fun x -> x.vdi = dest_vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Remote VDI %s not found" dest_vdi)
  in

Find the metadata of the destination VDI

  let dest_content_id = remote_vdi.content_id in

If we’ve got a local VDI with the same content_id as the destination, we only need copy the deltas, so we make a note of the destination content ID here.

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

  debug "copy local=%s/%s content_id=%s" sr vdi local_vdi.content_id;
  debug "copy remote=%s/%s content_id=%s" dest dest_vdi remote_vdi.content_id;

Find the source VDI metadata.

  if local_vdi.virtual_size > remote_vdi.virtual_size then begin
    (* This should never happen provided the higher-level logic is working properly *)
    error "copy local=%s/%s virtual_size=%Ld > remote=%s/%s virtual_size = %Ld" sr vdi local_vdi.virtual_size dest dest_vdi remote_vdi.virtual_size;
    failwith "local VDI is larger than the remote VDI";
  end;

Sanity check - the remote VDI can’t be smaller than the source.

  let on_fail : (unit -> unit) list ref = ref [] in

We do some ugly error handling here by keeping a mutable list of operations to perform in the event of a failure.

  let base_vdi =
    try
      let x = (List.find (fun x -> x.content_id = dest_content_id) vdis).vdi in
      debug "local VDI %s has content_id = %s; we will perform an incremental copy" x dest_content_id;
      Some x
    with _ ->
      debug "no local VDI has content_id = %s; we will perform a full copy" dest_content_id;
      None
  in

See if we can identify a local VDI with the same content_id as the destination. If not, no problem.

  try
    let remote_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let base_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in
    let leaf_dp = Uuid.string_of_uuid (Uuid.make_uuid ()) in

Construct some datapaths - named reasons why the VDI is attached - that we will pass to VDI.attach/activate.

    let dest_vdi_url = Http.Url.set_uri remote_url (Printf.sprintf "%s/nbd/%s/%s/%s" (Http.Url.get_uri remote_url) dest dest_vdi remote_dp) |> Http.Url.to_string in

    debug "copy remote=%s/%s NBD URL = %s" dest dest_vdi dest_vdi_url;

Here we are constructing a URI that we use to connect to the destination xapi. The handler for this particular path will verify the credentials and then pass the connection on to tapdisk which will behave as a NBD server. The VDI has to be attached and activated for this to work, unlike the new NBD handler in xapi-nbd that is smarter. The handler for this URI is declared in this file

    let id=State.copy_id_of (sr,vdi) in
    debug "Persisting state for copy (id=%s)" id;
    State.add id State.(Copy_op Copy_state.({
        base_dp; leaf_dp; remote_dp; dest_sr=dest; copy_vdi=remote_vdi.vdi; remote_url=url}));

Since we’re about to perform a long-running operation that is stateful, we persist the state here so that if xapi is restarted we can cancel the operation and not leak VDI attaches. Normally in xapi code we would be doing VBD.plug operations to persist the state in the xapi db, but this is storage code so we have to use a different mechanism.

    SMPERF.debug "mirror.copy: copy initiated local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    Pervasiveext.finally (fun () ->
        debug "activating RW datapath %s on remote=%s/%s" remote_dp dest dest_vdi;
        ignore(Remote.VDI.attach ~dbg ~sr:dest ~vdi:dest_vdi ~dp:remote_dp ~read_write:true);
        Remote.VDI.activate ~dbg ~dp:remote_dp ~sr:dest ~vdi:dest_vdi;

        with_activated_disk ~dbg ~sr ~vdi:base_vdi ~dp:base_dp
          (fun base_path ->
             with_activated_disk ~dbg ~sr ~vdi:(Some vdi) ~dp:leaf_dp
               (fun src ->
                  let dd = Sparse_dd_wrapper.start ~progress_cb:(progress_callback 0.05 0.9 task) ?base:base_path true (Opt.unbox src)
                      dest_vdi_url remote_vdi.virtual_size in
                  Storage_task.with_cancel task
                    (fun () -> Sparse_dd_wrapper.cancel dd)
                    (fun () ->
                       try Sparse_dd_wrapper.wait dd
                       with Sparse_dd_wrapper.Cancelled -> Storage_task.raise_cancelled task)
               )
          );
      )
      (fun () ->
         Remote.DP.destroy ~dbg ~dp:remote_dp ~allow_leak:false;
         State.remove_copy id
      );

In this chunk of code we attach and activate the disk on the remote SR via the SMAPI, then locally attach and activate both the VDI we’re copying and the base image we’re copying deltas from (if we’ve got one). We then call sparse_dd to copy the data to the remote NBD URL. There is some logic to update progress indicators and to cancel the operation if the SMAPIv2 call TASK.cancel is called.

Once the operation has terminated (either on success, error or cancellation), we remove the local attach and activations in the with_activated_disk function and the remote attach and activation by destroying the datapath on the remote SR. We then remove the persistent state relating to the copy.

    SMPERF.debug "mirror.copy: copy complete local_vdi:%s dest_vdi:%s" vdi dest_vdi;

    debug "setting remote=%s/%s content_id <- %s" dest dest_vdi local_vdi.content_id;
    Remote.VDI.set_content_id ~dbg ~sr:dest ~vdi:dest_vdi ~content_id:local_vdi.content_id;
    (* PR-1255: XXX: this is useful because we don't have content_ids by default *)
    debug "setting local=%s/%s content_id <- %s" sr local_vdi.vdi local_vdi.content_id;
    Local.VDI.set_content_id ~dbg ~sr ~vdi:local_vdi.vdi ~content_id:local_vdi.content_id;
    Some (Vdi_info remote_vdi)

The last thing we do is to set the local and remote content_id. The local set_content_id is there because the content_id of the VDI is constructed from the location if it is unset in the storage_access.ml module of xapi (still part of the storage layer)

  with e ->
    error "Caught %s: performing cleanup actions" (Printexc.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

Here we perform the list of cleanup operations. Theoretically. It seems we don’t ever actually set this to anything, so this is dead code.

DATA.MIRROR.start

let start' ~task ~dbg ~sr ~vdi ~dp ~url ~dest =
  debug "Mirror.start sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  SMPERF.debug "mirror.start called sr:%s vdi:%s url:%s dest:%s" sr vdi url dest;
  let remote_url = Http.Url.of_string url in
  let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in

  (* Find the local VDI *)
  let vdis = Local.SR.scan ~dbg ~sr in
  let local_vdi =
    try List.find (fun x -> x.vdi = vdi) vdis
    with Not_found -> failwith (Printf.sprintf "Local VDI %s not found" vdi) in

As with the previous calls, we make a remote module for SMAPIv2 calls on the destination, and we find local VDI metadata via SR.scan

  let id = State.mirror_id_of (sr,local_vdi.vdi) in

Mirror ids are deterministically constructed.

  (* A list of cleanup actions to perform if the operation should fail. *)
  let on_fail : (unit -> unit) list ref = ref [] in

This on_fail list is actually used.

  try
    let similar_vdis = Local.VDI.similar_content ~dbg ~sr ~vdi in
    let similars = List.filter (fun x -> x <> "") (List.map (fun vdi -> vdi.content_id) similar_vdis) in
    debug "Similar VDIs to %s = [ %s ]" vdi (String.concat "; " (List.map (fun x -> Printf.sprintf "(vdi=%s,content_id=%s)" x.vdi x.content_id) similar_vdis));

As with copy we look locally for similar VDIs. However, rather than use that here we actually pass this information on to the destination SR via the receive_start internal SMAPIv2 call:

    let result_ty = Remote.DATA.MIRROR.receive_start ~dbg ~sr:dest ~vdi_info:local_vdi ~id ~similar:similars in
    let result = match result_ty with
        Mirror.Vhd_mirror x -> x
    in

This gives the destination SR a chance to say what sort of migration it can support. We only support Vhd_mirror style migrations which require the destination to support the compose SMAPIv2 operation. The type of x is a record:

type mirror_receive_result_vhd_t = {
	mirror_vdi : vdi_info;
	mirror_datapath : dp;
	copy_diffs_from : content_id option;
	copy_diffs_to : vdi;
	dummy_vdi : vdi;
}

Field descriptions:

  • mirror_vdi is the VDI to which new writes should be mirrored.
  • mirror_datapath is the remote datapath on which the VDI has been attached and activated. This is required to construct the remote NBD url
  • copy_diffs_from represents the source base VDI to be used for the non-mirrored data copy.
  • copy_diffs_to is the remote VDI to copy those diffs to
  • dummy_vdi exists to prevent leaf-coalesce on the mirror_vdi
    (* Enable mirroring on the local machine *)
    let mirror_dp = result.Mirror.mirror_datapath in

    let uri = (Printf.sprintf "/services/SM/nbd/%s/%s/%s" dest result.Mirror.mirror_vdi.vdi mirror_dp) in
    let dest_url = Http.Url.set_uri remote_url uri in
    let request = Http.Request.make ~query:(Http.Url.get_query_params dest_url) ~version:"1.0" ~user_agent:"smapiv2" Http.Put uri in
    let transport = Xmlrpc_client.transport_of_url dest_url in

This is where we connect to the NBD server on the destination.

    debug "Searching for data path: %s" dp;
    let attach_info = Local.DP.attach_info ~dbg:"nbd" ~sr ~vdi ~dp in
    debug "Got it!";

we need the local attach_info to find the local tapdisk so we can send it the connected NBD socket.

    on_fail := (fun () -> Remote.DATA.MIRROR.receive_cancel ~dbg ~id) :: !on_fail;

This should probably be set directly after the call to receive_start

    let tapdev = match tapdisk_of_attach_info attach_info with
      | Some tapdev ->
        debug "Got tapdev";
        let pid = Tapctl.get_tapdisk_pid tapdev in
        let path = Printf.sprintf "/var/run/blktap-control/nbdclient%d" pid in
        with_transport transport (with_http request (fun (response, s) ->
            debug "Here inside the with_transport";
            let control_fd = Unix.socket Unix.PF_UNIX Unix.SOCK_STREAM 0 in
            finally
              (fun () ->
                 debug "Connecting to path: %s" path;
                 Unix.connect control_fd (Unix.ADDR_UNIX path);
                 let msg = dp in
                 let len = String.length msg in
                 let written = Unixext.send_fd control_fd msg 0 len [] s in
                 debug "Sent fd";
                 if written <> len then begin
                   error "Failed to transfer fd to %s" path;
                   failwith "foo"
                 end)
              (fun () ->
                 Unix.close control_fd)));
        tapdev
      | None ->
        failwith "Not attached"
    in

Here we connect to the remote NBD server, then pass that connected fd to the local tapdisk that is using the disk. This fd is passed with a name that is later used to tell tapdisk to start using it - we use the datapath name for this.

    debug "Adding to active local mirrors: id=%s" id;
    let alm = State.Send_state.({
        url;
        dest_sr=dest;
        remote_dp=mirror_dp;
        local_dp=dp;
        mirror_vdi=result.Mirror.mirror_vdi.vdi;
        remote_url=url;
        tapdev;
        failed=false;
        watchdog=None}) in
    State.add id (State.Send_op alm);
    debug "Added";

As for copy we persist some state to disk to say that we’re doing a mirror so we can undo any state changes after a toolstack restart.

    debug "About to snapshot VDI = %s" (string_of_vdi_info local_vdi);
    let local_vdi = add_to_sm_config local_vdi "mirror" ("nbd:" ^ dp) in
    let local_vdi = add_to_sm_config local_vdi "base_mirror" id in
    let snapshot =
    try
      Local.VDI.snapshot ~dbg ~sr ~vdi_info:local_vdi
    with
    | Storage_interface.Backend_error(code, _) when code = "SR_BACKEND_FAILURE_44" ->
      raise (Api_errors.Server_error(Api_errors.sr_source_space_insufficient, [ sr ]))
    | e ->
      raise e
    in
    debug "Done!";

    SMPERF.debug "mirror.start: snapshot created, mirror initiated vdi:%s snapshot_of:%s"
      snapshot.vdi local_vdi.vdi ;

    on_fail := (fun () -> Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi) :: !on_fail;

This bit inserts into sm_config the name of the fd we passed earlier to do mirroring. This is interpreted by the python SM backends and passed on the tap-ctl invocation to unpause the disk. This causes all new writes to be mirrored via NBD to the file descriptor passed earlier.

    begin
      let rec inner () =
        debug "tapdisk watchdog";
        let alm_opt = State.find_active_local_mirror id in
        match alm_opt with
        | Some alm ->
          let stats = Tapctl.stats (Tapctl.create ()) tapdev in
          if stats.Tapctl.Stats.nbd_mirror_failed = 1 then
            Updates.add (Dynamic.Mirror id) updates;
          alm.State.Send_state.watchdog <- Some (Scheduler.one_shot scheduler (Scheduler.Delta 5) "tapdisk_watchdog" inner)
        | None -> ()
      in inner ()
    end;

This is the watchdog that runs tap-ctl stats every 5 seconds watching mirror_failed for evidence of a failure in the mirroring code. If it detects one the only thing it does is to notify that the state of the mirroring has changed. This will be picked up by the thread in xapi that is monitoring the state of the mirror. It will then issue a MIRROR.stat call which will return the state of the mirror including the information that it has failed.

    on_fail := (fun () -> stop ~dbg ~id) :: !on_fail;
    (* Copy the snapshot to the remote *)
    let new_parent = Storage_task.with_subtask task "copy" (fun () ->
        copy' ~task ~dbg ~sr ~vdi:snapshot.vdi ~url ~dest ~dest_vdi:result.Mirror.copy_diffs_to) |> vdi_info in
    debug "Local VDI %s == remote VDI %s" snapshot.vdi new_parent.vdi;

This is where we copy the VDI returned by the snapshot invocation to the remote VDI called copy_diffs_to. We only copy deltas, but we rely on copy' to figure out which disk the deltas should be taken from, which it does via the content_id field.

    Remote.VDI.compose ~dbg ~sr:dest ~vdi1:result.Mirror.copy_diffs_to ~vdi2:result.Mirror.mirror_vdi.vdi;
    Remote.VDI.remove_from_sm_config ~dbg ~sr:dest ~vdi:result.Mirror.mirror_vdi.vdi ~key:"base_mirror";
    debug "Local VDI %s now mirrored to remote VDI: %s" local_vdi.vdi result.Mirror.mirror_vdi.vdi;

Once the copy has finished we invoke the compose SMAPIv2 call that composes the diffs from the mirror with the base image copied from the snapshot.

    debug "Destroying dummy VDI %s on remote" result.Mirror.dummy_vdi;
    Remote.VDI.destroy ~dbg ~sr:dest ~vdi:result.Mirror.dummy_vdi;
    debug "Destroying snapshot %s on src" snapshot.vdi;
    Local.VDI.destroy ~dbg ~sr ~vdi:snapshot.vdi;

    Some (Mirror_id id)

we can now destroy the dummy vdi on the remote (which will cause a leaf-coalesce in due course), and we destroy the local snapshot here (which will also cause a leaf-coalesce in due course, providing we don’t destroy it first). The return value from the function is the mirror_id that we can use to monitor the state or cancel the mirror.

  with
  | Sr_not_attached(sr_uuid) ->
    error " Caught exception %s:%s. Performing cleanup." Api_errors.sr_not_attached sr_uuid;
    perform_cleanup_actions !on_fail;
    raise (Api_errors.Server_error(Api_errors.sr_not_attached,[sr_uuid]))
  | e ->
    error "Caught %s: performing cleanup actions" (Api_errors.to_string e);
    perform_cleanup_actions !on_fail;
    raise e

The exception handler just cleans up afterwards.

This is not the end of the story, since we need to detach the remote datapath being used for mirroring when we detach this end. The hook function is in storage_migrate.ml:

let post_detach_hook ~sr ~vdi ~dp =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  State.find_active_local_mirror id |>
  Opt.iter (fun r ->
      let remote_url = Http.Url.of_string r.url in
      let module Remote = Client(struct let rpc = rpc ~srcstr:"smapiv2" ~dststr:"dst_smapiv2" remote_url end) in
      let t = Thread.create (fun () ->
          debug "Calling receive_finalize";
          log_and_ignore_exn
            (fun () -> Remote.DATA.MIRROR.receive_finalize ~dbg:"Mirror-cleanup" ~id);
          debug "Finished calling receive_finalize";
          State.remove_local_mirror id;
          debug "Removed active local mirror: %s" id
        ) () in
      Opt.iter (fun id -> Scheduler.cancel scheduler id) r.watchdog;
      debug "Created thread %d to call receive finalize and dp destroy" (Thread.id t))

This removes the persistent state and calls receive_finalize on the destination. The body of that functions is:

let receive_finalize ~dbg ~id =
  let recv_state = State.find_active_receive_mirror id in
  let open State.Receive_state in Opt.iter (fun r -> Local.DP.destroy ~dbg ~dp:r.leaf_dp ~allow_leak:false) recv_state;
  State.remove_receive_mirror id

which removes the persistent state on the destination and destroys the datapath associated with the mirror.

Additionally, there is also a pre-deactivate hook. The rationale for this is that we want to detect any failures to write that occur right at the end of the SXM process. So if there is a mirror operation going on, before we deactivate we wait for tapdisk to flush its queue of outstanding requests, then we query whether there has been a mirror failure. The code is just above the detach hook in storage_migrate.ml:

let pre_deactivate_hook ~dbg ~dp ~sr ~vdi =
  let open State.Send_state in
  let id = State.mirror_id_of (sr,vdi) in
  let start = Mtime_clock.counter () in
  let get_delta () = Mtime_clock.count start |> Mtime.Span.to_s in
  State.find_active_local_mirror id |>
  Opt.iter (fun s ->
      try
        (* We used to pause here and then check the nbd_mirror_failed key. Now, we poll
				   until the number of outstanding requests has gone to zero, then check the
				   status. This avoids confusing the backend (CA-128460) *)
        let open Tapctl in
        let ctx = create () in
        let rec wait () =
          if get_delta () > reqs_outstanding_timeout then raise Timeout;
          let st = stats ctx s.tapdev in
          if st.Stats.reqs_outstanding > 0
          then (Thread.delay 1.0; wait ())
          else st
        in
        let st = wait () in
        debug "Got final stats after waiting %f seconds" (get_delta ());
        if st.Stats.nbd_mirror_failed = 1
        then begin
          error "tapdisk reports mirroring failed";
          s.failed <- true
        end;
      with
      | Timeout ->
        error "Timeout out after %f seconds waiting for tapdisk to complete all outstanding requests" (get_delta ());
        s.failed <- true
      | e ->
        error "Caught exception while finally checking mirror state: %s"
          (Printexc.to_string e);
        s.failed <- true
    )

XE CLI architecture

Info

The links in this page point to the source files of xapi v1.132.0, not to the latest source code. Meanwhile, the CLI server code in xapi has been moved to a library separate from the main xapi binary, and has its own subdirectory ocaml/xapi-cli-server.

Architecture

  • The actual CLI is a very lightweight binary in ocaml/xe-cli

    • It is just a dumb client, that does everything that xapi tells it to do
    • This is a security issue
      • We must trust the xenserver that we connect to, because it can tell xe to read local files, download files, …
    • When it is first called, it takes the few command-line arguments it needs, and then passes the rest to xapi in a HTTP PUT request
      • Each argument is in a separate line
    • Then it loops doing what xapi tells it to do, in a loop, until xapi tells it to exit or an exception happens
  • The protocol description is in ocaml/xapi-cli-protocol/cli_protocol.ml

    • The CLI has such a protocol that one binary can talk to multiple versions of xapi as long as their CLI protocol versions are compatible
    • and the CLI can be changed without updating the xe binary
    • and also for performance reasons, it is more efficient this way than by having a CLI that makes XenAPI calls
  • Xapi

Walk-through: CLI handler in xapi (external calls)

Definitions for the HTTP handler

Constants.cli_uri = "/cli"

Datamodel.http_actions = [...;
  ("post_cli", (Post, Constants.cli_uri, false, [], _R_READ_ONLY, []));
...]

(* these public http actions will NOT be checked by RBAC *)
(* they are meant to be used in exceptional cases where RBAC is already *)
(* checked inside them, such as in the XMLRPC (API) calls *)
Datamodel.public_http_actions_with_no_rbac_check` = [...
  "post_cli";  (* CLI commands -> calls XMLRPC *)
...]

Xapi.common_http_handlers = [...;
  ("post_cli", (Http_svr.BufIO Xapi_cli.handler));
...]

Xapi.server_init () =
  ...
  "Registering http handlers", [], (fun () -> List.iter Xapi_http.add_handler common_http_handlers);
  ...

Due to there definitions, Xapi_http.add_handler does not perform RBAC checks for post_cli. This means that the CLI handler does not use Xapi_http.assert_credentials_ok when a request comes in, as most other handlers do. The reason is that RBAC checking is delegated to the actual XenAPI calls that are being done by the commands in Cli_operations.

This means that the Xapi_http.add_handler call so resolves to simply:

Http_svr.Server.add_handler server Http.Post "/cli" (Http_svr.BufIO Xapi_cli.handler))

…which means that the function Xapi_cli.handler is called directly when an HTTP POST request with path /cli comes in.

High-level request processing

Xapi_cli.handler:

  • Reads the body of the HTTP request, limitted to Xapi_globs.http_limit_max_cli_size = 200 * 1024 characters.
  • Sends a protocol version string to the client: "XenSource thin CLI protocol" plus binary encoded major (0) and (2) minor numbers.
  • Reads the protocol version from the client and exits with an error if it does not match the above.
  • Calls Xapi_cli.parse_session_and_args with the request’s body to extract the session ref, if there.
  • Calls Cli_frontend.parse_commandline to parse the rest of the command line from the body.
  • Calls Xapi_cli.exec_command to execute the command.
  • On error, calls exception_handler.

Xapi_cli.parse_session_and_args:

  • Is passed the request body and reads it line by line. Each line is considered an argument.
  • Removes any CR chars from the end of each argument.
  • If the first arg starts with session_id=, the the bit after this prefix is considered to be a session reference.
  • Returns the session ref (if there) and (remaining) list of args.

Cli_frontend.parse_commandline:

  • Returns the command name and assoc list of param names and values. It handles --name and -flag arguments by turning them into key/value string pairs.

Xapi_cli.exec_command:

  • Finds username/password params.
  • Get the rpc function: this is the so-called “fake_rpc callback”, which does not use the network or HTTP at all, but goes straight to Api_server.callback1 (the XenAPI RPC entry point). This function is used by the CLI handler to do loopback XenAPI calls.
  • Logs the parsed xe command, omitting sensitive data.
  • Continues as Xapi_cli.do_rpcs
  • Looks up the command name in the command table from Cli_frontend (raises an error if not found).
  • Checks if all required params have been supplied (raises an error if not).
  • Checks that the host is a pool master (raises an error if not).
  • Depending on the command, a session.login_with_password or session.slave_local_login_with_password XenAPI call is made with the supplied username and password. If the authentication passes, then a session reference is returned for the RBAC role that belongs to the user. This session is used to do further XenAPI calls.
  • Next, the implementation of the command in Cli_operations is executed.

Command implementations

The various commands are implemented in cli_operations.ml. These functions are only called after user authentication has passed (see above). However, RBAC restrictions are only enforced inside any XenAPI calls that are made, and not on any of the other code in cli_operations.ml.

The type of each command implementation function is as follows (see cli_cmdtable.ml):

type op =
  Cli_printer.print_fn ->
  (Rpc.call -> Rpc.response) ->
  API.ref_session -> ((string*string) list) -> unit

So each function receives a printer for sending text output to the xe client, and rpc function and session reference for doing XenAPI calls, and a key/value pair param list. Here is a typical example:

let bond_create printer rpc session_id params =
  let network = List.assoc "network-uuid" params in
  let mac = List.assoc_default "mac" params "" in
  let network = Client.Network.get_by_uuid rpc session_id network in
  let pifs = List.assoc "pif-uuids" params in
  let uuids = String.split ',' pifs in
  let pifs = List.map (fun uuid -> Client.PIF.get_by_uuid rpc session_id uuid) uuids in
  let mode = Record_util.bond_mode_of_string (List.assoc_default "mode" params "") in
  let properties = read_map_params "properties" params in
  let bond = Client.Bond.create rpc session_id network pifs mac mode properties in
  let uuid = Client.Bond.get_uuid rpc session_id bond in
  printer (Cli_printer.PList [ uuid])
  • The necessary parameters are looked up in params using List.assoc or similar.
  • UUIDs are translated into reference by get_by_uuid XenAPI calls (note that the Client module is the XenAPI client, and functions in there require the rpc function and session reference).
  • Then the main API call is made (Client.Bond.create in this case).
  • Further API calls may be made to output data for the client, and passed to the printer.

This is the common case for CLI operations: they do API calls based on the parameters that were passed in.

However, other commands are more complicated, for example vm_import/export and vm_migrate. These contain a lot more logic in the CLI commands, and also send commands to the client to instruct it to read or write files and/or do HTTP calls.

Yet other commands do not actually do any XenAPI calls, but instead get “helpful” information from other places. Example: diagnostic_gc_stats, which displays statistics from xapi’s OCaml GC.

Tutorials

The following tutorials show how to extend the CLI (and XenAPI):