Design document | |
---|---|
Revision | v1 |
Status | proposed |
Review | create new issue |
OCFS2 is a (host-)clustered filesystem which runs on top of a shared raw block device. Hosts using OCFS2 form a cluster using a combination of network and storage heartbeats and host fencing to avoid split-brain.
The following diagram shows the proposed architecture with xapi
:
Please note the following:
xapi
will be
modified to configure the cluster and manage the cluster node numbers.xhad
process which runs in userspace but in the realtime
scheduling class so it has priority over all other userspace tasks. xhad
sends heartbeats via the ha_statefile
VDI and via UDP, and uses the
Xen watchdog for host fencing.o2cb
kernel driver which sends heartbeats via the
o2cb_statefile
and via TCP, fencing the host by panicing domain 0.OCFS2 uses the O2CB “cluster stack” which is similar to our xhad
. To configure
O2CB we need to
In the current Xapi toolstack there is a single global implicit cluster called a “Pool” which is used for: resource locking; “clustered” storage repositories and fault handling (in HA). In the long term we will allow these types of clusters to be managed separately or all together, depending on the sophistication of the admin and the complexity of their environment. We will take a small step in that direction by keeping the OCFS2 O2CB cluster management code at “arms length” from the Xapi Pool.join code.
In xcp-idl we will define a new API category called “Cluster” (in addition to the categories for Xen domains , ballooning , stats , networking and storage ). These APIs will only be called by Xapi on localhost. In particular they will not be called across-hosts and therefore do not have to be backward compatible. These are “cluster plugin APIs”.
We will define the following APIs:
Plugin:Membership.create
: add a host to a cluster. On exit the local host cluster software
will know about the new host but it may need to be restarted before the
change takes effect
hostname:string
: the hostname of the management domainuuid:string
: a UUID identifying the hostid:int
: the lowest available unique integer identifying the host
where an integer will never be re-used unless it is guaranteed that
all nodes have forgotten any previous state associated with itaddress:string list
: a list of addresses through which the host
can be contactedPlugin:Membership.destroy
: removes a named host from the cluster. On exit the local
host software will know about the change but it may need to be restarted
before it can take effect
uuid:string
: the UUID of the host to removePlugin:Cluster.query
: queries the state of the cluster
maintenance_required:bool
: true if there is some outstanding configuration
change which cannot take effect until the cluster is restarted.hosts
: a list of all known hosts together with a state including:
whether they are known to be alive or dead; or whether they are currently
“excluded” because the cluster software needs to be restartedPlugin:Cluster.start
: turn on the cluster software and let the local host joinPlugin:Cluster.stop
: turn off the cluster softwareXapi will be modified to:
Cluster
which will have columns
name: string
: this is the name of the Cluster plugin (TODO: use same
terminology as SM?)configuration: Map(String,String)
: this will contain any cluster-global
information, overrides for default values etc.enabled: Bool
: this is true when the cluster “should” be running. It
may require maintenance to synchronise changes across the hosts.maintenance_required: Bool
: this is true when the cluster needs to
be placed into maintenance mode to resync its configurationXenAPI:Cluster.enable
which sets enabled=true
and waits for all
hosts to report Membership.enabled=true
.XenAPI:Cluster.disable
which sets enabled=false
and waits for all
hosts to report Membership.enabled=false
.Membership
which will have columns
id: int
: automatically generated lowest available unique integer
starting from 0cluster: Ref(Cluster)
: the type of cluster. This will never be NULL.host: Ref(host)
: the host which is a member of the cluster. This may
be NULL.left: Date
: if not 1/1/1970 this means the time at which the host
left the cluster.maintenance_required: Bool
: this is true when the Host believes the
cluster needs to be placed into maintenance mode.Host.memberships: Set(Ref(Membership))
vdi_type
to include o2cb_statefile
as well as ha_statefile
Pool.enable_o2cb
with arguments
heartbeat_sr: Ref(SR)
: the SR to use for global heartbeatsconfiguration: Map(String,String)
: available for future configuration tweaksPool.enable_ha
this will find or create the heartbeat VDI, create the
Cluster
entry and the Membership
entries. All Memberships
will have
maintenance_required=true
reflecting the fact that the desired cluster
state is out-of-sync with the actual cluster state.XenAPI:Membership.enable
self:Host
: the host to modifycluster:Cluster
: the cluster.XenAPI:Membership.disable
self:Host
: the host to modifycluster:Cluster
: the cluster name.Host.memberships
field and calls Plugin:Membership.create
and
Plugin:Membership.destroy
to keep the local cluster software up-to-date
when any host in the pool changes its configurationPlugin:Cluster.query
after an Plugin:Membership:create
or
Plugin:Membership.destroy
to see whether the
SR needs maintenanceMembership
record’s left
date, deletes the Membership
.XenAPI:Pool.join
to resync with the master’s Host.memberships
list.XenAPI:Pool.eject
to
Membership.disable
in the cluster plugin to stop the o2cb
serviceMembership.destroy
in the cluster plugin to remove every other host
from the local configurationHost
metadata from the poolXenAPI:Membership.left
to NOW()
XenAPI:Host.forget
to
Host
metadata from the poolXenAPI:Membership.left
to NOW()
XenAPI:Cluster.maintenance_required
to trueA Cluster plugin called “o2cb” will be added which
Plugin:Membership.destroy
Plugin:Membership.create
Plugin:Cluster.start
: find the VDI with type=o2cb_statefile
;
add this to the “static-vdis” list; chkconfig
the service on. We
will use the global heartbeat mode of o2cb
.Plugin:Cluster.stop
: stop the service; chkconfig
the service off;
remove the “static-vdis” entry; leave the VDI itself aloneSummary of differences between this and xHA:
o2cb
we
should be able to have join
work live and only eject
requires
maintenance modeWe need to ensure o2cb
and xhad
do not try to conflict by fencing
hosts at the same time. We shall:
use the default o2cb
timeouts (hosts fence if no I/O in 60s): this
needs to be short because disk I/O on otherwise working hosts can
be blocked while another host is failing/ has failed.
make the xhad
host fence timeouts much longer: 300s. It’s much more
important that this is reliable than fast. We will make this change
globally and not just when using OCFS2.
In the xhad
config we will cap the HeartbeatInterval
and StatefileInterval
at 5s (the default otherwise would be 31s). This means that 60 heartbeat
messages have to be lost before xhad
concludes that the host has failed.
The SM plugin OCFS2
will be a file-based plugin.
TODO: which file format by default?
The SM plugin will first check whether the o2cb
cluster is active and fail
operations if it is not.
When either HA or OCFS O2CB “fences” the host it will look to the admin like a host crash and reboot. We need to (in priority order)
If heartbeat I/O fails for more than 60s when running o2cb
then the host will fence.
This can happen either
for a good reason: for example the host software may have deadlocked or someone may have pulled out a network cable.
for a bad reason: for example a network bond link failure may have been ignored and then the second link failed; or the heartbeat thread may have been starved of I/O bandwidth by other processes
Since the consequences of fencing are severe – all VMs on the host crash simultaneously – it is important to avoid the host fencing for bad reasons.
We should recommend that all users
Furthermore we need to help users monitor their I/O paths. It’s no good if they use a bonded network but fail to notice when one of the paths have failed.
The current XenServer HA implementation generates the following I/O-related alerts:
HA_HEARTBEAT_APPROACHING_TIMEOUT
(priority 5 “informational”): when half the
network heartbeat timeout has been reached.HA_STATEFILE_APPROACHING_TIMEOUT
(priority 5 “informational”): when half the
storage heartbeat timeout has been reached.HA_NETWORK_BONDING_ERROR
(priority 3 “service degraded”): when one of the bond
links have failed.HA_STATEFILE_LOST
(priority 2 “service loss imminent”): when the storage heartbeat
has completely failed and only the network heartbeat is left.Unfortunately alerts are triggered on “edges” i.e. when state changes, and not on “levels” so it is difficult to see whether the link is currently broken.
We should define datasources suitable for use by xcp-rrdd to expose the current state (and the history) of the I/O paths as follows:
pif_<name>_paths_failed
: the total number of paths which we know have failed.pif_<name>_paths_total
: the total number of paths which are configured.sr_<name>_paths_failed
: the total number of storage paths which we know have failed.sr_<name>_paths_total
: the total number of storage paths which are configured.The pif
datasources should be generated by xcp-networkd
which already has a
network bond monitoring thread.
THe sr
datasources should be generated by xcp-rrdd
plugins since there is no
storage daemon to generate them.
We should create RRDs using the MAX
consolidation function, otherwise information
about failures will be lost by averaging.
XenCenter (and any diagnostic tools) should warn when the system is at risk of fencing in particular if any of the following are true:
pif_<name>_paths_failed
is non-zerosr_<name>_paths_failed
is non-zeropif_<name>_paths_total
is less than 2sr_<name>_paths_total
is less than 2XenCenter (and any diagnostic tools) should warn if any of the following have been true over the past 7 days:
pif_<name>_paths_failed
is non-zerosr_<name>_paths_failed
is non-zeroThe network and storage paths used by heartbeats must remain responsive otherwise the host will fence (i.e. the host and all VMs will crash).
Outstanding issue: how slow can multipathd
get? How does it scale with the number of
LUNs.
When a host crashes the effect on the user is severe: all the VMs will also crash. In cases where the host crashed for a bad reason (such as a single failure after a configuration error) we must help the user understand how they can avoid the same situation happening again.
We must make sure the crash kernel runs reliably when xhad
and o2cb
fence the host.
Xcp-rrdd will be modified to store RRDs in an mmap(2)
d file sin the dom0
filesystem (rather than in-memory). Xcp-rrdd will call msync(2)
every 5s
to ensure the historical records have hit the disk. We should use the same
on-disk format as RRDtool (or as close to it as makes sense) because it has
already been optimised to minimise the amount of I/O.
Xapi will be modified to run a crash-dump analyser program xen-crash-analyse
.
xen-crash-analyse
will:
o2cb
xhad
o2cb
or xhad
then the analyser
xhad.log
and look for evidence of heartbeats “approaching
timeout”TODO: depending on what information we can determine from the analyser, we
will want to record some of it in the Host_crash_dump
database table.
XenCenter will be modified to explain why the host crashed and explain what the user should do to fix it, specifically:
o2cb
or xhad
then either
The documentation should strongly recommend
xcp-networkd
will be modified to change the behaviour of the DHCP client.
Currently the dhclient
will wait for a response and eventually background
itself. This is a big problem since DHCP can reset the hostname, and this can
break o2cb
. Therefore we must insist that PIF.reconfigure_ip
becomes
fully synchronous, supporting timeout and cancellation. Once the call returns
– whether through success or failure – there must not be anything in the
background which will change the system’s hostname.
TODO: figure out whether we need to request “maintenance mode” for hostname changes.
The purpose of “maintenance mode” is to take a host out of service and leave it in a state where it’s safe to fiddle with it without affecting services in VMs.
XenCenter currently does the following:
Host.disable
: prevents new VMs starting hereHost.evacuate
: move the running VMs somewhere elseThe problems with maintenance mode are:
We should also
PBD.unplug
: all storage. This allows the network to be safely reconfigured.
If the network is configured when NFS storage is plugged then the SR can
permanently deadlock; if the network is configured when OCFS2 storage is
plugged then the host can crash.TODO: should we add a Host.prepare_for_maintenance
(better name TBD)
to take care of all this without XenCenter having to script it. This would also
help CLI and powershell users do the right thing.
TODO: should we insist that the host is rebooted to leave maintenance mode? This would make maintenance mode more reliable and allow us to integrate maintenance mode with xHA (where maintenance mode is a “staged reboot”)
TODO: should we leave all clusters as part of maintenance mode? We probably need to do this to avoid fencing.
Assume you have an existing Pool of 2 hosts. First the client will set up the O2CB cluster, choosing where to put the global heartbeat volume. The client should check that the I/O paths have all been setup correctly with bonding and multipath and prompt the user to fix any obvious problems.
Internally within Pool.enable_o2cb
Xapi will set up the cluster metadata
on every host in the pool:
At this point all hosts have in-sync cluster.conf
files but all cluster
services are disabled. We also have requires_mainenance=true
on all
Membership
entries and the global Cluster
has enabled=false
.
The client will now try to enable the cluster with Cluster.enable
:
Now all hosts are in the cluster and the SR can be created using the standard SM APIs.
Assume you have an existing Pool of 2 hosts with o2cb
clustering enabled
and at least one ocfs2
filesystem mounted. If the host is online then
XenAPI:Pool.eject
will:
Note that:
o2cb
cluster.conf
to comment out
the former hostMembership
table still remembers the node number of the ejected host–
this cannot be re-used until the SR is taken down for maintenance.cluster.conf
and the one they would use if they restarted the cluster service, so all
hosts report that the cluster must be taken offline i.e. requires_maintence=true
.OCFS2 is fundamentally a different type of storage to all existing storage
types supported by xapi. OCFS2 relies upon O2CB, which provides
Host-level High Availability. All HA implementations
(including O2CB and xhad
) impose restrictions on the server admin to
prevent unnecessary host “fencing” (i.e. crashing). Once we have OCFS2 as
a feature, we will have to live with these restrictions which previously only
applied when HA was explicitly enabled. To reduce complexity we will not try
to enforce restrictions only when OCFS2 is being used or is likely to be used.