How to set up alarms
Introduction
In XAPI, alarms are triggered by a Python daemon located at /opt/xensource/bin/perfmon.
The daemon is managed as a systemd service and can be configured by setting parameters in /etc/sysconfig/perfmon.
It listens on an internal Unix socket to receive commands. Otherwise, it runs in a loop, periodically requesting metrics from XAPI. It can then be configured to generate events based on these metrics. It can monitor various types of XAPI objects, including VMs, SRs, and Hosts. The configuration for each object is defined by writing an XML string into the object’s other-config key.
The metrics used by perfmon are collected by the xcp-rrdd daemon. The xcp-rrdd daemon is a component of XAPI responsible for collecting metrics and storing them as Round-Robin Databases (RRDs).
A XAPI plugin also exists, providing the functions refresh and debug_mem, which send commands through the Unix socket. The refresh function is used when an other-config key is added or updated; it triggers the daemon to reread the monitored objects so that new alerts are taken into account. The debug_mem function logs the objects currently being monitored into /var/log/user.log as a dictionary.
Monitoring and alarms
Overview
- To get the metrics,
perfmonrequests XAPI by calling:http://localhost/rrd_updates?session_id=<ref>&start=1759912021&host=true&sr_uuid=all&cf=AVERAGE&interval=60 - Different consolidation functions can be used like AVERAGE, MIN, MAX or LAST. See the details in the next sections for specific objects and how to set it.
- Once retrieve,
perfmonwill check all its triggers and generate alarms if needed.
Specific XAPI objects
VMs
- To set an alarm on a VM, you need to write an XML string into the
other-configkey of the object. For example, to trigger an alarm when the CPU usage is higher than 50%, run:
xe vm-param-set uuid=<UUID> other-config:perfmon='<config> <variable> <name value="cpu_usage"/> <alarm_trigger_level value="0.5"/> </variable> </config>'- Then, you can either wait until the new configuration is read by the
perfmondaemon or force a refresh by running:
xe host-call-plugin host-uuid=<UUID> plugin=perfmon fn=refresh- Now, if you generate some load inside the VM and the CPU usage goes above 50%, the
perfmondaemon will create a message (a XAPI object) with the name ALARM. This message will include a priority, a timestamp, an obj-uuid and a body. To list all messages that are alarms, run:
xe message-list name=ALARM- You will see, for example:
uuid ( RO) : dadd7cbc-cb4e-5a56-eb0b-0bb31c102c94
name ( RO): ALARM
priority ( RO): 3
class ( RO): VM
obj-uuid ( RO): ea9efde2-d0f2-34bb-74cb-78c303f65d89
timestamp ( RO): 20251007T11:30:26Z
body ( RO): value: 0.986414
config:
<variable>
<name value="cpu_usage"/>
<alarm_trigger_level value="0.5"/>
</variable>where the body contains all the relevant information: the value that triggered the alarm and the configuration of your alarm.
When configuring you alarm, your XML string can:
- have multiple
<variable>nodes - use the following values for child nodes:
- name: what to call the variable (no default)
- alarm_priority: the priority of the messages generated (default ‘3’)
- alarm_trigger_level: level of value that triggers an alarm (no default)
- alarm_trigger_sense:‘high’ if alarm_trigger_level is a max, otherwise ’low’. (default ‘high’)
- alarm_trigger_period: num seconds of ‘bad’ values before an alarm is sent (default ‘60’)
- alarm_auto_inhibit_period: num seconds this alarm disabled after an alarm is sent (default ‘3600’)
- consolidation_fn: how to combine variables from rrd_updates into one value (default is ‘average’ for ‘cpu_usage’, ‘get_percent_fs_usage’ for ‘fs_usage’, ‘get_percent_log_fs_usage’ for ’log_fs_usage’,‘get_percent_mem_usage’ for ‘mem_usage’, & ‘sum’ for everything else)
- rrd_regex matches the names of variables from (xe vm-data-sources-list uuid=$vmuuid) used to compute value (only has defaults for “cpu_usage”, “network_usage”, and “disk_usage”)
- have multiple
Notice that
alarm_prioritywill be the priority of the generatedmessage, 0 being low priority.
SRs
- To set an alarm on an SR object, as with VMs, you need to write an XML string into the
other-configkey of the SR. For example, you can run:
xe sr-param-set uuid=<UUID> other-config:perfmon='<config><variable><name value="physical_utilisation"/><alarm_trigger_level value="0.8"/></variable></config>'- When configuring you alarm, the XML string supports the same child elements as for VMs
Hosts
- As with VMs ans SRs, alarms can be configured by writing an XML string into an
other-configkey. For example, you can run:
xe host-param-set uuid=<UUID> other-config:perfmon=\
'<config><variable><name value="cpu_usage"/><alarm_trigger_level value="0.5"/></variable></config>'The XML string can include multiple
nodes allowed The full list of supported child nodes is:
- name: what to call the variable (no default)
- alarm_priority: the priority of the messages generated (default ‘3’)
- alarm_trigger_level: level of value that triggers an alarm (no default)
- alarm_trigger_sense: ‘high’ if alarm_trigger_level is a max, otherwise ’low’. (default ‘high’)
- alarm_trigger_period: num seconds of ‘bad’ values before an alarm is sent (default ‘60’)
- alarm_auto_inhibit_period:num seconds this alarm disabled after an alarm is sent (default ‘3600’)
- consolidation_fn: how to combine variables from rrd_updates into one value (default is ‘average’ for ‘cpu_usage’ & ‘sum’ for everything else)
- rrd_regex matches the names of variables from (xe host-data-source-list uuid=
) used to compute value (only has defaults for “cpu_usage”, “network_usage”, “memory_free_kib” and “sr_io_throughput_total_xxxxxxxx”) where that last one ends with the first eight characters of the SR UUID)
As a special case for SR throughput, it is also possible to configure a Host by writing XML into the
other-configkey of an SR connected to it. For example:
xe sr-param-set uuid=$sruuid other-config:perfmon=\
'<config><variable><name value="sr_io_throughput_total_per_host"/><alarm_trigger_level value="0.01"/></variable></config>'- This only works for that specific variable name, and
rrd_regexmust not be specified. - Configuration done directly on the host (variable-name, sr_io_throughput_total_xxxxxxxx) takes priority.
Which metrics are available?
- Accepted name for metrics are:
- cpu_usage: matches RRD metrics with the pattern
cpu[0-9]+ - network_usage: matches RRD metrics with the pattern
vif_[0-9]+_[rt]x - disk_usage: match RRD metrics with the pattern
vbd_(xvd|hd)[a-z]+_(read|write) - fs_usage, log_fs_usage, mem_usage and memory_internal_free do not match anything by default.
- cpu_usage: matches RRD metrics with the pattern
- By using
rrd_regex, you can add your own expressions. To get a list of available metrics with their descriptions, you can call theget_data_sourcesmethod for VM, for SR and also for Host. - A python script is provided at the end to get data sources. Using the script we can, for example, see:
# ./get_data_sources.py --vm 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
List of data sources related to VM 5a445deb-0a8e-c6fe-24c8-09a0508bbe21
cpu0 | CPU0 usage
cpu_usage | Domain CPU usage
memory | Memory currently allocated to VM
memory_internal_free | Memory used as reported by the guest agent
memory_target | Target of VM balloon driver
...
vbd_xvda_io_throughput_read | Data read from the VDI, in MiB/s
...- You can then set up an alarm when the data read from a VDI exceeds a certain level by doing:
xe vm-param-set uuid=5a445deb-0a8e-c6fe-24c8-09a0508bbe21 \
other-config:perfmon='<config><variable> \
<name value="disk_usage"/> \
<alarm_trigger_level value="10"/> \
<rrd_regex value="vbd_xvda_io_throughput_read"/> \
</variable> </config>'- Here is the script that allows you to get data sources:
#!/usr/bin/env python3
import argparse
import sys
import XenAPI
def pretty_print(data_sources):
if not data_sources:
print("No data sources.")
return
# Compute alignment for something nice
max_label_len = max(len(data["name_label"]) for data in data_sources)
for data in data_sources:
label = data["name_label"]
desc = data["name_description"]
print(f"{label:<{max_label_len}} | {desc}")
def list_vm_data(session, uuid):
vm_ref = session.xenapi.VM.get_by_uuid(uuid)
data_sources = session.xenapi.VM.get_data_sources(vm_ref)
print(f"\nList of data sources related to VM {uuid}")
pretty_print(data_sources)
def list_host_data(session, uuid):
host_ref = session.xenapi.host.get_by_uuid(uuid)
data_sources = session.xenapi.host.get_data_sources(host_ref)
print(f"\nList of data sources related to Host {uuid}")
pretty_print(data_sources)
def list_sr_data(session, uuid):
sr_ref = session.xenapi.SR.get_by_uuid(uuid)
data_sources = session.xenapi.SR.get_data_sources(sr_ref)
print(f"\nList of data sources related to SR {uuid}")
pretty_print(data_sources)
def main():
parser = argparse.ArgumentParser(
description="List data sources related to VM, host or SR"
)
parser.add_argument("--vm", help="VM UUID")
parser.add_argument("--host", help="Host UUID")
parser.add_argument("--sr", help="SR UUID")
args = parser.parse_args()
# Connect to local XAPI: no identification required to access local socket
session = XenAPI.xapi_local()
try:
session.xenapi.login_with_password("", "")
if args.vm:
list_vm_data(session, args.vm)
if args.host:
list_host_data(session, args.host)
if args.sr:
list_sr_data(session, args.sr)
except XenAPI.Failure as e:
print(f"XenAPI call failed: {e.details}")
sys.exit(1)
finally:
session.xenapi.session.logout()
if __name__ == "__main__":
main()