Measuring, Visualizing, and Optimizing the Energy Consumption of Computer Clusters

Bachelor´s Thesis

Nils Steinger, 2017-06-16

This digital distribution contains the following materials related to the thesis:

Software Developed in the Context of this Thesis

The following software has been developed to complement the work described in the thesis, as listed below and in appendix A of the thesis.

Details on the required hardware setup can be found in the chapter "Measurement Technologies".

Intel RAPL Implementations

SNMP Agent Interface

power_gadget_snmp is based on Power Gadget, Intel's reference implementation of an interface to the RAPL energy measurement functionality of modern Intel "Core" and "Xeon" CPUs.

It implements the pass_persist interface used by the Net-SNMP SNMP agent to expose the RAPL measurement values (converted to Joules) via an SNMP subtree rooted at the OID .1.3.6.1.4.1.47670.5.
Its OIDs follow the format .1.3.6.1.4.1.47670.5.<cpu_number>.<rapl_domain>, where cpu_number is the one-based CPU number returned by the RAPL interface, and rapl_domain is either 1, 2, 3, or 4, signifying the package, core, uncore, and dram RAPL domains, respectively.

Written in C, it can be compiled using the supplied Makefile.

Note that the RAPL interface requires the msr and cpuid kernel modules to be loaded.

Collectd Integration

collectd-plugin-intel_cpu_energy is a native plugin for the collectd "system statistics collection daemon".

It uses the same reference implementation as power_gadget_snmp to access the RAPL measurement values, converts them to Joules, then passes them on to collectd for further processing.
Its output follows the recommended collectd plugin structure, with the following mapping:

Therefore, the resulting query path will look similar to zeus01.intel_cpu_energy-cpu0.energy-package.

Written in C, it can be compiled using the supplied Makefile.

Note that the RAPL interface requires the msr and cpuid kernel modules to be loaded.

Reading Measurements from Single-phase Energy Meters

The data acquisition and processing for the single-face energy meters is split into multiple programs for increased flexibility.

Data Collection Daemon

A central daemon --- emeterd --- collects and counts pulses from the energy meters and stores the results in realtime.
Two storage formats are used: first, all events (both pulses and starting or stopping the daemon) are recorded to raw.log as plain-text lines. Pulses are logged in the format <date>,<time>,event,[ start | stop ], while pulses are logged as <date>,<time>,<input pin number>,0. Additionally, the daemon maintains separate text files --- one for each configured input pin --- that contain a monotonically increasing value signifying the number of pulses received on that pin so far. The values of these counters are preserved across restarts of the daemon, or indeed the system it runs on.

Refer to pins.txt for the mapping between GPIO pin numbers, meter numbers, and node names in our setup.

The emeterd.py executable file includes a standardized configuration block to be parsed by SysV-style init systems, so it can be directly installed as a system service.
A logrotate configuration file to regularly raw.log is also included in the distribution.

Command-line Interface

emeter-live is an ncurses-based command-line utility that reads lines from raw.log (as written by emeterd), converts the timestamps and recorded pulses within into the most recently observed power consumption of each connected node, and displays that information both numerically and as a set of bar graphs.

emeter-live.py expects continous input on its standard input stream, so emeter-live.sh is provided as a wrapper to automatically supply the required data.

Websocket-based Interface

emeter-ws operates in a manner similar to emeter-live in that it parses lines from raw.log to calculate current power consumption.
In this instance, however, the process is split into two parts: a daemon serves a web page via HTTP on port 9000 and simultaneously detects and broadcasts any changes to raw.log to all clients currently connected via the Websocket protocol.
These broadcasts are received by a JavaScript function integrated into the web page, which then calculates the current power consumption and displays it to the user (again both numerically and as bar graphs).
Additionally, the JavaScript function regularly checks whether the time passed since the last pulse on a certain pin is still consistent with the power consumption reported for that pin. When the time period exceeds the one previously observed (equating to a lower power consumption than before), it begins extrapolating the current consumption value based on the time period since the last pulse. As soon as the next pulse is received on that pin, the estimate is replaced with a newly calculated exact value and normal operation resumes.

A SysV-compatible service file is provided in emeter-ws.

SNMP Agent Interface

Finally, the distribution contains two Python scripts that implement the Net-SNMP SNMP agent's pass and pass_persist interfaces, respectively.
As explained in more detail by the snmpd.conf manual page, the pass_persist interface has the advantage of being more resource-efficient, since it re-uses the same interface process repeatedly, instead of launching a process for each individual SNMP query.

Both scripts use an SNMP subtree under the OID .1.3.6.1.4.1.47670.2 to provide the current contents of the counter files written by emeterd. The OIDs follow the format .1.3.6.1.4.1.47670.2.<meter_number>, using the meter numbering described in pins.txt.

SML Electricity Meter Interface

Both implementations of the SML electricity meter interface are based on the jSML Java library developed by the department "Intersectoral Energy Systems and Grid Integration" at the Fraunhofer Institute for Solar Energy Systems in Freiburg.

jSML supports a number of features, including both encoding and decoding SML message files.
The following two programs use its decoding functionality and format the results according to their intended use case.

https://www.openmuc.org/sml/

Command-line Interface

To facilitate debugging the optical interface on SML-capable electricity meters, a jSML example program was modified to receive SML messages via the serial interface of a Raspberry Pi single-board computer and format it in a human-readable way.

Collectd Integration

Expanding upon the jSML-based program modified for human-readable output, collectd-plugin-sml-electricity-meter integrates with collectd's Java plugin interface and submits all relevant SML message contents to collectd for further processing.
The fields in collectd's data packets are set as follows:

Refer to section "Electronic Electricity Meter EMH ED300L" and the glossary for details on the use of OBIS codes.

VerifierCloud Automatic Power Control

The VerifierCloud is a sophisticated system for queueing and distributing tasks --- primarily verification runs --- to nodes of a computer cluster.

The automatic power-off and power-on functionality described in this thesis was implemented in a separate feature branch "automatic-shutdown" so it could be tested without impacting productive use on the main cluster.

All information required for powering worker nodes off and on is collected automatically.
However, initiating a power-off sequence requires the worker process to have elevated privileges on the node it is running on. For this, the node's sudoers configuration needs to be amended to allow the worker process to execute the command sudo poweroff non-interactively (i.e. without entering a password).

After privilege escalation has been configured on the node, the user can then use the VerifierCloud command-line client to manually initiate a power-off of that node, as well as a subsequent power-on.

To enable the automatic power-off functionality, the worker needs to be started with a non-empty "shutdown-delay" value, either via the command-line client or from the master's WorkerInformation file.
The configuration syntax is unchanged and described in more detail in the official VerifierCloud documentation.