unagi.py

P2P-like system monitor

Last Modified: Wed Oct 26 01:35:24 EDT 2005 (10/26, 14:35 JST)

Overview

unagi.py is a system monitoring tool for a small, loosely coupled cluster environment where trusted users run various programs from time to time. It helps users to utilize the machine resources cooperatively by reporting the current status of a cluster. It runs on every machine in a cluster and shares the system information from all machines cooperatively. It also acts as a simple HTTP server through which a user can view a status report.

WARNING: unagi.py is *NOT* intended to run on a public server. For security reasons, machine status should not be disclosed to outside users.

Here is a sample status report generated for a 11-machine cluster (HTML, 220k).

unagi.py has two goals:

Robustness. unagi.py works in a p2p-like manner so there is no central server. unagi.py uses UDP packets for exchanging machine information. When a machine is temporally unavailable (due to shutdown or network problems), other machines can still hold the information obtained before on that machine. A user is able to see what was happening on the machine before it went down.
Minimal administrative work. unagi.py can automatically recognize when a new machine is added. So in a homogeneous environment where every machine runs the same OS, you don't need a per-machine configuration. Also, unagi.py is a stand-alone Python script and its configuration can be embedded directly in the script file. So when you add a new machine, you can simply copy an existing script from other machines and run it as a daemon at startup.

What was wrong with SNMP?

unagi.py is similar to SNMP in its function. However, there are two different points: SNMP requires a manager to gather all data and never fail. Since in our situation any machine might be shut down or disconnected at any time, I wanted to have every machine watch each other, so that a user can expect s/he is always able to reach some system info stored in some machine (unless all of them die at once). Also, I didn't like SNMP configuration because it is way complicated. I wanted something which is robust enough and still easy to configure.

Supported OS

Because the way of gathering system information is OS-dependent, currently unagi.py runs on Linux only. It can report the folloing information:

Average CPU load (in a certain period).
Block(disk) and network I/O traffic.
Memory usage.
Busy processes.
Current users.
Recent syslog messages.

Changes

Oct 24, 2005: Version 0.39 released. (Bugfix)
Jun 8, 2005: Version 0.38 released. (Limited kernel-2.6 support)
Dec 3, 2004: Version 0.36 released.
Nov 14, 2004: First public release. (version 0.35)

Download and Configuration

You need Python 2.3 or newer to run this program. (Although it runs on Python 1.5.2 too, a newer version is recommended because an older Python lacks some functions which is necessary to drop privileges.)

Download unagi-0.39.py (32KBytes).

To run the program, first you need to modify the script to configure several parameters (although these parameters can be specified with command line options, too). Here is the part you need to change:

# your network information
SIGNATURE = 'some_string'
P2P_PORT = TCP_port_number_for_P2P_network     # >1024
HTTP_PORT = HTTP_port_number_for_status_report # >1024
P2P_SCAN_RANGES = [ address_ranges_for_initial_scanning ] # ex. [ "192.168.0.100-199" ]
P2P_ALLOW_RANGES = [ address_ranges_of_trusted_peers ]    # ex. [ "192.168." ]
HTTP_ALLOW_RANGES = [ address_ranges_of_http_clients ]    # ex. [ "192.168.", "127." ]
UNAGI_USER = 'username_to_run_the_script'      # make sure to modify /etc/passwd!

SIGNATURE is some short string identifier. All machines in a certain P2P network must share the same signature (otherwise the machine is ignored from other peers.) Although the primary purpose of this is to distinguish multiple P2P networks, you need to set some string even if you are running only one network. You also need to use the same P2P_PORT number and HTTP_PORT number at every machine in the network. For P2P_SCAN_RANGES, P2P_ALLOW_RANGES, and HTTP_ALLOW_RANGES, you probably want to set the network address of your subnet. If you want to view the status report outside the network, you need to add other IP addresses to HTTP_ALLOW_RANGES. You can also modify P2P_ALLOW_RANGES to allow a machine outside the subnet access your P2P network, but do not allow every machine in the world to access to your P2P network! This might result in a serious security breach. I strongly recommend to filter these ports from the outside network. If you are running the program within a private network, you will not worry much of this. And remember that p2p communication beyond a router is less reliabile because unagi.py uses UDP to communicate with other peers. Also, you should limit P2P_SCAN_RANGES strictly within your network because unagi.py scans all addresses in this range at first to find other machines in the network. Notice that P2P_SCAN_RANGES must include the machine itself where the program is running.

Address ranges are specified as a Python list. Each list can contain one or more ranges which is a string constant. The following formats can be accepted:

"a.b.c.d" (specify the exact address)
"a.b.c.d-e" (specify addresses from a.b.c.d to a.b.c.e inclusively)
"a.b.c." (equivalent to "a.b.c.1-254")
"a.b." (specify addresses which start with a.b. -- only allowed in P2P_ALLOW_RANGES and HTTP_ALLOW_RANGES)

unagi.py doesn't need a root privilege to run. So usually it should run as a harmless user (e.g. "nobody") which has minimum privilege. When you run this program as root, the script changes its process UID and GID if a specific username is given as UNAGI_USER parameter. (However there is a small problem in the older Python -- since Python 1.5.2 or older doesn't have setgroups function, it cannot drop its groups permission, which is not secure. This is why I recommend to use a newer version.) However, it does need a permission to read several files and execute ps command to obtain system information. In particular, it needs a permission to read a syslog output file (usually stored in /var/log/messages) which is probably not world-readable in most Linux distributions. So you will need to grant a permission to the user for reading this file. The safest way is to create a new group named something like "log", and make the syslog file group-readable to this group. Then create a user "unagi" which belongs to this group and run the script on this user. For example,

Add the following entry to /etc/group:

log:x:888:

Add the following entry to /etc/passwd:

unagi:x:10000:888::/:/sbin/nologin

After setting the configuration, make sure this script launched at startup. You can also run the script directly from the shell for testing purpose.

Running on Linux, unagi.py refers to the following files:

/proc/uptime
/proc/loadavg
/proc/meminfo
/proc/stat
/proc/net/dev
/var/run/utmp
/var/log/messages
ps command (included in procps package)

Command Line Syntax

unagi.py [-d] [-u update] [-p p2pport] [-n p2pallow] [-h httpport]
         [-a httpallow] [-s scanaddrs] [-S signature] [-H nhistory] [-U username]

unagi.py accepts the following command line options.

-d : Indicate debug mode. It produces verbose output. Two '-d' options increase the debug level.
-u update : Specify an update interval in seconds. (default: 600)
-p p2pport : Specify a TCP port number to use P2P communication. (no default)
-n p2pallow : Specify an address range to accept P2P communication (can be specified multiple times).
-h httpport : Specify a TCP port number to publish a status report via HTTP. (no default)
-a httpallow : Specify an address range to accept HTTP requests (can be specified multiple times).
-s scanaddrs : Specify an address range to scan peer machines at startup.
-S signature : Specify a signature string. (no default)
-H nhistory : Specify the number of past entries which the program preserves.
-U username : Specify a username to run the script. (default: "unagi") If unspecified, the program doesn't attempt to change its process user id.

How It Works

unagi.py shares system information with other peers by periodically asking every machine its status (default: every 10mins). Each peer has an internal on-memory database which contains the current status of each machine (including the machine itself). unagi.py uses a packet-based message to communicate with other peers. When a peer receives a message from an unknown machine, after confirming its signature it recognizes the sender as a new peer and registers it to the database. Ocasionally (default: every 30mins) each peer broadcasts a list of all known peers to the entire P2P network so that other peers can know newly added machines. When a machine doesn't respond for a while (default: 30mins), it marks the machine as "down". A machine is eventually removed from the database if it doesn't respond for a long time (default: 10days).

When it is started, unagi.py tries to drop a privilege (if a username is given), then to bind two sockets: a UDP socket is used to communicate with other peers. A TCP socket is used to serve HTTP requests. It first scans a certain range of addresses (specified by P2P_SCAN_RANGES) to find other peers by sending an initial query packet to every machine in the specified range until it receives a response from another peer. Since the initial query range includes the machine itself, at lease one machine should respond. After this, it enters an event loop and handles the following protocol.

Protocol

Communication protocol between peers is simple and stateless. Each message consists of a single UDP packet which contains a text string. Each peer must respond to a message as soon as it receives, or do nothing. Each message must start with its network signature string and it is checked whenever a message is received. There are four types of messages used in the current protocol:

signature? : INITIAL-QUERY. Each peer sends this message only once when it starts. The receiver of this message must immediately respond with an address broadcast message.
signature!address1 address2 ... : ADDRESS-BROADCAST. This messaege contains a space-seperated IP address list of all the peers that the sender currently knows. This message is sent either by a request of an initial query or by periodical broadcast by the peer itself. The receiver of this message can add a new host entry to its internal database if the machine is not known so far, but it should not update any system information in the database.
signature> : REQUEST-IF-UPDATED. The receiver of this message must send its system status information to its sender with a status-report message, only if its information has been updated since the last time the sender asked. Otherwise this message is ignored.
signature<system-status-information : STATUS-REPORT. This message is sent when a peer receives a report-if-updated message and its system status has been updated since the last time the sender asked. A status information is string separated with tabs and spaces, and its format depends on the system it uses.

By giving the debug option (-d), you can see which message is sent from one peer to another.

Future Work

Support other platforms.
Independent from ps command.
HTTP authentication for outside clients.
More strict check for corrupted peer messages.
Documentation / comments in the code.

Terms and Conditions

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Yusuke Shinyama