An update on NTPmon

Over the past few weeks I've made some changes to my NTP health monitor NTPmon, and I want to explain them more fully and outline the roadmap from here.

2.0.x series

It was pointed out to me that it wasn't actually clear how to run NTPmon for anyone looking at it for the first time. Besides updating the documentation to fix, I decided to brush off my long-neglected Debian packaging skills (which are much in need of improvement), and create a personal package archive (PPA) for installing NTPmon on Ubuntu and Debian. This currently supports Ubuntu 20.04 (focal) and 22.04 (jammy), and Debian 11 (bullseye) and 12 (bookworm). I will endeavour to keep this up-to-date with the releases as they are tagged on GitHub.

I had previously not versioned NTPmon rigorously, but at some point in the past I had decided it was 1.0.0, so I started the packaged versions from 2.0.0. It took a few minor version numbers and some ugly release numbers for me to get the packaging right.

2.1.0

GitHub user flyingflo reported an incompatibility with icinga2's GraphiteWriter, and I released version 2.1.0 to integrate that fix.

3.0.x series

Shortly I'll be releasing 3.0.0, which will mark the start of a larger evolution of NTPmon into a more comprehensive telemetry suite for NTP.

More telemetry

I've been using different NTP implementations for some time, and I'm now running four different implementations live in my network:

Three of these are in the public NTP pool, and I hope to be adding the fourth (NTPsec) in the near future.

These are part of my larger ambition to maintain a sort of unofficial interoperability lab for NTP implementations and evaluate their performance and practicality in live networks, particularly as servers in the global NTP pool.

With that goal in mind, I need NTPmon to offer more in the realm of comparison between different implementations and the ability to compare measurements from various points on my network. So I've switched my metrics solution from Prometheus to InfluxDB and telegraf. Prometheus is a great tool and I'll still be using it for some things, but it's more a monitoring system akin to LibreNMS or Nagios implemented as a time series database than a dedicated TSDB in its own right. InfluxDB offers nanosecond resolution in timestamps, gathered from the point of measurement, which is much more appropriate for NTP analysis.

In addition to the overall health indicators, NTPmon will now emit statistics on individual NTP sources under the ntpmon_peer measurement name. The metrics included depend on the NTP implementation, but will include the time of the sample, the peer type, offset, delay, and dispersion for both chronyd and ntpd. The timestamp resolution on these measurements (only relevant for telegraf) is seconds for chronyd and milliseconds for ntpd.

Here's an example of what this looks like for chrony in telegraf mode (note how the ntpmon_peer lines include an exact timestamp reflecting the time of the message in the log file):

ntpmon_peer,mode=server,refid=2A03734F,rx_timestamp=kernel,source=2001:44b8:2100:3f11::7b:1,tx_timestamp=daemon,type=outlier delay=0.03426,dispersion=5.261e-08,offset=0.01728,root_delay=0.0005646,root_dispersion=0.0003357,score=1.0,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=1i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=8i,remote_poll=8i,stratum=2i,sync_loop=0i,synchronized=1i 1703557302000000000
ntpmon, frequency=-7.391,offset=-0.0005147999999999999,reach=100.0,rootdelay=0.002461519,rootdisp=0.003245465,runtime=58755.045063495636,sysoffset=4e-09,stratum=2i
ntpmon_peers,peertype=backup count=0i
ntpmon_peers,peertype=excess count=0i
ntpmon_peers,peertype=false count=0i
ntpmon_peers,peertype=invalid count=0i
ntpmon_peers,peertype=outlier count=5i
ntpmon_peers,peertype=pps count=0i
ntpmon_peers,peertype=survivor count=5i
ntpmon_peers,peertype=sync count=1i
ntpmon_peer,mode=server,refid=6442F632,rx_timestamp=kernel,source=2001:44b8:2100:3f00::7b:5,tx_timestamp=daemon,type=outlier delay=0.02749,dispersion=1.412e-07,offset=0.01203,root_delay=0.001053,root_dispersion=0.006958,score=0.13,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=1i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=10i,remote_poll=10i,stratum=2i,sync_loop=0i,synchronized=1i 1703557324000000000
ntpmon_peer,mode=server,refid=50505300,rx_timestamp=kernel,source=2001:44b8:2100:3f11::7b:6,tx_timestamp=daemon,type=sync delay=0.002295,dispersion=7.967e-08,offset=0.0007394,root_delay=1.526e-05,root_dispersion=3.052e-05,score=1.0,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=0i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=8i,remote_poll=8i,stratum=1i,sync_loop=0i,synchronized=1i 1703557345000000000
ntpmon_peer,mode=server,refid=6442F632,rx_timestamp=kernel,source=150.101.186.48,tx_timestamp=daemon,type=outlier delay=0.02854,dispersion=1.412e-07,offset=0.01358,root_delay=0.001053,root_dispersion=0.007309,score=1.0,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=1i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=8i,remote_poll=8i,stratum=2i,sync_loop=0i,synchronized=1i 1703557347000000000
ntpmon_peer,mode=server,refid=04B34211,rx_timestamp=kernel,source=2001:44b8:2100:3f00::7b:7,tx_timestamp=daemon,type=outlier delay=0.0244,dispersion=3.836e-06,offset=0.01071,root_delay=0.001526,root_dispersion=9.155e-05,score=1.0,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=1i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=8i,remote_poll=8i,stratum=3i,sync_loop=0i,synchronized=1i 1703557348000000000
ntpmon_peer,mode=server,refid=2A03734F,rx_timestamp=kernel,source=2001:44b8:2100:3f00::7b:4,tx_timestamp=daemon,type=survivor delay=0.03447,dispersion=5.132e-08,offset=0.01674,root_delay=0.0004578,root_dispersion=0.0005798,score=1.0,authentication_fail=0i,bad_header=0i,bogus=0i,duplicate=0i,exceeded_max_delay=0i,exceeded_max_delay_dev_ratio=1i,exceeded_max_delay_ratio=0i,interleaved=0i,invalid=0i,leap=0i,local_poll=8i,remote_poll=8i,stratum=2i,sync_loop=0i,synchronized=1i 1703557348000000000

The metrics are derived from the chronyd measurements log or ntpd peerstats log, and are automatically detected by looking for these logs in their default locations of /var/log/chrony/measurements.log and /var/log/ntpstats/peerstats, respectively. If your system uses a different location, you can configure NTPmon to look for them elsewhere by using the --logfile command line option. If this logfile is missing, badly formatted, or cannot be parsed due to an internal bug, NTPmon will silently ignore it, so please ensure that you are alerting on the absence of these measurements if they are important for your purposes.

License change

From version 3.0.0 onwards, NTPmon will be licensed under the GNU Affero General Public License (AGPL), version 3 or later. I intend to write more about this later, but for the time being I'll say that I consider AGPL to be the GPL for the cloud era. I don't think it should cause any significant problems to any current NTPmon users, because I don't believe anyone is likely to offer any public service (commercial or otherwise) based on it. (Please contact me if you think otherwise. Note that you also have the option of using and/or forking version 2.1.0 or earlier without needing my permission.)

Other minor changes

  • The trace-related metrics have been deprecated for some time, and I've taken the opportunity to remove them with the 3.x major version bump.
  • I've also removed the juju layer because it was necessary in order to change the license, plus I'm no longer using juju and don't have the necessary tools to maintain it. Please stick with the 2.1.x series if you still need this.
  • I've moved to 'black' the opinionated python code formatter, for keeping the style consistent across the code base.

Future rename

I've discovered over the last few months that the name "NTPmon" clashes with at least two other software packages: a full-screen utility shipped with NTPsec, and the shorthand name for the monitoring agent for the NTP Pool. So in future I'll be renaming NTPmon to something else, to avoid these name clashes and to reflect its growing capabilities. I'll be sure to post when that happens and maintain as many backwards-compatible mappings as possible.