The School for Sysadmins Who Can’t Timesync Good and Wanna Learn To Do Other Stuff Good Too, part 1 – the problem with NTP

(With apologies to Derek Zoolander and Justin Steven.  And to whoever had to touch the HP-UX NTP setup at Queensland Police after I left. And to anyone who prefers the American spelling “synchronization”.)

(This is the first of a series on NTP.  Part 2 is an overview of how NTP works.)

The problem with NTP

In my experience, Network Time Protocol (NTP) is one of the least well-understood of the fundamental Internet application-layer protocols, and very few IT professionals operate it effectively.  Part of the reason for this is that the documentation for NTP is highly technical and assumes a certain level of background knowledge.

I first encountered NTP more than 20 years ago, and my first efforts with it were an unmitigated disaster due to my ignorance of how the protocol was designed to function.  Since then virtually every IT environment I’ve encountered has had a less-than-optimal NTP setup.

I am still far from an expert on NTP, but I’ve learned quite a lot about operating it since my early days.  I hope this series of posts will help you develop a working knowledge of NTP faster and get the basics of NTP configuration right in your environment.

Why learn NTP?

Why bother learning this rather obscure corner of Internet lore?  I mean, the Internet mostly works, despite this alleged widespread lack of expertise in time sync, right?

Here are some of the reasons you might want to learn more about NTP:

  1. You run Ceph, Mongodb, Kerberos, or a similar distributed system, and you want it to actually work.
  2. You want your logs to match up across multiple systems, potentially on multiple continents.
  3. You like learning about new things and tinkering with embedded systems.
  4. You think bandwidth-efficient, high-precision time synchronisation is just a fun, nerdy problem.
  5. You think this is cool:

    A scenario where the latter behavior [the PPS driver disciplining the local clock in the absence of external sources] can be most useful is a planetary orbiter fleet, for instance in the vicinity of Mars, where contact between orbiters and Earth only one or two times per Sol (Mars day). These orbiters have a precise timing reference based on an Ultra Stable Oscillator (USO) with accuracy in the order of a Cesium oscillator. A PPS signal is derived from the USO and can be disciplined from Earth on rare occasion or from another orbiter via NTP. In the above scenario the PPS signal disciplines the spacecraft clock between NTP updates.

    (Personally, they had me at “planetary orbiter fleet”. 🙂 )

Caveats

In this series, I’ll describe a few best practices for setting up NTP in a standard 64-bit Ubuntu Linux 16.04 LTS environment.  Bear in mind this quite limited scope; this advice will not apply in all circumstances and intentionally ignores the less common use cases.  Further caveats:

    1. I have no looks.
    2. I am not an expert.   My descriptions of the algorithms are based on the documentation and operational experience.  I’m not a member of the NTP project; I’ve never submitted a patch; I’ve never compiled ntpd from source (I hate reading & writing C/C++).
    3. I’ve only worked with the reference implementation of NTP, and only on Linux, with only one reference clock driver (NMEA), and a limited range of configuration options.
    4. I will be glossing over a lot of detail.  Sometimes it’s because I don’t think it’s necessary in order to work with NTP successfully; sometimes it’s because I haven’t looked into that particular corner and so I don’t understand it; sometimes it’s because I have looked into that particular corner and I still don’t understand it. 🙂  But mostly it’s because I’m attempting to keep this series accessible for those who are newcomers.  If you’re an experienced NTP operator, you probably won’t find much of interest (if anything) until later in the series.
    5. We won’t cover much history or theory of time sync in this series.  If you’d like to know a little more about that, check out Julien Goodwin‘s previous LCA & SLUG talks:

Read on in part 2 – how NTP works.

ntpq: write to ::1 failed: Operation not permitted

 

The other day I got a bug report about check_ntpmon, which was reporting UNKNOWN status back to Nagios even though everything seemed to be working fine. A bit of debugging revealed that it was receiving the message on standard error:

ntpq: write to ::1 failed: Operation not permitted

This was a bit strange, because various links I found indicated that this message is usually due to firewalls:

But this host was not blocking anything, not to mention that check_ntpmon’s use of ntpq only ever uses the loopback interface, which is rarely ever touched by firewalls. A bit of further digging showed that indeed it was not the firewall, but a full conntrack table, with dmesg showing:

Aug 4 03:04:19 hostname kernel: [5226949.016837] nf_conntrack: table full, dropping packet

Increasing the conntrack limit fixed the problem.

(Just thought I’d document this here for posterity, since none of the links I found suggested this issue.)
Source: libertysys.com.au

Sun’s NTP documentation

 

It seems in the merging of Oracle.com and Sun.com, Sun’s blueprints site has disappeared from view.  A lot of it was very specific to Sun hardware & software, but it had much general applicability as well.  One of the treasures in their collection was a series of three documents on understanding NTP (Network Time Protocol), called “Using NTP to Control and Synchronize System Clocks”.  Thanks to the Internet Archive, they are still available, and i thought i’d throw up some links here so they won’t be forgotten:

The Wayback Machine at the Internet Archive is a bit touchy sometimes – i find that trying again after a few hours often works.