I presented my talk based on the blog posts here at Linux.conf.au 2017; thanks to everyone who responded – your feedback is appreciated. Here are some links for anyone who’s interested:
There won’t be much new in there for anyone who has been following along with the blog series, but it’s a little more succinct. Note that there are some “Deleted scenes” in the slides which cover some of the myths in a slightly different form.
On a different note, some kind folks in the Freenode #ntp IRC channel linked to this great talk by Bryan Fink at the Systems We Love conference last December. It was so much nerdier and more interestingly presented than my talk that I have a serious case of professional jealousy. 🙂
(See the NTP category for previous posts in this series.)
In this post, I’m going to address some of the more common myths about NTP and how to avoid the mistakes which they produce. Some of these myths are grounded in fact, and in many cases it’s fine to accept them if you don’t need highly accurate time and you know the consequences, but they are usually based on misconceptions about how NTP works which can lead to greater errors later.
Advance warning: This is a long post! It has been brewing for a while and has ended up being quite lengthy due to the amount of data I’ve collected and the number of references for and against each myth.
Myth: Using the local clock is good enough
Good enough for what? Good enough for keeping 3-4 machines at roughly the same time? Possibly. Good enough for keeping within, say, 1 second of the real time for an extended period? Well, no.
As an experiment, I set up two bare metal machines. The first was configured with a number of peers, including my local LAN pool, the Australian public pool, & its own local clock (fudged to stratum 1). The second was configured with just its own local clock, also fudged to stratum 1. Both machines had an existing drift file in place from a previous experiment. I let these systems run for a few days; I then reinstalled the second system, removing the drift file.
During all this time I had a VM on the first system configured with both bare metal servers and my local stratum 1 as sources. Here’s a graph of the peer offsets recorded by that VM:
As you can see, during the first part of the experiment, the 2nd bare metal server (configured with only its own local clock) performed reasonably, only falling behind in time a little. But after the drift file was removed, it’s an entirely different story. Both bare metal servers were old, cheap hardware, but the one which was configured with appropriate external sources maintained close sync with the stratum 1 source (its yellow line is hidden behind the blue line in the graph), while the one with only its local clock gained around 24 seconds in 9 days, or more than 2.5 seconds every day. That’s an eternity by NTP reckoning.
Reality: You can rely on the local clock for only a very short period of time, and only when the error rate of the local clock has already been established
And even then, there are better ways to do this. NTP has orphan mode, which is a method by which a group of peered servers can agree on an interim authoritative source in the event that their connectivity to higher-stratum peers is lost for a time. In short, there is no justification for using the local clock. (Julien Goodwin’s advice about local clocks was already outdated in 2011.)
Best practice: Disable the local clock and enable orphan mode
The local clock is disabled in the default configuration of most current Linux distributions. If you have an old install where you haven’t updated the configuration, check it now to make sure the local clock is disabled. Comment out or delete any lines in /etc/ntp.conf beginning with “server 127.127.1.” or “fudge 127.127.1.“.
To enable orphan mode, add “tos orphan N” to /etc/ntp.conf, where N is the stratum which the orphan server will take – 5 or 6 is usually a reasonable starting point, since servers higher than stratum 4 are rarely seen in on the public Internet. You should configure orphan mode on a group of peered servers.
Myth: You don’t need NTP in VMs
This myth is relatively common in Windows/VMware environments, but can also be seen in Linux-focused materials. At its core is basically the same assumption as the local clock myth: the local (virtual) clock is good enough. So in that sense, it’s not really a myth: you can get away without NTP if you’re happy to have time accurate to within a second/minute/hour/whatever-your-clock-happens-to-do.
Reality: If you need NTP on bare metal, you need it in VMs
Oliver Yang‘s Pitfalls of TSC usage is an interesting read covering the characteristics of virtual clocks in various hypervisors. Spoiler: best case, they’re no better than the oscillator in your machine; worst case, they’re much worse.
I performed this experiment to demonstrate: using 8 identical VMs (running on the KVM hypervisor) running on the same Intel PC which was used in the local clock test above, I synced the time with NTP, then shut down NTP on 4 of the 8 VMs. I left them running for a few days, then measured the offset from my local stratum 1 server over a 1-day period. The host was synced using NTP throughout.
Here’s a graph of the system offsets for the 4 VMs which were running NTP (the blue & green shades) and the host (red):
Here are the 4 VMs without NTP:
As you can see, this is two orders of magnitude worse than the ones running NTP. In case the scale wasn’t obvious in the graph above, here’s a combined graph – the 4 NTP-synced VMs and the host are the smudge over the X axis:
The takeaway is simple: if you want accurate time in VMs, run ntpd in them.
But, why not just sync with the host regularly?
Let’s try that. Here’s a graph showing the same 4 VMs, with an “ntpd -gq” (which does a one-time set of the clock & then exits) run from /etc/cron.hourly/:
Compared to the NTP-synced VMs & the host, it’s very jumpy:
This is definitely a much-improved situation over just trusting the local virtual clock and within the tolerance of an application like Ceph (which needs < 50 ms). But in this case, the clock is stepping often, rather than slewing. That could be improved by running the sync more often, say, every 1-5 minutes, but in that case, why not just run ntpd? (For further discussion, see this Server Fault thread.)
Myth: Time sync in VMs doesn’t work
This myth is firmly grounded in the ghosts of Linux kernels past. Under kernel versions up to 2.6, and early VMware, Xen, and KVM hypervisors, clocks were problematic, such that clock drivers often needed to be specified on the kernel command line. (See, for example, the VMware knowledge base article on timekeeping best practices for Linux guests, and the kernel versions mentioned.)
Reality: In most cases, VMs can maintain excellent time sync
The virtual clock drivers of modern Linux kernels are mostly adequate to support NTP (although see the article by Oliver Yang linked above for caveats).
Frequency (error rate in parts-per-million):
Reachability of peers:
System peer offset:
(For a recent discussion on this, see this Server Fault thread.) It should be noted that due to the Great Snapchat NTP Pool Surge of 2016 (I wish we had a snappier name for that…), this VM was under much higher load than normal, and still managed to keep good time. Here are some graphs showing a 2-week period (ending on the same date as the above graphs) which illustrate the traffic increase.
This is not a particularly powerful VM nor does it run on particularly modern hardware, and yet its pool score remains steady:
So a VM can make a perfectly viable NTP server, given the right configuration.
Myth: You don’t need to be connected to the Internet to get accurate time
This is not strictly a myth, because you can get accurate time without being connected to the Internet. It is sometimes expressed something like this: “Time synchronisation is a critical service, and we can’t trust random servers on the public Internet to provide it.” If you work for a military or banking organisation, and expressing the myth like this is an excuse to get hardware receivers for a mix of GPS, radio, and atomic clocks, then perpetuating this myth is a good thing. 🙂
But usually this is just a twist on the “local clock is fine” myth, and the person promoting this approach wants to keep misguided security restrictions without investing in any additional NTP infrastructure. In this form, it is certainly a myth. (There are circumstances where something close to this can be made to work, by having a local PPS device coupled with an occasional sync with external sources to obtain the correct second, but for the majority of use cases, the complexity and risk of running such a setup greatly outweighs any perceived security benefits.)
Reality: You need a connection to multiple stratum 1 clocks
You can use stratum 1 servers on your own local network, or you can access public servers, but ultimately you need to have a reliable external reference. The most common and affordable of the options for a local stratum 1 source is GPS (usually provided with PPS), but PPS-only devices and Caesium & Rubidium clocks are not unheard of. (See Julien Goodwin’s talks for more)
(Time for a quick shout-out to Leo Bodnar Electronics, whose NTP server seems like a really nifty little box at a sensible price: if your organisation is large enough to have bare metal servers in multiple data centres, Leo’s box makes it having a GPS-based stratum 1 source in each DC easy and affordable.)
Best practice: Maintain connectivity to at least four stratum 1 servers
If you maintain your own data centres or other sites and have a partial view of the sky, maintaining a stratum 1 server synced from GPS isn’t difficult (especially given products like the LeoNTP server mentioned above). The NTP foundation maintains a list of stratum 1 servers, some of which allow public access. Many of the NTP pool servers (such as mine) are stratum 1 also. Or you might peer with a research organisation which has access to an atomic clock.
There is no need to connect directly to stratum 1 servers; most public pool servers are stratum 2 or 3, and as long as you have a sufficient variety of them, you’ll be connected to the stratum 1 servers indirectly.
Myth: You should only have one authoritative source of time
This myth results in the most common misconfiguration of NTP: not configuring enough time sources. On first glance, and without any further information about how NTP works, it is a natural assumption that one source of time would be the master, and all other sources would stem from that.
Reinforcing this myth is a saying which crops up occasionally, known as Segal’s law: “A man with a watch knows what time it is. A man with two watches is never sure.” This often seems to be quoted without the knowledge that it is an ironic poke at being fully reliant on one time source.
But this is not how time (or our measurement of it, to be more precise) works, and NTP’s foundational assumptions are designed to match reality: no one source of time can be trusted to be accurate. If you have 2 sources of time, they will disagree; if you have 3 sources of time, they will disagree; if you have 10 sources of time, they will still all disagree.
This is because of both the natural inaccuracies of clocks, and how the NTP polling process (described in the last post) works: network latencies between two hosts constantly vary based on system load, demand on the network, and even environmental factors. So both the sources themselves and NTP’s perception of them introduce inaccuracy. However, NTP’s intersection and clustering algorithms are designed to take these differences into account, and minimise their significance.
[Edit: There are some common variants to this myth, including “NTP is a consensus algorithm”, and “you need more than 2 sources for NTP in order to achieve quorum”. Reality #1: NTP is not a consensus algorithm in the vein of Raft or Paxos; the only use of true consensus algorithms in NTP is electing a parent in orphan mode when upstream connectivity is broken, and in deciding whether to honour leap second bits. Reality #2: There is no quorum, which means there’s nothing magical about using an odd number of servers, or needing a third source as a tie-break when two sources disagree. When you think about it for a minute, it makes sense that NTP is different: consensus algorithms are appropriate if you’re trying to agree on something like a value in a NoSQL database or which database server is the master, but in the time it would take a cluster of NTP servers to agree on a value for the current time, its value would have changed!]
Reality: the NTP algorithms work best when they have multiple sources
If the description of the intersection algorithm in the previous post wasn’t enough to convince you that you need more than one source, here’s another experiment I performed: I used the same 2 bare metal hosts which I used in the previously-described experiment, each using a single local (well-configured) source. I then configured 8 VMs on the 2 bare metal hosts: 4 used only their local bare metal server as a source, while the other 4 used my local LAN pool.
All of the VMs kept good time. Those which were hosted on the Intel Core 2 host had error rates which almost exactly mirrored their host’s. This seems to be because of the constant_tsc support on the Intel platform; my AMD CPU lacks this feature. Those VMs which were hosted on the AMD Athlon 64 X2 host actually had substantially better error rates than their host; I still don’t have an explanation for this.
All of the VMs maintained offsets below 100 microseconds from their peers, and the ones with only a single peer actually maintained a lower average offset from their peer than those with multiple peers. However, the VMs with multiple peers were lower in root delay by between 4 and 9%, and had a 77 to 79% lower root dispersion. (The root dispersion represents the largest likely discrepancy between the local clock and the root servers, and so is the best metric for overall clock synchronisation with UTC.) My current explanation of the lower root delay and dispersion (despite higher average and system peer offsets) is that the intersection and clustering algorithms were able to correct for outlying time samples. For full figures, see the table below.
|Metric||Hosts||AMD Athlon 64 X2||Intel Core 2 Duo|
(All of the averaged figures above use absolute value.)
Best practice: configure at least 4, and up to 10, sources
I’ve heard plenty of incorrect advice about this (including even Julien Goodwin’s 2011 and 2014 talks), which states that if you have too many time sources, NTP’s algorithms don’t act appropriately. I don’t really understand why this belief persists, because all of the data I’ve collected suggests that the more time sources you give your local NTP server, the better it performs (up to the documented limit of 10 – however, even that is an arbitrary figure). My best guess is that older versions of the NTP reference implementation were buggy in this respect.
The one circumstance I have seen where too many sources caused problems is when symmetric mode was used between a large number of peers (around 25-30), and these peers started to prefer one another over their lower-stratum sources. I was never able to reproduce the issue after reducing the amount of same-stratum peering.
Myth: you can’t get accurate time behind asymmetric links like ADSL
(This one comes from Julien Goodwin’s talk as well.)
That depends; define “accurate”. Can you get < 1ms offset? Probably not. But you can get pretty close; certainly less than 5 ms on average sustained over a long period, with a standard deviation around the same range. Here’s a graph from an experiment I did with 4 VMs on my home ADSL link over a 1 week period. I made no attempt to change my Internet usage, so this covers a period where my family was doing various normal Internet activities, such as watching videos, email, web browsing, and games.
Whilst the sort of offsets seen in the diagram above are non-desirable for high-precision clients, they are certainly viable for many applications. My pool server, a Beagle Bone Black with a GPS cape (expansion board) also runs behind this ADSL link, and its pool score is rarely below 19:
It’s generally true that if you have a choice of NTP servers, you should select the ones with the lowest network delay to your location, but this is not the only relevant factor. During the the above experiment I had a number of time sources with greater than 300 ms latency, and yet they maintained reasonable offset and jitter.
NTP also has a specific directive designed to help cope with asymmetric bandwidth, called the huff-n’-puff filter. This filter compensates for variation in a link by keeping history of the delay to a source over a period (2 hours is recommended), then using that history to inform its weighting of the samples returned by that source. I’ve never found it necessary to use this option.
Putting it all together: sample NTP configurations
So given all of the above advice about what not to do, what should an ideal NTP setup look like? As with many things in IT (and life), the answer is “it depends”. The focus of this blog series has been to increase awareness of the fundamentals of NTP so that you can make informed choices about your own configuration. Below I’ll describe a few different scenarios which will hopefully be sufficiently common to allow you to settle upon a sensible configuration for your environment.
Data centres with large numbers of virtual or bare metal clients
For serving accurate time to a large number of hosts in 3 or more data centres, minimal latency is preferred, so in this scenario, a preferred configuration would be to have 4 dedicated stratum 2 servers (either VMs or bare metal – the latter are preferred) in each DC, peered with each other, and synced to a number of stratum 1 sources. (See here and here for two similar recommendations.)
Ideally, a stratum 1 GPS or atomic clock would be in each data centre, but public stratum 1 servers could be used in lieu of these, if (in the case of GPS) view of the sky is a problem or external antenna access is impractical, or atomic clocks are unaffordable.
The advantages of this setup are that it minimises coupling between the clients (consumers of NTP) and the stratum 1 servers, meaning that if a stratum 1 server needs to be taken out of service or replaced, it has no operational impact on the clients.
This configuration also minimises NTP bandwidth usage between data centres (although, unless the number of clients is in the tens of millions, this is unlikely to be significant). It also ensures that latencies for the clients remain low, and makes the stratum 2 servers essentially disposable – they could be deployed with a configuration something like the following (assuming it’s in DC2):
driftfile /var/lib/ntp/ntp.drift statistics loopstats peerstats clockstats filegen loopstats file loopstats type day enable filegen peerstats file peerstats type day enable filegen clockstats file clockstats type day enable restrict -4 default kod notrap nomodify nopeer noquery limited restrict -6 default kod notrap nomodify nopeer noquery limited restrict source notrap nomodify noquery restrict 127.0.0.1 restrict ::1 orphan tos 5 server ntp1.dc1.example.org iburst server ntp1.dc2.example.org iburst server ntp1.dc3.example.org iburst server public-stratum1.example.net iburst peer 0.ntp2.dc2.example.org iburst peer 1.ntp2.dc2.example.org iburst peer 2.ntp2.dc2.example.org iburst peer 3.ntp2.dc2.example.org iburst
Clients could use a configuration like this:
driftfile /var/lib/ntp/ntp.drift restrict -4 default kod notrap nomodify nopeer noquery limited restrict -6 default kod notrap nomodify nopeer noquery limited restrict source notrap nomodify noquery restrict 127.0.0.1 restrict ::1 pool 0.ntp2.dc2.example.org iburst pool 1.ntp2.dc2.example.org iburst pool 2.ntp2.dc2.example.org iburst pool 3.ntp2.dc2.example.org iburst
Any of the commonly-available Free Software automation tools could be used for deploying the stratum 2 servers and maintaining the client configurations. I’ve used juju & MAAS, puppet, and ansible to good effect for NTP configuration.
Distributed corporate network
A distributed corporate network is likely to have a number of (possibly smaller) data centres, along with a number of corporate/regional offices, and possibly smaller branches with local servers. In this case, you would probably start with a similar configuration to that described above for large data centres. The differences would be:
- Stratum 1 sources might be located in corporate/regional offices rather than the data centres (because getting a view of the sky sufficient to get GPS timing might be easier there), or the organisation might be entirely dependent on public servers. (For a corporation of any significant size, however, cost shouldn’t be a barrier to having at least one stratum 1 server feeding from GPS in each region.)
- Bandwidth between branch offices and the central data centres might be rather limited, so corporate/regional/branch servers might be stratum 3 servers, and clients would be configured to use them rather than the DC stratum 2 servers, easing load on the corporate WAN. If their Internet bandwidth is equal to or better than their WAN bandwidth, the stratum 3 servers could also use the public NTP pool.
- To minimise configuration differences between sites, clients could be configured to use a fixed DNS name which would be directed to the local server by a DNS override (see BIND response policy zones) or a fixed IP address which is routed to a loopback interface on the stratum 3 server via anycast routing.
Standalone cloud instance
If you’re using a public cloud instance and install NTP on an Ubuntu 16.04 LTS image, you’ll get a default configuration which uses the NTP pool and looks something like this. In the case of the major public cloud vendors, this is a reasonable default, but with some caveats:
- Google Compute Engine runs leap-smeared time on its local time servers. Leap-smearing spreads out leap seconds over the time before & after the actual leap second, meaning that the client clocks will slew through the leap second without noticing a jump. Because they are close to the instances and reliable, Google’s time servers are very likely to be selected by the intersection algorithm in favour of the pool servers. This means that your instances could track leap-smeared time rather than real time during a leap second. (I’ll have more data about this in a future post – I’ve set up some monitoring of Google’s time servers to run over the upcoming leap second interval.) Unless all of your systems (including clients) are tracking Google’s time servers (which they’re probably not), my recommendation is not to use Google’s time servers.
- Microsoft Hyper-V seems to have a less mature virtual clock driver than KVM and Xen, meaning that time synchronisation issues on Microsoft Azure are more common, and it doesn’t seem to have changed much in recent years. (I hope to have more data and possible workarounds on this in a future post as well.)
Clustered/related cloud instances
In the case where you’re using a number of cloud VMs for related tasks in a distributed application, it’s likely that using the public NTP pool along with selective local peering is the best compromise between cost/complexity and accuracy. Because the main public pool and the vendor pools are allocated using GeoDNS, you will probably get a reasonable selection of servers from them, but in some cases using a country pool will give better results. Check your delay & offset figures to be sure.
Small business/home networks
This is probably a case where accuracy requirements are low enough and the cost of setting up solid infrastructure high enough that it simply isn’t worth using anything but the public NTP pool under most circumstances. If you’re using a dedicated server/VM or a full-featured router for connectivity rather than a consumer xDSL/fibre gateway, it would probably be desirable to configure that device as a pool client (it will probably end up at stratum 2 or 3), and point your local clients at that as a single source.
Hopefully between dispelling common myths and outlining common use cases, this post has given you enough background to help you make informed choices about your NTP infrastructure and configuration. For further (more authoritative) reading on this, see the recently-published BCP draft.
This will be the last post in this series for at least a few weeks as I focus on turning the material here into a presentable talk for the Linux.conf.au 2017 sysadmin miniconf. Hope to see you in Hobart!
Addendum: Other mistakes well worth not making
Here are a couple of other issues that cropped up as I wrote this post, but haven’t found a good place to add them.
- Letting time zones confuse your thinking. NTP doesn’t care about time zones. In fact, the Linux kernel (and I’d guess most other kernels) doesn’t care about time zones either. There is only UTC: conversions to your local time are done in user space.
- Being a botnet enabler. NTP has been used in reflective DDoS attacks for quite some time. This seems to have gone out of vogue a little lately, and the default configuration for your distro should protect you from this, but you should still double-check that your configuration is up-to-date. The examples given above show a basic minimum set of restrictions which should prevent this.
(See the NTP category for the previous posts in this series.)
So now that we’ve configured NTP, how do we know it’s working? As Limoncelli et. al. have said, “If you aren’t monitoring it, you aren’t managing it.” There are several tools which can be used to monitor and troubleshoot your NTP service.
ntpq is part of the NTP distribution, and is the most important monitoring and troubleshooting tool you’ll use; it is used on the NTP server to query various parameters about the operation of the local NTP server. (It can be used to query a remote NTP server, but this is prevented by the default configuration in order to limit NTP’s usefulness in reflective DDoS attacks; ntpq can also be used to adjust the configuration of NTP, but this is rarely used in practice.)
The command you’ll most frequently use to determine NTP’s health is ntpq -pn. The -p tells ntpq to print its list of peers, and the -n flag tells it to use numeric IP addresses rather than looking up DNS names. (You can leave off the -n if you like waiting for DNS lookups and complaining to people about their broken reverse lookup domains. Personally, I’m not a fan of either.) This can be run as a normal user on your NTP server; here’s what the output looks like:
$ ntpq -pn remote refid st t when poll reach delay offset jitter ============================================================================== +172.22.254.1 172.22.254.53 2 u 255 1024 177 0.527 0.082 2.488 *172.22.254.53 .NMEA. 1 u 37 64 376 0.598 0.150 2.196 -184.108.40.206 220.127.116.11 2 u 338 1024 377 45.129 -1.657 18.318 +18.104.22.168 22.214.171.124 2 u 576 1024 377 32.610 -0.345 4.734 +126.96.36.199 188.8.131.52 2 u 158 1024 377 54.957 -0.281 3.400 -2001:4478:fe00: 184.108.40.206 2 u 509 1024 377 36.336 7.210 6.654 -220.127.116.11 18.104.22.168 2 u 384 1024 377 36.832 -1.825 7.134 -2001:67c:1560:8 22.214.171.124 2 u 846 1024 377 370.902 -1.583 3.784 -126.96.36.199 188.8.131.52 2 u 772 1024 377 328.477 -1.623 51.695
Let’s run through the fields in order:
- The IPv4 or IPv6 address of the peer
- The IPv4 or IPv6 address of the peer’s currently selected peer, or a textual identifier referring to the stratum 0 source in the case of stratum 1 peers.
- The NTP stratum of the peer. You’ll recall from previous parts of this series that the stratum of an NTP server is determined by the stratum of its time source, so in the example above we’re synced to a stratum 1 server, therefore the local server is stratum 2.
- The type of peer association; in the example above, all of the peers are of type unicast. Other possible types are broadcast and multicast; we’ll focus exclusively on unicast peers in this series; see [Mills] for more information on the other types.
- The elapsed time, in seconds, since the last poll of this peer.
- The interval, in seconds, between polls of this peer. So if you run ntpq -pn multiple times, you’ll see the “when” field for each peer counting upwards until it reaches the “poll” field’s value. NTP will automatically adjust the poll interval based on the reliability of the peer; you can place limits on it with the minpoll and maxpoll directives in ntp.conf, but usually there’s no need to do this. The number is always a power of 2, and the default range is 64 (2^6) to 1024 (2^10) seconds (so, a bit over 1 minute to a bit over 17 minutes).
- The reachability of the peer over the last 8 polls, represented as an octal (base 8) number. Each bit in the reach field represents one poll: if a reply was received, the bit is 1; if the peer failed to reply or the reply was lost, it is 0. So if the peer was 100% reachable over the last 8 polls, you’ll see the value 377 (binary 11 111 111) here. If 7 polls succeeded, then one failed, you’ll see 376 (binary 11 111 110). If one failed, then 5 succeeded, then one failed, then another succeeded, you’ll see 175 (binary 01 111 101) If they all failed, you’ll see 0. (I’m not sure why this is displayed in octal; hexadecimal would save a column and is more familiar to most programmers & sysadmins.)
- The round-trip transit time, in milliseconds, that the poll took to be sent to and received from the peer. Low values mean that the peer is nearby (network-wise); high values mean the peer is further away.
- The maximum probable offset, in milliseconds, of the peer’s clock from the local clock [RFC5905 s4], which ntpd calculates based on the round-trip delay. Obviously, lower is better, since that’s the whole point of NTP.
- The weighted RMS average of the differences in offsets in recent polls. Lower is better; this figure represents the estimated error in calculating the offset of that peer.
- There’s actually an unlabelled field right at the beginning of each row, before all the other information. It’s a one-character column called the tally code. It represents the current state of the peer from the perspective of the various NTP algorithms. The values you’re likely to see are:
- * system – this is the best of the candidates which survived the filtering, intersection, and clustering algorithms
- o PPS – this peer is preferred for pulse-per-second signalling
- # backup – more than enough sources were supplied and ntpd doesn’t need them all, so this peer was excluded from further calculations
- + candidate – this peer survived all of the testing algorithms and was used in calculating the correct time
- – outlier – this peer includes the true time but was discarded during the cluster algorithm
- x falseticker – this peer was outside the possible range and was discarded during the selection (intersection) algorithm
- [space] – invalid peer; might cause a synchronisation loop, have an incorrect stratum, or might be unreachable or too far away from the root servers
Aside: the anatomy of a poll, and the selection (intersection) algorithm
Before we dig into applying the above knowledge of the peer fields to our example, we need to take a quick side trip into two more bits of theory. Firstly, how NTP polls work. You can find more detail on this process in RFC5905, but in a nutshell, each poll uses 4 timers:
t1 – the time the poll request leaves the local system
t2 – the time the poll request arrives at the remote peer
t3 – the time the poll reply leaves the remote peer
t4 – the time the poll reply arrives at the local system
t1 & t4 are recorded by the local system and are relative to its clock, t2 & t3 are recorded by the peer, and are relative to its clock. Here’s a graphical representation, adapted from [Mills]:
The total delay (the time taken for the request to get to and from the peer) is the overall time minus the processing time on the peer, i.e. (t4 – t1) – (t3 – t2). Because it can’t know the network topology or utilisation between the local system and the remote peer, NTP assumes that the delay in both directions is equal, i.e. that the peer’s reported times are in the middle of the round trip.
NTP performs the above calculation for every poll of every peer. When the results from peers are available, NTP runs the selection (or intersection) algorithm. The intersection algorithm is a modified version of an algorithm first devised by Keith Marzullo, and is used to determine which of the peers are producing possible reports of the real time, and which are not.
The intersection algorithm attempts to find the largest possible agreement about the true time represented by its remote peers. It does this by finding the interval which includes the highest low point and the lowest high point of the greatest number of peers. (Read that again a couple of times to make sure it makes sense.) This agreement must include at least half of the total number of peers for NTP to consider it valid.
If you forget everything else about NTP, try to remember the intersection algorithm, because it helps to make sense of NTP’s best practices, which might otherwise seem pointless. There are various diagrammatic representations of the intersection algorithm around, including Wikipedia:
Or this one from Mills:
The intersection algorithm currently in use in NTPv4 is not perfectly represented by any of the above diagrams (since the current version requires that the midpoint of the round trip for all truechimers is included in the intersection), but they are useful nonetheless in helping to visualise the intersection algorithm.
Interpreting ntpq -pn
So let’s look back at the example above and make a few observations about our ntpq -pn output:
- There are a couple of peers at the start of the list with RFC1918 addresses and very low delay (less than 1 ms). These are peers on my local network. The latter of these is a stratum 1 server using the NMEA driver, a reference clock which uses GPS for timing, but also includes a PPS signal for additional accuracy. (More on this time server in a later post.) Both of the LAN peers have missed a poll recently, but they’re still reliable enough and accurate enough that they are included in the calculations, and the local stratum 1 server is the selected sync peer.
- There are four other peers with delays in the 30-60 ms range; these are public servers in Australia.
- Then there are two other peers with delays in the 300-400 ms range; these are servers in Canonical’s network which I monitor; they live in our European data centres.
Note that all of these are still possible sources of accurate time – not one of them is excluded as an invalid peer (tally code space) or a falseticker (tally code x). We’ve also got pretty low jitter on most of them, so overall our NTP server is in good shape.
Other ntpq metrics
There are a couple more metrics of interest which we can get from ntpq:
- root delay – our delay, in milliseconds, to the stratum 0 time sources
- root dispersion – the maximum possible offset, in milliseconds, that our local clock could be from the stratum 0 time sources, given the characteristics of our peers.
- system offset – the local clock’s offset, in milliseconds, from NTP’s best estimate of the correct time, given our peers
- system jitter – as for peer jitter, this is the overall error in estimating the system offset
- system frequency (or drift) – the estimated error, in parts per million, of the local clock
Here’s an example of retrieving these metrics:
$ ntpq -nc readvar 0 associd=0 status=0615 leap_none, sync_ntp, 1 event, clock_sync, version="ntpd firstname.lastname@example.org Fri Jul 22 17:30:51 UTC 2016 (1)", processor="x86_64", system="Linux/3.16.0-4-amd64", leap=00, stratum=2, precision=-20, rootdelay=0.579, rootdisp=5.813, refid=172.22.254.53, reftime=dbbeb981.38507158 Sat, Oct 29 2016 16:00:33.219, clock=dbbeba0c.ae22daf2 Sat, Oct 29 2016 16:02:52.680, peer=514, tc=10, mintc=3, offset=-0.102, frequency=6.245, sys_jitter=0.037, clk_jitter=0.061, clk_wander=0.000
This tells ntpq to print the variables for peer association 0, which is the local system. (You can get similar individual figures for each active peer association; see the ntpq man page for details.)
It probably should go without saying, but if ntpq doesn’t produce the kind of output you were expecting, check the system logs (/var/log/syslog on Ubuntu & other Debian derivatives, or /var/log/messages on Red Hat-based systems). If ntpd didn’t start for some reason, you’ll probably find the answer in the logs. If you’re experimenting with changes to your NTP configuration, you might want to have tail -F /var/log/syslog|grep ntp running in another window while you restart ntpd.
Other monitoring tools
- ntptrace – we mentioned this in the previous post. It’s rarely used nowadays since the default ntpd configuration prevents ntptrace from remote hosts, but can be helpful if you run a local reference clock which you’ve configured for remote query from authorised sources.
- ntpdate – set the time on a system not running ntpd using one or more NTP servers. This tool is deprecated (use ntpd -g instead), but it has one really helpful flag: -s (for simulate) – this does a dry run which goes through the process of contacting the NTP server(s), calculating the correct time, and comparing with the local clock, without actually changing the local time.
- /var/log/ntpstats/clockstats – this log file, if enabled, has some interesting data from your local reference clock. We’ll cover it in more detail in a later post.
So those are the basic tools for interactive monitoring and troubleshooting of NTP. Hopefully you’ll only have to use them when investigating an anomaly or fixing things if something goes wrong. So how do you know if that’s needed?
At work we use Nagios for alerting, so when I wanted to improve our NTP alerting, I went looking for Nagios plugins. I was disappointed with what I found, so I ended up writing my own check, ntpmon, which you can find at Github and Launchpad. The goal of ntpmon is to cover the most common use cases with reasonably comprehensive checks at the host level (as opposed to the individual peer level), and to have sensible, but reasonably stringent, defaults. Alerts should be actionable, so my aim is to produce a check which points people in the right direction to fix their NTP server.
Here’s a brief overview of the alternative Nagios checks:
Some NTP checks are provided with Nagios (you can find them in the monitoring-plugins-basic package in Ubuntu); check_ntp_peer has some good basic checks, but doesn’t check a wide enough variety of metrics, and is rather liberal in what it considers acceptable time synchronisation; check_ntp_time is rather strange in that it checks the clock offset between the local host and a given remote NTP server, rather than interrogating the local NTP server for its offset. Use check_ntp_peer if you are limited to only the built-in checks; it gets enough right to be better than nothing.
check_ntpd was the best of the checks I found before writing ntpmon. Use it if you prefer perl over python. Most of the remaining checks in the Nagios exchange category for NTP are either token gestures to say that NTP is monitored, or niche solutions.
For historical measurement and trending, there are a number of popular solutions, all with rather patchy NTP coverage:
collectd has an NTP plugin, which reports the frequency, system offset, and something else called “error”, the meaning of which is rather unclear to me, even after reading the source code and comparing the graphed values with known quantities from ntpmon. It also reports the offset, delay, and dispersion for each peer.
The prometheus node_exporter includes NTP, but similar to check_ntp_time, it only reports the offset of the local clock from a configured peer, and that peer’s stratum. This seems of such minimal usefulness as not to be worth storing or graphing.
Telegraf has a ntpq input plugin, which offers a reasonably straightforward interface to the data for individual peers in ntpq’s results. It’s fairly young, and has at least a couple of glaring bugs, like getting the number of seconds in an hour wrong, and not converting reachability from an octal bitmap to a decimal counter.
Given the limitations of the above solutions, and because I’m trying to strike a balance between minimalism and overwhelming & unactionable data, I extended ntpmon to support telemetry. This is available via the Nagios plugin through the standard reporting mechanism, and as a collectd exec plugin. I intend to add telegraf and/or prometheus support in the near future.
Here’s an example from the Nagios check:
$ /opt/ntpmon/check_ntpmon.py OK: offset is -0.000870 | frequency=12.288000 offset=-0.000870 peers=10 reach=100.000000 result=0 rootdelay=0.001850 rootdisp=0.032274 runtime=120529 stratum=2 sync=1.000000 sysjitter=0.001121488 sysoffset=-0.000451404 tracehosts= traceloops=
And here’s a glimpse of the collectd plugin in debug mode:
PUTVAL "localhost/ntpmon-frequency/frequency_offset" interval=60 N:12.288000000 PUTVAL "localhost/ntpmon-offset/time_offset" interval=60 N:-0.000915111 PUTVAL "localhost/ntpmon-peers/count" interval=60 N:10.000000000 PUTVAL "localhost/ntpmon-reachability/percent" interval=60 N:100.000000000 PUTVAL "localhost/ntpmon-rootdelay/time_offset" interval=60 N:0.001850000 PUTVAL "localhost/ntpmon-rootdisp/time_offset" interval=60 N:0.036504000 PUTVAL "localhost/ntpmon-runtime/duration" interval=60 N:120810.662998199 PUTVAL "localhost/ntpmon-stratum/count" interval=60 N:2.000000000 PUTVAL "localhost/ntpmon-syncpeers/count" interval=60 N:1.000000000 PUTVAL "localhost/ntpmon-sysjitter/time_offset" interval=60 N:0.001096107 PUTVAL "localhost/ntpmon-sysoffset/time_offset" interval=60 N:-0.000451404 PUTVAL "localhost/ntpmon-tracehosts/count" interval=60 N:2.000000000 PUTVAL "localhost/ntpmon-traceloops/count" interval=60 N:0.000000000
This post ended up being pretty long and detailed; hope it all makes sense. As always, contact me if you have questions or feedback.
I decided that the long dyndns URLs were a bit daggy, so it’s back to the old site name. Please let me know if you notice any issues with the changeover.