What’s the time, Mister Cloud? An introduction to and experimental comparison of time synchronisation in AWS and Azure, part 1

It's About Time

Time is something that many of us take for granted, and most of us don't really think about it a lot, especially when it comes to computer systems - unless we are rudely awoken an hour too early when our mobile provider messes up our phone's time zone during a daylight savings transition.

In this three-part blog series, I'm going to provide an introduction to time synchronisation in AWS and Azure. We'll start here in part 1 by looking at the fundamentals of time and how we represent it. In part 2, we'll see how this applies to computer systems in practice, and in part 3, we'll explore the particular application to AWS and Azure virtual machines and containers.

My principal goals with this series are:

  1. to introduce those who work in cloud environments to the discipline of time synchronisation,
  2. to demonstrate that the cloud environment is highly practical for time sync without any significant concerns for clock accuracy, and
  3. to help cloud professionals know how to configure time sync on their systems to meet their organisation's business needs and compliance obligations.

Prerequisites

You'll gain the most benefit from this series if you:

My plan is to start slow, explaining the fundamental concepts, but ramp up the nerd factor at each juncture. I know some things will need further explanation, but my goal is to pique your interest to dig deeper rather than explain everything in detail. For those who would like to explore further, I'll provide a full reference list at the end of part 3.

What is time synchronisation?

🕒-🖥️--🌏--🖥️-🕒

There are undoubtedly many technically superior definitions, but my layperson's definition is this: making sure that the time as reported on two different computers (possibly on opposite sides of the world) is the same at the same instant in actual time. This is much, much harder than it sounds.

Why should you care?

Anyone who runs systems that require millisecond-or-better time measurement will care about time synchronisation at some point. This includes things such as:

  • Reading logs - these are often reported nowadays at 1 millisecond resolution, so more than 1 millisecond offset between hosts can result in unexpectedly jumbled logs.

  • Distributed systems - for example, the Ceph distributed storage engine has a soft limit of 10 milliseconds and a hard limit of 50 milliseconds before it declares a storage cluster out of sync; many cloud services have similar or stricter tolerance requirements.

  • Certain compliance regulations (the EU is the one example I'm aware of) and certain industries (I've been told high-frequency trading is one of these) insist that timestamps be within a defined offset from UTC.

  • One recent blog post suggests that clocks which are out by more than a few seconds could be used in a cache confusion attack on HTTP web servers.

Or maybe you just think it’s cool in its own right:

Credit: https://mars.nasa.gov/resources/273/

A scenario where the latter behavior can be most useful is a planetary orbiter fleet, for instance in the vicinity of Mars, where contact between orbiters and Earth occurs only one or two times per Sol (Mars day). These orbiters have a precise timing reference based on an Ultra Stable Oscillator (USO) with accuracy in the order of a Cesium oscillator. A PPS signal is derived from the USO and can be disciplined from Earth on rare occasion or from another orbiter via NTP. In the above scenario the PPS signal disciplines the spacecraft clock between NTP updates. -- https://doc.ntp.org/documentation/drivers/driver22/

(I’ll be honest: this is my main reason. They had my complete attention as soon as I read the phrase “planetary orbiter fleet”. 🤓)

What is good or bad, anyway?

This is not an existential question. 🤔 One of the things I've learned from my Mantel Group colleague Colin Andrews is that when learning something it's useful to think about a topic from first principles.

So what do we really want from a clock? What does it mean for a clock to be good or bad? If you have an analogue wristwatch or wall clock take a few moments to think about it. 🕰 ⌚ What do you want from it?

I would suggest our clock needs the following characteristics:

  1. intervals - seconds, minutes, and hours should be the correct length as determined by international standards; their length shouldn't change over time
  2. direction - the time should never go backwards
  3. start and finish point - our clock ticks should start at the same time as other people’s ticks
  4. calendar - if our clock has a calendar, it should be on the same day as everyone else’s

Technical terms

In time synchronisation we use the following terms to describe these clock characteristics:

  • frequency - the interval between ticks (a positive frequency error means a clock runs fast, and a negative frequency error means it runs slow)
  • phase - starting the ticks at the same time as other clocks' ticks
  • epoch - reference point of the ticks

Looking at time

I often find representing something in graphical form helps me better understand how it works. Let's look at a few ways to visualise the clock characteristics we've just described.

For representing our wall clock or watch accuracy I’ve chosen a line graph, showing real time along the X axis and clock time along the Y axis. We could have chosen any units, but in this case we’ll think about the progress of seconds over 1 minute. Our clock only has a resolution of 1 second (it’s actually digital in the sense that it can only represent 60 discrete values), but real time is a lot more analogue than that. I’ve chosen milliseconds for my X axis resolution, although microseconds and nanoseconds are more often used in practice.

Frequency - an accurate clock

This is a graph of an accurate clock - it starts at zero and ticks over to 1 at the beginning of second number 1. It then stays on that second until exactly the beginning of second number 2, when the second hand moves to point to 2. It stays there until exactly the beginning of second number 3, when it ticks over to 3. And so on all the way to 60.

Graph of an accurate clock

Frequency - a clock which runs fast, consistently

In this next graph, our focus clock (in purple) is 5% fast, and has counted to 60 long before real time has advanced 60 seconds. If I hadn’t clipped the graph, we would see that it was part way through its 64th second by the time we get to 60 real seconds. (The previous accurate clock is the fainter green line.)

Graph of a consistently fast clock

Frequency - a clock which runs consistently slow

Our next clock is 5% slow (again in purple, with the fainter blue representing our fast clock), and is only part way through its 58th second when we get to 60 real seconds.

Graph of a consistently slow clock

Frequency - an inconsistent clock

This clock has random changes in frequency added at selected points. Sometimes it even goes backwards. To create this graph, I adapted the famous fizzbuzz interview question - on seconds evenly divisible by 3 it changes the length of the second by a random amount, on those divisible by 5 it does it by another random amount, and on those divisible by 15 by another. In all other cases it acts as a reliable clock. It actually gets to the end of the minute roughly on the right time, and if you weren’t watching it constantly, all you would notice is that it’s a little bit fast. But it’s a dreadfully inconsistent clock and you would never want to rely on it. I’ve never seen a wristwatch this bad.

Graph of an inconsistent clock

Frequency - an equally inconsistent clock (but can you tell?)

This was produced with the identical algorithm to the previous graph, just with a smaller magnitude of changes (10% as large). But it’s equally unreliable if you drill down into the detail.

Graph of another inconsistent clock

Phase - a consistent clock, running 500ms late

Here's a clock that is totally consistent, but it started ticking half a second too late, so it is out of phase by 500 milliseconds.

Graph of an out-of-phase clock

Epoch - a consistent clock, running 12 hours early

Here's a clock that is also consistent, but in a different epoch - 12 hours offset from the correct time. How would you determine whether this clock is accurate?

Graph of a clock in the wrong epoch

I confess: I cheated on this one and used the same graph as the first clock. With a wall clock or wristwatch there's no way to work out which epoch it is in, unless it also has an am/pm indicator and calendar.

Offset - difference from the real time

Now if we keep using graphs like this to represent time it won’t be very helpful, because time keeps marching on, and so all of the graphs would be small variations on “up and to the right”. So instead we usually graph time using offsets, which are shown in the graphs below.

The offset is just the time it says on the clock we’re measuring, minus the real time. So ideally that number should always be zero - a positive offset means our clock runs fast, and a negative offset means our clock runs slow. Here's what our fast and slow clock offsets look like in graphical form:

Offset graph of fast, slow, and accurate clocks

As you can see, our consistently fast clock (purple) keeps getting further and further ahead of the real time, and our consistently slow clock (green) lags further and further behind.

Surprisingly, our inconsistent clocks stay somewhat closer to zero, because some of their random jumps are negative:

Offset graph of inconsistent clocks

The End of the Beginning

That's the end of part 1 of this series. Hopefully you've begun to think about time in a new light. In part 2 we'll get more practical and talk about the implementations of time on real computers.