Questioning platform engineering

(Yes, I know this is being published on April Fools' Day. I might be a fool, but this is not an April Fool's joke.)

This year's model

Platform teams and platform engineering are phenomenons which have taken the cloud native world by storm recently, so much so that it has seemed a bit socially unacceptable to question them. Team Topologies has been every DevOps manager's favourite book for a while, and recently Gartner discovered platform engineering and declared it this year's cool kids toy strategic technology trend.

However, Sam Newman recently published a very funny, very snarky post on what's wrong with the platform engineering space, and others have started to respond, so now I feel like I can voice my objections in public. 😃

First some history and definitions, which will necessarily be dramatically oversimplified.

Which came first, the DevOps or the SRE?

DevOps started as a sort of grass-roots counter-cultural movement in the late 2000s, with the first DevOps Days conference showing up in 2009. I think I first heard of it in about 2011 or 2012, and the message of Patrick Debois and his collaborators really resonated with me: IT works better when those who develop a service and those who run it work together rather than at crossed purposes, as was often the case in traditional IT shops.

Those who started their careers after that time or who have never worked in traditional IT shops might find it hard to believe, but at the time rigid role definitions in IT shops were accepted wisdom. I remember a CIO telling me that I couldn't possibly be a programmer and a sysadmin and a network engineer (despite the fact that I was already doing all of these things in his organisation), because that didn't fit his operational model. DevOps offered a tangible foil to traditional ITIL-driven IT, suggesting that maybe we could give a damn about our co-workers' lives, and maybe even work in the same team alongside them, towards the same goals, rather than just worrying about the things in our own silo.

Whilst Site Reliability Engineering (SRE) started at Google earlier than the first documented origins of DevOps, it seemed to me that it entered the public sphere later (it might be that I just wasn't aware of it, or maybe Google didn't start talking about it externally until later). I first came across it in some talks by Ben Sloss (this might have been one of them) where he explained the way services at Google are deployed: for the first 6 months of a new service it is run by the team that developed it, and after it qualifies as production-grade it is handed over to an operations team to run it. The operations team is made up of more developers who use things like automation, error budgets, postmortems, and metrics to maintain the service. You can read all about SRE in Google's books on the topic.

SRE always struck me as a step backwards from DevOps (despite the commendable innovations in operations which they produced), because the development of a product was still separated from the operation of that product rather than having an integrated team dealing with both. (Although, after reading Killed by Google one could be forgiven for wondering whether Google product development teams are simply disbanded, or reallocated to something else once their services are released to SRE. 😃)

Would you like abstractions with that?

More recently, platform engineering arrived on the scene, proclaiming that operations teams need to treat their infrastructure like a product that they are providing to developers, in order to shield developers from the underlying details of that infrastructure.

Often it is assumed that Kubernetes is the preferred infrastructure abstraction. The Kubernetes ecosystem seems to collect imperative/declarative abstractions like [insert witty proverbial layer cake comparison here] - we had Docker as a way to manage Linux processes, but it was imperative, so we added k8s manifests to spin up our containers declaratively. But managing manifests in our k8s clusters was done imperatively so we brought in Helm to bring them into line declaratively. However, installing helm charts is an imperative process, so along came things like Helmfile to help out by making that declarative. And then we have Declarative GitOps Continuous Delivery, which I'm pretty sure involves yet another abstraction layer on the cake. Don't get me wrong: I think declarative languages - whether domain-specific like Terraform or generic like YAML - are ideal for infrastructure as code, but at the moment the CNCF landscape seems to treat anything imperative like a problem to solve with another declarative abstraction.

Why platform engineering might not be the best choice

So what's the problem with spinning up a combination of Kubernetes and our preferred CI/CD tool to create a bespoke corporate PaaS that helps our developers focus on code and not have to worry about infrastructure?

  1. You can't do it as well as Heroku or CloudFoundry

    This point was going to contain the word "probably", but seriously, unless you have a team of hyper-10x unicorns, you're never even going to get close. PaaS solutions in the marketplace have years of a head start and have solved problems that most of us haven't even thought about yet, let alone actually encountered in production. All abstractions involve a trade-off between ease-of-use and control (and often performance as well) and the established PaaS players have years of experience in fine-tuning their balance of those trade-offs. Being a sub-par version of Heroku isn't something to aspire to; if we really believe in platforms, why aren't we using mature platforms which can be bought for a whole lot less than a team of k8s experts?

  2. Like it or not, developers need to understand infrastructure

    The love of abstractions in the platform engineering space stems from a fairly simple assertion: developers don't like dealing with infrastructure. On first blush, this seems reasonable: why should someone whose job is just writing the code have to understand all the details of what that code runs on? To a certain extent, this is true: very few of us understand how modern superscalar CPUs actually operate at the microcode level or below. However, for most of us that's 3-4 abstraction layers below the one we work at. Knowing the abstraction layers one or two levels on either side of your main area of expertise is at the very least desirable, and in many cases indispensable. It would be very reasonable to expect, for example, a software engineer working on compiler design to have at least a passing familiarity with the major concepts of CPU architecture.

    I once heard it said that bug fixes are meaningless; deployed bug fixes are the only thing that matters. A software engineer's great work isn't useful if it isn't deployed in a production system, and when something goes wrong, having that engineer understand the moving parts in that production system is the surest path to low mean time to recovery.

I think most of the current hype around platform engineering is comes down to two things:

  1. Engineers gonna over-engineer

    We engineers love having our thing, tweaked until it's just so. (Most of my recent blog posts are outworkings of this tendency in myself.) I suspect that the tendency for this to run unchecked in many organisations was the catalyst for many of Sam Newman's comments on this.

  2. Pundits trying to bump up sales of their industry reports.

    Disagree with me? I dare you to read the Gartner article on platform engineering linked at the top of this post, substitute any other technology used by an IT development or operations team whenever you read the phrase "platform engineering", and tell me with a straight face that the substance of it changes. As far as I'm concerned, it is vacuous promises and keeping up with the Joneses until proven innocent.

DevOps is dead; long live DevOps!

Most organisations I have worked with and have been otherwise exposed to, despite how they like to present themselves, haven't really done DevOps. I get it; it's hard to pull off.

For one, it's hard to hire any skilled software or infrastructure engineer in the current climate, let alone ones who can embrace being out of their comfort zone and are committed to continually improving their understanding what their customers and their suppliers do.

Not only that, but culture takes time to change, and this is always in conflict with the usual need for organisational 'leaders' to implement their mandatory churn every few months. (What's the private enterprise equivalent of Machinery of Government? Machinery of Corporate?)

That doesn't mean that DevOps isn't a superior paradigm, or that it shouldn't be our goal. Service-based multi-disciplinary teams focused on producing positive customer outcomes are the kind of teams that motivate engineers like me to show up and give our best each day.