Hey there, newsletter readers. I'm Nat Bennett, software engineer, and you're reading Simpler Machines. This week we're talking about pipelines, baby, and a system that I've been exploring recently for managing Github Actions, called Skylounge.
This week is a little bit unusual because this is based on some paid work I did last month for 33 Teams, a consulting agency I've worked with a few times in the past – I spent about 30 hours last month learning and evaluating Skylounge, in preparation for future client work. I mention this mostly for transparency but also to make sure it's clear that I (and/or a team of folks like me!) am available right now to build out automation/infrastructure/platform-y things. And if you're reading this, I'm interested in learning more about your problems in this area, even if you don't need consulting help right now.
So if you at any point you're reading this and you're like "hey, that sounds like something we're doing right now," please don't hesitate to send me an e-mail.
I love writing CI/CD pipelines.
They're like Rube Goldberg machines. I can almost hear the satisfying ka-thunk when a script puts a tested, working application into production. When I worked on Cloud Foundry, I spent much of my time on release teams, where writing and maintaining automation was the core of my daily job. In my work with 33 Teams, I’m often the person on the team who’s the fastest and happiest writing Bash scripts.
So I was super interested in getting my hands on Skylounge. I’ve spent a few weeks working with it, and I can see the potential to solve some common problems I’ve encountered over the past few years. If you have complex workflows, or have to manage pipelines for many applications, you should check it out.
One thing I want to emphasize is that Skylounge works specifically with Github Actions. And I'm going to be talking a lot about Github Actions here specifically, because it's the main automation tool I've been using since I left Cloud Foundry. At Cloud Foundry I used Concourse, and it's still my first love, but these days I’m usually working with clients, so I’m usually not introducing new tools. Github Actions is available, familiar, and easy to get started with, so it’s a good choice for a lot of teams. I desperately miss Concourse’s resource abstraction, but it otherwise gets the job done: It runs scripts in isolated environments.
Writing workflow automation well is hard
My biggest problem with CI right now is totally tool agnostic. Regardless of where and how you’re running the scripts, building complicated workflows out of Bash scripts is still a relatively specialized skill, and it’s hard to get application developers to spend time on it. It tends to be highly error-driven, and requires command of some technology that application developers otherwise don’t encounter in their day-to-day work.
When I encounter an error while writing a script or some YAML for a pipeline, I can often recognize the general class of error it belongs to and make sense of it quickly. I can immediately recognize when a mysterious parsing error is probably a YAML whitespace issue, for instance, or a Bash shell expansion issue. I also have a lot of habits that eliminate these kinds of errors. I habitually quote strings and surround variables with curly braces, and I start all my scripts with
But even with with that experience, I still find that writing automation like this is slow, fiddly, and hard to make predictions about. It's a lot of "get an error, solve that error, whoops there's a new error, repeat." I’ve seen developers without these habits and experience struggle to even understand what’s happening when they encounter errors in CI. Pipelines are often people’s first serious encounters with details of the filesystem, Bash’s error handling and process model, and all kinds of deliciously Linux-y “blub.” Sometimes people don’t even know what to Google, or have any idea what’s going wrong. What they’re looking at should work.
And while the microwaved-black-coffee-drinking sysadmin in me thinks it’s good for engineers to learn these things about tools they use every day — better in a CI pipeline than in production! — these problems often get in the way of shipping software. And they can get in the way of people investing the time to get themselves a good, tight feedback loop to production. Folks will end up running long, complicated test and deploy processes manually, because they can’t justify the time to figure out and setup a good CI system.
This all gets much worse in regulated or otherwise security sensitive environments. Teams won't set up even simple security scanning or dependency automation, because they don’t have command of their automation tools. Or they’ll put together pipelines using scripts and container images they pulled from who-knows-where because they're just trying to get something working, dang it.
If you’re a manager or a lead for a team in this position — especially if you’re a manager or a lead for a lot of them — you know how painful this is. You know the work in question isn't even really that hard – for someone who knows how to do it – and it's not that different from team to team. But you don't have a lot of the people who are good at it, and they have a lot of demands on their time. You can’t have them spend all day writing what’s basically the same automation, with small variations, for each individual team. You could hire a specialist like me, but that can be expensive and doesn’t scale well, especially once you start having to maintain those pipelines.
Skylounge automates the parts that are the same for all projects
So here's where we get back to Skylounge, and why I immediately went, "Oh, I want that." Skylounge is a Github application that combines templates from a shared organizational library with repository-specific configuration, builds Github Actions workflows, and makes pull requests against those repositories to create and update those workflows. When Skylounge detects changes to that repository specific configuration or to the share organizational templates, it opens a pull request on any affected repository to update the Github Actions workflows it managers.
It’s a tool that can make managing CI/CD pipelines easier, and let organizations leverage the folks like me — who are fast and confident writing CI pipelines, and even think it’s kind of fun — to build tooling that works for whole fleets of applications.
So here's an example. Say you have a lot of Java Spring apps, and you want to deploy them to Kubernetes. You can write a Skylounge blueprint for a Github workflow that builds a container and pushes it to your cluster, then add the
skylounge.yml that configures that workflow to your repositories. Want to add a scanning step to that workflow later, change the base image, or maybe add a sidecar container to the application pods? All you have to do is make that change once, in the blueprint file, and Skylounge will open up PRs on all the applications using it. You won’t surprise the developers responsible for those applications by making changes out from under them — they’ll have a chance to review and understand the PRs — but you’ll also be able to manage those applications as a fleet.
Skylounge helps with platform onboarding
The place I'm especially interested in this is for building "platforms" in the "platform as a service" sense – systems that take infrastructure primitives (like Kubernetes) and wrap them in developer-friendly interfaces for provisioning and updating those primitives. Platforms are great for freeing up developer time from repetitive infrastructure tasks, but for them to do their job, you have to be able to get applications onto the platform, and you have to be able to update applications once they’re there. These are both harder problems than you might expect.
First, let’s talk about onboarding applications. Platform/infrastructure/DevOps teams, by the nature of their role, tend to have a much better handle on tools like Docker and why they’re valuable than developers do. They also tend to know a lot more about the options available. It’s easy to underestimate the amount of work that you’re asking a team to do when you’re asking them to start using a tool that you know well, and they don’t. It’s also easy to underestimate the degree to which you’re disrupting their workflow when you ask them to change, say, their application’s build process in order to get onto your platform. If you don’t have a plan for getting applications onto a platform, it’s easy to spend millions of dollars in engineer time building something that sees very little use.
Using a tool like Skylounge to create and manage CI templates resolves several common barriers to onboarding. It makes it easier to “meet teams where they’re at.” You can start by templating a workflow they’re already using and familiar with. Then, when you’ve got everyone’s CI under management, you can start incrementally moving them towards the workflow you want them to adopt. It also means DevOps specialists can do more of the technical heavy lifting.
Abandoned apps, the platform problem no one tells you about
Then once you get the applications on the platform, you have to maintain them. You’ll need to update dependencies when they have security vulnerabilities. Most especially, you’ll need to be able to update applications that no longer have dedicated teams maintaining them.
This one is a surprisingly big problem, and I think it sneaks up on people who haven’t worked with a platform before. But platforms make it easy for companies to deploy applications. And if you make it easy to deploy applications, then you’ll get more applications. A lot more. And then once those applications are out there, in production, making money or providing services, they’ll stay there. Often for decades, if things are going really well. And eventually they won’t really need any more updates or work, and something else will, and the team that’s dedicated to them will get rolled off.
I’ve seen situations where an engineering organization deployed an application platform, onboarded teams to it, and then within months had abandoned applications with no team attached running on it. It happens fast. If you’re looking at building at a platform, you need to have a plan for maintaining these apps.
There are a lot of ways to tackle this problem, but SkyLounge’s CI-management option is interesting because it essentially gives you a shim into the applications — a way to make a lot of different kinds of changes to an application and its runtime environment.
Let’s take CVE-2021-4428 as an example — a vulnerability that you might know as “the Log4j problem.” This is a vulnerability in a Java logging library that allowed attackers who could get an application to print a particular string to its logs to get that application to download and execute code from anywhere. (When it was first reported, vulnerable applications included basically every Minecraft server. It was an exciting week.) There are two ways to patch an application that’s vulnerable to the issue: You can update the logging library, or you can set an environment variable. The attack relied on a feature of the JVM that, it turns out, it’s possible to turn off. So even if you can’t update the dependency, if you had access to the application’s runtime environment, you could set an environment variable that prevented the JVM from accessing and running the remote code.
Applications on container-orchestration systems like Kubernetes typically get their environment variables updated when they’re deployed. If can make updates to all the deploy processes for all your applications from one place, you can set that environment variable for all your applications at once. You don’t have to make changes to the code. You don’t even have to have access to the code! Changing environment variables isn’t always safer than changing code but it’s nice to have the option for those times when it is — throw a quick fix on everything now, and then start the process of testing and deploying the bigger change.
Skylounge doesn't hide anything from developers
The last thing I really like about Skylounge is that it’s accessible. It’s easy to install on top of any existing infrastructure choices. Even if the main blueprint that most applications use doesn’t work for a particular application, you can still write a one-off blueprint for a particular application, store the core of its deployment pipeline in your central library, and have everyone’s CI available for inspection and update by your infrastructure team.
And while it takes over Github Actions from application teams, it does so in a relatively transparent way— through pull requests. The automation it creates is right there in the repository, and the developers responsible for it need to approve the PRs it generates to get updates. This doesn’t guarantee they’ll understand what it does, but it does prevent the operations of the pipelines Skylounge manages from being a complete black box. If something goes wrong, they know where the files are and they can at least begin the process of debugging them.
Get in touch
If any of these problems sound familiar, I’d love to talk more — even if you’re not using Github Actions. As you may have picked up on by now I love talking about, thinking about, and building automation to deploy and manage applications, and if that’s something you’re struggling with I want to hear about it, and might even be able to help.