The problem
In this post, we go into detail about how I wrote the actual Terraform for implementing the solution we discussed before.
Remember this from last time?
--- title: Github Runners on Demand Workflow, showing Github Actions, Repo triggers, Cloudflare and Nomad (in Hashi@Home) flowchartConfig: width: 100% config: theme: base themeVariables: background: "#d8dee9" fontFamily: "Roboto Mono" primaryColor: "#88c0d0" secondaryColor: "#81a1c1" tertiaryColor: "#ebcb8b" primaryTextColor: "#2e3440" secondaryTextColor: "#3b4252" primaryBorderColor: "#7C0000" lineColor: "#F8B229" fontSize: "20px" --- flowchart TB commit(Commit) repo(Git Repo) webhook(Repo Webhook) action_workflow(Github Action workflow) action_job(Github Action job) runner(Github Self-Hosted runner) listener(Webhook receiver) worker([Cloudflare Worker]) tunnel(Cloudflare tunnel) cf_app(Cloudflare Application) nomad_job(Nomad Parametrized job) nomad(Nomad Dispatch API) %% triggers %% subgraph Triggers["fa:fa-github Triggers"] direction LR commit-- Push update to -->repo commit-- Trigger -->action_workflow end %% webhook flow %% subgraph Webhooks direction TB repo-- Emit -->webhook subgraph Cloudflare["fab:fa-cloudflare"] direction LR webhook-- POST -->listener listener-- Route -->cf_app zta-- Protect -->cf_app cf_app-- Invoke -->worker worker-- POST -->tunnel end subgraph Nomad direction LR tunnel-- Expose -->nomad nomad-- Schedule -->nomad_job nomad_job-- Instantiates -->runner end end %% actions flow %% subgraph Actions["fa:fa-github Actions"] action_workflow-- Schedules -->action_job action_job-- Requests -->runner runner-- Executes -->action_job action_job-- Notifies Status -->commit end
The flow of data is easier to visualise perhaps as a sequence diagram:
--- title: Simplified event flow sequence: actorFontSize: "128px" actorFontFamily: "IBM Plex Mono" messageFontSize: "128px" messageFontFamily: "IBM Plex Mono" config: theme: base themeVariables: background: "#d8dee9" primaryColor: "#88c0d0" secondaryColor: "#81a1c1" tertiaryColor: "#ebcb8b" primaryTextColor: "#2e3440" secondaryTextColor: "#3b4252" primaryBorderColor: "#7C0000" lineColor: "#F8B229" fontSize: "28px" fontFamily: "IBM Plex Mono" --- sequenceDiagram autonumber actor User box github participant GithubRepo participant GithubWebhook participant GithubAction participant GithubCIJob participant GithubRunner end box nomad %% participant NomadTunnelJob participant NomadDispatchAPI participant NomadRunnerJob end User->>GithubRepo: commit GithubRepo->>GithubWebhook: trigger webhook GithubWebhook->>NomadDispatchAPI: deliver payload GithubWebhook->>GithubAction: queue workflow GithubAction->>GithubCIJob: queue job NomadDispatchAPI->>NomadRunnerJob: start activate NomadRunnerJob NomadRunnerJob->>GithubRunner: create activate GithubAction loop Alive NomadRunnerJob->>GithubRunner: Report Health GithubRunner->>GithubCIJob: Notify presence GithubAction->>NomadRunnerJob: schedule job NomadRunnerJob->>NomadRunnerJob: Run job GithubCIJob-->>GithubAction: update status NomadRunnerJob->>GithubAction: Terminate end deactivate NomadRunnerJob NomadRunnerJob->>GithubRepo: Remove Runner GithubAction->>User: show status deactivate GithubAction
As you you can see here, I’ve left out the crucial part of Cloudflare resources, assuming that the endpoint that we need to POST
to is resolvable by Github.
Don’t worry, it’s just to save space and keep the diagram readable - I’ll show in the next section exactly how Cloudflare fits into the picture.
Note however that there is a loop in the sequence while the runner is alive and processes a job. Once the job finishes however, the loop exits and the runner is removed from the repository.
The runners are used once and destroyed.
Terraforming
We will be creating all of these resources with Terraform. Where should we start? When I start out implementing something in Terraform, I usually start with declaring the providers:
Ok, we’ve got the tools, now to go about creating the resources.
Configuring providers
At the outset, the only thing we really have are the things that are already present in Github (my repositories and Github’s own state), and a Cloudflare account with a registered domain.
We could go about looking up data from e.g. Github by writing a declaration like:
Astute readers might object saying that only public repositories will be found like that, since we haven’t provided any means to authenticate to GitHub, and seasoned Terraformers would then also ask things like “Where is the provider configuration?” and “Where is the backend configuration?”
Well, it turns out that I’ve actually written the terraform as a module, and thus ensured that the declaration is abstract. The providers are configured in an instantiation of the module:
Several configuration parameters are not shown here, such as the Nomad and Vault tokens, because I usually already have them set in the environment1. Now that the providers are configured, we can go about creating all of the resources we need.
The rest of the damn owl
You know what I really don’t like? Those “tutorials” that start off really explicit and simple, you follow them nodding your head going “yeah, ok, I get it, I can do this” and then somewhere around step 3 it pulls a magic trick with a wave of the hand and out pops a fully-formed masterpiece that you have no idea how to make. That’s not what I’m trying to do here, so let’s take a step back and try to reason about the rest of the damn owl2.
Github
Naively, we might assume that the first thing to do is register the webhook, but the first thing that the webhook will ask you when creating it is “where should I POST the event payload to”? So, we’ll need the domain and route first. Once we have that, we can register the webhook. When a webhook is registered, a webhook secret can be declared which the receiving end should also have. This secret is used to sign the hash of the payload that is sent by Github, and then by the receiving end to validate the payload, providing a means for ensuring message authenticity3.
So, in github we’ll need:
- A webhook with
- webhook secret
- endpoint
We’ll also need to look up:
- Specific repositories
- Github ip ranges
Cloudflare
The Cloudflare edge servers will be able to receive that data, but the response will be a 500 at best, because there’s nothing to serve the request. So the next thing we’ll need is to attach a worker to the route to be able to respond when webhook payloads hit the URL4. This would tell Github that “ok, we’ve received your webhook, thank you. Carry on”, but we’d still have to invoke the actual runner if required by the specific event.
The worker script will deal with the business logic, including the authentication of the payload data based on a secret shared between Github and Cloudflare used to sign it mentioned above. If all goes well, the script will be responsible for sending a dispatch POST to the Nomad API. Recall that the Nomad API is running in my local private network, so we need to create a tunnel for it, with an application to expose it to the Cloudflare edge. This application will be able to receive and respond to requests specifically for Nomad, so I don’t want to expose it to anything, but only to the cloudflare worker which deals with incoming Github webhooks. This is a machine-machine interaction, so the authentication mechanism will be with a service token. We can then make an access rule which only allows requests which have that token in their headers.
So in cloudflare we’ll need:
- A worker with:
- secrets bindings
- a worker domain
- A Cloudflare access application with
- An Access Group
- An Access Policy
- A Cloudflare Tunnel with
- a tunnel configuration
- (optional KV namespace for metadata and job tracking)
We’ll also need to look up:
- Accounts
- Domains
- Cloudflare Tunnel Secrets5
Nomad
At last we can hit the Nomad API. We’ll be using the Nomad parametrized job type, so that we can invoke ephemeral runner instances, without having to register persistent runners in repositories. This is a key aspect which allows us to scale as needed, and have zero capacity when not needed, saving resources and money.
We will therefore need a Nomad job both for the tunnel mentioned above, as well as for the Github runner parametrized job.
So, for Nomad we will need:
- Nomad service job for
cloudflared
tunnel - Nomad parametrized job for Github runner
Finally, we have all of the resources necessary and we can Terraform all the things!.
For all of the code, see the repo
Discussion
I’ve shown here a few of the gritty details of how to build this with Terraform.
There are a few rough edges still, some hardcoded information and a few resources which were created by hand, and not terraformed. From what I’ve read on the various fora, I believe these will be implemented soon in their respective providers.
I haven’t gone into detail regarding the worker itself or the Nomad job definition here - you can find them in the repo. I hope to discuss them in a future post.
I’ve tried very hard here to provide a linear description of how to go about building this. The actual experience was quite different to this - I spent a lot of time experimenting with Cloudflare resources before I got it right. I’m probably just slow, and once I finally got it properly implemented, it all made sense. The hardest part was the actual worker, but that’s just because I suck at writing Javascript… hey, I’m getting there.
I’ve used Hashi at Home services (Vault, Nomad and Consul) and a free Cloudflare account to do this, and honestly it’s really nice to be able to run as many runners and CI actions as I want free of charge, by using hardware that I’ve already paid for. The only cost involved here was my fixed-line subscription and the computatoms I’ve got in the cluster. Running these in Github itself, I’d probably need the Enterprise subscription, since the Team subscription only gives me 1000 CI/CD minutes a month over the free account. It’s neither here nor there whether I would actually save money6, but it certainly works. What is more, by implementing this solution, I’ve learned a lot about how Github runners actually work, some details of Nomad and Cloudflare, not to mention the Javascript I needed to know to write the worker.
Moral of the story
Making things yourself is important, you cannot learn if you don’t do. And sometimes owning your own things is better than subscribing to other peoples’ things.
References and footnotes
-
I actually want a short-lived token from for Nomad from the Vault Nomad secrets mount, but I haven’t gotten that to work yet. I also want some form of short-lived token for Vault, but full-disclosure, I’m using a root token at home 😱. ↩
-
I’m referring to the things that the “How to draw an owl” meme refers to. ↩
-
This is a crucial part of security of the setup – without it, I could send any old data through the worker and do very malicious things to my Nomad cluster. I followed the official guide to implement the function in Javascript when implementing the worker. ↩
-
It would be best to expose the worker only to known Github Actions endpoints. Github actually does expose their IP ranges and the provider implements a
data
source for it. However Cloudflare access rules can only be specified as /16 or /24 at the moment which means having to convert the /23 and other CIDRs that Github returns to expand to those sizes. Honestly, it felt like I’d have to do a hack and I leave that for next time. ↩ -
This resource is not quite terraformable yet. ↩
-
For example, most of the runners run on a 4 CPU, 32 GB RAM Lenovo Thinkcentre I got refurbished for about 100 euros. That’s just over 2 years of a team membership… but I can use the machine to do lots of other things as well, which I can’t really do with the Github actions. Owning things is actually pretty rad. ↩