- What do I want from my runners
- Problem Statement
- Implementation
- Results
- Discussion
- References and Footnotes
What do I want from my runners
In the previous post, I considered using Hashi@Home as a place to run Self-Hosted GitHub runners. In this article, I wanted to go into detail about how I actually went about designing and implementing a solution1
Before discussing problems to address, perhaps it’s better to spend a few words describing how I would like my environment to behave when I’m done.
My story looks a bit like this:
As a po’ boy with a Nomad cluster
I want just enough self-hosted GitHub runners
So that I don’t have to pay for them in terms of time or money
Let’s take a closer look at the necessary features.
Features
We can define three features; Runners:
- Are On Demand
- Scale to Zero
- Are ephemeral
Using gherkin syntax, I’ll try to be a bit more explicit about what is meant by these features.
First off, runners are on demand:
Furthermore, I want the runners to scale to zero:
I also want runners to be used once and then discarded, so that I don’t have to manually clean them up afterwards:
On a finer note, I also want runners to be scheduled only when the relevant runner label is present; when the relevant label is not present, we do not create a runner. I see this as being a part of the first feature:
Problem Statement
There were two main problems to address:
- personal repository runners need a token scoped to a specific repository; this would create an unsustainable amount of runners if they were to be provisioned statically.
- In order to provision them on-demand, we need to react to webhooks, which in turns requires that we expose an endpoint to a domain that GitHub can
POST
to.
Registering runners
The first is quite simple to address.
If we emit a webhook2 for workflow job events on a repository, the webhook payload will contain the repository name as part of the payload3.
In order to register the runner, with the repo, we first need a runner token, which can be obtained via an authenticated HTTP REST call to the registration endpoint: /repos/:owner/:repo/actions/runners/registration-token
However, in order to receive the webhook payload, we need somewhere to actually execute this call. True, we could probably set up a small action in GitHub itself to register a new runner registration token and then pass it on to whatever will create the self-hosted runner, but that would just be kicking the can down the road – we’d eventually need a place to run a runner anyway!
So, now we start touching problem number 2 above. We need to:
- Register a webhook to send payload data to and endpoint
- Receive and perform computation on the payload, extracting amongst other things the repository
- Obtain a github token and call the runner registration endpoint
- Start the runner with the correct parameters: url, personal access token and labels
The first step may be large, but it is not complicated. We can get a list of repositories, and using the REST API we can create webhooks. However, in order to create the webhook, we need to know the destination, and it is this second item in this list is topic for the next subsection – where do we send our payload?
Cloudflare to the rescue
As is a well-ascertained fact by now, Hashi@home is deployed … at home. Like, in my private network here in my home office. It is behind a fibre connection with no public domain resolution – I can’t just tell Github “send your payload to my place”… that’s like telling someone “take this to my buddy in New York”.
So, I need now some way of exposing the services running on Hashi@Home to the wider internet, by making them resolvable via DNS. There are a few common tricks that folks have been known to play when exposing their private network to the internet, but I chose to go with Cloudflare4 5.
With a Cloudflare account, I can use the Developer Platform and Zero Trust products to6:
- register a DNS name on a domain that I own7
- attach a worker to the route
- configure a KV store to cache data for the workers
- create a tunnel deploy it into the Nomad cluster at home.
- protect the tunnel with access policies requiring a service token
This will safely solve the problem 2 above, ensuring that webhook payloads are delivered to an endpoint which authenticates them and triggers a function which, based on some business logic, dispatches the job to start the runner, passing the relevant data from the payload to the Nomad Dispatch API endpoint.
Below is a schematic diagram of the flow of events:
--- title: Github Runners on Demand Workflow, showing Github Actions, Repo triggers, Cloudflare and Nomad (in Hashi@Home) flowchartConfig: width: 100% config: theme: base themeVariables: background: "#d8dee9" fontFamily: "Roboto Mono" primaryColor: "#88c0d0" secondaryColor: "#81a1c1" tertiaryColor: "#ebcb8b" primaryTextColor: "#2e3440" secondaryTextColor: "#3b4252" primaryBorderColor: "#7C0000" lineColor: "#F8B229" fontSize: "20px" --- flowchart TB commit(Commit) repo(Git Repo) webhook(Repo Webhook) action_workflow(Github Action workflow) action_job(Github Action job) runner(Github Self-Hosted runner) listener(Webhook receiver) worker([Cloudflare Worker]) tunnel(Cloudflare tunnel) cf_app(Cloudflare Application) nomad_job(Nomad Parametrized job) nomad(Nomad Dispatch API) %% triggers %% subgraph Triggers["fa:fa-github Triggers"] direction LR commit-- Push update to -->repo commit-- Trigger -->action_workflow end %% webhook flow %% subgraph Webhooks direction TB repo-- Emit -->webhook subgraph Cloudflare["fab:fa-cloudflare"] direction LR webhook-- POST -->listener listener-- Route -->cf_app zta-- Protect -->cf_app cf_app-- Invoke -->worker worker-- POST -->tunnel end subgraph Nomad direction LR tunnel-- Expose -->nomad nomad-- Schedule -->nomad_job nomad_job-- Instantiates -->runner end end %% actions flow %% subgraph Actions["fa:fa-github Actions"] action_workflow-- Schedules -->action_job action_job-- Requests -->runner runner-- Executes -->action_job action_job-- Notifies Status -->commit end
So, technically we have solved the problem 🎉 all we have to do now is implement it 🤔. Time to break out the ol’ toolbox 🧰!
Implementation
This section deals with the technical implementation of the solution to the problem described above. In this case, I’m implementing it myself, but often implementations are a team effort, with several engineers collaborating to create and integrate several parts.
Architecture
The first thing you need is agreement on the architecture of the thing you’re building, often referred to as a model. Representing architecture as a diagram makes it a bit easier for the team to know who is responsible for what, and how the pieces integrate. However, if we adhere to the old adage that “a picture is worth a thousand words”, we risk losing much of rigour we gain with code by representing models as diagrams. Pictures may be interpreted differently by different people, or lack a means to express precise details of how parts are supposed to be built. What is more, a given architecture will be deployed differently in different environments.
You also don’t want to overwhelm whoever is reading the design with irrelevant details of parts that don’t concern them, so actually a hierarchical visualisation of the architecture would be the best thing.
We’re going to use the C4 model to visualise how the workflow shown above is implemented. This is a four-level hierarchy showing just enough detail at each level to be able to collaborate effectively, and is becoming widely-used as of the time of writing.
Below we have a container diagram
mechanism for
authorising connections") Component(rbac, "RBAC", "", "Role-based access
controls for
expressing access
policies") Component(worker, "Worker", "NodeJS", "Cloudflare worker") Component(kv, "KV", "", "Cloudflare Worker
KV store") Component(domain, "Domain", "DNS", "Resolveable domain
on which to expose services") Component(tunnel, "Tunnel", "", "Cloudflare Access Tunnel") } Container_Boundary(github, "Github", "SCM Platform") { Component(repo, "Repository", "Git", "Hosts and tracks
changes to source code") Component(webhook, "Webhook", "REST", "Delivers payload based
on predefined triggers") Component(actions, "Actions", "REST", "CI/CD Workflows
as defined in repo") } Container_Boundary(hah, "Hashi@Home", "Compute services deployed at Home") { Component(nomadServer,"Nomad API", "REST", "Nomad endpoint") Component(nomadJob, "Nomad Runner
Parametrised Job", "bash", "Nomad job to run
runner registration
and execution script") Component(nomadExec, "Nomad", "Nomad Driver", "Nomad Task
Executor") Component(tunnelConnector, "Tunnel Connector", "cloudflared", "Cloudflare tunnel
connector") } Rel(user, repo, "Pushes Code") Rel(repo, webhook, "Triggers webhook delivery") Rel(repo, actions, "Trigger workflow job") Rel(repo, domain, "Delivers Payload") Rel(rbac, zta, "Define Policy") Rel(nomadServer tunnelConnector, "Run") Rel(tunnelConnector, tunnel, "Expose") Rel(worker, kv, "Lookup Data") Rel(domain, worker, "Triggers") Rel(worker, nomadServer, "Dispatch payload") Rel(nomadServer, nomadJob, "Trigger job") Rel(nomadServer, nomadExec, "Provision job runtime") Rel(nomadExec, actions, "Register runner")
This diagram shows the architecture up to the component level as well as the container boundaries for the software systems we will need to work with8 9.
For an actual deployment, let’s create a more detailed diagram:
to handle incoming
webhook payloads") } } Deployment_Node(hah, "Hashi@Home", "Hashi@Home Services") { Container(nomad_server, "Nomad Server", "Member of
Nomad Server
Raft consensus", "Provides Nomad API") Deployment_Node(nomad_cluster, "Nomad Cluster", "Nomad Cluster", "Nomad execution
Environment") { Container(tunnel_connector, "Tunnel Connector", "Nomad Job", "Provides
connectivity to
Cloudflare Edge") Container(parametrised_runner, "Parametrised Runner", "Parametrised
Nomad Job", "Templated Nomad
Job which can be
instantiated.") Container(runner_instance, "Runner Instance", "Dispatched
Nomad Job", "Specific instance
of Github Runner
with relevant payload") } Rel(webhook, dns_name, "Look up endpoint", "HTTPS") Rel(dns_name, application, "Route payload
to Edge
application", "") Rel(application, worker_script, "Invoke
worker script", "") Rel(worker_script, parametrised_runner, "Dispatch with payload", "HTTPS") Rel(tunnel_connector, cloudflare_tunnel, "Expose", "cloudflared") Rel(parametrised_runner, runner_instance, "Execute", "Nomad Agent") }
I leave it to the reader to decide which of these diagrams is the more enlightening.
Deployment
Now it comes time finally to implement the solution as code. This was done with Terraform10, by using the Github, Cloudflare, Nomad and other providers to create the relevant resources.
I will go into the implementation perhaps in a later post, but for the curious take a look at the Terraform module repo.
This module essentially creates the functional parts (github webhooks, cloudflare tunnel, cloudflare worker, nomad job), as well as the access policies necessary to protect the tunnel and allow access only to authorised calls.
With a terraform apply
we deploy it all and wait for webhooks to send data from Github to start runners.
Results
With the solution deployed, we can see that there are webhooks registered11 on all my personal repos, registered to trigger for specific events. For example:
Here we can see what the target is indeed repository
, solving the issue we had before of having to register large amounts of runners persistently, and the payload signature X-Hub-Signature
which is verified in the worker before being decoded and passed on to the Cloudflare application connected to the Cloudflare tunnel exposing the Nomad API.
When the payload hits the Cloudflare Worker it is verified, and if the event matches the right condition, a request is sent to the the Nomad endpoint:
Here, the access_client_id
and access_client_secret
are registered as secrets in the Cloudflare worker, so that we can authenticate the worker to the Cloudflare Access tunnel.
The payload is the base64 encoded payload sent to the Nomad dispatch endpoint:
Also included is a body with a Meta
key which provides us with variables which we can pass to the job, using the Nomad runtime
Discussion
For now you’ll have to trust me, dear reader, that this whole shebang works. There’s a new post coming soon going into depth on the implementation I hope I’ve tickled your fancy, but I must confess that my goal here was not just to write a story about how I managed to run Github Actions at home.
Communicating design across the technical boundary
Designing and describing the architecture of a solution, and then implementing it are usually done by different people in a professional context. The former is done by a solutions architect, the latter by an engineer; it is quite rare that one has the luxury of being able to design and implement a solution in the actual workplace12.
This leaves the door open to misinterpretation and fuzziness about what is actually meant by a given diagram or description, creating at best a level of frustration and at worst a bunch of finger-pointing when things go wrong. It’s probably just another way that “DevOps” was supposed to save us, only in this case we’ve moved up the maturity path all the way to the design phase.
I’ve heard a bit about how the C4 model is supposed to be able to communicate effectively across technical boundaries and wanted to give it a real try in a nontrivial scenario I have control over.
Everything as code?
The second muscle I wanted to exercise was the diagrams as code. Like any engineer affine role, I’ve drawn a few diagrams in my day, and I’ve usually tried hard – especially when I needed to communicate something via them to someone else – to adopt good design principles. This often means applying consistent styling to diagrams, which can be very time consuming. What is more, there are often different audiences which need to be told the story in the diagram, and they will be interested in different details.
The quality and correctness of these diagrams is of the utmost importance so the ability to separate the content from the view makes it possible to significantly reduce the amount of work that needs to be done to produce consistent visualisations. This ability is afforded by writing the diagrams in a “source code” and then “compiling” them into a given visualisation using an appropriate tool. This is generally referred to as diagrams as code, and first appeared on the Thoughtworks Technology Radar in 2021:
… There are benefits to using these tools over the heavier alternatives, including easy version control and the ability to generate the DSLs from many sources…
I wanted something that I could embed in this very document, so that the diagram was part of the document, which led me to use MermaidJS. A similar approach would have been to use diagrams.net in a Jupyter notebook… but this is a Jekyll blog so, nah.
The diagrams generated by Mermaid, especially the C4 diagrams, are subpar, but I must admit that having the diagram readable in the source code and readable in the docs, and actually in the same place, is quite convincing. When written as code, what it loses in expressiveness, the diagram is explicit and semantically precise. I am looking forward to seeing how far I can push this with Asciidoctor and Structurizr, and comparing it with what (spoiler alert) can be done with Terraform graph.
Reducing rework
Taken individually, this may seem like sterile navel gazing, but if we can keep the design, code and docs in a single place, and communicate effectively to the target audience, I have a suspicion that this will help significantly in getting things done right the first time, without resorting to interminable meetings and begrudging rework
References and Footnotes
-
The goal here was more than just having a working solution. I used a few new tools which I wanted to practice, including mermaidjs for the diagrams in this article, the c4 model for representing architecture, as well as the actually technical services in Cloudflare, Nomad and Github. ↩
-
See the Github docs for a full description of Github Webhooks. ↩
-
The reasons for this basically came down to: I already have an account, I can do it for free, I wanted to learn the Cloudflare One zero trust mechanism. ↩
-
I have no affiliation with Cloudflare other than the same consumer agreement along with all the other shmoes. I pays my money, I gets the goods. ↩
-
The combination of all of these services was really the main reason that I chose the platform. For more information on Cloudflare products see the docs ↩
-
I chose the
brucellino.dev
domain because reasons. ↩ -
Container, Component and Software System here are C4 terms. ↩
-
The diagrams here were made with mermaidJS. If I’m being brutally honest, the C4 implementation is not great. Granted, it’s still not fully implemented, but if someone had to give me this picture, I’d be more confused than I started out. ↩
-
This was built in 2023 when nobody would have been surprised to hear that a platform service was implemented with Terraform. We’ll see what 2024 has in store and perhaps the golden path here will deviate. ↩
-
These are registered with a
:webhook_id
and can be seen at a given:repo
: e.g.https://github.com/<username>/:repo/settings/hook/:webhook_id
↩ -
Not to mention the feature requirements gathering – this is why I included the user stories in the article. ↩