Github Runners on Nomad... on demand

What do I want from my runners
- Features
Problem Statement
- Registering runners
- Cloudflare to the rescue
Implementation
- Architecture
- Deployment
Results
Discussion
References and Footnotes

What do I want from my runners

In the previous post, I considered using Hashi@Home as a place to run Self-Hosted GitHub runners. In this article, I wanted to go into detail about how I actually went about designing and implementing a solution¹

Before discussing problems to address, perhaps it’s better to spend a few words describing how I would like my environment to behave when I’m done.

My story looks a bit like this:

As a po’ boy with a Nomad cluster

I want just enough self-hosted GitHub runners

So that I don’t have to pay for them in terms of time or money

Let’s take a closer look at the necessary features.

Features

We can define three features; Runners:

Are On Demand
Scale to Zero
Are ephemeral

Using gherkin syntax, I’ll try to be a bit more explicit about what is meant by these features.

First off, runners are on demand:

Feature: On Demand Runner
  Runners are created based on events in the repository

  Scenario: A pull request is opened
      Given there is a webhook for pull requests
      When a pull request with an action using self-hosted runner is created
      Then a runner is created using the webhook payload

Furthermore, I want the runners to scale to zero:

Feature: Scale to Zero
  Resource consumption scales as a function of demand

  Scenario: There are no triggered workflows
      Given there are no jobs currently scheduled by workflows
      When the number of workflow jobs is zero
      Then the number of running runners on the cluster is zero

I also want runners to be used once and then discarded, so that I don’t have to manually clean them up afterwards:

Feature: Ephemeral Runners
  Runners are used once and then discarded

  Scenario: A CI/CD job is scheduled
      Given a Github Actions workflow is triggered
      When a job is scheduled as part of the workflow
      Then the runner created is used by the job

  Scenario: A CI/CD job is completed
    Given workflow job is correctly picked up by a runner
    When the job completes
    Then the runner is deregistered and removed from the repository

On a finer note, I also want runners to be scheduled only when the relevant runner label is present; when the relevant label is not present, we do not create a runner. I see this as being a part of the first feature:

Feature: On Demand Runner
    Resource consumption scales as a function of demand

    Scenario: A job is tagged as running on a self-hosted runner
        Given a job is scheduled
        When the job requests a runner with the tag "self-hosted"
        Then a runner is created using the webhook payload

    Scenario: A job not tagged as running on a self-hosted runner
        Given a job is scheduled
        When the job requests a runner with a tag not "self-hosted"
        Then no runner is created using the webhook payload

Problem Statement

There were two main problems to address:

personal repository runners need a token scoped to a specific repository; this would create an unsustainable amount of runners if they were to be provisioned statically.
In order to provision them on-demand, we need to react to webhooks, which in turns requires that we expose an endpoint to a domain that GitHub can POST to.

Registering runners

The first is quite simple to address. If we emit a webhook² for workflow job events on a repository, the webhook payload will contain the repository name as part of the payload³. In order to register the runner, with the repo, we first need a runner token, which can be obtained via an authenticated HTTP REST call to the registration endpoint: /repos/:owner/:repo/actions/runners/registration-token

However, in order to receive the webhook payload, we need somewhere to actually execute this call. True, we could probably set up a small action in GitHub itself to register a new runner registration token and then pass it on to whatever will create the self-hosted runner, but that would just be kicking the can down the road – we’d eventually need a place to run a runner anyway!

So, now we start touching problem number 2 above. We need to:

Register a webhook to send payload data to and endpoint
Receive and perform computation on the payload, extracting amongst other things the repository
Obtain a github token and call the runner registration endpoint
Start the runner with the correct parameters: url, personal access token and labels

The first step may be large, but it is not complicated. We can get a list of repositories, and using the REST API we can create webhooks. However, in order to create the webhook, we need to know the destination, and it is this second item in this list is topic for the next subsection – where do we send our payload?

Cloudflare to the rescue

As is a well-ascertained fact by now, Hashi@home is deployed … at home. Like, in my private network here in my home office. It is behind a fibre connection with no public domain resolution – I can’t just tell Github “send your payload to my place”… that’s like telling someone “take this to my buddy in New York”.

So, I need now some way of exposing the services running on Hashi@Home to the wider internet, by making them resolvable via DNS. There are a few common tricks that folks have been known to play when exposing their private network to the internet, but I chose to go with Cloudflare⁴ ⁵.

With a Cloudflare account, I can use the Developer Platform and Zero Trust products to⁶:

register a DNS name on a domain that I own⁷
attach a worker to the route
configure a KV store to cache data for the workers
create a tunnel deploy it into the Nomad cluster at home.
protect the tunnel with access policies requiring a service token

This will safely solve the problem 2 above, ensuring that webhook payloads are delivered to an endpoint which authenticates them and triggers a function which, based on some business logic, dispatches the job to start the runner, passing the relevant data from the payload to the Nomad Dispatch API endpoint.

Below is a schematic diagram of the flow of events:

---
title: Github Runners on Demand Workflow, showing Github Actions, Repo triggers, Cloudflare and Nomad (in Hashi@Home)
flowchartConfig:
  width: 100%
config:
  theme: base
  themeVariables:
    background: "#d8dee9"
    fontFamily: "Roboto Mono"
    primaryColor: "#88c0d0"
    secondaryColor: "#81a1c1"
    tertiaryColor: "#ebcb8b"
    primaryTextColor: "#2e3440"
    secondaryTextColor: "#3b4252"
    primaryBorderColor: "#7C0000"
    lineColor: "#F8B229"
    fontSize: "20px"
---
flowchart TB
  commit(Commit)
  repo(Git Repo)
  webhook(Repo Webhook)
  action_workflow(Github Action workflow)
  action_job(Github Action job)
  runner(Github Self-Hosted runner)
  listener(Webhook receiver)
  worker([Cloudflare Worker])
  tunnel(Cloudflare tunnel)
  cf_app(Cloudflare Application)
  nomad_job(Nomad Parametrized job)
  nomad(Nomad Dispatch API)

  %% triggers %%
  subgraph Triggers["fa:fa-github Triggers"]
    direction LR
    commit-- Push update to -->repo
    commit-- Trigger -->action_workflow
  end

  %% webhook flow %%
  subgraph Webhooks
    direction TB
    repo-- Emit -->webhook
    subgraph Cloudflare["fab:fa-cloudflare"]
      direction LR
      webhook-- POST -->listener
      listener-- Route -->cf_app
      zta-- Protect -->cf_app
      cf_app-- Invoke -->worker
      worker-- POST -->tunnel
    end

    subgraph Nomad
      direction LR
      tunnel-- Expose -->nomad
      nomad-- Schedule -->nomad_job
      nomad_job-- Instantiates -->runner
    end
  end

  %% actions flow %%
  subgraph Actions["fa:fa-github Actions"]
    action_workflow-- Schedules -->action_job
    action_job-- Requests -->runner
    runner-- Executes -->action_job
    action_job-- Notifies Status -->commit
  end

So, technically we have solved the problem 🎉 all we have to do now is implement it 🤔. Time to break out the ol’ toolbox 🧰!

Implementation

This section deals with the technical implementation of the solution to the problem described above. In this case, I’m implementing it myself, but often implementations are a team effort, with several engineers collaborating to create and integrate several parts.

Architecture

The first thing you need is agreement on the architecture of the thing you’re building, often referred to as a model. Representing architecture as a diagram makes it a bit easier for the team to know who is responsible for what, and how the pieces integrate. However, if we adhere to the old adage that “a picture is worth a thousand words”, we risk losing much of rigour we gain with code by representing models as diagrams. Pictures may be interpreted differently by different people, or lack a means to express precise details of how parts are supposed to be built. What is more, a given architecture will be deployed differently in different environments.

You also don’t want to overwhelm whoever is reading the design with irrelevant details of parts that don’t concern them, so actually a hierarchical visualisation of the architecture would be the best thing.

We’re going to use the C4 model to visualise how the workflow shown above is implemented. This is a four-level hierarchy showing just enough detail at each level to be able to collaborate effectively, and is becoming widely-used as of the time of writing.

Below we have a container diagram

--- config: theme: base themeVariables: background: "#d8dee9" fontFamily: "Roboto Mono" primaryColor: "#88c0d0" secondaryColor: "#81a1c1" tertiaryColor: "#ebcb8b" primaryTextColor: "#2e3440" secondaryTextColor: "#3b4252" primaryBorderColor: "#7C0000" lineColor: "#F8B229" fontSize: "20px" --- C4Container title "C4 Container Diagram for Zero-Scale runner" Person(user, User, "A user of the repository") Container_Boundary(cloudflare, "Cloudflare", "Cloudflare developer platform") { Component(zta, "Zero Trust", "", "Access control
mechanism for
authorising connections") Component(rbac, "RBAC", "", "Role-based access
controls for
expressing access
policies") Component(worker, "Worker", "NodeJS", "Cloudflare worker") Component(kv, "KV", "", "Cloudflare Worker
KV store") Component(domain, "Domain", "DNS", "Resolveable domain
on which to expose services") Component(tunnel, "Tunnel", "", "Cloudflare Access Tunnel") } Container_Boundary(github, "Github", "SCM Platform") { Component(repo, "Repository", "Git", "Hosts and tracks
changes to source code") Component(webhook, "Webhook", "REST", "Delivers payload based
on predefined triggers") Component(actions, "Actions", "REST", "CI/CD Workflows
as defined in repo") } Container_Boundary(hah, "Hashi@Home", "Compute services deployed at Home") { Component(nomadServer,"Nomad API", "REST", "Nomad endpoint") Component(nomadJob, "Nomad Runner
Parametrised Job", "bash", "Nomad job to run
runner registration
and execution script") Component(nomadExec, "Nomad", "Nomad Driver", "Nomad Task
Executor") Component(tunnelConnector, "Tunnel Connector", "cloudflared", "Cloudflare tunnel
connector") } Rel(user, repo, "Pushes Code") Rel(repo, webhook, "Triggers webhook delivery") Rel(repo, actions, "Trigger workflow job") Rel(repo, domain, "Delivers Payload") Rel(rbac, zta, "Define Policy") Rel(nomadServer tunnelConnector, "Run") Rel(tunnelConnector, tunnel, "Expose") Rel(worker, kv, "Lookup Data") Rel(domain, worker, "Triggers") Rel(worker, nomadServer, "Dispatch payload") Rel(nomadServer, nomadJob, "Trigger job") Rel(nomadServer, nomadExec, "Provision job runtime") Rel(nomadExec, actions, "Register runner")

This diagram shows the architecture up to the component level as well as the container boundaries for the software systems we will need to work with⁸ ⁹.

For an actual deployment, let’s create a more detailed diagram:

--- config: theme: base themeVariables: background: "#d8dee9" fontFamily: "Roboto Mono" primaryColor: "#88c0d0" secondaryColor: "#81a1c1" tertiaryColor: "#ebcb8b" primaryTextColor: "#2e3440" secondaryTextColor: "#3b4252" primaryBorderColor: "#7C0000" lineColor: "#F8B229" fontSize: "20px" --- C4Deployment title Deployment diagram for Zero-Scale Github Runner on Demand Deployment_Node(github, "Github", "GitHub") { Deployment_Node(repo, "repo", "git", "Contains source code") { Container(webhook, "webhook", "REST", "Sends Webhook payload") Container(action, "Actions", "Github Actions", "Triggers Actions workflows") } } Deployment_Node(cloudflare, "Cloudflare", "Cloudflare Account") { Deployment_Node(dns, "DNS", "Domain Name System") { Container(dns_name, "DNS record", "DNS") } Deployment_Node(cloudflare_one", "Cloudflare One", "Cloudflare access control") { Container(application, "Cloudflare Access Application", "Nomad", "Self-Hosted Cloudflare Access Application") Container(access_policy, "Cloudflare Access Policy", "Nomad", "Policy rules for allowing access to the Nomad application") Container(cloudflare_tunnel, "Cloudflare Tunnel", "", "") } Deployment_Node(workers, "Workers", "V8", "Cloudflare Workers") { Container(worker_script, "Worker Script", "TypeScript", "Cloudflare worker script
to handle incoming
webhook payloads") } } Deployment_Node(hah, "Hashi@Home", "Hashi@Home Services") { Container(nomad_server, "Nomad Server", "Member of
Nomad Server
Raft consensus", "Provides Nomad API") Deployment_Node(nomad_cluster, "Nomad Cluster", "Nomad Cluster", "Nomad execution
Environment") { Container(tunnel_connector, "Tunnel Connector", "Nomad Job", "Provides
connectivity to
Cloudflare Edge") Container(parametrised_runner, "Parametrised Runner", "Parametrised
Nomad Job", "Templated Nomad
Job which can be
instantiated.") Container(runner_instance, "Runner Instance", "Dispatched
Nomad Job", "Specific instance
of Github Runner
with relevant payload") } Rel(webhook, dns_name, "Look up endpoint", "HTTPS") Rel(dns_name, application, "Route payload
to Edge
application", "") Rel(application, worker_script, "Invoke
worker script", "") Rel(worker_script, parametrised_runner, "Dispatch with payload", "HTTPS") Rel(tunnel_connector, cloudflare_tunnel, "Expose", "cloudflared") Rel(parametrised_runner, runner_instance, "Execute", "Nomad Agent") }

I leave it to the reader to decide which of these diagrams is the more enlightening.

Deployment

Now it comes time finally to implement the solution as code. This was done with Terraform¹⁰, by using the Github, Cloudflare, Nomad and other providers to create the relevant resources.

I will go into the implementation perhaps in a later post, but for the curious take a look at the Terraform module repo.

This module essentially creates the functional parts (github webhooks, cloudflare tunnel, cloudflare worker, nomad job), as well as the access policies necessary to protect the tunnel and allow access only to authorised calls.

With a terraform apply we deploy it all and wait for webhooks to send data from Github to start runners.

Results

With the solution deployed, we can see that there are webhooks registered¹¹ on all my personal repos, registered to trigger for specific events. For example:

Request URL: https://github_webhook.brucellino.dev
Request method: POST
Accept: */*
Content-Type: application/json
User-Agent: GitHub-Hookshot/f09667f
X-GitHub-Delivery: 7a410360-ac7b-11ee-915d-61f3c222f3d9
X-GitHub-Event: workflow_job
X-GitHub-Hook-ID: 444544179
X-GitHub-Hook-Installation-Target-ID: 175939748
X-GitHub-Hook-Installation-Target-Type: repository
X-Hub-Signature: sha1=36b4f3433c470a6bacfc6f2e87a53908cc84f0fc
X-Hub-Signature-256: sha256=af8b8f7ca85340ee76a26a87d91ceb7e01b06e4ae84b57b1c2ef8bd69efabd2f

Here we can see what the target is indeed repository, solving the issue we had before of having to register large amounts of runners persistently, and the payload signature X-Hub-Signature which is verified in the worker before being decoded and passed on to the Cloudflare application connected to the Cloudflare tunnel exposing the Nomad API.

When the payload hits the Cloudflare Worker it is verified, and if the event matches the right condition, a request is sent to the the Nomad endpoint:

const dispatch = {
  method: "POST",
  headers: {
    "Authorization":  "Bearer " + nomad_acl_token,
    "Content-Type": "application/json;charset=UTF-8",
    "CF-Access-Client-Id": access_client_id,
    "CF-Access-Client-Secret": access_client_secret,
  },
  body: JSON.stringify({
    Payload: data,
    Meta: {
      REPO_FULL_NAME: payload.repository.full_name,
      REPO_SHORT_NAME: payload.repository.name,
      WORKFLOW_RUN: (payload.workflow_job.run_id).toString()
    },
  }),
};

Here, the access_client_id and access_client_secret are registered as secrets in the Cloudflare worker, so that we can authenticate the worker to the Cloudflare Access tunnel.

The payload is the base64 encoded payload sent to the Nomad dispatch endpoint:

const data = btoa(
  JSON.stringify({
    repo: payload.repository.full_name,
    fork: payload.repository.fork,
    job: payload.workflow_job.name,
    run_id: payload.workflow_job.run_id
  })
);

Also included is a body with a Meta key which provides us with variables which we can pass to the job, using the Nomad runtime

Discussion

For now you’ll have to trust me, dear reader, that this whole shebang works. There’s a new post coming soon going into depth on the implementation I hope I’ve tickled your fancy, but I must confess that my goal here was not just to write a story about how I managed to run Github Actions at home.

Communicating design across the technical boundary

Designing and describing the architecture of a solution, and then implementing it are usually done by different people in a professional context. The former is done by a solutions architect, the latter by an engineer; it is quite rare that one has the luxury of being able to design and implement a solution in the actual workplace¹².

This leaves the door open to misinterpretation and fuzziness about what is actually meant by a given diagram or description, creating at best a level of frustration and at worst a bunch of finger-pointing when things go wrong. It’s probably just another way that “DevOps” was supposed to save us, only in this case we’ve moved up the maturity path all the way to the design phase.

I’ve heard a bit about how the C4 model is supposed to be able to communicate effectively across technical boundaries and wanted to give it a real try in a nontrivial scenario I have control over.

Everything as code?

The second muscle I wanted to exercise was the diagrams as code. Like any engineer affine role, I’ve drawn a few diagrams in my day, and I’ve usually tried hard – especially when I needed to communicate something via them to someone else – to adopt good design principles. This often means applying consistent styling to diagrams, which can be very time consuming. What is more, there are often different audiences which need to be told the story in the diagram, and they will be interested in different details.

The quality and correctness of these diagrams is of the utmost importance so the ability to separate the content from the view makes it possible to significantly reduce the amount of work that needs to be done to produce consistent visualisations. This ability is afforded by writing the diagrams in a “source code” and then “compiling” them into a given visualisation using an appropriate tool. This is generally referred to as diagrams as code, and first appeared on the Thoughtworks Technology Radar in 2021:

… There are benefits to using these tools over the heavier alternatives, including easy version control and the ability to generate the DSLs from many sources…

Thoughtworks Technology Radar (2020/2021)

I wanted something that I could embed in this very document, so that the diagram was part of the document, which led me to use MermaidJS. A similar approach would have been to use diagrams.net in a Jupyter notebook… but this is a Jekyll blog so, nah.

The diagrams generated by Mermaid, especially the C4 diagrams, are subpar, but I must admit that having the diagram readable in the source code and readable in the docs, and actually in the same place, is quite convincing. When written as code, what it loses in expressiveness, the diagram is explicit and semantically precise. I am looking forward to seeing how far I can push this with Asciidoctor and Structurizr, and comparing it with what (spoiler alert) can be done with Terraform graph.

Reducing rework

Taken individually, this may seem like sterile navel gazing, but if we can keep the design, code and docs in a single place, and communicate effectively to the target audience, I have a suspicion that this will help significantly in getting things done right the first time, without resorting to interminable meetings and begrudging rework

References and Footnotes

The goal here was more than just having a working solution. I used a few new tools which I wanted to practice, including mermaidjs for the diagrams in this article, the c4 model for representing architecture, as well as the actually technical services in Cloudflare, Nomad and Github. ↩
See the Github docs for a full description of Github Webhooks. ↩
See Webhook payload object for workflow_job ↩
The reasons for this basically came down to: I already have an account, I can do it for free, I wanted to learn the Cloudflare One zero trust mechanism. ↩
I have no affiliation with Cloudflare other than the same consumer agreement along with all the other shmoes. I pays my money, I gets the goods. ↩
The combination of all of these services was really the main reason that I chose the platform. For more information on Cloudflare products see the docs ↩
I chose the brucellino.dev domain because reasons. ↩
Container, Component and Software System here are C4 terms. ↩
The diagrams here were made with mermaidJS. If I’m being brutally honest, the C4 implementation is not great. Granted, it’s still not fully implemented, but if someone had to give me this picture, I’d be more confused than I started out. ↩
This was built in 2023 when nobody would have been surprised to hear that a platform service was implemented with Terraform. We’ll see what 2024 has in store and perhaps the golden path here will deviate. ↩
These are registered with a :webhook_id and can be seen at a given :repo : e.g. https://github.com/<username>/:repo/settings/hook/:webhook_id ↩
Not to mention the feature requirements gathering – this is why I included the user stories in the article. ↩