From four to five 9s of uptime by migrating to Kubernetes

From four to five 9s of uptime by migrating to Kubernetes

·

9 min read

When we launched User Management along with a free tier of up to 1 million MAUs, we faced several challenges using Heroku: the lack of an SLA, limited rollout functionality, and inadequate data locality options. To address these, we migrated to Kubernetes on EKS, developing a custom platform called Terrace to streamline deployment, secret management, and automated load balancing.

User Management and Scale

In 2023 we launched WorkOS User Management with a free tier that includes 1 million monthly active users. We quickly realized that Heroku couldn't support a vision that big. Here are the four biggest challenges we faced:

  • No SLA — Heroku does not provide an SLA, making it much more difficult for us to provide one to our customers.

  • Limiting rollout functionality — With the 99.99% of uptime target, we needed a platform that could both quickly rollout a new version and support more sophisticated rollout strategies, such as blue/green and canary deployments.

  • Security — Heroku’s Private Spaces run within AWS VPCs, but given its platform nature, it is very difficult to address vectors of attack. See Heroku’s most recent severe incident.

  • Fine grained data locality — While Heroku allows apps to be deployed in multiple regions, they will always lag behind in availability when compared to the underlying Infrastructure as a Service (IaaS), which is AWS.

Since the initial stages of the migration, we sought for a platform that would allow us to customize it to fit our needs. We knew that the migration was going to be a risky and lengthy process, and we wanted to do it only once, so any custom bits we needed, we could either reach for OSS or build it ourselves.

We evaluated various vendors, including ECS, OpenShift, Render, and fly.io, to see if they could meet our needs. While each had promising features, they either lacked one of our key requirements or didn't offer the customization tools we wanted. Ultimately, we chose Kubernetes on EKS as the underlying platform.

Our Own Kubernetes Flavor

Our main goal with this new platform was to create a seamless, efficient way to deploy new services. We wanted the "golden path" to be fast and easy, and even when deviating from it, the experience would remain user-friendly.

To provide this experience to our users, we built Terrace, a Heroku like platform on top of EKS that allowed engineers to create their own apps in TypeScript without having to learn about the nooks and crannies of Kubernetes.

The team had mixed experiences with Kubernetes, and one key lesson we learned was to avoid having special clusters, meaning treating them like cattle, not pets. Clusters can break, and when they do, the remediation that follows can be painful. To address this, we aimed to use Terraform for our clusters, making it easy, fast, and automated to bring them up or down. This included both the clusters themselves and their Operators and Controllers.

With clusters up and running, we identified the functionalities in Heroku that we wanted to replicate in Terrace.

Secret Injection

Heroku has the concept of Config Vars, which are key value pairs that are injected as environment variables into each app. It also offers the functionality to synchronize these values from 3rd party vendors, such as AWS SSM, Doppler and Vault. We wanted to maintain this ability, which can be summarized in two steps:

  1. Synchronize secrets through the integration

  2. Rollout new replicas with the updated secrets

First, we decided to use the External Secrets Operator, which allows us to pull secrets from multiple vendors and inject them as Secret resources in our clusters.

Next, we went with Reloader. This Kubernetes Controller watches for changes in Secret resources and triggers rolling upgrades on the tied Workloads.

The final workflow should feel familiar to engineers, as it is similar to how Heroku operates. Engineers will update the secrets in the third-party vendor, which will then sync with the Secret resources in the clusters and trigger a rolling upgrade on the corresponding workload.

Automated Load Balancing, TLS and DNS

Choosing the right Ingress Controller is a common topic when migrating to Kubernetes since there are a lot of different implementations, each one focusing on specific use cases (curated list by The Kubernetes Authors). We ended up choosing the AWS Application Load Balancer Controller for its robust EKS integration and automated TLS management by leveraging its AWS Certificate Manager integration. This meant that we could automatically spin up load balancers with their TLS certificates getting automatically renewed.

On the DNS side, we wanted our CNAME records to be automatically created and synced with whatever records the load balancers had, so we leveraged the ExternalDNS controller. This detects the hosts present on the rules of each Ingress resource and creates the appropriate DNS records in the configured DNS provider.

Machine Provisioning

When using an IaaS abstraction like Heroku, we wanted to avoid thinking about provisioning machines to run our apps. For example, if we had to scale it to a million replicas, the platform should do the heavy lifting and give us the hardware to run it. There were two options to address this: Cluster Autoscaler and Karpenter. We decided to go with the latter due to its more complete feature set and documentation.

Application Management

Although the kubectl CLI is in a very good state today, managing complex apps can become very difficult. We evaluated two options for managing our apps, ArgoCD and Argonaut. We decided to go with ArgoCD on this one due to its extension capabilities and maturity as a solution.

With the necessary tools in place, we faced another important decision: should our infrastructure team manage every service, or should we make it self-serve for other teams? To decide, we considered our team sizes and our automation goals. We chose the self-serve option, but with clearly defined escape hatches. This approach would create an abstraction layer for most engineers to work with, while still allowing them to manage their own Kubernetes tasks if needed.

The Path to Self Serving

The abstraction we chose was similar to Heroku’s Apps and Processes, where each App could contain multiple processes that would be deployed to multiple clusters. Since our codebase is mostly done with TypeScript, we decided to go with the same tech stack. Enter cdk8s, which is a tool that allows writing Kubernetes resources in multiple languages, including TypeScript.

Here's the flow we envisioned for generating the Kubernetes manifests:

This direction provided us with more flexibility for building apps when compared to other popular solutions like Helm. This meant that instead of just templating, we had all of the Node.js ecosystem at our disposal, so customizing each app was easy. This is especially true with cdk8s, since its own abstraction allows us to attach any Kubernetes resources to its Chart object.

With our abstraction on top of cdk8s done, all that was left was a way to get the generated Kubernetes resources from our apps to be managed by ArgoCD, our application manager. We had two options here, either a Git artifact repository where we would push changes on every deploy or extending ArgoCD and make it understand cdk8s. Since automation was a general theme of this migration, we went with latter. Luckily, ArgoCD was just moving its Sidecar Plugin functionality out of beta, which we took advantage of.

For the Sidecar Plugin to work, we had to first build a Docker image with the proper Config Management Plugin file located at /home/argocd/cmp-server/config/plugin.yaml and add it as a sidecar container of ArgoCD’s Repo Server. Here’s an example plugin.yaml file:

apiVersion: argoproj.io/v1alpha1
kind: ConfigManagementPlugin
metadata:
  name: cdk8s
spec:
  version: v1.0
  init:
    command: [sh, -ce]
    args:
      # Install your dependencies here. Caching is highly recommended, since it will
      # run this step on every ArgoCD Sync.
      - npm install
  generate:
    command: [sh, -ce]
    args:
      - |
        # Synthesize your cdk8s Chart, but output only to stderr, so that ArgoCD
        # can pick up errors if they happen.
        cdk8s synth 1> /dev/null
        # Output your generated Kubernetes resources in the end. ArgoCD will consider
        # everything emitted to stdout as part of the your ArgoCD app manifest.
        cat dist/app.k8s.yaml

Now we can build the Docker image with the YAML file above and attach it as a sidecar container in ArgoCD’s Repo Server:

FROM node:-alpine

WORKDIR /home/argocd/cmp-server/config/
COPY plugin.yaml .

# Follow with any additional setup needed by your repository

Follow by adding the Docker image above as an ArgoCD’s Repo Server sidecar container:

containers:
  ...
  # Sidecar container definition
  - command:
    - /var/run/argocd/argocd-cmp-server
    image:
    name: cdk8s
    securityContext:
      runAsNonRoot: true
      runAsUser: 999
    volumeMounts:
    # The init containers will inject the executables in this mounted volume
    - mountPath: /var/run/argocd
      name: var-files
    - mountPath: /home/argocd/cmp-server/plugins
      name: plugins
    - mountPath: /tmp
      name: cmp-tmp
volumes:
- emptyDir: {}
  name: cmp-tmp

Now you are able to create an ArgoCD Application resources using the new plugin. Here’s how it looks like:

apiVersion: argoproj.io/v1alpha1
kind: Application
spec:
  source:
    path:
    plugin:
      name: cdk8s-v1.0 # This comes from the plugin definition's name and version

For further configuration on how to leverage parameters and environment variables in the plugin, check out ArgoCD’s docs.

Migration and Results

During the migration process, we decided to do two different types of load tests in our apps. We started out with short bursts of requests and after making sure that it was working properly on those scenarios, we performed soak tests that mimicked the network traffic we had in production. The idea here was to understand how normal operations, such as deployments, including ones that could bring the service down, would impact the system reliability.

During the soak tests, we discovered some configurations that needed tweaking. More importantly, we gained the ability to fully understand what was happening throughout our system. With Heroku, we struggled with observability because most of the stack was a black box to us. Now, we can identify and understand all the failure modes in our system.

After the migration, our uptime improved significantly, from four nines to consistently achieving five nines over 7 and 30-day periods across all services. This increase is directly related to our improved ability to gain insights from our systems and fine-tune them accordingly.

Another factor enhancing our system's reliability is the speed at which we can roll out changes. On Heroku, deploying a new version could take up to 12 minutes. In Terrace, new versions are rolled out in just 2-3 minutes. This speed difference can elevate uptime from three to four nines in a 7-day window.

The Future of Terrace

Reflecting on our initial goals, the only need we haven't yet implemented is deploying to different data localities. However, we have the capability to achieve this easily, so it will be our focus in the near future.

Currently, we use rolling updates for rollouts. While this approach allows us to move quickly, we want to ensure that critical paths in our product remain bug-free and reliable. To achieve this, we plan to experiment with blue/green and canary deployment strategies.

Additionally, our strategy with ArgoCD’s Sidecar Plugin has some downsides, particularly with ArgoCD’s UI. The UI can feel sluggish because it attempts to go through the plugin flow on page loads within the Application view. To address this, we plan to build a Terrace Operator. This operator will apply a single Custom Resource containing all the app and process definitions, and then create all the necessary resources for them to work efficiently.