For the complete documentation index, see llms.txt. This page is also available as Markdown.

Disaster Recovery

Run Pipekit workflows continuously across primary and secondary clusters in separate regions, with manual failover when the primary is unavailable.

Run Pipekit workflows continuously across a primary and secondary cluster in separate regions, with manual failover when the primary region is unavailable. This pattern fits operators with a contractual recovery time objective and operators who run business-critical workflows where losing the primary cluster requires a disaster recovery plan. Pipekit gates and coordinates the cluster-level routing through admin-only controls and queue cleanup; the surrounding infrastructure remains your responsibility as the operator.

Overview

The pattern uses two Kubernetes clusters connected to the same Pipekit organization, each running the Pipekit Agent. Only one cluster is active at a time. Workflow submissions target the active cluster by passing its name to the --cluster-name flag on pipekit submit. Failover is a short sequence of REST calls that flip the isActive flag on each cluster, run from a CI pipeline you trigger when the primary region is unhealthy.

Failover flips active and inactive on each cluster via PUT /clusters/{cluster-uuid}.

What Pipekit provides

Pipekit owns three pieces of the failover that you would otherwise have to build yourself.

  • Admin-only state changes. Only organization admins can toggle a cluster between active and inactive. The check is enforced at the control plane, not at the CI layer, so a leaked CI credential cannot flip cluster state.

  • Automatic queue cleanup on reactivation. When a cluster transitions from inactive back to active, Pipekit clears any workflows submitted while the cluster was inactive. Stale submissions that were enqueued before the failover do not replay when the primary comes back online.

  • A single audited toggle. The active state is one field on one resource; flipping it is one REST call. Operators do not have to coordinate routing changes across multiple Pipekit objects.

Prerequisites

  • Two Kubernetes clusters in separate regions, each able to run Argo Workflows.

  • The Pipekit Agent installed on both clusters, following Connect a Kubernetes Cluster to Pipekit.

  • Pipekit credentials for an organization admin (only admins can toggle cluster state). Store the username and password in a Kubernetes Secret on the cluster that runs the failover workflow. The failover example authenticates with the Pipekit CLI, following the container login pattern, so a short-lived bearer token does not need to be persisted anywhere.

  • A system to run the failover steps. Any CI tool that can call the Pipekit REST API works (Azure DevOps, GitHub Actions, GitLab CI), or you can use Pipekit itself by submitting an Argo Workflow to a third, always-active cluster.

This page covers Pipekit's role in the failover. Surrounding infrastructure such as database replication, secrets sync, DNS updates, and image registry availability is your responsibility. Coordinate the Pipekit steps with your wider DR plan.

Preparing the secondary cluster

Bring the secondary cluster up in its target region and register it under the same organization as the primary. Install the Pipekit Agent and confirm the green connection indicator in the Clusters view.

Set the secondary cluster inactive immediately after the Agent reports healthy. Run pipekit update cluster <dr-name> --status=inactive (or PUT /api/users/v1/clusters/{cluster-uuid} with {"isActive": false} against the REST API). This prevents accidental submissions while the primary is the active cluster.

Failing over to the secondary cluster

Trigger your CI pipeline when you decide to fail over. The pipeline runs three steps against the Pipekit REST API.

  1. Mark the primary cluster inactive. Run pipekit update cluster <primary-name> --status=inactive (or PUT /api/users/v1/clusters/{primary-uuid} with {"isActive": false} against the REST API). New submissions targeted at the primary are rejected with the error trying to start workflow on inactive cluster.

  2. Mark the secondary cluster active. Run pipekit update cluster <dr-name> --status=active (or PUT /api/users/v1/clusters/{dr-uuid} with {"isActive": true}).

  3. Switch the value passed to --cluster-name on every pipekit submit invocation to the secondary cluster's name. The --cluster-name flag is required and has no environment-variable fallback, so every submission job in your pipeline must read the cluster name from a shared variable.

Example: Argo Workflow

The Argo Workflow below performs the failover by calling pipekit update cluster against two clusters in sequence. Submit it to a third, always-active cluster that hosts your ops workflows. That cluster must not share a region with the primary or the secondary; if it does, it goes down with them.

Store the Pipekit admin credentials in a Kubernetes Secret named pipekit-credentials on the third cluster, with the username under the username key and the password under the password key. The workflow re-authenticates each run via pipekit login, so a short-lived bearer token is never persisted. See Used within a Workflow for the underlying pattern.

Submit the workflow to the ops cluster from any environment that has the Pipekit CLI.

The same workflow handles the reverse direction when invoked with the cluster names swapped. Wrapping it in a Pipe with a webhook or schedule, or invoking it from Azure DevOps, GitHub Actions, or GitLab CI, are all variations on the same shape.

Failing back

When the primary region recovers and its Pipekit Agent reports a green indicator in the Clusters view, run the reverse sequence.

  1. Confirm the primary cluster's Agent is connected.

  2. Run pipekit update cluster <primary-name> --status=active, then pipekit update cluster <dr-name> --status=inactive. Reactivating the primary triggers Pipekit's queue cleanup: any workflow, action, or lint messages that were queued for the primary before the failover are dropped, not replayed.

  3. Switch --cluster-name back to the primary cluster's name in every submission job.

Runs that completed on the secondary cluster during the outage remain in its run history. There is no need to migrate them; they stay associated with the cluster that executed them.

Last updated