> For the complete documentation index, see [llms.txt](https://docs.pipekit.io/llms.txt). Markdown versions of documentation pages are available by appending `.md` to page URLs; this page is available as [Markdown](https://docs.pipekit.io/use-cases/disaster-recovery.md).

# Disaster Recovery

Run Pipekit workflows continuously across a primary and secondary cluster in separate regions, with manual failover when the primary region is unavailable. This pattern fits operators with a contractual recovery time objective and operators who run business-critical workflows where losing the primary cluster requires a disaster recovery plan. Pipekit gates and coordinates the cluster-level routing through admin-only controls and queue cleanup; the surrounding infrastructure remains your responsibility as the operator.

## Overview

The pattern uses two Kubernetes clusters connected to the same Pipekit organization, each running the Pipekit Agent. Only one cluster is active at a time. Workflow submissions target the active cluster by passing its name to the `--cluster-name` flag on `pipekit submit`. Failover is a short sequence of REST calls that flip the `isActive` flag on each cluster, run from a CI pipeline you trigger when the primary region is unhealthy.

```mermaid
graph LR
    Operator[Operator or CI]
    CLI[Pipekit CLI]
    CP[Pipekit Control Plane]

    subgraph RegionA["Primary cluster (Region A, active)"]
        AgentA[Pipekit Agent]
        ArgoA[Argo Workflows]
    end

    subgraph RegionB["DR cluster (Region B, inactive)"]
        AgentB[Pipekit Agent]
        ArgoB[Argo Workflows]
    end

    Operator -->|pipekit submit --cluster-name| CLI
    CLI --> CP
    CP -->|active| AgentA
    CP -.->|inactive| AgentB
    AgentA --> ArgoA
    AgentB --> ArgoB
```

Failover flips `active` and `inactive` on each cluster via `PUT /clusters/{cluster-uuid}`.

## What Pipekit provides

Pipekit owns three pieces of the failover that you would otherwise have to build yourself.

* **Admin-only state changes.** Only organization admins can toggle a cluster between `active` and `inactive`. The check is enforced at the control plane, not at the CI layer, so a leaked CI credential cannot flip cluster state.
* **Automatic queue cleanup on reactivation.** When a cluster transitions from `inactive` back to `active`, Pipekit clears any workflows submitted while the cluster was `inactive`. Stale submissions that were enqueued before the failover do not replay when the primary comes back online.
* **A single audited toggle.** The active state is one field on one resource; flipping it is one REST call. Operators do not have to coordinate routing changes across multiple Pipekit objects.

## Prerequisites

* Two Kubernetes clusters in separate regions, each able to run Argo Workflows.
* The Pipekit Agent installed on both clusters, following [Connect a Kubernetes Cluster to Pipekit](/pipekit/clusters.md).
* Pipekit credentials for an organization admin (only admins can toggle cluster state). Store the username and password in a Kubernetes Secret on the cluster that runs the failover workflow. The failover example authenticates with the Pipekit CLI, following the [container login pattern](/cli.md#docker-container), so a short-lived bearer token does not need to be persisted anywhere.
* A system to run the failover steps. Any CI tool that can call the Pipekit REST API works (Azure DevOps, GitHub Actions, GitLab CI), or you can use Pipekit itself by submitting an Argo Workflow to a third, always-active cluster.

{% hint style="info" %}
This page covers Pipekit's role in the failover. Surrounding infrastructure such as database replication, secrets sync, DNS updates, and image registry availability is your responsibility. Coordinate the Pipekit steps with your wider DR plan.
{% endhint %}

## Preparing the secondary cluster

Bring the secondary cluster up in its target region and register it under the same organization as the primary. Install the Pipekit Agent and confirm the green connection indicator in the `Clusters` view.

Set the secondary cluster inactive immediately after the Agent reports healthy. Run `pipekit update cluster <dr-name> --status=inactive` (or PUT `/api/users/v1/clusters/{cluster-uuid}` with `{"isActive": false}` against the REST API). This prevents accidental submissions while the primary is the active cluster.

## Failing over to the secondary cluster

Trigger your CI pipeline when you decide to fail over. The pipeline runs three steps against the Pipekit REST API.

1. Mark the primary cluster inactive. Run `pipekit update cluster <primary-name> --status=inactive` (or PUT `/api/users/v1/clusters/{primary-uuid}` with `{"isActive": false}` against the REST API). New submissions targeted at the primary are rejected with the error `trying to start workflow on inactive cluster`.
2. Mark the secondary cluster active. Run `pipekit update cluster <dr-name> --status=active` (or PUT `/api/users/v1/clusters/{dr-uuid}` with `{"isActive": true}`).
3. Switch the value passed to `--cluster-name` on every `pipekit submit` invocation to the secondary cluster's name. The `--cluster-name` flag is required and has no environment-variable fallback, so every submission job in your pipeline must read the cluster name from a shared variable.

{% hint style="warning" %}
Setting a cluster inactive blocks new submissions only. Workflows already running on the primary continue until they complete or until the region becomes unreachable. Plan for in-flight work to remain on the primary; the failover does not drain it.
{% endhint %}

## Example: Argo Workflow

The Argo Workflow below performs the failover by calling `pipekit update cluster` against two clusters in sequence. Submit it to a third, always-active cluster that hosts your ops workflows. That cluster must not share a region with the primary or the secondary; if it does, it goes down with them.

Store the Pipekit admin credentials in a Kubernetes Secret named `pipekit-credentials` on the third cluster, with the username under the `username` key and the password under the `password` key. The workflow re-authenticates each run via `pipekit login`, so a short-lived bearer token is never persisted. See [Used within a Workflow](/cli.md#used-within-a-workflow) for the underlying pattern.

```yaml
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
  generateName: pipekit-failover-
spec:
  entrypoint: failover
  arguments:
    parameters:
      - name: from-cluster-name
      - name: to-cluster-name
  templates:
    - name: failover
      dag:
        tasks:
          - name: mark-from-inactive
            template: set-status
            arguments:
              parameters:
                - name: cluster-name
                  value: "{{workflow.parameters.from-cluster-name}}"
                - name: status
                  value: inactive
          - name: mark-to-active
            dependencies: [mark-from-inactive]
            template: set-status
            arguments:
              parameters:
                - name: cluster-name
                  value: "{{workflow.parameters.to-cluster-name}}"
                - name: status
                  value: active
    - name: set-status
      inputs:
        parameters:
          - name: cluster-name
          - name: status
      container:
        image: pipekit13/cli
        env:
          - name: PIPEKIT_USERNAME
            valueFrom:
              secretKeyRef:
                name: pipekit-credentials
                key: username
          - name: PIPEKIT_PASSWORD
            valueFrom:
              secretKeyRef:
                name: pipekit-credentials
                key: password
        command: [sh, -c]
        args:
          - |
            pipekit login
            pipekit update cluster {{inputs.parameters.cluster-name}} --status={{inputs.parameters.status}}
```

Submit the workflow to the ops cluster from any environment that has the Pipekit CLI.

```bash
pipekit submit failover.yaml --cluster-name ops \
  -p from-cluster-name=<primary-name> \
  -p to-cluster-name=<dr-name>
```

The same workflow handles the reverse direction when invoked with the cluster names swapped. Wrapping it in a Pipe with a webhook or schedule, or invoking it from Azure DevOps, GitHub Actions, or GitLab CI, are all variations on the same shape.

## Failing back

When the primary region recovers and its Pipekit Agent reports a green indicator in the `Clusters` view, run the reverse sequence.

1. Confirm the primary cluster's Agent is connected.
2. Run `pipekit update cluster <primary-name> --status=active`, then `pipekit update cluster <dr-name> --status=inactive`. Reactivating the primary triggers Pipekit's queue cleanup: any workflow, action, or lint messages that were queued for the primary before the failover are dropped, not replayed.
3. Switch `--cluster-name` back to the primary cluster's name in every submission job.

Runs that completed on the secondary cluster during the outage remain in its run history. There is no need to migrate them; they stay associated with the cluster that executed them.


---

# Agent Instructions
This documentation is published with GitBook. GitBook is the documentation platform designed so that both humans and AI agents can read, navigate, and reason over technical content effectively. Learn more at gitbook.com.

## Querying This Documentation
If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://docs.pipekit.io/use-cases/disaster-recovery.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
