Cloud + CI/CD lifecycle

SARC deploys the same stack to AWS, Azure, GCP, and local k3d from one repo via TARGET_CLOUD. Each cloud, and each CI/CD pipeline that feeds it, can be turned off and back on without losing the ability to rebuild. There are three levels of “off”, from cheapest-to-resume to fully removed.

The full operator guide lives in the repo at docs/CLOUD-LIFECYCLE.md. This page is the summary.

Three levels of off

Level	What it does	Billing	Reversibility
Pause	Scale nodes to 0 (EKS/GKE) or `az aks stop` (AKS). Control plane stays.	Control plane only	Instant — `just cluster-start-<cloud>`, 3-5 min to Ready
Teardown	`terraform destroy` the whole `infra/<cloud>` stack.	Zero	Rebuild via `just bootstrap-<cloud>`, ~30 min
Disable-not-delete	For out-of-band artifacts (e.g. a GitLab-integration service account + token): `gcloud ... disable`.	Negligible	`gcloud ... enable`

Pick the lowest level that meets the goal. Pause is the default for routine weekend/overnight cost saving; the per-cloud stop/start recipes exist so you do not have to tear down and rebuild.

Pause / resume a cluster

just clusters-stop-all          stop AWS + Azure + GCP + k3d (fail-tolerant)
just clusters-start-all         start every cluster before a demo
just clusters-status            power state of all four clusters

just cluster-stop-<cloud>       aws | azure | gcp | k3d
just cluster-start-<cloud>

AWS EKS — nodes scaled to 0; control plane still billed.
Azure AKS — az aks stop deallocates control plane + nodes (cheapest pause).
GCP GKE — node pools to 0; regional control plane still billed.
k3d — local, no cloud cost.

Teardown / rebuild

cd infra/<cloud>
terraform destroy            # frees the LoadBalancer first, then cluster, network, IAM, secrets
just bootstrap-<cloud>       # rebuild from clean state, ~30 min

A Kubernetes Service of type LoadBalancer (ingress-nginx) creates a cloud load balancer out of band. In SARC that release is Terraform-managed, so a normal terraform destroy uninstalls it first and frees the LB in the correct order before the network is deleted. OpenShift (ROSA HCP) is teardown-only — it has no stop/start lifecycle.

Disable / enable CI/CD per platform and env

Nothing in the cloud-deploy path fires automatically — all cloud deploys are manual/dispatch — so disabling CI/CD is mostly about the few scheduled or push-triggered jobs.

GitLab — the GitHub + ADO mirror sync runs 2x/day from mirror-sync.yml; kill switch is the project CI/CD variable DISABLE_MIRRORS=true. The Azure + GCP terraform templates are pinned when: never; AWS terraform apply is when: manual.
GitHub Actions — cloud deploys use workflow_dispatch (only run when invoked). Disable a workflow with gh workflow disable <wf.yml>; disable a specific cloud/env by protecting or removing its <cloud>-karc-<env> GitHub Environment or its OIDC secret.
Azure DevOps — pipelines toggle Disabled in the UI; environments (sarc-azure-*) are operator-bound in the ADO Library.
ArgoCD — production auto-sync is disabled by policy. Pause an app with argocd app set <app> --sync-policy none; re-enable by restoring the policy.

Park a whole cloud, recoverably

terraform destroy the cloud (or just cluster-stop-<cloud> if returning soon).
Disable-not-delete any out-of-band integration service account + secret.
GitLab: set DISABLE_MIRRORS=true if replicas should not refresh.
GitHub: gh workflow disable the cloud’s deploy workflows, or protect its Environments.
ArgoCD: set the cloud’s apps to --sync-policy none.

Reverse each step to bring it back: just bootstrap-<cloud>, re-enable the SA + secret, unset DISABLE_MIRRORS, gh workflow enable, restore ArgoCD sync.