MCP Gateway runbook

Operator runbook for the portal’s agentic, permissioned, audited write path. For platform / principal engineers; assumes familiarity with Kubernetes, Helm, and the Model Context Protocol.

What it is

The portal can take human-approved, audited write actions against GitHub (Phase 1) via that provider’s MCP server, without breaking the CI-attested compliance model. SARC has been an MCP server (read tools) since early 2026; this is the MCP client gateway.

Design highlights:

Hybrid execution: safe tracking-artifact writes (open an issue, comment, label) go in-portal via MCP; compliance-critical writes (change requests, deploy / release triggers, code / PR changes) stay on the CI-attested path.
Human-in-the-loop: the assistant proposes an action, the operator sees a dry-run preview, an ADMIN approves, then it executes.
Attribution: the tenant’s PAT executes the call; the approving user is stamped into the hash-chained AuditLog and into the created artifact body.

Architecture

portal pod  ──HTTP (per-request Bearer PAT)──▶  github-mcp-server sidecar  ──▶  GitHub API
   │                                                  (in-cluster Deployment)
   └─ MCP transport, registry, guardrails, gateway modules in the portal source
   └─ /api/mcp/{propose,execute} routes  ◀── operator UI panel

The prod portal image is Next.js standalone (non-root, no npx), so MCP servers run as separate in-cluster workloads reached over HTTP — never spawned from the portal pod. No PAT is baked into the sidecar; the per-tenant token arrives per-request in the Authorization header.

Operator surface: `/t/<tenant>/admin/mcp`

ADMIN-only standalone page that drives the propose → approve → execute flow against a real tenant + sidecar in one form. Reachable from:

Sidebar → ADMIN → MCP Gateway
Settings grid → MCP Gateway tile
Direct URL: https://<portal>/t/<tenant>/admin/mcp

What the page shows:

A prereqs strip (3 green checks expected) — mcp.gateway flag, agentDispatchEnabled per-tenant, GitHub PAT set. If any is red the operator sees the warning before they spend cycles filling the form.
A create-issue form (owner / repo / title / body).
The mounted MCP propose panel — drives propose → Approve. Renders the preview args + idempotency marker after Preview, and the new issue URL after Approve.

Recommended demo flow

Open /t/demo/admin/mcp as an ADMIN user.
Enter a target owner / repo the tenant PAT can write issues to.
Tweak the title to make it greppable, click Open propose panel.
Click Preview action — confirms the args + marker that will be committed to the AuditLog.
Click Approve & execute — creates the GitHub issue and renders the link.

The propose → approve flow

POST /api/mcp/propose?tenant=<slug> with { provider, tool, args }. Side-effect-free. Returns { provider, tool, argsSha, marker, preview }. Available to any non-AUDITOR member.
Operator reviews the preview (the exact args + the artifact body that would be created, carrying an idempotency marker ).
POST /api/mcp/execute?tenant=<slug> with { provider, tool, args, argsSha }. ADMIN only. The server recomputes argsSha(args) and rejects with 409 if it differs from the approved sha (anti-tamper). Guardrails re-checked, then the write executes via the sidecar; the result (e.g. the new issue URL) is returned.

Guardrails

classify(provider, tool) → safe | compliance-critical. Code / PR writes, deploy / release triggers, and CR mutations are compliance-critical.
assertWriteAllowed (fail-closed) throws 403 for compliance-critical tools AND for any tool not on the Phase-1 write allow-list. Unknown / renamed tools are rejected, never allowed by default. The allow-list is static code (never config) so the audit story holds.

RBAC + audit

ADMIN approves and executes. AUDITOR and non-ADMIN members see the preview read-only (no Approve button).
Each execute writes exactly one hash-chained AuditLog row (action = mcp.tool.<provider>.<tool>), with argsSha + an ok flag. Raw args are never stored — only the hash.

Enablement (per cloud / per tenant)

Deploy the sidecar: set mcp.github.enabled=true in the portal Helm values. Image pinned to github-mcp-server:v1.0.4.
Point the portal at the sidecar: set the portal env var MCP_GITHUB_SERVER_URL to the sidecar Service URL.
Enable the feature flag: mcp.gateway (defaults on; toggleable in the AI kill-switch matrix) AND the per-tenant agentDispatchEnabled switch. Both must be on or the routes return 503 / 403 respectively.
Tenant credentials: the tenant’s GitHub PAT must be set with the scopes the target tools need (issues for issue_write).

After flipping mcp.github.enabled=true via a values-file change that ArgoCD picks up: the portal Deployment’s envFrom ConfigMap will not auto-restart the portal pods (chart has no checksum annotation). Trigger one manually:

kubectl rollout restart deploy/karc-portal -n karc-<env>

NetworkPolicy note: the sidecar’s NetworkPolicy restricts ingress to the portal pod and egress to DNS + 443. It is only enforced on a CNI that supports it — k3d’s default flannel does NOT enforce NetworkPolicy; use Calico / Cilium where the cross-namespace-denial guarantee matters.

Verifying it works — live smoke

Structural validators (helm lint, helm template, ado-validate, tsc, vitest) do not exercise the MCP runtime — they only prove the manifests render and the code compiles. Runtime correctness is verified only by the live smoke below, against a real cluster with the sidecar deployed.

On a staging tenant with a throwaway GitHub repo:

Flag on (mcp.gateway + agentDispatchEnabled), PAT set, sidecar running, MCP_GITHUB_SERVER_URL pointed at it.

Propose a github issue_write (with method: "create", plus owner / repo / title / body) — preview renders with the marker. The /admin/mcp page builds these args for you. Raw POST shape:

{ "provider": "github", "tool": "issue_write",
  "args": { "method": "create", "owner": "...", "repo": "...",
            "title": "...", "body": "..." } }

ADMIN execute with the proposed argsSha → issue is created with the approver footer; the response carries the issue URL.
Confirm: audit chain valid; the AuditLog row stores the argsSha (not raw args); re-running with the same marker is idempotent (no duplicate issue).
Negative path: flag off → 503 (feature-disabled); a compliance-critical tool → 403 (tool_not_allowed).

Roadmap

Phase 2 — recipe wiring (problem-investigate-fix proposes issue + work-item + problem); ADO + GitLab sidecars; comment / label / cross-link tools.
Phase 3 — CR writes, deploy triggers, code changes. Each gated behind portal-side parity with CI Evidence Vault / Fides attestation + change-window enforcement.

Lessons from the first end-to-end install

The chart and the write allow-list were both written before being end-to-end exercised against a deployed sidecar. The first real install on AWS surfaced 5 distinct bugs (wrong image tag, missing env wiring, non-root mismatch, deprecated arg, stale tool name) — all fixed in the same engagement.

The audit + gateway design held through all of this — every failed execute attempt wrote a clean AuditLog row with ok: false and the upstream error, proving the chain works as designed. The bugs were all chart-side or allowlist-side wiring errors, not gateway-design defects.

Lesson for the next provider sidecar (GitLab, ADO, ServiceNow): do an end-to-end install on a real cluster before merging the chart + allowlist work, not after. The unit tests + helm template + tsc covered everything they were designed to cover; the failures were all at the integration seam.