Skip to content

MCP Gateway runbook

Operator runbook for the portal’s agentic, permissioned, audited write path. For platform / principal engineers; assumes familiarity with Kubernetes, Helm, and the Model Context Protocol.

The portal can take human-approved, audited write actions against GitHub (Phase 1) via that provider’s MCP server, without breaking the CI-attested compliance model. SARC has been an MCP server (read tools) since early 2026; this is the MCP client gateway.

Design highlights:

  • Hybrid execution: safe tracking-artifact writes (open an issue, comment, label) go in-portal via MCP; compliance-critical writes (change requests, deploy / release triggers, code / PR changes) stay on the CI-attested path.
  • Human-in-the-loop: the assistant proposes an action, the operator sees a dry-run preview, an ADMIN approves, then it executes.
  • Attribution: the tenant’s PAT executes the call; the approving user is stamped into the hash-chained AuditLog and into the created artifact body.
portal pod ──HTTP (per-request Bearer PAT)──▶ github-mcp-server sidecar ──▶ GitHub API
│ (in-cluster Deployment)
└─ MCP transport, registry, guardrails, gateway modules in the portal source
└─ /api/mcp/{propose,execute} routes ◀── operator UI panel

The prod portal image is Next.js standalone (non-root, no npx), so MCP servers run as separate in-cluster workloads reached over HTTP — never spawned from the portal pod. No PAT is baked into the sidecar; the per-tenant token arrives per-request in the Authorization header.

ADMIN-only standalone page that drives the propose → approve → execute flow against a real tenant + sidecar in one form. Reachable from:

  • Sidebar → ADMIN → MCP Gateway
  • Settings grid → MCP Gateway tile
  • Direct URL: https://<portal>/t/<tenant>/admin/mcp

What the page shows:

  • A prereqs strip (3 green checks expected) — mcp.gateway flag, agentDispatchEnabled per-tenant, GitHub PAT set. If any is red the operator sees the warning before they spend cycles filling the form.
  • A create-issue form (owner / repo / title / body).
  • The mounted MCP propose panel — drives propose → Approve. Renders the preview args + idempotency marker after Preview, and the new issue URL after Approve.
  1. Open /t/demo/admin/mcp as an ADMIN user.
  2. Enter a target owner / repo the tenant PAT can write issues to.
  3. Tweak the title to make it greppable, click Open propose panel.
  4. Click Preview action — confirms the args + marker that will be committed to the AuditLog.
  5. Click Approve & execute — creates the GitHub issue and renders the link.
  1. POST /api/mcp/propose?tenant=<slug> with { provider, tool, args }. Side-effect-free. Returns { provider, tool, argsSha, marker, preview }. Available to any non-AUDITOR member.
  2. Operator reviews the preview (the exact args + the artifact body that would be created, carrying an idempotency marker <!-- sarc-mcp:<sha> -->).
  3. POST /api/mcp/execute?tenant=<slug> with { provider, tool, args, argsSha }. ADMIN only. The server recomputes argsSha(args) and rejects with 409 if it differs from the approved sha (anti-tamper). Guardrails re-checked, then the write executes via the sidecar; the result (e.g. the new issue URL) is returned.
  • classify(provider, tool)safe | compliance-critical. Code / PR writes, deploy / release triggers, and CR mutations are compliance-critical.
  • assertWriteAllowed (fail-closed) throws 403 for compliance-critical tools AND for any tool not on the Phase-1 write allow-list. Unknown / renamed tools are rejected, never allowed by default. The allow-list is static code (never config) so the audit story holds.
  • ADMIN approves and executes. AUDITOR and non-ADMIN members see the preview read-only (no Approve button).
  • Each execute writes exactly one hash-chained AuditLog row (action = mcp.tool.<provider>.<tool>), with argsSha + an ok flag. Raw args are never stored — only the hash.
  1. Deploy the sidecar: set mcp.github.enabled=true in the portal Helm values. Image pinned to github-mcp-server:v1.0.4.
  2. Point the portal at the sidecar: set the portal env var MCP_GITHUB_SERVER_URL to the sidecar Service URL.
  3. Enable the feature flag: mcp.gateway (defaults on; toggleable in the AI kill-switch matrix) AND the per-tenant agentDispatchEnabled switch. Both must be on or the routes return 503 / 403 respectively.
  4. Tenant credentials: the tenant’s GitHub PAT must be set with the scopes the target tools need (issues for issue_write).

After flipping mcp.github.enabled=true via a values-file change that ArgoCD picks up: the portal Deployment’s envFrom ConfigMap will not auto-restart the portal pods (chart has no checksum annotation). Trigger one manually:

Terminal window
kubectl rollout restart deploy/karc-portal -n karc-<env>

NetworkPolicy note: the sidecar’s NetworkPolicy restricts ingress to the portal pod and egress to DNS + 443. It is only enforced on a CNI that supports it — k3d’s default flannel does NOT enforce NetworkPolicy; use Calico / Cilium where the cross-namespace-denial guarantee matters.

Structural validators (helm lint, helm template, ado-validate, tsc, vitest) do not exercise the MCP runtime — they only prove the manifests render and the code compiles. Runtime correctness is verified only by the live smoke below, against a real cluster with the sidecar deployed.

On a staging tenant with a throwaway GitHub repo:

  1. Flag on (mcp.gateway + agentDispatchEnabled), PAT set, sidecar running, MCP_GITHUB_SERVER_URL pointed at it.
  2. Propose a github issue_write (with method: "create", plus owner / repo / title / body) — preview renders with the marker. The /admin/mcp page builds these args for you. Raw POST shape:
    { "provider": "github", "tool": "issue_write",
    "args": { "method": "create", "owner": "...", "repo": "...",
    "title": "...", "body": "..." } }
  3. ADMIN execute with the proposed argsSha → issue is created with the approver footer; the response carries the issue URL.
  4. Confirm: audit chain valid; the AuditLog row stores the argsSha (not raw args); re-running with the same marker is idempotent (no duplicate issue).
  5. Negative path: flag off → 503 (feature-disabled); a compliance-critical tool → 403 (tool_not_allowed).
  • Phase 2 — recipe wiring (problem-investigate-fix proposes issue + work-item + problem); ADO + GitLab sidecars; comment / label / cross-link tools.
  • Phase 3 — CR writes, deploy triggers, code changes. Each gated behind portal-side parity with CI Kosli attestation + change-window enforcement.

The chart and the write allow-list were both written before being end-to-end exercised against a deployed sidecar. The first real install on AWS surfaced 5 distinct bugs (wrong image tag, missing env wiring, non-root mismatch, deprecated arg, stale tool name) — all fixed in the same engagement.

The audit + gateway design held through all of this — every failed execute attempt wrote a clean AuditLog row with ok: false and the upstream error, proving the chain works as designed. The bugs were all chart-side or allowlist-side wiring errors, not gateway-design defects.

Lesson for the next provider sidecar (GitLab, ADO, ServiceNow): do an end-to-end install on a real cluster before merging the chart + allowlist work, not after. The unit tests + helm template + tsc covered everything they were designed to cover; the failures were all at the integration seam.