MCP Gateway runbook
Operator runbook for the portal’s agentic, permissioned, audited write path. For platform / principal engineers; assumes familiarity with Kubernetes, Helm, and the Model Context Protocol.
What it is
Section titled “What it is”The portal can take human-approved, audited write actions against GitHub (Phase 1) via that provider’s MCP server, without breaking the CI-attested compliance model. SARC has been an MCP server (read tools) since early 2026; this is the MCP client gateway.
Design highlights:
- Hybrid execution: safe tracking-artifact writes (open an issue, comment, label) go in-portal via MCP; compliance-critical writes (change requests, deploy / release triggers, code / PR changes) stay on the CI-attested path.
- Human-in-the-loop: the assistant proposes an action, the operator sees a dry-run preview, an ADMIN approves, then it executes.
- Attribution: the tenant’s PAT executes the call; the approving user is stamped into the hash-chained AuditLog and into the created artifact body.
Architecture
Section titled “Architecture”portal pod ──HTTP (per-request Bearer PAT)──▶ github-mcp-server sidecar ──▶ GitHub API │ (in-cluster Deployment) └─ MCP transport, registry, guardrails, gateway modules in the portal source └─ /api/mcp/{propose,execute} routes ◀── operator UI panelThe prod portal image is Next.js standalone (non-root, no npx), so MCP servers run as separate in-cluster workloads reached over HTTP — never spawned from the portal pod. No PAT is baked into the sidecar; the per-tenant token arrives per-request in the Authorization header.
Operator surface: /t/<tenant>/admin/mcp
Section titled “Operator surface: /t/<tenant>/admin/mcp”ADMIN-only standalone page that drives the propose → approve → execute flow against a real tenant + sidecar in one form. Reachable from:
- Sidebar → ADMIN → MCP Gateway
- Settings grid → MCP Gateway tile
- Direct URL:
https://<portal>/t/<tenant>/admin/mcp
What the page shows:
- A prereqs strip (3 green checks expected) —
mcp.gatewayflag,agentDispatchEnabledper-tenant, GitHub PAT set. If any is red the operator sees the warning before they spend cycles filling the form. - A create-issue form (owner / repo / title / body).
- The mounted MCP propose panel — drives propose → Approve. Renders the preview args + idempotency marker after Preview, and the new issue URL after Approve.
Recommended demo flow
Section titled “Recommended demo flow”- Open
/t/demo/admin/mcpas an ADMIN user. - Enter a target owner / repo the tenant PAT can write
issuesto. - Tweak the title to make it greppable, click Open propose panel.
- Click Preview action — confirms the args + marker that will be committed to the AuditLog.
- Click Approve & execute — creates the GitHub issue and renders the link.
The propose → approve flow
Section titled “The propose → approve flow”POST /api/mcp/propose?tenant=<slug>with{ provider, tool, args }. Side-effect-free. Returns{ provider, tool, argsSha, marker, preview }. Available to any non-AUDITOR member.- Operator reviews the preview (the exact args + the artifact body that would be created, carrying an idempotency marker
<!-- sarc-mcp:<sha> -->). POST /api/mcp/execute?tenant=<slug>with{ provider, tool, args, argsSha }. ADMIN only. The server recomputesargsSha(args)and rejects with 409 if it differs from the approved sha (anti-tamper). Guardrails re-checked, then the write executes via the sidecar; the result (e.g. the new issue URL) is returned.
Guardrails
Section titled “Guardrails”classify(provider, tool)→safe | compliance-critical. Code / PR writes, deploy / release triggers, and CR mutations are compliance-critical.assertWriteAllowed(fail-closed) throws 403 for compliance-critical tools AND for any tool not on the Phase-1 write allow-list. Unknown / renamed tools are rejected, never allowed by default. The allow-list is static code (never config) so the audit story holds.
RBAC + audit
Section titled “RBAC + audit”- ADMIN approves and executes. AUDITOR and non-ADMIN members see the preview read-only (no Approve button).
- Each execute writes exactly one hash-chained
AuditLogrow (action = mcp.tool.<provider>.<tool>), withargsSha+ anokflag. Raw args are never stored — only the hash.
Enablement (per cloud / per tenant)
Section titled “Enablement (per cloud / per tenant)”- Deploy the sidecar: set
mcp.github.enabled=truein the portal Helm values. Image pinned togithub-mcp-server:v1.0.4. - Point the portal at the sidecar: set the portal env var
MCP_GITHUB_SERVER_URLto the sidecar Service URL. - Enable the feature flag:
mcp.gateway(defaults on; toggleable in the AI kill-switch matrix) AND the per-tenantagentDispatchEnabledswitch. Both must be on or the routes return 503 / 403 respectively. - Tenant credentials: the tenant’s GitHub PAT must be set with the scopes the target tools need (
issuesforissue_write).
After flipping mcp.github.enabled=true via a values-file change that ArgoCD picks up: the portal Deployment’s envFrom ConfigMap will not auto-restart the portal pods (chart has no checksum annotation). Trigger one manually:
kubectl rollout restart deploy/karc-portal -n karc-<env>NetworkPolicy note: the sidecar’s NetworkPolicy restricts ingress to the portal pod and egress to DNS + 443. It is only enforced on a CNI that supports it — k3d’s default flannel does NOT enforce NetworkPolicy; use Calico / Cilium where the cross-namespace-denial guarantee matters.
Verifying it works — live smoke
Section titled “Verifying it works — live smoke”Structural validators (helm lint, helm template, ado-validate, tsc, vitest) do not exercise the MCP runtime — they only prove the manifests render and the code compiles. Runtime correctness is verified only by the live smoke below, against a real cluster with the sidecar deployed.
On a staging tenant with a throwaway GitHub repo:
- Flag on (
mcp.gateway+agentDispatchEnabled), PAT set, sidecar running,MCP_GITHUB_SERVER_URLpointed at it. - Propose a
github issue_write(withmethod: "create", plus owner / repo / title / body) — preview renders with the marker. The/admin/mcppage builds these args for you. Raw POST shape:{ "provider": "github", "tool": "issue_write","args": { "method": "create", "owner": "...", "repo": "...","title": "...", "body": "..." } } - ADMIN
executewith the proposedargsSha→ issue is created with the approver footer; the response carries the issue URL. - Confirm: audit chain valid; the AuditLog row stores the argsSha (not raw args); re-running with the same marker is idempotent (no duplicate issue).
- Negative path: flag off → 503 (
feature-disabled); a compliance-critical tool → 403 (tool_not_allowed).
Roadmap
Section titled “Roadmap”- Phase 2 — recipe wiring (problem-investigate-fix proposes issue + work-item + problem); ADO + GitLab sidecars; comment / label / cross-link tools.
- Phase 3 — CR writes, deploy triggers, code changes. Each gated behind portal-side parity with CI Kosli attestation + change-window enforcement.
Lessons from the first end-to-end install
Section titled “Lessons from the first end-to-end install”The chart and the write allow-list were both written before being end-to-end exercised against a deployed sidecar. The first real install on AWS surfaced 5 distinct bugs (wrong image tag, missing env wiring, non-root mismatch, deprecated arg, stale tool name) — all fixed in the same engagement.
The audit + gateway design held through all of this — every failed execute attempt wrote a clean AuditLog row with ok: false and the upstream error, proving the chain works as designed. The bugs were all chart-side or allowlist-side wiring errors, not gateway-design defects.
Lesson for the next provider sidecar (GitLab, ADO, ServiceNow): do an end-to-end install on a real cluster before merging the chart + allowlist work, not after. The unit tests + helm template + tsc covered everything they were designed to cover; the failures were all at the integration seam.