Secure AI Coding Agents: Sandbox, Permissions & Review Gates

Q: How do you detect AI-generated pull requests?

Detection uses multiple signals layered together: Co-authored-by commit trailers from tools like GitHub Copilot and Claude Code, PR labels such as ai-generated or copilot, known bot author accounts, and branch naming conventions like copilot/ or claude/ prefixes. Combining these signals reduces false negatives.

AI coding agents are writing production code at scale. Industry surveys put adoption at 84% among professional developers, and that number is climbing. GitHub Copilot, Claude Code, Cursor, Amazon Q Developer, and a growing roster of agentic tools now generate everything from boilerplate utility functions to complete feature implementations. The velocity gains are real. The governance gap is also real.

Most teams adopted AI coding agents but did not adopt governance for AI coding agents. The tools shipped faster than the policies. The result is that AI-generated code flows into production through the same pull request workflows designed for human-written code, but without the same scrutiny. Human reviewers often skim AI-generated PRs because they assume the AI “got it right,” or because the volume of AI-generated changes overwhelms their review capacity. Securing AI coding agent workflows requires treating AI-generated PRs as a distinct trust tier, one with its own detection mechanisms, policy constraints, automated scanning, and review gates.

This is not about slowing things down. It is about building a governance layer that matches the speed and scale at which AI agents operate. Teams that get this right move faster because they have confidence in their safety net. Teams that skip it accumulate risk until something breaks in production.

Threat model for AI-generated code

Before building controls, you need to understand what can go wrong. AI-generated code introduces a specific set of risks that differ from the bugs and vulnerabilities humans typically create.

Secret leakage. AI agents can inadvertently include API keys, tokens, or credentials in generated code. They may copy patterns from training data that include placeholder secrets, or they may pull secrets from the context window (files the agent read during its session) and embed them in new code. Unlike a human who knows which strings are sensitive, an agent treats all context as potential material.

Dependency risks. Agents frequently suggest packages they have seen in training data, including packages that are deprecated, unmaintained, or have known vulnerabilities. They can also hallucinate package names that do not exist, creating an opportunity for dependency confusion attacks where an attacker publishes a malicious package under the hallucinated name.

Logic errors at scale. A human developer writing a subtle logic bug produces one instance. An AI agent applying the same flawed pattern across twenty files produces twenty instances. The velocity that makes agents useful also amplifies the impact of systematic errors.

Permission creep. Agents tend to request broad permissions when narrower ones would suffice. An agent writing a database migration might add ALTER TABLE and DROP TABLE privileges when it only needs CREATE TABLE. An agent configuring an IAM role might attach AdministratorAccess because it solves the immediate permission error without understanding the blast radius.

Supply chain injection. If an agent’s context can be influenced by external inputs, whether through prompt injection in documentation it reads, poisoned repository files, or compromised MCP servers, the generated code may contain intentionally malicious payloads. This is not theoretical. Research has demonstrated prompt injection attacks that cause agents to introduce backdoors in generated code.

For a broader treatment of the threat landscape, see our guide to securing agentic AI applications. For supply chain risks specifically, see Software Supply Chain Security in the Age of AI.

The four layers of AI code governance

Effective governance for AI-generated code operates at four layers, each addressing a different part of the problem. Skip any layer and you have a gap that the other layers cannot compensate for.

Layer 1: Detection

You cannot govern what you cannot identify. The first requirement is reliably detecting which PRs, commits, and code changes were generated by AI agents.

Detection methods include:

Commit metadata: AI coding agents typically add Co-authored-by trailers with identifiable patterns (e.g., Co-Authored-By: Claude <noreply@anthropic.com>). Parse these systematically.
PR labels: Many agent-driven workflows automatically add labels like ai-generated, copilot, or claude. Require these labels as part of your agent configuration.
Bot authors: When agents operate through GitHub Apps or bot accounts, the PR author itself identifies the source.
Branch naming conventions: Enforce naming patterns for branches created by agents (e.g., ai/feature-name or agent/ticket-123).

The goal is a boolean determination at the PR level: was this change produced by an AI agent? That flag drives everything downstream. If your detection has false negatives, undetected AI-generated code bypasses all your governance controls. Build redundancy into detection by checking multiple signals.

Layer 2: Policy

Once you know a PR is AI-generated, apply a set of constraints that define what the agent was allowed to do. This is where you encode your organization’s risk tolerance.

Policy answers questions like:

Which files and directories can the agent modify?
Which files are off-limits (authentication, infrastructure-as-code, CI/CD pipelines)?
How many files can a single AI PR touch?
How many lines of code can a single AI PR add or modify?
Does the agent need to include tests for any code it writes?

Policy violations are hard blocks. If an AI-generated PR modifies a file in the blocked_patterns list, the PR fails the governance check regardless of what the code looks like. This is intentional. Some areas of the codebase require human authorship not because the AI cannot write correct code, but because the risk of an undetected error in those areas is too high.

Layer 3: Scanning

Automated security scanning runs against every AI-generated PR before it can proceed to review. This layer includes:

Secret detection: scan for API keys, tokens, passwords, and other credentials in the diff. Tools like TruffleHog, GitLeaks, and GitHub’s built-in secret scanning catch most patterns.
Dependency analysis: check any newly added dependencies for known vulnerabilities (CVEs), license compatibility issues, and maintenance status. Flag hallucinated packages that do not exist in the registry.
Static analysis (SAST): run language-specific static analysis to catch common vulnerability patterns like SQL injection, path traversal, and cross-site scripting.
Infrastructure-as-code scanning: if the PR modifies Terraform, CloudFormation, or Docker configuration, run tools like Checkov or Trivy to flag misconfigurations.

The key distinction from standard CI security scanning is that the thresholds for AI-generated code should be stricter. A medium-severity finding that would produce a warning on a human PR might produce a hard block on an AI-generated PR, because the probability that the AI introduced the vulnerability without understanding the implications is higher than for a human who chose to accept the risk consciously.

Layer 4: Review

The final layer is human review, but structured differently for AI-generated code. Instead of the standard “one approval” rule, AI-generated PRs go through a risk-tiered review process.

A risk score is computed based on factors like:

number of files changed
which directories and file types are affected
whether the PR modifies security-sensitive areas
whether scanning found any warnings (even non-blocking ones)
the size and complexity of the diff

Low-risk changes (small scope, non-sensitive files, clean scans) can be auto-merged. Medium-risk changes require one human approval. High-risk changes require two approvals including someone from the security team. This tiering lets you maintain velocity for the 70% of AI-generated changes that are genuinely low-risk while concentrating review effort where it matters.

Policy-as-code for AI agents

The governance layers described above need a concrete implementation. The approach we recommend is a declarative configuration file that lives in the repository alongside the code it governs. We call this file .ai-code-gate.yml.

detection:
  labels: ["ai-generated", "copilot", "claude"]
  co_authors: ["*[bot]@*", "*noreply@anthropic.com"]

policy:
  allowed_patterns:
    - "src/**/*.ts"
    - "src/**/*.tsx"
    - "tests/**"
  blocked_patterns:
    - "*.env*"
    - "**/auth/**"
    - "docker-compose*.yml"
    - ".github/workflows/**"
  scope_limits:
    max_files: 20
    max_lines_added: 500

review:
  risk_tiers:
    low:
      threshold: 30
      approvals: 0
      auto_merge: true
    medium:
      threshold: 70
      approvals: 1
    high:
      threshold: 100
      approvals: 2
      require_security_team: true

This configuration encodes your organization’s policies as structured data that your CI/CD pipeline can enforce automatically. Let’s walk through each section.

Detection rules

The detection block defines how the pipeline identifies AI-generated PRs. It checks PR labels and commit Co-authored-by trailers against patterns you define. Glob-style wildcards let you match broad patterns without maintaining an exhaustive list of every bot email address. When any detection signal matches, the PR is flagged as AI-generated and the remaining governance checks apply.

Allowed and blocked patterns

The allowed_patterns list is a whitelist of file paths the agent is permitted to modify. The blocked_patterns list is a blacklist that takes precedence. If a file matches both, the block wins.

This is conceptually similar to a CODEOWNERS file, but serves a different purpose. CODEOWNERS controls who must review a change. .ai-code-gate.yml controls whether the change is allowed to exist at all. An AI agent might be blocked from modifying .github/workflows/ regardless of who reviews it, because modifications to CI/CD pipelines by an AI introduce too much risk for your organization’s tolerance.

Scope limits

The scope_limits section caps the size of AI-generated PRs. A limit of 20 files and 500 added lines forces agents (and the humans configuring them) to break large changes into reviewable chunks. This is a guardrail against the common pattern where an agent is asked to “refactor the authentication module” and produces a 2,000-line PR that nobody actually reviews because it is too large to reason about.

Risk tiers

The review section defines three tiers based on a computed risk score (0–100). The score calculation considers file count, diff size, which directories are touched, and scanning results. Low-risk PRs auto-merge if scans are clean. Medium-risk PRs need one approval. High-risk PRs need two approvals and someone from the security team.

These thresholds are tunable. Start conservative and adjust as you build confidence in your detection and scanning layers. Most teams find that 60–70% of AI-generated PRs fall in the low-risk tier after a few weeks of tuning, which means governance adds almost no friction for the majority of changes.

Sandboxed execution

Static analysis catches a lot, but it cannot catch everything. Some classes of bugs only surface at runtime: environment variable misuse, incorrect API call sequences, race conditions, broken integrations. Sandboxed execution gives you a way to run AI-generated code before merging it, without exposing your production environment or CI infrastructure.

Why run code before merging

The traditional model is to merge code and then run it in a staging environment. For AI-generated code, you want to shift that left. Running the code in a sandbox before merge gives you a signal about runtime behavior that no static analysis tool can provide. Does the code actually start? Do the tests pass? Does it try to make unexpected network calls? Does it consume excessive memory or CPU?

This is especially important because AI agents do not always write correct code on the first attempt. They produce code that looks plausible but may have subtle issues that only appear when you try to execute it. Catching these before merge saves the time and context-switching cost of a revert.

Container-based sandboxes

The most practical approach is running AI-generated code in isolated Docker containers with constrained capabilities:

Isolated filesystem: the container gets a copy of the repository with the PR changes applied, but cannot write back to the host or access other repositories.
Network restrictions: the container has no outbound network access, or access limited to specific internal endpoints needed for tests. This prevents exfiltration of secrets or data, and blocks dependency installation from untrusted sources.
Resource limits: CPU, memory, and execution time are capped. If the AI-generated code contains an infinite loop or a memory leak, the container is killed after the timeout rather than consuming shared CI resources.
No privileged access: the container runs without root privileges and without access to the Docker socket. The AI-generated code cannot escape the sandbox.

Test execution in sandbox

Within the sandbox, you run the project’s test suite against the AI-generated changes. This serves two purposes: it validates that the new code works, and it validates that the new code does not break existing functionality. If the agent was supposed to include tests for its changes (a policy you can enforce in the policy section), the sandbox also verifies that those tests exist and pass.

Approaches compared

Several approaches exist for sandboxed execution, each with different trade-offs:

Docker-based (DIY): you manage your own container images, security configuration, and cleanup. Maximum control, most operational overhead. Good for teams with existing container infrastructure.
E2B: a managed service that provides sandboxed environments designed for AI agent code execution. Lower operational overhead, but adds an external dependency and cost.
Kubernetes Jobs: ephemeral pods with security contexts, network policies, and resource quotas. Good for teams already running Kubernetes. Provides strong isolation through pod security standards.

For most teams, Docker-based sandboxes running in your existing CI infrastructure are the right starting point. They require no new external dependencies and integrate cleanly with GitHub Actions, GitLab CI, or whatever CI system you already use.

The audit trail

Every governance decision needs a record. When an incident occurs six months from now and you need to understand how a piece of AI-generated code reached production, you need to reconstruct the complete chain: who triggered the agent, what prompt was used, what code was generated, what scans ran, what the results were, who approved it, and when it was merged.

What to log

A complete audit trail for AI-generated code includes:

Trigger event: who initiated the agent session, what ticket or task was referenced, what the original prompt or instruction was
Generation context: which model was used, what files the agent read for context, what tools the agent invoked during generation
Code diff: the exact changes the agent produced, stored as a versioned artifact
Detection result: how the PR was identified as AI-generated, which signals matched
Policy check result: which patterns were evaluated, whether any violations were found
Scan results: full output from secret detection, dependency analysis, SAST, and any other scanners
Risk score: the computed score, the factors that contributed to it, and the resulting tier assignment
Review decisions: who reviewed, when, what comments were made, whether any review was overridden
Merge event: when the code was merged, by whom, to which branch

Compliance relevance

If your organization operates under SOC 2, ISO 27001, FedRAMP, or similar compliance frameworks, AI-generated code creates specific documentation requirements. Auditors will ask how you ensure that AI-generated code meets the same security standards as human-written code. Having a structured audit trail that demonstrates detection, policy enforcement, scanning, and review is the difference between a smooth audit and a finding.

SOC 2 Type II in particular requires evidence that security controls operate consistently over time. Point-in-time screenshots do not suffice. You need continuous, automated evidence collection, which is exactly what a well-implemented audit trail provides.

Structured events vs. unstructured logs

Avoid dumping governance data into application logs as unstructured text. Instead, emit structured audit events with consistent schemas. Each event should be a JSON object with a defined type, timestamp, correlation ID (linking it to the PR), and event-specific payload.

Structured events are queryable, aggregatable, and integrable with your SIEM and compliance tooling. Unstructured logs require parsing, break when formats change, and make incident investigation painful. Invest in the structured approach from the start.

The OWASP Top 10 for Agentic Applications lists insufficient logging and monitoring as a top risk for agentic systems. The OpenSSF AI Code Assistant Security Guide similarly emphasizes audit trails as a foundational control. These are not theoretical recommendations; they reflect observed incidents where lack of logging made it impossible to determine the scope of AI-related security events.

The ai-code-gate reference implementation

To make the governance framework concrete and immediately usable, we built ai-code-gate, a reference implementation that you can install into any GitHub repository.

Repository structure

The repo is organized as a set of GitHub Actions composite actions that chain together into a complete governance pipeline:

ai-code-gate/
  .github/
    actions/
      detect-ai-pr/    # Identify AI-generated PRs
      policy-check/    # Enforce file and scope policies
      security-scan/   # Run secret, dependency, and SAST scans
      sandbox-test/    # Execute tests in isolated Docker container
      risk-assessment/ # Compute risk score and assign tier
    workflows/
      ai-code-gate.yml # Main workflow that chains all actions
  src/
    detect.ts          # AI PR detection engine
    policy.ts          # Policy loading and diff validation
    risk.ts            # Risk scoring and tier assignment
  examples/
    .ai-code-gate.yml  # Default configuration for adopters
  sample-app/          # Demo Express API for testing the pipeline
  .ai-code-gate.yml    # Policy for this repo itself
  README.md

Workflow architecture

The main workflow is triggered on pull_request events and runs detection first, then fans out policy checks, security scans, and sandbox tests in parallel before converging on risk assessment:

name: AI Code Gate
on:
  pull_request:
    types: [opened, synchronize, reopened]

jobs:
  detect:
    runs-on: ubuntu-latest
    outputs:
      is_ai_pr: ${{ steps.detect.outputs.is_ai_pr }}
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0
      - uses: ./.github/actions/detect-ai-pr
        id: detect

  policy-check:
    needs: detect
    if: needs.detect.outputs.is_ai_pr == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/policy-check

  security-scan:
    needs: detect
    if: needs.detect.outputs.is_ai_pr == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/security-scan

  sandbox-test:
    needs: detect
    if: needs.detect.outputs.is_ai_pr == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/sandbox-test

  risk-assessment:
    needs: [detect, policy-check, security-scan, sandbox-test]
    if: always() && needs.detect.outputs.is_ai_pr == 'true'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: ./.github/actions/risk-assessment

The workflow short-circuits for non-AI PRs (the if conditions on each job). Human-authored PRs pass through without any additional checks, so adopting ai-code-gate adds zero overhead to your existing human workflows.

Risk scoring

The risk calculator evaluates multiple factors and produces a weighted score from 0 to 100:

interface RiskFactors {
  filesChanged: number;
  linesAdded: number;
  sensitivePathsModified: string[];
  newDependenciesAdded: number;
  scanFindings: ScanFinding[];
  hasTests: boolean;
}

function calculateRiskScore(factors: RiskFactors): number {
  let score = 0;

  // File count contributes up to 20 points
  score += Math.min(20, factors.filesChanged * 2);

  // Lines added contributes up to 20 points
  score += Math.min(20, Math.floor(factors.linesAdded / 50));

  // Sensitive paths are heavily weighted
  score += factors.sensitivePathsModified.length * 15;

  // New dependencies add risk
  score += factors.newDependenciesAdded * 5;

  // Scan findings add directly to risk
  for (const finding of factors.scanFindings) {
    score += finding.severity === "high" ? 20 : finding.severity === "medium" ? 10 : 5;
  }

  // Having tests reduces risk
  if (factors.hasTests) {
    score = Math.max(0, score - 10);
  }

  return Math.min(100, score);
}

The weights are configurable. The defaults are designed to be conservative: a PR that touches authentication code or adds new dependencies will land in the medium or high tier even if it is small. You tune the weights based on your own risk profile after observing the score distribution across your first few weeks of AI-generated PRs.

Incremental adoption

You do not need to deploy all five actions at once. The recommended adoption path:

Week 1: deploy detection only. Label AI-generated PRs but take no blocking action. Observe how many PRs are detected and verify accuracy.
Week 2: add policy checks in warning mode. Flag policy violations in PR comments but do not block merge.
Week 3: add security scanning. Again, report findings without blocking.
Week 4: enable blocking mode for policy violations and high-severity scan findings.
Week 5+: enable risk-tiered review gates. Tune thresholds based on observed data.

This gradual rollout builds confidence and avoids the organizational friction of deploying a blocking governance system on day one. By the time you enable blocking, everyone has seen the system in action and understands what it checks and why.

What is next

If you want to implement this framework hands-on, follow the companion tutorial: Lock Down AI Coding Agent Pipelines. It walks through installing ai-code-gate, configuring policies for your codebase, and validating each governance layer with test PRs.

The governance landscape for AI-generated code is evolving rapidly. Directions we expect to see mature over the next 12 months include AI-generated test coverage requirements (agents must produce tests that achieve a minimum coverage threshold for the code they write), automated rollback triggers (if AI-generated code causes an anomaly in production metrics, automatically revert the merge), and model-specific policy tuning (different risk profiles for different AI models based on observed quality differences).

The fundamental insight is that AI coding agents are not going away, and their capabilities are only increasing. The organizations that will benefit most are those that build governance infrastructure now, while the volume of AI-generated code is manageable enough to iterate on policies and tooling. Waiting until AI agents write the majority of your code to start thinking about governance means retroactively applying controls to a codebase where you cannot confidently distinguish AI-written code from human-written code.

For more on the topics covered in this guide:

AI Coding Agents in 2026: How MCP and Tool-Augmented Models Are Changing Development covers the capabilities and architecture of modern AI coding agents.
How to Secure Agentic AI Applications: The 2026 Playbook provides a broader security framework for AI-powered systems.
Software Supply Chain Security in the Age of AI addresses dependency and build pipeline risks.
Harden Your CI/CD Pipeline with Sigstore, SLSA, and SBOMs covers complementary CI/CD hardening techniques.

Get the AI Code Gate Pipeline →

Get the free AI Code Governance Checklist →

Frequently asked questions

How do you detect AI-generated pull requests?

Detection uses multiple signals layered together: Co-authored-by commit trailers from tools like GitHub Copilot and Claude Code, PR labels such as ai-generated or copilot, known bot author accounts, and branch naming conventions like copilot/ or claude/ prefixes. Combining these signals reduces false negatives. The detection step runs first in the pipeline and gates whether the remaining AI-specific checks are triggered.

What should you scan AI-generated code for?

AI-generated code should be scanned for four categories: secrets and credentials (using tools like Gitleaks), dependency vulnerabilities and hallucinated package names (using npm audit, pip-audit, or similar), static analysis findings (using Semgrep or similar SAST tools), and infrastructure-as-code misconfigurations (for Terraform, CloudFormation, or Dockerfiles). Thresholds for AI-generated code should be stricter than for human-written code because the volume and speed of AI output amplifies the impact of systematic errors.

How do risk-tiered review gates work?

Risk-tiered review gates assign a composite risk score to each AI-generated pull request based on factors like the number of files changed, whether sensitive paths are touched, security scan findings, and policy violations. The score maps to a tier: low-risk PRs may auto-merge after passing all scans, medium-risk PRs require one human approval, and high-risk PRs require two approvals including a security reviewer. The tier thresholds and scoring weights are configurable in the policy file.

Securing AI Coding Agent Workflows: Sandbox, Permission, and Review AI-Generated Code in Production Pipelines