What a Real AI CI Pipeline Looks Like

Most teams running AI in production treat their CI pipeline the same way they did before they introduced a single language model. They lint. They run tests. They deploy. And then they wonder why their AI-generated code ships subtle defects that no individual stage was designed to catch. The problem is not that their pipeline is broken. The problem is that it was built for a world where every line of code was written by a human who understood the entire system.

When AI generates code, it introduces failure modes that traditional CI was never designed to detect. Style violations that are technically valid but violate brand constraints. Authorization policies that compile but leak data across tenants. Prose content embedded in code that contains characters your brand guidelines explicitly forbid. A standard lint, test, deploy pipeline catches syntax errors and logic bugs. It does not catch governance violations. And in an AI-augmented codebase, governance violations are the most dangerous class of defect because they are invisible to every tool in the standard chain.

The Standard Pipeline and Where It Falls Short

A typical modern CI pipeline for a Next.js application looks something like this: ESLint runs first, catching syntax issues and enforcing formatting rules. TypeScript compilation follows, verifying type safety. Unit and integration tests run next, exercising business logic. If everything passes, the build deploys. This is a solid foundation, and it catches a meaningful percentage of defects before they reach production.

But consider what it misses. An AI assistant generates a Supabase RLS policy that grants access to all rows when a certain condition is met. The policy is syntactically valid SQL. TypeScript has no opinion on it. Your unit tests, unless you wrote explicit cross-tenant isolation tests, will not flag it. The code ships. A tenant can now read another tenant's data. Your pipeline gave you a green checkmark while a data breach was deploying.

Or consider a subtler case. Your security policy requires that no API keys appear in client-side bundles. An AI model generates a utility function that hardcodes a service key directly into a React component because the model optimized for "make it work" rather than "make it safe." ESLint does not flag hardcoded strings. TypeScript has no opinion on secret management. Your tests verify that the function returns the right data, not where the credentials live. The code ships with a secret embedded in a publicly downloadable JavaScript bundle.

These are not hypothetical scenarios. These are the exact failure modes we encountered while building a production platform with AI assistance. And they are the reason we built a 7-stage pipeline that treats governance as a first-class CI concern.

The 7-Stage Pipeline

Our CI workflow runs as a GitHub Actions pipeline with seven distinct stages. Some run in parallel where dependencies allow. Others run strictly sequentially because their inputs depend on prior stages. Here is the full structure:

# .github/workflows/ci.yml
name: CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx eslint . --max-warnings 0

  typecheck:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx tsc --noEmit --strict

  test:
    runs-on: ubuntu-latest
    needs: [lint, typecheck]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx vitest run --reporter=verbose

  governance:
    runs-on: ubuntu-latest
    needs: [lint, typecheck]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx tsx scripts/governance-check.ts

  backup-verify:
    runs-on: ubuntu-latest
    needs: [test, governance]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx tsx backup/verify-backups.ts
    env:
      SUPABASE_URL: ${{ secrets.SUPABASE_URL }}
      SUPABASE_SERVICE_ROLE_KEY: ${{ secrets.SUPABASE_SERVICE_ROLE_KEY }}

  e2e:
    runs-on: ubuntu-latest
    needs: [test, governance]
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx playwright install --with-deps chromium
      - run: npx playwright test

  deploy-gate:
    runs-on: ubuntu-latest
    needs: [backup-verify, e2e]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
          cache: 'npm'
      - run: npm ci
      - run: npx tsx scripts/deploy-gate.ts

Notice the dependency graph. Lint and typecheck run in parallel because they are independent. Tests and governance both depend on lint and typecheck passing first, but run in parallel with each other. Backup verification and E2E tests depend on both tests and governance. The deploy gate runs last, only on the main branch, and only if every prior stage has passed.

Stages 1 and 2: Lint and Typecheck as Foundation

The first two stages are standard, but we run them with zero tolerance. ESLint is configured with --max-warnings 0, meaning warnings are treated as errors. This is critical when AI generates code because AI assistants frequently produce code that triggers warnings rather than errors. A function declared but never used. An import pulled in but never referenced. A variable shadowing an outer scope. These are all warnings by default in most ESLint configurations, and AI-generated code produces them at a higher rate than human-written code because the model does not have full project context when generating a single file.

TypeScript runs in strict mode with --noEmit. Strict mode enables strictNullChecks, noImplicitAny, strictFunctionTypes, and several other flags that catch type-level bugs. The --noEmit flag tells TypeScript to check types without producing output files, which is all we need in CI. This stage catches a surprising number of AI-generated defects. Models frequently produce code that works in isolation but breaks when integrated because it assumes a property exists on a type when it is actually optional, or it passes arguments in the wrong order to a function with similar parameter types.

Stage 3: Unit and Integration Tests

We use Vitest as our test runner. The test suite covers three distinct layers. Unit tests verify pure functions in isolation: prompt builders, scoring algorithms, route validators, data transformers. Integration tests verify that modules work together: API route handlers with their middleware, database operations with their RLS policies, AI pipelines with their prompt chains.

For AI-augmented codebases, integration tests deserve special attention. A pure function that builds a prompt template can be unit tested by verifying its output string. But the behavior of that prompt in the context of a full pipeline, where its output feeds into a classifier that routes to different handlers, requires integration-level coverage. We structure these tests using a pattern we call contract testing:

describe('Proposal generation pipeline', () => {
  it('returns structured JSON matching the proposal schema', async () => {
    const input = createMockClientProfile({
      industry: 'SaaS',
      revenue: 2_000_000,
      challenge: 'churn reduction'
    });

    const result = await generateProposal(input);

    // Contract: structure matters more than content
    expect(result).toMatchObject({
      sections: expect.arrayContaining([
        expect.objectContaining({
          title: expect.any(String),
          body: expect.any(String),
          confidence: expect.any(Number)
        })
      ]),
      metadata: expect.objectContaining({
        generatedAt: expect.any(String),
        modelVersion: expect.any(String)
      })
    });

    // Governance: no hardcoded credentials in output
    const fullText = JSON.stringify(result);
    expect(fullText).not.toMatch(/sk[-_]live|sk[-_]test|AKIA[A-Z0-9]/);
  });
});

Notice the last assertion. It checks for patterns that match API keys and secret tokens in the generated output. This is a governance constraint enforced at the test level, bridging the gap between traditional testing and the governance stage that follows.

Stage 4: Governance Checks

This is the stage that separates an AI-aware pipeline from a standard one. The governance check script runs a series of invariant verifications that are specific to our platform's rules. These are not tests in the traditional sense. They are structural assertions about the codebase itself.

// scripts/governance-check.ts
import { glob } from 'glob';
import { readFileSync } from 'fs';

interface GovernanceResult {
  rule: string;
  passed: boolean;
  violations: string[];
}

const results: GovernanceResult[] = [];

// GOV-001: No hardcoded secrets in client-side code
const clientFiles = await glob('src/app/**/*.{ts,tsx}');
for (const file of clientFiles) {
  const content = readFileSync(file, 'utf-8');
  const lines = content.split('\n');
  const violations: string[] = [];

  lines.forEach((line, i) => {
    if (/sk[-_](live|test)|AKIA[A-Z0-9]|password\s*[:=]\s*['"]/.test(line)) {
      violations.push(`${file}:${i + 1}: ${line.trim()}`);
    }
  });

  if (violations.length > 0) {
    results.push({
      rule: 'GOV-001: No hardcoded secrets',
      passed: false,
      violations
    });
  }
}

// GOV-002: RLS bypass detection
const supabaseFiles = await glob('src/**/*.{ts,tsx}');
for (const file of supabaseFiles) {
  const content = readFileSync(file, 'utf-8');
  const violations: string[] = [];

  if (/createClient.*service_role/.test(content)) {
    // Service role usage must be in admin-only files
    if (!file.includes('/admin/') && !file.includes('/server/')) {
      violations.push(
        `${file}: service_role client outside admin scope`
      );
    }
  }

  if (violations.length > 0) {
    results.push({
      rule: 'GOV-002: RLS bypass scope',
      passed: false,
      violations
    });
  }
}

// GOV-003: Protected table verification
const migrationFiles = await glob('supabase/migrations/*.sql');
for (const file of migrationFiles) {
  const content = readFileSync(file, 'utf-8');
  const protectedTables = [
    'projects', 'users',
    'payments', 'audit_log'
  ];

  const violations: string[] = [];
  for (const table of protectedTables) {
    if (content.includes(`DROP TABLE`) &&
        content.includes(table)) {
      violations.push(
        `${file}: DROP on protected table ${table}`
      );
    }
  }

  if (violations.length > 0) {
    results.push({
      rule: 'GOV-003: Protected tables',
      passed: false,
      violations
    });
  }
}

// Report results
const failures = results.filter(r => !r.passed);
if (failures.length > 0) {
  console.error('Governance check FAILED:');
  for (const f of failures) {
    console.error(`\n  ${f.rule}`);
    f.violations.forEach(v => console.error(`    ${v}`));
  }
  process.exit(1);
}

console.log('All governance checks passed.');

The governance check script is extensible. Each invariant is a discrete check with a rule identifier, a pass/fail result, and a list of specific violations with file paths and line numbers. When a check fails, the CI output tells you exactly which rule was violated, in which file, on which line. There is no ambiguity.

Why does this matter specifically for AI-generated code? Because AI models do not internalize your governance rules the way a senior engineer would. A human developer who has been told "no secrets in client code" will remember. An AI model will comply with the instruction in the current prompt, but a different prompt, a different session, or a different context window will introduce secrets again. The governance check is the structural backstop that catches drift regardless of which session or which model produced the code.

RLS Bypass Detection

The RLS bypass check deserves special attention. In a Supabase application, the service role key bypasses all Row Level Security policies. This is necessary for admin operations, but it is catastrophic if it leaks into client-facing code. Our governance check verifies that any file importing a service role client is located within an /admin/ or /server/ directory. If a developer (or an AI) creates a service role client in a client-facing API route, the pipeline fails before it can deploy.

Stage 5: Backup Verification

This stage is unique to platforms where data integrity is a business requirement. The backup verification script compares the current database schema against a stored baseline. It verifies that all expected tables exist, that their column counts match the baseline, and that row counts have not dropped below a configured threshold.

// backup/verify-backups.ts
import { createClient } from '@supabase/supabase-js';
import baseline from './backup-baselines.json';

const supabase = createClient(
  process.env.SUPABASE_URL!,
  process.env.SUPABASE_SERVICE_ROLE_KEY!
);

for (const table of baseline.tables) {
  // Verify table exists
  const { data, error } = await supabase
    .from(table.name)
    .select('*', { count: 'exact', head: true });

  if (error) {
    console.error(`Table ${table.name}: MISSING`);
    process.exit(1);
  }

  // Verify row count has not regressed
  const count = data?.length ?? 0;
  if (count < table.minimumRows) {
    console.error(
      `Table ${table.name}: ${count} rows, ` +
      `expected >= ${table.minimumRows}`
    );
    process.exit(1);
  }
}

This catches a specific class of regression: migrations that accidentally drop data, alter column types in destructive ways, or remove tables that downstream features depend on. When AI generates migration files, this stage provides a safety net that verifies the migration did not silently destroy production data.

Stages 6 and 7: E2E Tests and the Deploy Gate

Playwright E2E tests exercise the full application from a browser. They verify that pages render, forms submit, navigation works, and authenticated flows complete successfully. In an AI-augmented codebase, E2E tests serve as the final reality check: regardless of what the unit tests say, does the application actually work when a real user interacts with it?

The deploy gate is the final gatekeeper. It runs only on the main branch and performs a series of pre-deployment checks:

Branch verification: Confirms the deployment is running from the expected branch
Clean tree check: Verifies no uncommitted changes exist in the working directory
TypeScript recheck: A redundant typecheck to catch any issues introduced by merge resolution
Test rerun: Full test suite execution to verify nothing broke during merge
Health probe: After deployment, hits the production health endpoint to verify the application is responding

The deploy gate exists because merges can introduce defects that neither branch's CI run detected. Two branches that pass all checks independently can conflict when merged. The deploy gate catches these conflicts at the last possible moment, before the deployment reaches production.

Failure Modes by Stage

Each stage in the pipeline is designed to catch a specific category of defect. Understanding these categories helps you reason about what your pipeline covers and where gaps might exist:

Lint: Catches syntax errors, unused code, formatting violations, import issues. These are surface-level defects that are easy to fix but noisy if left unchecked.
Typecheck: Catches type mismatches, null reference risks, incorrect function signatures, missing properties. These are structural defects that would cause runtime errors.
Tests: Catches logic errors, incorrect behavior, broken contracts, regression bugs. These are behavioral defects that violate expectations.
Governance: Catches policy violations, security scope leaks, brand constraint breaches, protected resource mutations. These are compliance defects invisible to traditional tooling.
Backup verification: Catches data model regressions, accidental drops, schema drift. These are data integrity defects that compound over time.
E2E: Catches integration failures, rendering bugs, authentication flow breaks. These are user-facing defects that only manifest in a real browser.
Deploy gate: Catches merge conflicts, environment mismatches, deployment infrastructure failures. These are release defects that appear only at the boundary between code and production.

A standard CI pipeline catches defect categories 1 through 3 and sometimes 6. An AI-aware pipeline must also catch categories 4 and 5, because AI-generated code introduces governance and data integrity risks at a rate that human-written code does not. The deploy gate (category 7) is valuable regardless of whether AI is involved, but it becomes critical when AI can generate and commit code in rapid iteration cycles.

Lessons from Production

After running this pipeline across 1,100+ tests and 25+ database migrations, three patterns have become clear.

First, governance checks catch more defects per month than any other stage. Not because the other stages are weak, but because governance violations are the defect type that AI produces most frequently. Models forget constraints between sessions. They optimize for correctness within a single file while violating system-wide invariants. The governance stage exists specifically to catch this class of error, and it does so reliably.

Second, parallel execution matters. Running lint and typecheck in parallel saves roughly 40 seconds per pipeline run. Running tests and governance in parallel saves another 30 seconds. Over hundreds of commits, those savings compound into hours of developer time recovered. Structure your dependency graph to maximize parallelism wherever stages are truly independent.

Third, the deploy gate has prevented three production incidents that would have been costly. In each case, the merge introduced a defect that neither branch's CI run caught. The deploy gate's redundant checks, which feel wasteful when everything is green, are the exact checks that save you when something is not.

The purpose of CI is not to make you feel confident. It is to make you correctly informed. A green pipeline that misses an entire category of defect is worse than no pipeline at all, because it gives you false confidence to ship.

If you are building with AI, your pipeline needs to evolve with your tooling. Add governance checks. Add data verification. Add the stages that catch what AI introduces. Because the cost of catching a governance violation in CI is a failed build. The cost of catching it in production is a customer trust incident. The math is not complicated.