RAG With Accountability: Designing Citation-Backed Knowledge Engines

The Standard RAG Pipeline and Where It Breaks

Retrieval-Augmented Generation promised to solve the hallucination problem. Instead of relying purely on parametric memory baked into model weights, RAG retrieves relevant documents and injects them into the context window before generation. The theory is sound. The implementation, in nearly every production system we have audited, is not.

The standard pipeline looks deceptively simple. User query comes in. The query gets embedded into a vector. A similarity search runs against a vector store. Top-k chunks come back. Those chunks get concatenated into a prompt alongside the user's question. The LLM generates a response grounded in the retrieved context. Ship it.

This pipeline fails in at least five distinct ways, and most teams only discover them after deployment when users start reporting answers that look authoritative but are factually wrong.

Retrieval misalignment: The embedding model captures semantic similarity, not factual relevance. A chunk about "revenue growth strategies" matches a query about "revenue decline causes" because the vector space clusters them together.
Context contamination: Multiple retrieved chunks may contain contradictory information from different time periods or different authors. The LLM picks whichever fits the narrative it is constructing.
Synthesis hallucination: The model correctly retrieves factual chunks but then synthesizes a conclusion that none of the individual chunks actually support. Each fact is real. The inference connecting them is fabricated.
Chunk boundary loss: When documents get split into chunks, critical context at the boundary between two chunks disappears. A paragraph explaining an exception to a rule lands in a different chunk than the rule itself.
Attribution collapse: The generated response blends information from multiple sources into a single fluent paragraph with no indication of which claim came from which source. Verification becomes impossible.

The core problem is that standard RAG optimizes for fluency, not for verifiability. An accountable RAG system inverts that priority.

Hallucination in Retrieval Contexts

There is a widespread misconception that RAG eliminates hallucination. It does not. It changes the character of hallucination. Without retrieval, LLMs fabricate facts from their training data distribution. With retrieval, LLMs fabricate connections between real facts. This second type is harder to detect because the individual components are verifiably true.

Consider a knowledge base containing two chunks. Chunk A states: "Company X achieved 40% revenue growth in Q3 2025." Chunk B states: "The SaaS market experienced a downturn in Q3 2025." A standard RAG system, asked "How did Company X perform relative to the market?", might generate: "Company X achieved exceptional performance, growing revenue 40% while the broader SaaS market declined, suggesting their product-market fit insulated them from macro headwinds."

Every surface-level fact is grounded. But the causal inference about product-market fit and insulation from macro headwinds exists nowhere in the source material. The model constructed a plausible narrative and dressed it in the authority of retrieved facts. This is grounded hallucination, and it is the primary failure mode of production RAG systems.

The most dangerous hallucination is one that cites real sources. It inherits the credibility of the citation while fabricating the conclusion.

Citation Architecture: Tracking Every Claim to Its Source

An accountable RAG system requires citation architecture that operates at three levels: chunk identification, claim extraction, and citation binding. Each level introduces traceability that standard pipelines lack entirely.

Chunk Identification

Every chunk in your vector store needs a stable, globally unique identifier that traces back to the original document, the specific location within that document, and the version of the document at ingestion time. This is not optional metadata. It is the foundation of the entire accountability chain.

interface SourceChunk {
  chunk_id: string;          // deterministic hash of content + position
  document_id: string;       // stable reference to source document
  document_version: string;  // version or last-modified timestamp
  start_offset: number;      // character offset in original document
  end_offset: number;        // character offset end
  section_path: string[];    // hierarchical path: ["Chapter 3", "Section 2", "Paragraph 4"]
  ingested_at: string;       // ISO timestamp of ingestion
  content: string;           // the actual chunk text
  confidence: number;        // retrieval similarity score (0-1)
}

The section_path field is critical. When a user wants to verify a citation, telling them "this came from document X" is not useful. Telling them "this came from Chapter 3, Section 2, Paragraph 4 of document X, version 2.1" gives them a precise location to check.

Claim Extraction

Before generating a response, decompose the LLM's output into individual claims. Each claim is a discrete, verifiable statement. This decomposition can happen as a post-processing step or, better, as an integrated part of the generation prompt.

// System prompt fragment for claim-level citation
const CITATION_PROMPT = `
For every factual claim in your response, append an inline citation
using the format [src:CHUNK_ID]. If a claim synthesizes information
from multiple sources, cite all of them: [src:CHUNK_A,CHUNK_B].

If you draw a conclusion not directly stated in any source,
mark it as [inference] and explain your reasoning.

If the retrieved context does not contain sufficient information
to answer confidently, respond with: "The available sources do
not contain enough information to answer this question reliably."
Do not guess. Do not extrapolate.
`;

This prompt structure creates three categories of output: cited claims (traceable to specific chunks), declared inferences (explicitly marked as model reasoning), and abstentions (honest acknowledgment of insufficient evidence). Each category has a different trust level, and your UI should reflect that distinction visually.

Citation Binding and Validation

The citation is only useful if you validate it. After generation, run a verification pass that checks whether each cited chunk actually supports the claim it is attached to. This is where most teams skip a step and pay for it later.

async function validateCitations(
  response: GeneratedResponse
): Promise<ValidationResult[]> {
  const results: ValidationResult[] = [];

  for (const claim of response.claims) {
    for (const citationId of claim.citations) {
      const sourceChunk = await chunkStore.get(citationId);

      if (!sourceChunk) {
        results.push({
          claim: claim.text,
          citation: citationId,
          status: 'BROKEN',
          reason: 'Referenced chunk not found in store'
        });
        continue;
      }

      // Use a smaller, faster model for entailment checking
      const entailment = await checkEntailment(
        premise: sourceChunk.content,
        hypothesis: claim.text
      );

      results.push({
        claim: claim.text,
        citation: citationId,
        status: entailment.supported ? 'VERIFIED' : 'UNSUPPORTED',
        confidence: entailment.score,
        reason: entailment.explanation
      });
    }
  }

  return results;
}

The checkEntailment function performs Natural Language Inference (NLI) to determine whether the source chunk logically entails the claim. You can use a fine-tuned NLI model (like DeBERTa trained on MNLI) or a smaller LLM prompted for entailment classification. The key requirement: this validation model must be separate from the generation model. You cannot ask the same model that generated the claim to objectively judge whether the claim is supported.

Embedding Strategies for Precision vs. Recall

The embedding model determines what gets retrieved. In an accountable system, retrieval precision matters more than recall. Retrieving ten chunks where three are relevant and seven are noise creates more hallucination surface area than retrieving three chunks that are all highly relevant.

Three strategies improve retrieval precision in practice.

Hypothetical Document Embeddings (HyDE)

Instead of embedding the raw user query, generate a hypothetical answer first, then embed that answer. The hypothetical answer lives in the same semantic space as the document chunks (declarative statements), while the user query often lives in a different space (questions, commands). HyDE bridges that gap. The trade-off: it adds one LLM call to the retrieval path, increasing latency by 200 to 500 milliseconds depending on the model.

Multi-vector Retrieval

Embed each chunk at multiple granularities: the full chunk, a summary of the chunk, and extracted key entities. At query time, search across all three vector spaces and merge results with reciprocal rank fusion. Chunks that appear in the top results across multiple representations are more likely to be genuinely relevant, not just semantically similar on one dimension.

Parent Document Retrieval

Store small chunks for embedding precision but retrieve the parent document section for context completeness. When a 200-token chunk matches, pull the full 2000-token section it belongs to. This preserves the contextual information lost at chunk boundaries without sacrificing embedding precision. The implementation requires maintaining a mapping from chunk IDs to parent section IDs in your metadata store.

Rule of thumb: if your top-5 retrieved chunks contain fewer than 3 that are directly relevant to the query, your embedding strategy needs work. Measure retrieval precision weekly on a labeled evaluation set.

Re-ranking for Relevance

Vector similarity is a coarse filter. It narrows millions of chunks to hundreds. Re-ranking is the precision layer that narrows hundreds to the handful that actually matter.

A cross-encoder re-ranker takes the query and each candidate chunk as a pair and produces a relevance score that accounts for the full interaction between query and document, not just their independent embeddings. This is computationally expensive (you cannot pre-compute it), which is why it runs on the shortlist, not the full corpus.

async function retrieveWithReranking(
  query: string,
  topK: number = 5,
  initialPool: number = 50
): Promise<SourceChunk[]> {
  // Stage 1: Coarse retrieval via vector similarity
  const candidates = await vectorStore.search(query, initialPool);

  // Stage 2: Cross-encoder re-ranking
  const scored = await Promise.all(
    candidates.map(async (chunk) => ({
      chunk,
      relevance: await crossEncoder.score(query, chunk.content)
    }))
  );

  // Stage 3: Filter and sort
  return scored
    .filter(s => s.relevance > 0.7)  // hard relevance threshold
    .sort((a, b) => b.relevance - a.relevance)
    .slice(0, topK)
    .map(s => s.chunk);
}

The hard relevance threshold at 0.7 is important. Standard top-k retrieval always returns k results, even if none of them are relevant. In an accountable system, returning zero results and admitting ignorance is better than returning marginally relevant chunks that invite hallucination. Set the threshold empirically using your evaluation set and tune it until the false-positive rate on irrelevant retrievals drops below 5%.

Confidence Thresholds and the Value of "I Don't Know"

The most important capability in an accountable RAG system is the ability to decline to answer. This requires confidence scoring at multiple levels.

Retrieval confidence: Are the retrieved chunks relevant? Measured by the re-ranker score. If the best chunk scores below 0.7, the system does not have relevant context.
Coverage confidence: Do the retrieved chunks cover the full scope of the question? A question with three sub-parts where only one has supporting chunks should surface that gap explicitly.
Entailment confidence: Does the generated response follow from the retrieved chunks? Measured by the post-generation validation pass.
Consistency confidence: If you run the same query three times, do you get materially the same answer? High variance across runs indicates the model is interpolating rather than grounding.

interface ConfidenceReport {
  retrieval_score: number;    // avg re-ranker score of used chunks
  coverage_score: number;     // fraction of query aspects covered
  entailment_score: number;   // avg entailment score across claims
  consistency_score: number;  // semantic similarity across N runs
  overall: number;            // weighted composite
  decision: 'ANSWER' | 'PARTIAL' | 'ABSTAIN';
}

function computeDecision(report: ConfidenceReport): string {
  if (report.overall >= 0.85) return 'ANSWER';
  if (report.overall >= 0.60) return 'PARTIAL';
  return 'ABSTAIN';
}

When the decision is PARTIAL, the system should answer what it can and explicitly state what it cannot. When the decision is ABSTAIN, the system should explain why: which aspect of the query lacked supporting evidence. This transforms the system from an oracle (always answers, sometimes wrong) into an advisor (answers when confident, flags uncertainty when not).

An AI system that says "I don't know" when it genuinely doesn't know is more trustworthy than one that always produces a fluent answer. Fluency is not fidelity.

Audit Trails for Every AI Response

Every response generated by an accountable RAG system must produce an audit record. This record is not for debugging. It is for accountability. When a user, a compliance officer, or a downstream system questions a response, the audit trail must reconstruct the full chain: what was asked, what was retrieved, what was generated, what was validated, and what confidence level was assigned.

interface AuditRecord {
  request_id: string;
  timestamp: string;
  user_query: string;
  query_embedding_model: string;
  retrieval_results: {
    chunk_id: string;
    similarity_score: number;
    reranker_score: number;
    was_used_in_generation: boolean;
  }[];
  generation_model: string;
  generation_prompt_hash: string;  // hash of the full prompt template
  generated_response: string;
  claims: {
    text: string;
    citations: string[];
    entailment_scores: number[];
    validation_status: 'VERIFIED' | 'UNSUPPORTED' | 'BROKEN';
  }[];
  confidence_report: ConfidenceReport;
  response_decision: 'ANSWER' | 'PARTIAL' | 'ABSTAIN';
  latency_ms: number;
  token_usage: { prompt: number; completion: number; total: number; };
}

Store these records in an append-only log. Not in your application database. Not in your LLM provider's logs. In a storage system you control, with retention policies appropriate for your compliance requirements. For regulated industries (healthcare, finance, legal), retention typically means years, not days.

Querying the Audit Trail

The audit trail is only valuable if it is queryable. Build indexes on at least these dimensions: time range, user ID, confidence band (high, medium, low), validation status (all verified, any unsupported, any broken), and document source. When an issue surfaces, you need to answer questions like: "Show me all responses from the last 30 days that cited document X and had at least one unsupported claim." If your audit store cannot answer that query in seconds, it is not serving its purpose.

Putting It Together: A Complete Accountable Retrieval Pass

Here is the full orchestration of an accountable RAG query, from user input to validated, auditable response.

async function accountableQuery(
  userQuery: string,
  userId: string
): Promise<AccountableResponse> {
  const requestId = crypto.randomUUID();
  const start = performance.now();

  // 1. Retrieve with re-ranking
  const chunks = await retrieveWithReranking(userQuery, 5, 50);

  // 2. Check retrieval confidence
  const avgRetrievalScore = chunks.reduce(
    (sum, c) => sum + c.confidence, 0
  ) / (chunks.length || 1);

  if (avgRetrievalScore < 0.5 || chunks.length === 0) {
    return buildAbstainResponse(requestId, userQuery,
      'Insufficient relevant sources found');
  }

  // 3. Generate with citation instructions
  const response = await generate({
    systemPrompt: CITATION_PROMPT,
    context: formatChunksForPrompt(chunks),
    query: userQuery,
    model: 'gpt-4o'
  });

  // 4. Parse claims and citations
  const claims = extractClaims(response.text);

  // 5. Validate every citation
  const validations = await validateCitations({ claims });

  // 6. Compute confidence
  const confidence = computeConfidence(
    avgRetrievalScore, claims, validations
  );

  // 7. Write audit record
  await auditLog.append({
    request_id: requestId,
    timestamp: new Date().toISOString(),
    user_query: userQuery,
    retrieval_results: chunks.map(c => ({
      chunk_id: c.chunk_id,
      similarity_score: c.confidence,
      reranker_score: c.confidence,
      was_used_in_generation: true
    })),
    claims: validations,
    confidence_report: confidence,
    response_decision: confidence.decision,
    latency_ms: performance.now() - start,
    token_usage: response.usage
  });

  // 8. Return with transparency metadata
  return {
    text: response.text,
    citations: validations,
    confidence: confidence,
    decision: confidence.decision,
    request_id: requestId
  };
}

This adds latency. A standard RAG call takes 1 to 3 seconds. An accountable RAG call with re-ranking, citation extraction, and entailment validation takes 3 to 8 seconds. That latency is the cost of trust. For internal knowledge systems, compliance tools, and any application where accuracy has financial or legal consequences, it is a cost worth paying.

The question is never "can we make the AI faster?" The question is "can we afford to ship an answer we cannot verify?" In regulated environments, in customer-facing products, in any system where wrong answers carry consequences, the answer is no. Build the verification layer. Pay the latency cost. Sleep better.

What Accountability Costs and What It Buys

Building citation-backed RAG is more expensive than standard RAG. You pay for re-ranker inference, entailment validation, audit storage, and the engineering time to maintain the pipeline. The retrieval path is longer. The infrastructure is more complex. The monitoring surface area is larger.

What you get in return: every answer your system produces can be traced to specific sources. Every unsupported inference is flagged. Every response has a confidence score. Every interaction is auditable. Users learn to trust the system because the system earns trust through transparency, not through fluency.

Standard RAG systems degrade silently. When the knowledge base drifts, when embeddings become stale, when new documents contradict old ones, the system keeps generating confident-sounding answers. Accountable RAG systems degrade visibly. Confidence scores drop. Entailment failures spike. Abstention rates climb. These signals are not bugs. They are the system telling you it needs attention. That visibility is the entire point.