Two Corpora, One Answer: What We Built for Pharma Intelligence

A pharmaceutical quality director asked us a question we couldn't answer with a standard RAG system:

"What is FDA currently focused on in drug manufacturing — and do we have procedures that address it?"

That question has two parts. The first is about an external corpus — 870+ FDA warning letters spanning 2019 to present. The second is about an internal corpus — the company's own SOPs, policies, and quality system documents.

A single-corpus RAG system can answer either question. It can't answer both in the same breath.

That limitation is what led us to build the Pharma Intelligence Copilot.

The Problem with One Corpus

Most document intelligence systems are built around a single knowledge base. The user asks a question, the system retrieves relevant chunks, the model generates an answer grounded in those chunks.

This works well when the problem is lookup — "What does our data integrity policy say about audit trail requirements?" — but it breaks down when the problem is comparative: "Are our procedures addressing what FDA is actually enforcing?"

That's the question every pharma quality team needs to answer before an inspection. And it requires reasoning across two distinct, independently maintained corpora:

FDA enforcement data — what regulators have actually cited, in their own words, across hundreds of real enforcement actions
Internal quality documents — what the company's own procedures say, at the section level, with full document context

Retrieving from each separately gives you two lists of text. What you need is analysis that connects them.

The Architecture

We built around a concept we're calling two-corpus RAG. The system maintains two independent Pinecone namespaces:

fda-warning-letters — 8,388 chunks from 870 CDER/CBER warning letters (2019–present), each tagged with violation categories, company, issuing office, product type, and letter date
internal-docs — chunks from the company's quality system documents, stored in Box, with section-level context preserved

For each of ten violation categories — data integrity, OOS investigations, lab controls, change control, CAPA, supplier qualification, documentation, validation, microbiology, and annual product review — the system runs parallel retrieval against both namespaces using category-optimized search queries.

The retrieved chunks from both corpora go into a single prompt that instructs Claude to produce a structured risk signal: enforcement frequency, trend direction, quoted enforcement language, document coverage assessment, and a specific review question for the quality team.

That's the core insight. The model isn't answering one question with one corpus. It's synthesizing evidence from two distinct bodies of text to produce a single comparative judgment.

Scraping and Categorizing 870 Warning Letters

The FDA warning letter corpus is the foundation. We built a scraper against FDA's DataTables AJAX endpoint (not a static file — FDA's listing page uses a Drupal DataTables interface that paginates via AJAX requests).

872 letters later, we ran each through a Claude Haiku categorization pass — five letters per API call, producing a JSON array of which of the ten violation categories appeared in each letter. At batch-five, that's about 175 API calls to categorize the entire corpus.

The categorization metadata becomes Pinecone vector metadata, enabling filtered retrieval: "give me warning letter chunks specifically about data integrity" rather than just "give me the most semantically similar chunks to this query." Category filtering meaningfully improves retrieval precision for the risk scan.

The chunk structure:

[LETTER-1] Company: Acme Pharma | Date: 2023-04-12 | Office: CDER | Type: Drug
FDA's review found that your firm failed to establish adequate procedures for
reviewing and approving changes to established specifications...

The Box Integration

The internal document corpus lives in Box — which is exactly where pharma quality teams keep their controlled documents. We built a server-to-server JWT connector using box-node-sdk that downloads and re-ingests documents on demand.

This matters for the demo story, but it also matters architecturally. The system doesn't require you to migrate your documents somewhere new. Your team continues working in Box. The AI layer connects to where the documents already live.

We also built a webhook endpoint: when a file changes in Box, the webhook triggers re-ingestion of that specific file. Existing chunks for the file are deleted and replaced. The internal corpus stays current without manual intervention.

Section-level chunking preserves document context:

[Document: OOS Investigation Procedure QA-011 v2.3]
[Section: 5.2 Phase 2 Investigation]

When an OOS result cannot be attributed to laboratory error in Phase 1,
a Phase 2 investigation shall be initiated within 3 business days...

The section prefix means that when a chunk is retrieved, the model knows which document and which section it came from — not just what it says.

Ten Categories in Parallel

A full risk scan covers all ten violation categories. Running them sequentially at ~5 seconds per category (two retrievals + one LLM call) would take nearly a minute. We run in batches of three.

The implementation is a server-sent events endpoint backed by an async generator. The generator yields events as each category completes:

data: {"type":"signal","data":{...}}
data: {"type":"progress","completed":3,"total":10,"currentCategory":"CAPA"}
data: {"type":"signal","data":{...}}

The UI renders signals as they arrive. You watch the scan fill in — each card appearing with its coverage assessment, enforcement citations, and review prompt. The streaming isn't cosmetic; it's architecturally necessary when the full scan takes 50+ seconds.

The Redis-cached report enables the follow-up chat feature. After the scan, the user can open any signal and ask questions. The chat endpoint loads the cached report, builds a context-aware system prompt that includes the specific signal's enforcement data and document evidence, and streams a response. The model knows exactly what was found for that category — it's not a generic chatbot.

The Copilot Posture

Pharma is a regulated industry where the wrong word matters. A system that says "your procedure has a gap" is making a compliance determination — something that requires human expertise, not an LLM.

We made a deliberate language choice throughout: the system produces research signals, not findings. It says "may warrant review," not "is deficient." Every output includes an explicit human review required marker.

This isn't just legal caution. It's correct framing for what the system actually is. The AI can identify that FDA has cited a specific type of audit trail failure in 14 recent warning letters and that your data integrity policy doesn't appear to address the cited language. That's a research signal. Whether it represents a gap in your quality system is a judgment that requires reading the full documents, understanding your manufacturing context, and applying regulatory expertise.

The system surfaces the signal. The human makes the call.

What We Learned

Two-corpus RAG is meaningfully harder than single-corpus RAG. The retrieval queries need to be designed to work against both corpora simultaneously. The prompt needs to handle the case where one corpus returns nothing useful for a category. The output structure needs to capture coverage as a spectrum — addressed, partial, unclear, not-found — rather than a binary.

The Box integration was straightforward technically but important for demo credibility. Walking into a pharma meeting with a system that reads from a competitor's cloud storage provider is very different from walking in with one that connects to Box, where the quality team already works.

The hardest part was the copilot language. It took several iterations to find framing that's genuinely useful — specific enough to surface real research signals — without crossing into compliance assessment. The line is: report what FDA has said, report what the document says, describe how they relate. Stop before telling the user what to do about it.

The Pharma Intelligence Copilot is live. Try the enforcement trends Q&A or run a full risk scan against Meridian Biosciences' quality system. If you're in pharma and want to talk about building something similar for your organization, we'd like to hear about it.