Product Overview

Agentic RAG System for FDA Drug Label Data

What it is: A system that lets you ask any question about drug labels in natural language and get answers traceable to FDA-approved text.

Not a curated database. Not a prediction model. A retrieval system over 54,483 authoritative FDA labels.

54,483
Drug Labels
60
Data Fields
78.5%
Dedup Rate
~62m
Pipeline Build

Data Foundation

Data source: FDA Structured Product Labeling (SPL) via OpenFDA — the legally required text that appears on drug packaging. Unlike curated databases, MedCopilot retrieves from the authoritative source directly.

Data Pipeline

253,426
Raw
13 JSON shards from OpenFDA
211,821
Cleaned
Rows with ≥10 populated fields
54,483
Deduplicated
SimHash removes repackagers/revisions
Deduplication detail: Two-pass SimHash clustering — first by indications_and_usage (99.5% coverage, 32% distinct), then by dosage_and_administration (99.2% coverage, 39% distinct). Uses LSH bands + Union-Find for scalable clustering.

The 5-Step Agentic Pipeline

1
Interpret

Extract keywords, normalize drug names

2
Retrieve

Scout → Plan → Execute SQL FTS

3
Summarize

LLM compresses long fields

4
Answer

Persona-aware + citations

5
Follow-up

Suggest related questions

Self-Correction System

If the LLM cannot answer from initial results, the system automatically retries with broader scope

Attempt 1

PREFERRED

15 core clinical fields

LLM-planned AND/OR query

Retry 1

EXTENDED

18 fields (+pharmacokinetics)

Force OR (broad)

Retry 2

ALL

60 text columns (full label)

Force OR + Researcher persona

How a Query Flows Through the Pipeline

"atorvastatin grapefruit interaction"

1

Interpret

Extract: atorvastatin, grapefruit. Intent: drug-food interaction.

2

Scout

Count how many labels contain each term

3

Plan

Construct tsquery: 'atorvastatin' & 'grapefruit'

4

Execute

FTS returns matching labels ranked by ts_rank_cd() (PostgreSQL) or BM25 (SQLite)

5

Summarize

Long fields (drug_interactions, warnings) are LLM-summarized if they exceed token limits

6

Answer

LLM generates response using retrieved data, with citations in format [ROW ID, field_name]

7

Follow-ups

LLM suggests related questions based on context

Design Approach

🔍

SQL Full-Text Search

Every query translates to a visible tsquery string. The UI trace shows keywords, hit counts, retries, and timing — no hidden embeddings or unexplainable ML ranking.

📝

Summarization, Not Truncation

Long drug label fields are compressed using LLM summarization rather than truncation. Preserves information that might be lost by cutting mid-sentence.

🎯

Scout → Plan → Execute

Before searching, count hits per keyword. LLM constructs optimal boolean query with knowledge of term frequencies. Avoids both over-restrictive and over-broad queries.

📊

Three-Tier Indexing

Pre-computed tsvector indexes: tsv_preferred (15 fields), tsv_extended (18), tsv_all (60). Enables precision/recall trade-off per query.

Persona-Aware Responses

The same question gets different responses depending on who's asking

👩‍⚕️

Clinician / Pharmacist

Clinical terminology

Direct, professional, with dosing details

Use case: Point-of-care reference

🎓

Expert / Specialist

Technical

Comprehensive, preserves complexity

Use case: Complex clinical scenarios

🔬

Analyst / Researcher

Exhaustive

All variations, raw data preserved

Use case: Regulatory analysis, systematic reviews

🏠

Patient / Consumer

Plain language

Clear explanations, understand your label

Use case: Understanding prescription labels

Value Proposition

For Healthcare Organizations

  • Authoritative source: Answers grounded exclusively in FDA-approved label data
  • Auditable: Every answer includes (ROW ID, field_name) citations; every query is a visible SQL tsquery
  • Self-correcting: System automatically retries with broader scope rather than silently failing

For Technical Teams

  • Simple infrastructure: SQL only (PostgreSQL or SQLite) — no vector databases, no embedding model serving
  • Debuggable: Complete pipeline trace showing keywords, retrieval attempts, summarizations, timing
  • Maintainable: Clean separation of concerns (interpretation, retrieval, summarization, generation)

For End Users

  • Ask questions, don't navigate menus: "amoxicillin pediatric dosing" — no dropdowns, no category drilling
  • Persona-appropriate responses: Same question → different detail for clinician, expert, researcher, or patient
  • Full label scope: Not limited to interactions — access all 60 text columns via natural language

Technical Architecture

System Components

Database PostgreSQL / SQLite
Search ts_rank_cd / BM25
LLM Azure OpenAI
UI Streamlit
Data Format Parquet → SQL

Model Configuration

Models are configurable per task:

  • Lightweight: interpret, plan, follow-ups
  • Summarization: optimized for compression
  • Answer generation: higher-capability models

System Capabilities

54K

Labels

5

Pipeline Steps

3

Retry Tiers

4

Personas

15

Prompts

3

Index Tiers

Features

Complete data pipeline (JSON → Parquet → PostgreSQL)
Two-pass SimHash deduplication
Three-tier tsvector indexing
5-step agentic RAG pipeline
3-tier self-correction system
Persona-aware prompting
Conversation history support
Streamlit UI with debug panel

What MedCopilot Is NOT

Understanding the boundaries helps position the value

Not a predictor

DDI-GPT predicts undocumented interactions; Decagon predicts polypharmacy effects. MedCopilot retrieves documented information from FDA labels.

Not a drug knowledgebase

DrugBank curates drug targets, protein interactions, chemical structures, and clinical trials. MedCopilot focuses on FDA label text retrieval.

Not a curated monograph system

Lexidrug and Micromedex have human-written summaries. MedCopilot generates answers from authoritative source text.

Not a clinical decision engine

MedWise scores polypharmacy risk. MedCopilot answers questions about what's in drug labels.

See Market Landscape for detailed comparison.