Product - MedCopilot

Data Foundation

Data source: FDA Structured Product Labeling (SPL) via OpenFDA — the legally required text that appears on drug packaging. Unlike curated databases, MedCopilot retrieves from the authoritative source directly.

Data Pipeline

253,426

Raw

13 JSON shards from OpenFDA

211,821

Cleaned

Rows with ≥10 populated fields

54,483

Deduplicated

SimHash removes repackagers/revisions

Deduplication detail: Two-pass SimHash clustering — first by indications_and_usage (99.5% coverage, 32% distinct), then by dosage_and_administration (99.2% coverage, 39% distinct). Uses LSH bands + Union-Find for scalable clustering.

The 5-Step Agentic Pipeline

1

Interpret

Extract keywords, normalize drug names

2

Retrieve

Scout → Plan → Execute SQL FTS

3

Summarize

LLM compresses long fields

4

Answer

Persona-aware + citations

5

Follow-up

Suggest related questions

Self-Correction System

If the LLM cannot answer from initial results, the system automatically retries with broader scope

Attempt 1

PREFERRED

15 core clinical fields

LLM-planned AND/OR query

Retry 1

EXTENDED

18 fields (+pharmacokinetics)

Force OR (broad)

Retry 2

ALL

60 text columns (full label)

Force OR + Researcher persona

How a Query Flows Through the Pipeline

"atorvastatin grapefruit interaction"

1

Interpret

Extract: atorvastatin, grapefruit. Intent: drug-food interaction.

2

Scout

Count how many labels contain each term

3

Plan

Construct tsquery: 'atorvastatin' & 'grapefruit'

4

Execute

FTS returns matching labels ranked by ts_rank_cd() (PostgreSQL) or BM25 (SQLite)

5

Summarize

Long fields (drug_interactions, warnings) are LLM-summarized if they exceed token limits

6

Answer

LLM generates response using retrieved data, with citations in format [ROW ID, field_name]

7

Follow-ups

LLM suggests related questions based on context

Design Approach

🔍

SQL Full-Text Search

Every query translates to a visible tsquery string. The UI trace shows keywords, hit counts, retries, and timing — no hidden embeddings or unexplainable ML ranking.

📝

Summarization, Not Truncation

Long drug label fields are compressed using LLM summarization rather than truncation. Preserves information that might be lost by cutting mid-sentence.

🎯

Scout → Plan → Execute

Before searching, count hits per keyword. LLM constructs optimal boolean query with knowledge of term frequencies. Avoids both over-restrictive and over-broad queries.

📊

Three-Tier Indexing

Pre-computed tsvector indexes: tsv_preferred (15 fields), tsv_extended (18), tsv_all (60). Enables precision/recall trade-off per query.

Persona-Aware Responses

The same question gets different responses depending on who's asking

👩‍⚕️

Clinician / Pharmacist

Clinical terminology

Direct, professional, with dosing details

Use case: Point-of-care reference

🎓

Expert / Specialist

Technical

Comprehensive, preserves complexity

Use case: Complex clinical scenarios

🔬

Analyst / Researcher

Exhaustive

All variations, raw data preserved

Use case: Regulatory analysis, systematic reviews

🏠

Patient / Consumer

Plain language

Clear explanations, understand your label

Use case: Understanding prescription labels

Value Proposition

For Healthcare Organizations

✓
Authoritative source: Answers grounded exclusively in FDA-approved label data
✓
Auditable: Every answer includes (ROW ID, field_name) citations; every query is a visible SQL tsquery
✓
Self-correcting: System automatically retries with broader scope rather than silently failing

For Technical Teams

✓
Simple infrastructure: SQL only (PostgreSQL or SQLite) — no vector databases, no embedding model serving
✓
Debuggable: Complete pipeline trace showing keywords, retrieval attempts, summarizations, timing
✓
Maintainable: Clean separation of concerns (interpretation, retrieval, summarization, generation)

For End Users

✓
Ask questions, don't navigate menus: "amoxicillin pediatric dosing" — no dropdowns, no category drilling
✓
Persona-appropriate responses: Same question → different detail for clinician, expert, researcher, or patient
✓
Full label scope: Not limited to interactions — access all 60 text columns via natural language

Technical Architecture

System Components

Database PostgreSQL / SQLite

Search ts_rank_cd / BM25

LLM Azure OpenAI

UI Streamlit

Data Format Parquet → SQL

Model Configuration

Models are configurable per task:

•Lightweight: interpret, plan, follow-ups
•Summarization: optimized for compression
•Answer generation: higher-capability models

System Capabilities

54K

Labels

5

Pipeline Steps

3

Retry Tiers

4

Personas

15

Prompts

3

Index Tiers

Features

✓Complete data pipeline (JSON → Parquet → PostgreSQL)

✓Two-pass SimHash deduplication

✓Three-tier tsvector indexing

✓5-step agentic RAG pipeline

✓3-tier self-correction system

✓Persona-aware prompting

✓Conversation history support

✓Streamlit UI with debug panel

What MedCopilot Is NOT

Understanding the boundaries helps position the value

✗ Not a predictor

DDI-GPT predicts undocumented interactions; Decagon predicts polypharmacy effects. MedCopilot retrieves documented information from FDA labels.

✗ Not a drug knowledgebase

DrugBank curates drug targets, protein interactions, chemical structures, and clinical trials. MedCopilot focuses on FDA label text retrieval.

✗ Not a curated monograph system

Lexidrug and Micromedex have human-written summaries. MedCopilot generates answers from authoritative source text.

✗ Not a clinical decision engine

MedWise scores polypharmacy risk. MedCopilot answers questions about what's in drug labels.

See Market Landscape for detailed comparison.

Product Overview