Blog
How to Parse FDA Structured Product Labeling (SPL) for Drug Interactions
A developer guide to extracting drug interaction data from FDA Structured Product Labeling. Covers the SPL format, querying openFDA for label data, which label fields contain interaction information, parsing challenges with unstructured text, and how AI extraction turns label prose into structured interaction data.
What is FDA Structured Product Labeling
Structured Product Labeling (SPL) is the document format the FDA requires for submitting drug labeling information electronically. Every prescription drug, over-the-counter medication, and biological product approved in the United States has an SPL document that contains the official product labeling, commonly known as the package insert or prescribing information.
SPL documents are XML files that use the HL7 Clinical Document Architecture (CDA) standard. Each document is identified by a unique spl_set_id that remains stable across versions as the label is updated, and each version has its own spl_id and effective_time timestamp. This versioning is important for drug interaction work because labels are updated when new interactions are discovered, dosing recommendations change, or safety signals emerge from post-market surveillance.
The authoritative repository for SPL documents is DailyMed, maintained by the National Library of Medicine. DailyMed provides both web-based access and bulk download of all SPL documents. The openFDA Drug Label API provides a more developer-friendly JSON-based interface to the same underlying data, with full-text search capabilities and a REST API design that is easier to integrate than raw XML processing.
Label sections that contain drug interaction information
FDA prescription drug labeling follows a standardized section structure defined by the Physician Labeling Rule (PLR). Drug interaction information can appear in several sections, and a thorough extraction pipeline must check all of them. Relying on a single section will miss interactions documented elsewhere in the label.
Section 7, titled Drug Interactions, is the primary location for drug interaction information. This section is specifically designated for describing clinically significant interactions with other drugs, foods, and laboratory tests. It typically includes the interacting substance, the mechanism of interaction, the clinical effect, and management recommendations. In openFDA data, this corresponds to the drug_interactions field.
The Warnings and Precautions section (Section 5) often contains interaction-related information framed as clinical warnings. For example, a label might warn about increased bleeding risk when the drug is combined with anticoagulants, without repeating the full interaction detail from Section 7. In openFDA, this maps to the warnings field.
The Contraindications section (Section 4) lists conditions and co-administered drugs that absolutely prohibit use of the medication. When a drug combination is contraindicated, it typically appears here with a brief statement and may or may not be repeated in Section 7 with additional detail. The openFDA field is contraindications.
The Boxed Warning section, when present, contains the most serious safety information including life-threatening drug interactions. This is the boldly bordered warning at the top of the label, reserved for interactions and risks of the highest severity. The openFDA field is boxed_warning.
Some labels also include a drug_interactions_table field that contains tabular interaction data when the manufacturer has formatted interactions as a table rather than prose. This structured format is easier to parse but is used by only a subset of labels.
Querying openFDA for label data
The openFDA Drug Label API at https://api.fda.gov/drug/label.json provides the most practical access to SPL data for interaction extraction. The API supports search queries across all label fields and returns results as JSON with the full label text organized by section.
The most reliable query strategy for a specific drug is to search by RxCUI identifier. First resolve the drug name to an RxCUI using the RxNorm API, then query openFDA with: search=openfda.rxcui:{rxcui}&sort=effective_time:desc&limit=1. This returns the most recent label version for the drug, sorted by effective time to ensure you get the current label rather than an older version.
If RxCUI search returns no results, fall back to searching by other identifiers in this order: openfda.product_ndc for NDC code lookups, openfda.brand_name for brand name searches (use double quotes for exact matching), and openfda.generic_name for generic name searches. Each fallback level is less precise, so validate that the returned label matches the intended drug.
The openFDA response includes a results array where each entry contains the full label text organized by field name. The fields you need for interaction extraction are drug_interactions, warnings, contraindications, boxed_warning, and drug_interactions_table. Each field contains an array of strings (usually a single string with the full section text). The openfda nested object contains the structured metadata including spl_set_id, rxcui, brand_name, and generic_name.
The parsing challenge: unstructured clinical prose
Here is where the hard part begins. FDA label text is written by pharmaceutical companies as clinical narrative intended for healthcare professionals. It is not structured data. The same type of interaction information can be expressed in dramatically different ways across different labels, and extracting consistent, structured interaction pairs from this text is a substantial natural language processing challenge.
Consider these three real examples from different labels, all describing interactions with warfarin. Label A states: 'Co-administration with warfarin may result in increased INR and prothrombin time; monitor coagulation parameters frequently.' Label B states: 'Concomitant use with coumarin anticoagulants increases the risk of bleeding.' Label C states: 'Drug X is a CYP2C9 inhibitor. Warfarin is a CYP2C9 substrate. Increased warfarin exposure may occur.' All three describe the same clinical interaction, but they use different drug identifiers (warfarin vs coumarin anticoagulants), different outcome descriptions (increased INR vs bleeding risk vs increased exposure), and different recommendation formats.
Additional parsing challenges include references to drug classes rather than specific drugs ('concurrent use with other CNS depressants'), conditional interactions that depend on dose or patient factors ('in patients with renal impairment, co-administration may...'), interactions described through mechanism without explicit outcome statements ('Drug X inhibits CYP3A4, the primary metabolic pathway of Drug Y'), and negation patterns where the label states that an expected interaction was studied and not found ('no clinically significant interaction was observed with...').
These challenges are why building a reliable interaction extraction pipeline from raw FDA label data is a multi-month engineering project rather than a weekend hack. The text is technically accessible through openFDA, but transforming it into structured interaction data with consistent drug identification, severity classification, and clinical recommendations requires sophisticated natural language processing.
Deterministic extraction: what you can parse with rules
Before reaching for AI-based extraction, there is meaningful signal that can be extracted with deterministic text processing. Rule-based extraction works well for interactions described with explicit keywords and structured patterns, and it should form the first pass in any extraction pipeline.
Contraindicated interactions are the easiest to detect deterministically. Scan for keywords like 'contraindicated,' 'must not,' 'do not use,' and 'should not be administered' followed by drug name patterns. FDA labels are remarkably consistent in using these explicit terms for the most serious interactions because regulatory requirements mandate clear language for contraindications.
Sentence boundary detection, combined with keyword triggering, can identify interaction-relevant sentences within longer section text. Look for trigger words and phrases: 'concomitant,' 'co-administration,' 'concurrent use,' 'drug interaction,' 'when used with,' 'in combination with,' and 'coadministered.' Sentences containing these triggers are likely interaction descriptions and can be extracted for further processing.
Drug name recognition within interaction sentences requires matching against a known drug vocabulary. RxNorm provides the most comprehensive vocabulary for this purpose. Match candidate drug names in the text against RxNorm concepts to identify the interacting drug(s) mentioned in each sentence. This is more reliable than arbitrary named entity recognition because drug names are well-defined and enumerable.
The limitation of deterministic extraction is coverage. Rule-based approaches reliably extract interactions that are described with standard keywords and explicit drug names, but they miss interactions described through mechanism-only language, drug class references, or unusual phrasing. In practice, deterministic extraction captures 40 to 60 percent of interactions in a typical label, with the remainder requiring more flexible parsing.
AI-powered extraction: filling the gaps
The interactions that deterministic rules miss are exactly the ones where large language models excel. LLMs can interpret mechanism-based interaction descriptions ('Drug X inhibits CYP3A4'), resolve drug class references to specific drugs ('other QT-prolonging agents'), understand conditional language ('in patients with hepatic impairment'), and distinguish positive interaction findings from negative findings ('no clinically significant interaction was observed').
RxLabelGuard uses AWS Bedrock with Claude models for the AI extraction pass. The label text is provided to the model with a structured prompt that requests extraction into a defined JSON schema: target drug name, target drug type (specific drug, drug class, or food), RxCUI if identifiable, severity level, mechanism of interaction, clinical recommendation, and the evidence snippet from the source text.
The AI pass runs after the deterministic pass, processing label text that was flagged as potentially containing interactions but could not be fully parsed by rules. This two-pass architecture has three advantages: it reduces AI processing costs by handling the easy cases deterministically, it provides a performance baseline that the AI pass can be evaluated against, and it ensures that clearly stated interactions like explicit contraindications are never missed due to model variability.
The structured output format is critical. Without a defined schema, LLMs tend to produce verbose natural language descriptions that are no easier to consume programmatically than the original label text. By requiring specific fields (target_name, severity, mechanism, recommendation, evidence_snippet), the extraction produces data that can be directly stored, indexed, and served through the API.
Caching and freshness: keeping extracted data current
FDA labels are updated periodically as new safety information emerges, new interactions are discovered through post-market surveillance, and dosing recommendations change. The update frequency varies: some labels are revised multiple times per year while others remain unchanged for years. An extraction pipeline must track label freshness and re-extract when updates occur.
RxLabelGuard caches extracted interaction data keyed by the spl_set_id, which remains stable across label versions, along with the effective_time of the label version that was processed. When a new interaction check is requested, the system compares the cached effective_time against the most recent label version available from openFDA. If a newer version exists, the label is re-fetched and re-processed.
This caching strategy balances freshness against processing cost. Re-extracting every label on every request would be prohibitively expensive in both API calls to openFDA and AI processing costs for extraction. Caching with version-based invalidation ensures that consumers always receive data based on the current label version without redundant processing.
For teams building their own extraction pipeline, a reasonable caching approach is to set a check interval of 24 to 48 hours for label freshness verification (a lightweight openFDA query that returns only metadata), and trigger full re-extraction only when the effective_time has changed. Store both the raw label text and the extracted results so that extraction logic changes can be applied retroactively without re-fetching labels from openFDA.
Build versus buy: when to use a managed service
Building a complete SPL extraction pipeline from scratch is a substantial engineering investment. The component list includes openFDA query logic, rate limiting and retry handling, drug name resolution via RxNorm, label section parsing, deterministic interaction extraction rules, AI-powered extraction for unstructured text, severity classification, result caching with version-based invalidation, and ongoing monitoring and quality assurance.
For teams whose core product is drug interaction detection or clinical pharmacology, building this pipeline in-house provides maximum control over data quality and extraction logic. The raw data is freely available, and the engineering investment creates a defensible capability that is difficult for competitors to replicate quickly.
For teams that need drug interaction checking as one feature among many, such as an EHR module, a pharmacy management system, or a patient-facing medication safety app, building the full extraction pipeline is rarely the best use of engineering resources. A managed API that handles extraction, caching, and severity scoring reduces the drug interaction feature from a multi-month project to a single API integration.
RxLabelGuard exists precisely for this second category. It encapsulates the entire SPL extraction pipeline behind a simple REST API, allowing development teams to add drug interaction checking to their applications without building their own label parsing, AI extraction, drug resolution, and caching infrastructure.
Practical first steps for evaluation
If you want to understand the raw data before deciding on build versus buy, start by querying a few labels directly through the openFDA API. Pick a drug you are familiar with and query: curl 'https://api.fda.gov/drug/label.json?search=openfda.generic_name:metformin&limit=1'. Examine the drug_interactions, warnings, and contraindications fields in the response. Read through the text and mentally extract the interaction pairs, severity levels, and recommendations.
This exercise will give you a concrete sense of the parsing complexity. Some labels have cleanly organized interaction sections with explicit drug names and clear management recommendations. Others contain dense pharmacological prose that requires clinical knowledge to interpret. The variation across labels is the fundamental challenge that any extraction pipeline, whether rule-based, AI-powered, or manual, must handle.
After reviewing a handful of labels, you will be well positioned to estimate the engineering effort required to build a custom extraction pipeline and to evaluate whether a managed service provides sufficient quality and coverage for your use case.
Medical disclaimer
This information is derived from FDA Structured Product Labeling and is provided for informational purposes only. It should not be used as a substitute for professional medical advice, diagnosis, or treatment. Always consult a qualified healthcare provider.
References
- openFDA Drug Label Endpoint (U.S. Food and Drug Administration (FDA); accessed Mar 6, 2026)
- openFDA Drug Label Searchable Fields (U.S. Food and Drug Administration (FDA); accessed Mar 6, 2026)
- DailyMed Web Services (U.S. National Library of Medicine (NLM); accessed Mar 6, 2026)
- The FDA Announces New Prescription Drug Information Format (U.S. Food and Drug Administration (FDA); accessed Mar 6, 2026)
- DailyMed SPL Resources (U.S. National Library of Medicine (NLM); accessed Mar 22, 2026)