HL7 Clinical Document Architecture (CDA) was a significant advance in clinical document standardisation when it was introduced: it provided a structured XML representation for clinical documents — discharge summaries, progress notes, operative reports — that included both human-readable narrative sections and structured coded entries for key clinical data elements. For many organisations, CDA remains the standard format for exchanging clinical documents between systems. Understanding its semantic limitations is essential for designing realistic clinical document intelligence systems.

What CDA Provides Structurally

CDA provides document-level metadata (document type, creation date, author, patient identifier), section-level organisation (coded section identifiers from the LOINC Document Ontology identify the purpose of each section), and entry-level coded data (structured entries within sections can reference SNOMED CT, LOINC, or RxNorm codes for diagnoses, observations, and medications). In a well-implemented CDA document, the diagnosis section contains both a human-readable narrative and a coded entry with a SNOMED CT identifier for each diagnosis. This coded entry is directly machine-processable without NLP.

Where CDA Semantics Break Down

The limitation of CDA as a semantic interoperability standard is that the quality of the coded entries depends entirely on the implementation. In practice, CDA documents produced by clinical systems range from fully coded (every clinical entity has a structured coded entry) to partially coded (some sections have entries, others have only narrative) to nominally coded (entries are present but use local codes rather than standard terminologies). For pharmaceutical knowledge mining purposes, it is rarely safe to assume that CDA entries are complete and reliable: they must be validated against the narrative text, and the narrative must be processed by NLP to extract clinical information that the coded entries do not capture.

The Pragmatic Integration Approach

The pragmatic approach for pharmaceutical organisations working with CDA documents is a hybrid extraction strategy: extract all available coded entries from the CDA structure, validate them against the narrative using NLP, and use NLP to extract clinical information from sections where coded entries are absent or unreliable. This strategy captures the value of whatever structured coding exists in the documents while ensuring that the knowledge graph is not limited to the fraction of clinical information that the source system happened to code. The CDA structure also provides valuable document-level metadata — document type, section organisation, author identity — that improves the precision of NLP-based extraction by constraining what kind of clinical information to expect in each section.