Clinical NLP Graph Dataset Substance Use Disorder

Dataset Analysis Report

same_graph_test_bothmasked.json & same_graph_train_bothmasked.json — Full combined analysis

Total patients
7,628
Test: 2,640 · Train: 4,988
Total edges
190,092
Similarity graph connections
Total words
2.80M
Across all clinical notes
Total chars
19.4M
Across all clinical notes

1 · Corpus overview & label distribution

Overall label split (combined)

Inpatient (IP) — 2,609 (34.2%)Outpatient (OP) — 5,019 (65.8%)
IP: 2609, OP: 5019

Split across train / test files

IPOP
Train: IP 1676 OP 3312; Test: IP 933 OP 1707

The dataset is class-imbalanced at roughly 1:2 (IP:OP). This ratio is stable across both train and test splits, suggesting a stratified partition strategy was used.

2 · Note length — tokens, words & sentences

Combined dataset summary statistics

Avg words / note
367.2
σ = 253.8
Median words
309
Long right tail
Avg sentences
48.3
σ = 35.7
Avg chars / note
2,545
σ = 1,760

Word count distribution — all notes

All patients
Distribution of word counts in 7628 notes

Sentence count distribution — all notes

All patients
Distribution of sentence counts

3 & 4 · IP vs OP — length & severity

IP avg words
438.1
σ = 318.0 · median = 355
OP avg words
330.3
σ = 203.3 · median = 292
Length ratio IP:OP
1.33×
IP notes are 33% longer on avg

Word count by label (bucketed)

IPOP
IP: shorter notes fewer, longer more; OP: more concentrated in medium range

Severity score distribution

IPOP
IP avg severity 8.03, OP avg 6.27

Summary: IP vs OP length & clinical stats

MetricIP (n=2,609)OP (n=5,019)
Mean words438.1330.3
Median words355292
Std words318.0203.3
Mean sentences58.543.0
Mean chars3,0532,281
Max words (note)2,6092,501
Avg severity score8.036.27
Score ≥ 10 (%)33.4%19.1%
Score ≥ 15 (%)12.0%4.4%
Avg Treatment_decision tokens5.423.16

Train vs test file characteristics

MetricTrainTest
Nodes4,9882,640
Edges120,49269,600
IP %33.6%35.3%
OP %66.4%64.7%
Mean words338.5421.3
Median words291348
Mean sentences44.156.2
Mean chars2,3252,961

Test set notes are notably longer on average (421 vs 339 words). This may reflect more complex multi-visit collations, and could affect model behaviour at inference time.

IP notes are consistently longer, denser, and higher-severity than OP notes. This is clinically expected: inpatient cases involve more events (admission, detox, stabilisation), multiple clinicians, and higher acuity — all reflected in longer collated notes.

5 & 6 · IP vs OP — clinical patterns & severity indicators

Pattern strength ratio (IP% ÷ OP%) — features that distinguish inpatient admission

Ratio > 1.0 means more prevalent in IP. Features sorted by ratio descending.

Features with highest IP/OP ratios: social withdrawal 2.0x, paranoia 1.78x, memory issues 1.67x

Paranoia (1.78×), memory issues (1.67×), hallucinations (1.57×), nausea/vomiting (1.44×), and irritability (1.30×) are the strongest IP-associated symptom patterns. These reflect acute psychiatric and neurological complications requiring inpatient management.

7 · Borderline & ambiguous cases

Very short IP notes
50
1.9% of IP — <50 words
Very long OP notes
260
5.2% of OP — >700 words
Low-severity IP
341
13.1% — score ≤ 2
High-severity OP
555
11.1% — score ≥ 12

An estimated 10–13% of notes may constitute "borderline" cases — IP notes with minimal clinical documentation or OP notes with complex, high-severity presentations. These represent real-world label ambiguity and will be the hardest cases for any classifier.

Example short IP note (50 words)

"40-year-old married gentleman presented with history of alcohol and tobacco dependence syndrome for the past 17 years. Presented with withdrawal hallucinosis, multimodal in nature. Pulse rate is 112/min. Coarse tremors present. Plan: Treatment_decision1. Once the patient is stable, discharge..."

Severity overlap zone

At severity score = 8 (IP median), both classes overlap heavily:

Score thresholdIP above (%)OP above (%)
≥ 571.9%61.3%
≥ 1033.4%19.1%
≥ 1512.0%4.4%
≥ 203.6%0.7%

8 · Substance mention analysis

Substance prevalence — all notes

IP %OP %
Alcohol nearly universal at 99%, tobacco 84%, benzodiazepines 11%, cannabis 8%

What the data reveals

SubstanceAllIP%OP%IP ratio
Alcohol7,58199.0%99.6%0.99×
Tobacco/nicotine6,46783.7%85.3%0.98×
Benzodiazepines82413.9%9.2%1.51×
Cannabis62313.3%5.5%2.42×
Stimulants2906.1%2.6%2.35×
Opioids2055.9%1.0%5.90×
Sedatives892.1%0.7%3.00×
Inhalants541.3%0.4%3.25×

Opioid mention is the single strongest individual substance predictor of IP admission (5.9× more common in IP). Cannabis (2.4×), stimulants (2.4×), inhalants (3.3×), and sedatives (3.0×) also strongly discriminate. Alcohol and tobacco are near-universal and thus uninformative for classification.

9 · Duration of use & quantity patterns

IP avg daily quantity
15.4 units
median = 15, max = 72 units/day
OP avg daily quantity
15.1 units
median = 12, max = 96 units/day
Quantity mentions extracted
7,196
IP: 2,741 · OP: 4,455

Duration of use (months, from text)

MetricIP (n=495)OP (n=842)
Mean duration93.2 months126.2 months
Median duration48 months (4y)96 months (8y)

OP patients show longer documented durations of use (median 8y vs 4y for IP). This likely reflects that OP notes accumulate more longitudinal history, while IP notes focus on acute presentation. Duration alone is not a reliable IP predictor.

Quantity (units/day) distribution

IPOP
IP and OP similar mean quantity around 15 units per day

10 · Multi-substance co-use

Number of substances co-mentioned (per note)

IPOP
Most notes have 2 substances: IP 1580, OP 3644

Top substance co-occurrence pairs

PairCount
Alcohol + Tobacco6,462
Alcohol + Benzodiazepines819
Benzodiazepines + Tobacco716
Alcohol + Cannabis622
Cannabis + Tobacco605
Alcohol + Stimulants289
Stimulants + Tobacco262
Alcohol + Opioids204
Opioids + Tobacco192
Cannabis + Stimulants136

IP patients show significantly higher rates of 4+ substance co-use: 8.5% of IP vs 2.7% of OP have 4 or more substances mentioned. This polysubstance pattern is a strong predictor of admission complexity.

11 · Symptom analysis & co-occurrence

Symptom prevalence across cohort

IP %OP %
Withdrawal and craving most common; hallucinations and memory more elevated in IP

Top symptom co-occurrences (all patients)

Symptom pairCount
Craving + Withdrawal5,459
Tremors + Withdrawal4,001
Craving + Tremors3,758
Sleep disturbance + Withdrawal2,687
Seizures + Withdrawal2,677
Craving + Sleep disturbance2,598
Craving + Seizures2,407
Sleep disturbance + Tremors2,094
Seizures + Tremors1,891
Irritability + Withdrawal1,829
Craving + Irritability1,784
Anxiety + Withdrawal1,561
Anxiety + Craving1,534
Depression + Withdrawal1,282

12 · Which symptoms are predictive of IP admission?

Paranoia 1.78x, memory 1.67x, hallucinations 1.57x, nausea 1.44x

Most predictive (IP-enriched): Paranoia (1.78×), memory/blackout issues (1.67×), auditory/visual hallucinations (1.57×), and nausea/vomiting (1.44×) are the strongest individual symptom predictors of inpatient admission.

Near-universal (non-discriminating): Withdrawal (85%), craving (79%), and tremors (58%) are so prevalent across both classes that they add little discriminative signal on their own. Their combinations matter more.

13–16 · Temporal patterns, relapse, & event sequences

Avg relapse mentions
2.19
σ = 3.54 — all patients
IP relapse avg
2.85
σ = 4.40
OP relapse avg
1.85
σ = 2.94
Full event sequences
1,006
13.2% of all notes

Relapse mention frequency (IP vs OP)

IPOP
IP 2.85 avg mentions, OP 1.85 avg mentions

Temporal pattern notes

PatternIPOP
Notes with 0 relapses47.7%53.0%
Notes with 5+ relapses22.0%13.1%
Avg abstinence interval6.1 days6.0 days
Abstinence mentions (n)1,1951,404
Full sequence notes~13%~13%

IP patients show 54% higher frequency of multiple relapse mentions (5+), consistent with more severe, cyclical SUD patterns requiring inpatient intervention.


Canonical event sequence identified (in 1,006 notes, 13.2%)

Abstinence Relapse / Lapse Detoxification Follow-up / Review

The abstinence → relapse → detox → follow-up cycle is the dominant clinical trajectory in the dataset. Notes encoding the full cycle tend to be significantly longer (avg ~500+ words) and are more common in complex multi-visit IP cases.

17–18 · Behavioral indicators & classification potential

Behavioral indicator prevalence (IP vs OP %)

IPOP
Social withdrawal 2.0x, delusional thinking 1.68x, legal issues 1.17x elevated in IP

Feature discriminability table

FeatureIP%OP%Ratio
Social withdrawal4.2%2.1%2.00×
Delusional thinking12.0%7.2%1.68×
Socio-occupational dysfunction49.2%40.4%1.22×
Violence/aggression10.3%8.6%1.20×
Legal issues6.1%5.2%1.17×
Family discord32.3%34.2%0.94×
Use despite harm20.4%22.1%0.92×
Loss of control42.6%48.6%0.88×
Tolerance53.4%61.3%0.87×

Interestingly, standard dependence markers like "loss of control" and "tolerance" are more common in OP notes — possibly because OP clinicians document them more thoroughly in structured assessments, while IP notes focus on acute management.


Can these features alone classify IP vs OP?

Strong signal features
Opioids, paranoia,
hallucinations, memory,
social withdrawal
Moderate signal
Multi-substance,
nausea/vomiting,
delusional thinking
Weak/inverted signal
Alcohol, tobacco,
tolerance, LOC,
craving, withdrawal

Based on feature ratios alone, a rule-based classifier would achieve moderate performance. The strongest single predictor combinations are: opioid mention + hallucinations + paranoia + high relapse count. Lexical features alone likely achieve 65–72% accuracy; the graph structure (similarity edges) provides the key additional signal for GNN-based models.

19 · Masking analysis — does it reduce leakage?

Masked entity types (combined)

Person 14444, company 8746, address 6414, dates 2748, groups 300, languages 225

Masking statistics by class

MetricIPOP
Avg masked tokens / note5.504.03
Total masked tokens14,36020,226
Notes with any masking93.4%92.1%
Treatment_decision tokens / note5.423.16
Notes with Treatment_decision90.8%81.9%

Person, address, company, and date masking successfully removes patient/clinician identifiers — preventing memorisation of specific individuals or institutions as IP/OP labels.

Leakage risk: Treatment_decision tokens (e.g. Treatment_decision1–10) are 71% more frequent in IP notes on average. A model can trivially detect this pattern as a proxy for admission severity.


What the masking covers vs what it misses

Covered (low leakage risk):

Person names Addresses Company/hospital names Specific dates Languages Group identifiers

Not fully masked (residual leakage risk):

Treatment decision count Note length Multi-substance count Admission-specific language Ward/unit references Discharge planning phrases

The masking strategy is robust for PII/PHI removal. However, structural leakage remains: IP notes are longer, contain more Treatment_decision placeholders, and use admission-specific vocabulary (discharge, ward, detox unit). A model can learn these structural cues even without entity names. To fully eliminate leakage, Treatment_decision tokens should be masked uniformly, and note length normalisation should be considered.

Graph structure — edge weight & connectivity

Total edges
190,092
Both files combined
Edge weight range
0.80–1.00
mean = 0.821
Mean node degree
67.3
max = 776
Same-label edges
62.6%
IP-IP: 6.9% · OP-OP: 55.7%

Edge weight distribution

91% of edges 0.80-0.85, 8.5% 0.85-0.90

Cross-label connectivity

Edge typeCount%
OP — OP105,84755.7%
IP — OP (cross)71,14137.4%
IP — IP13,1046.9%

37.4% of edges are cross-label (IP ↔ OP). This high cross-label similarity is expected given that all patients are SUD cases with overlapping symptom language, but creates a challenging homophily situation for GNN classifiers.

Analysis performed on same_graph_test_bothmasked.json (2,640 nodes, 69,600 edges) and same_graph_train_bothmasked.json (4,988 nodes, 120,492 edges)