Clinical NLP Graph Dataset Substance Use Disorder

Dataset Analysis Report

same_graph_test_bothmasked.json & same_graph_train_bothmasked.json — Full combined analysis

Total patients

7,628

Test: 2,640 · Train: 4,988

Total edges

190,092

Similarity graph connections

Total words

2.80M

Across all clinical notes

Total chars

19.4M

Across all clinical notes

1 · Corpus overview & label distribution

Overall label split (combined)

Inpatient (IP) — 2,609 (34.2%)Outpatient (OP) — 5,019 (65.8%)

Split across train / test files

IPOP

The dataset is class-imbalanced at roughly 1:2 (IP:OP). This ratio is stable across both train and test splits, suggesting a stratified partition strategy was used.

2 · Note length — tokens, words & sentences

Combined dataset summary statistics

Avg words / note

367.2

σ = 253.8

Median words

309

Long right tail

Avg sentences

48.3

σ = 35.7

Avg chars / note

2,545

σ = 1,760

Word count distribution — all notes

All patients

Sentence count distribution — all notes

All patients

3 & 4 · IP vs OP — length & severity

IP avg words

438.1

σ = 318.0 · median = 355

OP avg words

330.3

σ = 203.3 · median = 292

Length ratio IP:OP

1.33×

IP notes are 33% longer on avg

Word count by label (bucketed)

IPOP

Severity score distribution

IPOP

Summary: IP vs OP length & clinical stats

Metric	IP (n=2,609)	OP (n=5,019)
Mean words	438.1	330.3
Median words	355	292
Std words	318.0	203.3
Mean sentences	58.5	43.0
Mean chars	3,053	2,281
Max words (note)	2,609	2,501
Avg severity score	8.03	6.27
Score ≥ 10 (%)	33.4%	19.1%
Score ≥ 15 (%)	12.0%	4.4%
Avg Treatment_decision tokens	5.42	3.16

Train vs test file characteristics

Metric	Train	Test
Nodes	4,988	2,640
Edges	120,492	69,600
IP %	33.6%	35.3%
OP %	66.4%	64.7%
Mean words	338.5	421.3
Median words	291	348
Mean sentences	44.1	56.2
Mean chars	2,325	2,961

Test set notes are notably longer on average (421 vs 339 words). This may reflect more complex multi-visit collations, and could affect model behaviour at inference time.

IP notes are consistently longer, denser, and higher-severity than OP notes. This is clinically expected: inpatient cases involve more events (admission, detox, stabilisation), multiple clinicians, and higher acuity — all reflected in longer collated notes.

5 & 6 · IP vs OP — clinical patterns & severity indicators

Pattern strength ratio (IP% ÷ OP%) — features that distinguish inpatient admission

Ratio > 1.0 means more prevalent in IP. Features sorted by ratio descending.

Paranoia (1.78×), memory issues (1.67×), hallucinations (1.57×), nausea/vomiting (1.44×), and irritability (1.30×) are the strongest IP-associated symptom patterns. These reflect acute psychiatric and neurological complications requiring inpatient management.

7 · Borderline & ambiguous cases

Very short IP notes

1.9% of IP — <50 words

Very long OP notes

260

5.2% of OP — >700 words

Low-severity IP

341

13.1% — score ≤ 2

High-severity OP

555

11.1% — score ≥ 12

An estimated 10–13% of notes may constitute "borderline" cases — IP notes with minimal clinical documentation or OP notes with complex, high-severity presentations. These represent real-world label ambiguity and will be the hardest cases for any classifier.

Example short IP note (50 words)

"40-year-old married gentleman presented with history of alcohol and tobacco dependence syndrome for the past 17 years. Presented with withdrawal hallucinosis, multimodal in nature. Pulse rate is 112/min. Coarse tremors present. Plan: Treatment_decision1. Once the patient is stable, discharge..."

Severity overlap zone

At severity score = 8 (IP median), both classes overlap heavily:

Score threshold	IP above (%)	OP above (%)
≥ 5	71.9%	61.3%
≥ 10	33.4%	19.1%
≥ 15	12.0%	4.4%
≥ 20	3.6%	0.7%

8 · Substance mention analysis

Substance prevalence — all notes

IP %OP %

What the data reveals

Substance	All	IP%	OP%	IP ratio
Alcohol	7,581	99.0%	99.6%	0.99×
Tobacco/nicotine	6,467	83.7%	85.3%	0.98×
Benzodiazepines	824	13.9%	9.2%	1.51×
Cannabis	623	13.3%	5.5%	2.42×
Stimulants	290	6.1%	2.6%	2.35×
Opioids	205	5.9%	1.0%	5.90×
Sedatives	89	2.1%	0.7%	3.00×
Inhalants	54	1.3%	0.4%	3.25×

Opioid mention is the single strongest individual substance predictor of IP admission (5.9× more common in IP). Cannabis (2.4×), stimulants (2.4×), inhalants (3.3×), and sedatives (3.0×) also strongly discriminate. Alcohol and tobacco are near-universal and thus uninformative for classification.

9 · Duration of use & quantity patterns

IP avg daily quantity

15.4 units

median = 15, max = 72 units/day

OP avg daily quantity

15.1 units

median = 12, max = 96 units/day

Quantity mentions extracted

7,196

IP: 2,741 · OP: 4,455

Duration of use (months, from text)

Metric	IP (n=495)	OP (n=842)
Mean duration	93.2 months	126.2 months
Median duration	48 months (4y)	96 months (8y)

OP patients show longer documented durations of use (median 8y vs 4y for IP). This likely reflects that OP notes accumulate more longitudinal history, while IP notes focus on acute presentation. Duration alone is not a reliable IP predictor.

Quantity (units/day) distribution

IPOP

10 · Multi-substance co-use

Number of substances co-mentioned (per note)

IPOP

Top substance co-occurrence pairs

Pair	Count
Alcohol + Tobacco	6,462
Alcohol + Benzodiazepines	819
Benzodiazepines + Tobacco	716
Alcohol + Cannabis	622
Cannabis + Tobacco	605
Alcohol + Stimulants	289
Stimulants + Tobacco	262
Alcohol + Opioids	204
Opioids + Tobacco	192
Cannabis + Stimulants	136

IP patients show significantly higher rates of 4+ substance co-use: 8.5% of IP vs 2.7% of OP have 4 or more substances mentioned. This polysubstance pattern is a strong predictor of admission complexity.

11 · Symptom analysis & co-occurrence

Symptom prevalence across cohort

IP %OP %

Top symptom co-occurrences (all patients)

Symptom pair	Count
Craving + Withdrawal	5,459
Tremors + Withdrawal	4,001
Craving + Tremors	3,758
Sleep disturbance + Withdrawal	2,687
Seizures + Withdrawal	2,677
Craving + Sleep disturbance	2,598
Craving + Seizures	2,407
Sleep disturbance + Tremors	2,094
Seizures + Tremors	1,891
Irritability + Withdrawal	1,829
Craving + Irritability	1,784
Anxiety + Withdrawal	1,561
Anxiety + Craving	1,534
Depression + Withdrawal	1,282

12 · Which symptoms are predictive of IP admission?

Most predictive (IP-enriched): Paranoia (1.78×), memory/blackout issues (1.67×), auditory/visual hallucinations (1.57×), and nausea/vomiting (1.44×) are the strongest individual symptom predictors of inpatient admission.

Near-universal (non-discriminating): Withdrawal (85%), craving (79%), and tremors (58%) are so prevalent across both classes that they add little discriminative signal on their own. Their combinations matter more.

13–16 · Temporal patterns, relapse, & event sequences

Avg relapse mentions

2.19

σ = 3.54 — all patients

IP relapse avg

2.85

σ = 4.40

OP relapse avg

1.85

σ = 2.94

Full event sequences

1,006

13.2% of all notes

Relapse mention frequency (IP vs OP)

IPOP

Temporal pattern notes

Pattern	IP	OP
Notes with 0 relapses	47.7%	53.0%
Notes with 5+ relapses	22.0%	13.1%
Avg abstinence interval	6.1 days	6.0 days
Abstinence mentions (n)	1,195	1,404
Full sequence notes	~13%	~13%

IP patients show 54% higher frequency of multiple relapse mentions (5+), consistent with more severe, cyclical SUD patterns requiring inpatient intervention.

Canonical event sequence identified (in 1,006 notes, 13.2%)

Abstinence → Relapse / Lapse → Detoxification → Follow-up / Review

The abstinence → relapse → detox → follow-up cycle is the dominant clinical trajectory in the dataset. Notes encoding the full cycle tend to be significantly longer (avg ~500+ words) and are more common in complex multi-visit IP cases.

17–18 · Behavioral indicators & classification potential

Behavioral indicator prevalence (IP vs OP %)

IPOP

Feature discriminability table

Feature	IP%	OP%	Ratio
Social withdrawal	4.2%	2.1%	2.00×
Delusional thinking	12.0%	7.2%	1.68×
Socio-occupational dysfunction	49.2%	40.4%	1.22×
Violence/aggression	10.3%	8.6%	1.20×
Legal issues	6.1%	5.2%	1.17×
Family discord	32.3%	34.2%	0.94×
Use despite harm	20.4%	22.1%	0.92×
Loss of control	42.6%	48.6%	0.88×
Tolerance	53.4%	61.3%	0.87×

Interestingly, standard dependence markers like "loss of control" and "tolerance" are more common in OP notes — possibly because OP clinicians document them more thoroughly in structured assessments, while IP notes focus on acute management.

Can these features alone classify IP vs OP?

Strong signal features

Opioids, paranoia,
hallucinations, memory,
social withdrawal

Moderate signal

Multi-substance,
nausea/vomiting,
delusional thinking

Weak/inverted signal

Alcohol, tobacco,
tolerance, LOC,
craving, withdrawal

Based on feature ratios alone, a rule-based classifier would achieve moderate performance. The strongest single predictor combinations are: opioid mention + hallucinations + paranoia + high relapse count. Lexical features alone likely achieve 65–72% accuracy; the graph structure (similarity edges) provides the key additional signal for GNN-based models.

19 · Masking analysis — does it reduce leakage?

Masked entity types (combined)

Masking statistics by class

Metric	IP	OP
Avg masked tokens / note	5.50	4.03
Total masked tokens	14,360	20,226
Notes with any masking	93.4%	92.1%
Treatment_decision tokens / note	5.42	3.16
Notes with Treatment_decision	90.8%	81.9%

Person, address, company, and date masking successfully removes patient/clinician identifiers — preventing memorisation of specific individuals or institutions as IP/OP labels.

Leakage risk: Treatment_decision tokens (e.g. Treatment_decision1–10) are 71% more frequent in IP notes on average. A model can trivially detect this pattern as a proxy for admission severity.

What the masking covers vs what it misses

Covered (low leakage risk):

Person names Addresses Company/hospital names Specific dates Languages Group identifiers

Not fully masked (residual leakage risk):

Treatment decision count Note length Multi-substance count Admission-specific language Ward/unit references Discharge planning phrases

The masking strategy is robust for PII/PHI removal. However, structural leakage remains: IP notes are longer, contain more Treatment_decision placeholders, and use admission-specific vocabulary (discharge, ward, detox unit). A model can learn these structural cues even without entity names. To fully eliminate leakage, Treatment_decision tokens should be masked uniformly, and note length normalisation should be considered.

Graph structure — edge weight & connectivity

Total edges

190,092

Both files combined

Edge weight range

0.80–1.00

mean = 0.821

Mean node degree

67.3

max = 776

Same-label edges

62.6%

IP-IP: 6.9% · OP-OP: 55.7%

Edge weight distribution

Cross-label connectivity

Edge type	Count	%
OP — OP	105,847	55.7%
IP — OP (cross)	71,141	37.4%
IP — IP	13,104	6.9%

37.4% of edges are cross-label (IP ↔ OP). This high cross-label similarity is expected given that all patients are SUD cases with overlapping symptom language, but creates a challenging homophily situation for GNN classifiers.

Analysis performed on same_graph_test_bothmasked.json (2,640 nodes, 69,600 edges) and same_graph_train_bothmasked.json (4,988 nodes, 120,492 edges)