Tilelli Med · PrimeKG follow-up — addressing the "biokg is old" feedback

Why we ran this

The 10 May write-up reported a strong ternary result on OGBL-biokg, the Stanford Open Graph Benchmark for biomedical link prediction. The benchmark dates to 2016 sources. A reviewer flagged that the benchmark is dated and asked whether the result holds on a newer graph.

PrimeKG is the de-facto modern replacement. It was assembled by Marinka Zitnik's group at Harvard in 2022–2023, with relations explicitly designed for precision-medicine work: indication, contraindication, off-label use, drug_drug (synergistic interactions), drug_protein, disease_phenotype_positive/negative, pathway_protein, anatomy_protein_present, exposure_disease, and 21 others.

The headline table

Filtered MRR on the canonical 10,000-triple held-out test split, with 500 type-constrained negatives per query (same protocol as OGBL).

Model	TEST MRR	H@10	Storage	Compression
Random init — float teacher	0.2884	0.502	506 MB	1×
Random init — ternary B=256	0.2939	0.506	158 MB	3.2×
Random init — ternary B=128	0.2933	0.502	95 MB	5.3×
Random init — ternary B=1	0.2935	0.500	32 MB	15.8×
SMILES warm — float teacher	0.2903	0.503	506 MB	1×
SMILES warm — ternary B=128	0.2972	0.507	95 MB	5.3×
SMILES warm — ternary B=1	0.2946	0.501	32 MB	15.8×
SMILES warm (seed 43) — float	0.2911	0.507	506 MB	1×

The two real findings

★ Finding 1 — replicated

Ternary quantization improves the teacher.

At every compression level (3.2× through 15.8×), the ternary student outperforms its float teacher. Best: warm B=128 (5.3× compression) = 0.2972 MRR, vs float warm 0.2903.

This is the same effect we observed on OGBL-biokg (May 10). The mechanism is the same: the float teacher mildly overfits (best val MRR is at epoch 5; ep 25 is lower), and the ternary quantization step acts as a regularizer that recovers generalization. The result now holds across two benchmarks separated by seven years of data construction.

○ Finding 2 — modest, replicated

The chemistry warm-start gives a small lift.

Initialising the drug rows of the entity table from Morgan fingerprints (radius 2, 2048 bits → PCA-512) improves aggregate MRR by +0.0023 (mean of two seeds: 0.2884 random → 0.2907 warm). 60.4% of PrimeKG drugs were warm-started; the rest stayed random-init.

The lift is concentrated in contraindication (H@10 +3pp) and is neutral on indication (already near-ceiling at MRR=0.80). Stated honestly: the chemistry prior helps the model say "this drug should NOT be used for this disease" slightly more than "this drug should treat this disease."

Drug-repurposing eval (the actual usefulness signal)

For each held-out drug-disease edge in PrimeKG's test split, we rank every drug as a candidate completion. Then we report the rank of the gold drug. With 7,957 drugs in the candidate pool:

Relation	n test	Random / Float MRR	Random / Float H@10	Warm / Float MRR	Warm / Float H@10
indication	26	0.802	0.93	0.796	0.93
off-label use	6	1.000	1.00	0.667	1.00
contraindication	72	0.337	0.74	0.338	0.77

Read: for the 26 held-out indications, the gold drug typically lands at rank 1 or 2 out of 7,957 candidates (MRR=0.80, H@10=0.93). The model is finding the right drug pattern even though the (drug, indication, disease) edge was filtered out of training.

Caveat: test sample sizes for the rare relations (indication n=26, off-label n=6) are small because PrimeKG itself has few of these edges. The corresponding numbers are reported here without over-claiming. For off-label especially, a 6-sample difference between conditions is not statistically meaningful.

Reproducibility

Code: intended for Apache-2.0 release; checkpoints on request.
Total external compute: 6 GPU·hours, ~$5.20 across RunPod A40 secure cloud + Vast.ai RTX 4090 community cloud.
Recipe: ComplEx-N3 d=512, reciprocal relations, 1-N CE, N3 reg λ=1e-2, Adagrad lr=0.1, batch 2048, BF16, 25 epochs (note: model peaks at epoch 5, so future runs only need 10 epochs with cosine LR — a reproducibility-cost improvement we are baking into the next batch).
Eval protocol: standard OGB filtered MRR + Hits@K with 500 type-constrained negatives per query, replicated for PrimeKG using its node-type taxonomy.

Research preview. Not medical advice. Predictions are intended for hypothesis generation by qualified researchers, not for direct clinical use.