Tilelli / Tilelli Med / PrimeKG follow-up
A reviewer noted that OGBL-biokg dates to 2016 sources. We ran the identical pipeline on PrimeKG (Chandak, Huang & Zitnik, Nature Sci Data 2023) — 4.83 million edges, 30 relation types, 129,375 entities, built for precision-medicine drug discovery. The ternary student again outperforms its own float teacher.
PrimeKG 2023 ComplEx-N3 Ternary {−1, 0, +1} SMILES warm-start Not medical advice
The 10 May write-up reported a strong ternary result on OGBL-biokg, the Stanford Open Graph Benchmark for biomedical link prediction. The benchmark dates to 2016 sources. A reviewer flagged that the benchmark is dated and asked whether the result holds on a newer graph.
PrimeKG is the de-facto modern replacement. It was assembled by Marinka Zitnik's group at Harvard in 2022–2023, with relations explicitly designed for precision-medicine work: indication, contraindication, off-label use, drug_drug (synergistic interactions), drug_protein, disease_phenotype_positive/negative, pathway_protein, anatomy_protein_present, exposure_disease, and 21 others.
Filtered MRR on the canonical 10,000-triple held-out test split, with 500 type-constrained negatives per query (same protocol as OGBL).
| Model | TEST MRR | H@10 | Storage | Compression |
|---|---|---|---|---|
| Random init — float teacher | 0.2884 | 0.502 | 506 MB | 1× |
| Random init — ternary B=256 | 0.2939 | 0.506 | 158 MB | 3.2× |
| Random init — ternary B=128 | 0.2933 | 0.502 | 95 MB | 5.3× |
| Random init — ternary B=1 | 0.2935 | 0.500 | 32 MB | 15.8× |
| SMILES warm — float teacher | 0.2903 | 0.503 | 506 MB | 1× |
| SMILES warm — ternary B=128 | 0.2972 | 0.507 | 95 MB | 5.3× |
| SMILES warm — ternary B=1 | 0.2946 | 0.501 | 32 MB | 15.8× |
| SMILES warm (seed 43) — float | 0.2911 | 0.507 | 506 MB | 1× |
At every compression level (3.2× through 15.8×), the ternary student outperforms its float teacher. Best: warm B=128 (5.3× compression) = 0.2972 MRR, vs float warm 0.2903.
This is the same effect we observed on OGBL-biokg (May 10). The mechanism is the same: the float teacher mildly overfits (best val MRR is at epoch 5; ep 25 is lower), and the ternary quantization step acts as a regularizer that recovers generalization. The result now holds across two benchmarks separated by seven years of data construction.
Initialising the drug rows of the entity table from Morgan fingerprints (radius 2, 2048 bits → PCA-512) improves aggregate MRR by +0.0023 (mean of two seeds: 0.2884 random → 0.2907 warm). 60.4% of PrimeKG drugs were warm-started; the rest stayed random-init.
The lift is concentrated in contraindication (H@10 +3pp) and is neutral on indication (already near-ceiling at MRR=0.80). Stated honestly: the chemistry prior helps the model say "this drug should NOT be used for this disease" slightly more than "this drug should treat this disease."
For each held-out drug-disease edge in PrimeKG's test split, we rank every drug as a candidate completion. Then we report the rank of the gold drug. With 7,957 drugs in the candidate pool:
| Relation | n test | Random / Float MRR | Random / Float H@10 | Warm / Float MRR | Warm / Float H@10 |
|---|---|---|---|---|---|
| indication | 26 | 0.802 | 0.93 | 0.796 | 0.93 |
| off-label use | 6 | 1.000 | 1.00 | 0.667 | 1.00 |
| contraindication | 72 | 0.337 | 0.74 | 0.338 | 0.77 |
Read: for the 26 held-out indications, the gold drug typically lands at rank 1 or 2 out of 7,957 candidates (MRR=0.80, H@10=0.93). The model is finding the right drug pattern even though the (drug, indication, disease) edge was filtered out of training.
Caveat: test sample sizes for the rare relations (indication n=26, off-label n=6) are small because PrimeKG itself has few of these edges. The corresponding numbers are reported here without over-claiming. For off-label especially, a 6-sample difference between conditions is not statistically meaningful.