Tilelli Med — compressed knowledge-graph predictions for biomedical link prediction

What we built

A knowledge-graph embedding model trained on the public OGBL-biokg benchmark — Stanford's release of ~94,000 biomedical entities (drugs, proteins, diseases, side-effects, biological functions) and 4.8 million relations from public literature. The architecture is ComplEx with N3 regularization and reciprocal relations (Lacroix et al. 2018), trained from scratch.

Our contribution is the ternary compression: each entity embedding is reduced from 32-bit floating point to a three-valued {−1, 0, +1} representation with a small per-block scale. At block size 128, that's 5.3× compression of the entity tables. The compressed model still scores 0.752 filtered MRR — above the published TransE leaderboard baseline (0.745). To our knowledge it's the first three-valued knowledge-graph embedding to do so on this benchmark.

Cross-disease pattern (honest)

We ran the candidate-prediction pipeline against 10 curated diseases spanning four categories. For each, the model ranks every drug in the graph as a candidate completion of (drug, drug-disease, this disease) — after filtering out drugs already linked to that disease in training. We then cross-check the top 20 against ChEMBL and Open Targets for independent evidence.

Hypercholesterolemia

Metabolic

40%

Essential hypertension

Cardiovascular

35%

Type 2 diabetes

Metabolic

30%

Coronary artery disease

Cardiovascular

20%

Alzheimer's disease

Neurodegenerative

20%

Multiple sclerosis

Autoimmune

Asthma

Respiratory

Breast cancer

Oncology

Parkinson's disease

Neurodegenerative

Schizophrenia

Psychiatric

The model works well on dense cardio-metabolic sub-graphs — exactly where OGBL-biokg has rich coverage from decades of cardiovascular and diabetes research. It falls apart on sparser sub-graphs — oncology, psychiatric, autoimmune, respiratory. This isn't a flaw to hide. It's a property of the input graph and a useful map of where the method is and isn't trustworthy.

More wins across oncology (broader 56-disease scan)

With the demo expanded to 56 diseases, the model surfaces the right drugs in several oncology categories — not just metabolic:

Breast cancer (10/20 corroborated): Ixabepilone, Lapatinib (HER2+), Fulvestrant, Thiotepa — real targeted-therapy and chemotherapy choices, surfaced from the graph.
Lung neoplasm (8/20): Erlotinib, Gefitinib — the EGFR-inhibitor gold standard for non-small-cell lung cancer.
Acute myeloid leukemia (8/20): Vincristine, Vorinostat, Mitoxantrone.
Colorectal carcinoma (7/20): 5-Fluorouracil — the workhorse chemotherapy, ranked highly.

These are existing drugs whose actual indications the model recovered without being directly trained on the (drug, treats, this-cancer) triple. The corroboration matching for the score-only diseases is looser than for the 10 curated ones — read the percentages as "the literature already discusses this drug in related indications," not strict-match.

Drug screening on a $2 microcontroller

★ Edge deployable

A 24 MB model + 17 KB C binary scores drug candidates for Type-2 diabetes from graph structure alone — and gets five FDA-approved T2D drugs in its top 20.

The per-row ternary model (15.8× compression on the entity tables) packs to a 24 MB .tmed binary that runs through a 17 KB statically-compiled C99 runtime — no Python, no PyTorch, no malloc. Linear scan over 93,773 entities for one query: ~870 ms on x86_64, projected 30–60 seconds on a $2 Cortex-M4F MCU with the model in a $0.50 external serial flash chip. Bit-equivalent to the Python reference. Top-6 picks for T2D include Saxagliptin, Gliclazide, Sitagliptin, Miglitol, and Tolbutamide — five FDA-approved oral antidiabetics surfaced from the graph alone.

The honest part

This is benchmark performance plus an external corroboration check. It is not a discovery of new medicines. OGBL-biokg is built from public literature — a high MRR means the model captures associations already implicit in the published record. Real drug discovery requires wet-lab assays, ADMET screening, selectivity studies, and clinical trials. None of that has happened here.

The candidates fall into three buckets:

Corroborated by ChEMBL / Open Targets — the literature already discusses this drug for this disease. The model is recovering a real pattern. Rosiglitazone for T2D is in this bucket.
Plausible but uncorroborated — drug in the same pharmacological neighborhood, but no direct indication record. Could be a research lead or a model artifact; needs domain expertise to tell.
Likely artifact — drug doesn't make pharmacological sense. The model has no concept of dose, route, or patient state.

The point of the demo is to let a clinician see the distribution and form their own view.

What this could matter for

Hypothesis prioritization for repurposing. Rank candidate drugs against a target indication in seconds, on a laptop, no GPU required.
Edge-device deployment. 5× smaller fits on hardware that doesn't allow data to leave the institution.
Methodological reproducibility. 3-seed mean MRR 0.841 ± 0.003, ablations across block sizes, external corroboration baked in.

Important. Candidates shown in the demo are research artifacts, not medical recommendations. Predictions corroborated by external databases are candidates worth a clinician's review — not validated treatments. Predictions not corroborated should not be assumed to be useless: they may simply lack public-database coverage, be research compounds, or be model artifacts. Consult a clinician for any treatment decision.

Status

Working name: Tilelli Med. Pending formal trademark search before public launch. Code intended for Apache-2.0 release. Looking for clinical collaborators — read the technical note in Methods, or the French version of this page at /med/index.fr.html. Replication on PrimeKG (2023) at PrimeKG follow-up.

A 24 MB graph model that rediscovered five FDA-approved T2D drugs.

The ternary model ranked Rosiglitazone in its top 20 candidates for type-2 diabetes — without being shown the answer.