Build the model the right way
Our first technical milestone is a proprietary foundation model in histopathology — optimized for the data–context–compute triangle, and designed to generalize across hospitals.
HistFM: the histology foundation model
Whole-slide images are enormous. Capturing both micro-scale cellular detail and macro-scale tissue organization requires long-context modeling.
Training substrate
We start with public datasets that cover diverse tissue types and cancer cohorts, then expand with partner data:
- TCGA for tumor diversity and paired outcomes.
- GTEx for normal tissue context.
- CPTAC for proteogenomics (critical for multimodal alignment).
Architecture philosophy
We separate the problem into (1) local tile representation and (2) slide-level aggregation with long-range context. This modularity lets us iterate faster and evaluate each component independently.
- Strong tile encoders for cellular texture and morphology.
- Long-context slide encoder for spatial organization and microenvironment structure.
- Multimodal objectives to embed what matters biologically.
Proteomics-aware learning
Vision-only models can be powerful — but therapy response is mediated by molecular function. Aligning histology with proteomics helps the representation become mechanistically useful.
Multi-task objective
Alongside self-supervised objectives, we add supervised regression heads to predict proteomic abundance on CPTAC subsets. The goal is not a single perfect predictor; it’s a richer embedding space.
External validation
We hold out cohorts and institutions to prevent leakage. CPTAC is particularly useful as a high-quality external test bed when splits are done correctly.
Path to causality
The model suggests resistance pathways; the lab tests key hypotheses. This is how the platform earns trust — and generates proprietary data.
Benchmarks & evaluation
We evaluate on tasks that test robustness, generalization, and molecular inference — not just in-distribution accuracy.
Representative benchmarks
- PANDA — prostate cancer grading.
- UBC-OCEAN — ovarian cancer subtyping.
- TCGA-NSCLC — lung cancer subtyping.
- Camelyon17-WILDS — distribution shift across hospitals.
- MHIST — colorectal polyp classification.
The exact benchmark set may evolve as we align with the most informative public evaluations.
Molecular prediction
For multimodal grounding we use gene expression / molecular prediction suites (e.g., HEST), and evaluate whether multimodal training improves out-of-distribution performance.
- Strict train/test splits to avoid cohort leakage.
- Confidence calibration and uncertainty reporting.
- Population-level bias checks when metadata allows it.
Safety & regulatory posture
We begin as a research and decision-support platform, not a diagnostic device. Clinical claims require prospective validation and the right regulatory pathway.
Human in the loop
Outputs are designed to support clinicians with evidence and uncertainty — not replace judgement.
Auditability
Every prediction should be traceable: data provenance, model versioning, and evaluation context.
Clinical validation
Prospective studies and lab validation are the bridge between retrospective signal and real-world care.