Merlin — 3D CT Vision-Language

Version 1.0 · CT · Image Captioning · human · Licence MIT · Released 2026-06-27

Merlin — 3D CT Vision-Language (version 1.0) is a peer-reviewed CT Image Captioning model for human medical imaging, available to run on managed, EU-hosted cloud GPUs through Nalvera.AI — upload a scan, run the model, and download reproducible, open-format results with no local GPU or CUDA setup.

Model card

Description
Merlin is Stanford's vision-language foundation model for 3D chest/abdomen CT. From a single CT it produces a whole-volume image embedding (2048-d) for retrieval and downstream modelling, scores the scan against a free-text description (image–text contrastive, 512-d), or predicts EHR phenotype findings across 1,692 phecodes as a ranked text report. Inference-only on this platform; outputs are vectors (.npy) or text (.txt), not segmentation masks.
Training data
Merlin was trained on a large Stanford cohort of paired abdominal/chest CT — 6+ million CT images across 15,331 studies, 1.8+ million EHR phecode labels, and 6+ million radiology-report tokens. Validated on an internal test set (5,137 CTs) and externally on 44,098 CTs from three external sites plus public datasets. Adult human anatomy.
Intended use
Research feature extraction and triage support: cross-study retrieval and similarity from image embeddings, zero-shot image–text scoring against a clinical description, and exploratory phenotype findings from CT. NOT for primary diagnosis or treatment decisions.
Known limitations
CT only (chest/abdomen distribution). Embeddings and phenotype probabilities reflect the training population and can be biased or wrong on out-of-distribution anatomy, contrast phases, paediatric scans, severe pathology, or non-standard fields of view. Image–text similarity is a relative score, not a calibrated probability. The model resamples every scan to 1.5×1.5×3 mm at 224×224×160, so very large or off-centre fields of view may be cropped.
Input modality
CT
Species
human
Output format
.txt

Citations & references

Research use only — not a medical device. Outputs are not validated for clinical diagnosis or treatment.
Run this model →