Research · Medical Imaging

From Pixels to Diagnosis

Using machine learning to classify medical image sequences

Antonio Badilla-Olivas, Enrique Vílchez-Lizano, Brandon Mora-Umaña, Kenneth Villalobos-Solís, Adrián Lara Petitdemange
Work done in late 2024 as part of the course Research in Computer Science, Bachelor in Computer Science, Universidad de Costa Rica

The question

When a patient arrives with a traumatic brain injury, clinicians rely on CT scans to detect intracranial hemorrhage. Each scan produces roughly 30 to 60 slices. A radiologist reads them as a sequence, picking up on how structures change from one slice to the next.

Can a machine learning model do the same? And does it matter whether the model sees the slices as a sequence or one at a time?

Two approaches, same patients

We tested two models on 75 patients from the CT-ICH dataset, 36 with hemorrhage and 39 without. Ten independent training runs each.

ViViT
72%
Mean accuracy
73%Precision
62%Recall
ConvNeXT
60%
Mean accuracy
67%Precision*
13%Recall

ViViT is a video vision transformer. It takes 18 slices as one sequence and outputs a single prediction. ConvNeXT is a convolutional image model. It classifies each slice alone, then votes: if any slice is positive, the patient is positive.

Why video models might work

A CT scan is not a bag of independent images. It is a spatially ordered sequence. Hemorrhage has structure: it spans multiple slices, changes shape, and sits in specific anatomical locations. A radiologist uses all of that context. We hypothesise that video models can too.

S1
S2
S3
S4
S5
S6
S7
...
ViViT
Positive

ViViT processes the entire sequence as a single input. Its transformer attention mechanism can relate any slice to any other, potentially capturing both spatial patterns within each slice and temporal patterns across the sequence. In principle, it could learn where hemorrhage starts, how it evolves, and where it ends.

ConvNeXT sees each slice in isolation. It can detect hemorrhage in a single image, but it cannot reason about patterns that span multiple slices. The voting mechanism is a workaround, not a substitute for sequential understanding.

Our hypothesis is that the advantage of video models goes beyond avoiding class imbalance. They may also pick up on spatial and temporal cues that image models are structurally unable to see. This study does not prove that claim, but the results are consistent with it.

The imbalance problem

Beyond architecture, the two models face very different training distributions from the same data.

Positive samples during training

ViViT
48% positive
ConvNeXT
≈11% positive
~53 patient sequences (train split) ≈2,000 individual slices (train split)

ConvNeXT trains on individual slices, roughly 89% of which are negative. It learns to say "no hemorrhage" almost every time. High accuracy, terrible recall. ViViT trains on patient-level sequences where the split is nearly even.

This imbalance is not a flaw in the experiment. It is what happens when you decompose sequential data into individual frames for an image model. The skew is inherent to the paradigm.

Run-by-run results

Each dot is one of 10 independent training runs. Hover for details.

ViViT ConvNeXT

Where predictions land

Mean confusion matrices across 10 runs. Toggle to compare.

Pred −
Pred +
True −
11.2 True Neg
3.0 False Pos
True +
4.1 False Neg
6.7 True Pos
Predicted label →

Statistical tests

MetricViViTConvNeXTp-valueSignificant?
Accuracy0.720.60< 0.001Yes
Recall0.620.13< 0.001Yes
Precision0.730.67*0.68No

ViViT produced less than half the false negatives of ConvNeXT (4.1 vs 9.6) and detected positive cases nearly five times as often (6.7 vs 1.4). In a medical setting, a missed hemorrhage can be fatal. That gap in recall matters.

* ConvNeXT precision values for runs 2 and 7 were corrupted due to logging issues and treated as 0.

Computational cost

ModelParametersTrain / runThroughput
ViViT87.6M∼39 min0.49 samples/sec
ConvNeXT27.8M∼42 min8.6 samples/sec

ViViT has lower throughput (full sequences are expensive), but total training time is comparable. Both ran on a Tesla V100S-PCIE-32GB, 10 replicates of 20 epochs each.

Limitations

What comes next