Research · Medical Imaging

From Pixels to Diagnosis

Using machine learning to classify medical image sequences

Antonio Badilla-Olivas, Enrique Vílchez-Lizano, Brandon Mora-Umaña, Kenneth Villalobos-Solís, Adrián Lara Petitdemange

Work done in late 2024 as part of the course Research in Computer Science, Bachelor in Computer Science, Universidad de Costa Rica

View code Dataset

The question

When a patient arrives with a traumatic brain injury, clinicians rely on CT scans to detect intracranial hemorrhage. Each scan produces roughly 30 to 60 slices. A radiologist reads them as a sequence, picking up on how structures change from one slice to the next.

Can a machine learning model do the same? And does it matter whether the model sees the slices as a sequence or one at a time?

Two approaches, same patients

We tested two models on 75 patients from the CT-ICH dataset, 36 with hemorrhage and 39 without. Ten independent training runs each.

ViViT

72%

Mean accuracy

73%Precision

62%Recall

ConvNeXT

60%

Mean accuracy

67%Precision*

13%Recall

ViViT is a video vision transformer. It takes 18 slices as one sequence and outputs a single prediction. ConvNeXT is a convolutional image model. It classifies each slice alone, then votes: if any slice is positive, the patient is positive.

Why video models might work

A CT scan is not a bag of independent images. It is a spatially ordered sequence. Hemorrhage has structure: it spans multiple slices, changes shape, and sits in specific anatomical locations. A radiologist uses all of that context. We hypothesise that video models can too.

...

→

ViViT

→

Positive

ViViT processes the entire sequence as a single input. Its transformer attention mechanism can relate any slice to any other, potentially capturing both spatial patterns within each slice and temporal patterns across the sequence. In principle, it could learn where hemorrhage starts, how it evolves, and where it ends.

ConvNeXT sees each slice in isolation. It can detect hemorrhage in a single image, but it cannot reason about patterns that span multiple slices. The voting mechanism is a workaround, not a substitute for sequential understanding.

Our hypothesis is that the advantage of video models goes beyond avoiding class imbalance. They may also pick up on spatial and temporal cues that image models are structurally unable to see. This study does not prove that claim, but the results are consistent with it.

The imbalance problem

Beyond architecture, the two models face very different training distributions from the same data.

Positive samples during training

ViViT

48% positive

ConvNeXT

≈11% positive

~53 patient sequences (train split) ≈2,000 individual slices (train split)

ConvNeXT trains on individual slices, roughly 89% of which are negative. It learns to say "no hemorrhage" almost every time. High accuracy, terrible recall. ViViT trains on patient-level sequences where the split is nearly even.

This imbalance is not a flaw in the experiment. It is what happens when you decompose sequential data into individual frames for an image model. The skew is inherent to the paradigm.

Run-by-run results

Each dot is one of 10 independent training runs. Hover for details.

ViViT ConvNeXT

Where predictions land

Mean confusion matrices across 10 runs. Toggle to compare.

Pred −

Pred +

True −

11.2 True Neg

3.0 False Pos

True +

4.1 False Neg

6.7 True Pos

Predicted label →

Statistical tests

Metric	ViViT	ConvNeXT	p-value	Significant?
Accuracy	0.72	0.60	< 0.001	Yes
Recall	0.62	0.13	< 0.001	Yes
Precision	0.73	0.67*	0.68	No

ViViT produced less than half the false negatives of ConvNeXT (4.1 vs 9.6) and detected positive cases nearly five times as often (6.7 vs 1.4). In a medical setting, a missed hemorrhage can be fatal. That gap in recall matters.

* ConvNeXT precision values for runs 2 and 7 were corrupted due to logging issues and treated as 0.

Computational cost

Model	Parameters	Train / run	Throughput
ViViT	87.6M	∼39 min	0.49 samples/sec
ConvNeXT	27.8M	∼42 min	8.6 samples/sec

ViViT has lower throughput (full sequences are expensive), but total training time is comparable. Both ran on a Tesla V100S-PCIE-32GB, 10 replicates of 20 epochs each.

Limitations

75 patients from a single hospital. Results may not generalise.
One architecture per category. We cannot claim all video models outperform all image models.
No class-balancing techniques were applied. Deliberate (we compared paradigms under natural conditions), but it puts ConvNeXT at a disadvantage.
Binary classification only. The dataset has hemorrhage subtype labels we did not use.

What comes next

Do class-balancing techniques (oversampling, augmentation, weighted loss) close the gap? That would isolate architecture effects from distribution effects.
How do other video models and multi-view approaches compare?
Multiclass hemorrhage subtype classification.
Larger, multi-site datasets to test generalisability.