When we train a neural network to diagnose diseases from medical images, we often celebrate high accuracy scores and move on. Lets analyze a ResNet-50 on chest X-ray classification to reveal a more nuanced story—one that every AI practitioner in healthcare should understand.
The Deceptive Simplicity of “Good Performance”
Training a deep learning model for medical image classification seems straightforward: get data, train model, achieve high metrics, deploy. But this workflow obscures critical questions: What has the model actually learned? When will its predictions fail? Can we trust it with real patients?
I analyzed the performance of a standard ResNet-50 on the NIH ChestX-ray14 dataset—112,120 chest X-ray images labeled with 14 different pathologies to understand these questions better. The goal wasn’t to achieve state-of-the-art performance, but rather to understand what a baseline model learns and, crucially, what we can and cannot infer from its behavior.
The Setup: Deliberately Minimal
The training approach was intentionally simple:
- Standard ResNet-50 architecture with random initialization (no ImageNet pretraining)
- Basic Adam optimizer with fixed learning rate
- No fancy data augmentation or hyperparameter tuning
- Single training run on one GPU
Why so minimal? Because extensive tuning can obscure which design choices actually matter and make results harder to reproduce. The focus here was on rigorous evaluation, not leaderboard competition.
Performance Isn’t a Single Number
The model achieved a macro-average AUROC of 0.744 across all pathologies. But that single number hides variation across different pathologies:
High performers (AUROC > 0.80):
- Cardiomegaly: 0.852
- Edema: 0.825
- Pneumothorax: 0.816
Low performers (AUROC < 0.70):
- Nodule: 0.642
- Pneumonia: 0.668
- Infiltration: 0.684
Why such a spread? The high performers share something in common: they produce clear, high-contrast visual patterns. An enlarged heart creates an obvious change in cardiac silhouette, Pulmonary edema shows widespread white opacities and Pneumothorax displays sharp lung edge demarcation.
The low performers? They’re subtle, variable, and easily confused with other conditions. Nodules can be tiny and hidden behind ribs, Pneumonia manifests in diverse patterns that overlap with other diseases, and Infiltration is notoriously ambiguous even for radiologists.
The key insight: Even with identical architecture and training, different diseases have drastically different learnability. “Model performance” is not a single scalar but it’s a distribution that reflects task complexity.
Looking Inside: The Grad-CAM Revelations
To understand what the model was actually looking at, I used Grad-CAM visualizations—heatmaps showing which image regions most influenced each prediction.
The Good News
For high-performing pathologies, attention patterns looked anatomically plausible:
- Cardiomegaly: Model focused on the central thoracic region where the heart is located
- Edema: Diffuse attention spanning bilateral lung fields, matching the widespread nature of edema
- Pneumothorax: Concentrated attention in upper chest regions
This seems encouraging and the model appears to be looking at the right places.
The Concerning News
For low-performing pathologies, attention patterns were problematic:
- Infiltration: Heavy attention on image corners and edges—likely learning from artifacts rather than pathology
- Pneumonia: Extrathoracic bias, potentially relying on hospital-specific tags or positioning artifacts
- Nodule: Central mediastinal focus instead of localized lung regions, suggesting the model hasn’t learned what a nodule actually is
The Interpretability Trap
Here’s where things get philosophically interesting. We should be warned against what is called the “Interpretability Trap”—a flawed reasoning chain that goes:
- Model produces plausible-looking attention map
- Human observer recognizes anatomically relevant region
- Observer concludes model learned “correct” reasoning
- Observer’s confidence in model increases ← The Trap
The problem? Plausible attention is necessary but not sufficient for valid reasoning. A model might highlight the correct anatomy while actually exploiting spurious correlations invisible to human observers. It might be detecting the presence of medical equipment, patient positioning, or hospital-specific imaging protocols rather than the disease itself.
Therefore, it is important to note that Grad-CAM shows correlation, not causation. It reveals what the model associates with labels in the training distribution, not necessarily the biological cause of pathology.
What We Still Don’t Know
We should always be explicit about our work’s limitations:
Not evaluated:
- Performance on other chest X-ray datasets
- Behavior across patient demographics
- Robustness to adversarial examples or distribution shifts
- Agreement with radiologist interpretations
- Temporal stability as medical practices evolve
Cannot conclude:
- The model learned clinically valid reasoning
- Performance will generalize to new hospitals or populations
- Grad-CAM validates model correctness
- The model is safe for clinical deployment
The Real Lesson
This work isn’t just about chest X-rays or ResNet-50. It’s a template for how we should approach AI in high-stakes domains:
- Be explicit about methodology: Document every choice and its rationale
- Report heterogeneous performance: Aggregate metrics hide crucial variation
- Use explanations diagnostically, not defensively: Grad-CAM helps identify failures, not justify deployment
- Define the scope of valid inference: State clearly what you can and cannot conclude
- Prioritize robustness over optimization: Real-world reliability matters more than benchmark numbers
Moving Forward
The path from “model works in lab” to “model safe for patients” requires much more than good test metrics:
- Rigorous out-of-distribution testing across multiple datasets
- Subgroup analysis to detect performance disparities
- Temporal validation to ensure stability over time
- Clinical validation with radiologist gold standards
- Adversarial testing to probe failure modes
High AUROC scores are necessary but far from sufficient. The hard questions aren’t about accuracy—they’re about robustness, fairness, interpretability, and ultimately trust.
Conclusion
The aim of this work is to reveals how much we still don’t understand about what our models have learned and when they’ll fail.
The most valuable contribution isn’t the highest performance metric. It’s the framework for rigorous evaluation and honest uncertainty quantification. In healthcare AI, knowing the boundaries of our knowledge isn’t just good science—it’s an ethical imperative. Therefore, before deploying any AI system in clinical settings, we must answer not just “Does it work?” but “How does it work?”, “When does it fail?”, and “How do we know?”
Read the full technical report: What a Standard ResNet Learns on NIH Chest X-rays—and What We Can (and Can’t) Infer From It