1. Introduction
Chest X-ray imaging is one of the most widely used and cost-effective diagnostic tools for identifying thoracic abnormalities, including pulmonary infections, pleural conditions, and cardiovascular diseases. Despite its clinical importance, accurate interpretation of chest radiographs is challenging due to subtle visual patterns, variability in image quality, and increasing workload on radiologists. These challenges motivate the development of automated deep learning systems that can assist in reliable and scalable diagnosis.
In this work, we address the task of classifying chest X-ray images into 20 thoracic pathology categories, including a No Finding class. Although the dataset is provided in a one-hot encoded format, exploratory data analysis confirms that each image is associated with exactly one active label, establishing this as a multi-class classification problem rather than a multi-label setting. The dataset consists of over 51,000 training images and 17,000 test images, with all images having consistent spatial dimensions, enabling a stable and uniform preprocessing pipeline.
A key challenge observed in the dataset is severe class imbalance. The No Finding class dominates the distribution with a significantly higher number of samples, while several pathological conditions are underrepresented, with some classes containing fewer than 50 instances. Additionally, analysis of pixel intensity distributions reveals that most diagnostic information is concentrated in a narrow mid-gray range, making contrast and normalization critical for effective feature learning. The dataset also exhibits potential spurious correlations arising from variations in imaging conditions, which may lead models to learn non-clinical shortcuts if not handled carefully.
Another important aspect of this problem is the asymmetric cost associated with prediction errors. In clinical practice, failing to detect a disease (false negative) is considerably more critical than incorrectly predicting its presence (false positive). This is reflected in the competition's evaluation metric, which imposes a substantially higher penalty on false negatives. Therefore, model development must prioritize sensitivity to pathological cases while maintaining balanced performance across all classes.
The objective of this study is to develop a robust deep learning-based classification framework that effectively captures discriminative features from chest X-ray images while addressing class imbalance and minimizing high-cost errors. The proposed approach focuses on optimizing performance under the given asymmetric scoring function, ensuring that the model does not bias toward dominant classes and remains sensitive to rare but clinically significant conditions.
2. Methodology
To address the chest X-ray classification task, a deep learning-based approach was adopted using convolutional neural networks with transfer learning. The primary model used in this work is ConvNeXt-Tiny, a modern convolutional architecture designed to incorporate design principles from transformer-based models while retaining the efficiency of CNNs. ConvNeXt was chosen due to its strong performance on image classification benchmarks and its ability to capture hierarchical visual features, which is particularly important for identifying subtle patterns in medical images.
The model was initialized with ImageNet-pretrained weights and trained using a staged fine-tuning strategy, where the backbone was initially frozen to stabilize optimization and subsequently unfrozen to enable full adaptation to the target dataset.
2.1 Data Preprocessing
All images were resized to a fixed resolution (224×224 for Model A and 256×256 for Model B) to ensure compatibility with the model architecture and to maintain computational efficiency. Since pretrained weights were used, ImageNet normalization was applied to align the input distribution with the pretraining domain.
Given the nature of medical images, where important features are often subtle intensity variations, normalization plays a critical role in improving model sensitivity. Additionally, all images were converted from grayscale to three-channel format to match the expected input of pretrained models.
2.2 Data Augmentation
To improve generalization and reduce overfitting, a set of controlled data augmentation techniques was applied using the Albumentations library. These augmentations were carefully chosen to preserve clinical relevance while introducing variability:
- Horizontal flipping (to simulate anatomical variations)
- ShiftScaleRotate (small rotations, scaling, and translations)
- Brightness and contrast adjustments
Aggressive augmentations were avoided to prevent distortion of medically relevant features.
2.3 Handling Class Imbalance
The dataset exhibits significant class imbalance, with certain classes being heavily underrepresented. To address this, a WeightedRandomSampler was used during training to ensure that minority classes are sampled more frequently. This approach helps the model learn from rare classes without discarding valuable majority class data.
2.4 Loss Function and Optimization
The model was trained using CrossEntropyLoss with label smoothing, which helps improve generalization and reduces overconfidence in predictions. The optimizer used was AdamW, which provides better regularization through decoupled weight decay. A learning rate scheduler (ReduceLROnPlateau) was employed to adapt the learning rate based on validation performance.
Mixed precision training was utilized to improve computational efficiency and enable faster training on GPU hardware.
2.5 Regularization and Training Stability
Several techniques were incorporated to stabilize training and prevent overfitting:
- Dropout in the classification head
- Gradient clipping to avoid exploding gradients
- Early stopping based on validation score
- Learning rate reduction on plateau
2.6 Inference Strategy
During inference, a strict multi-class prediction strategy was followed using softmax activation and argmax selection to ensure that exactly one class is predicted per image. To further improve performance, Test-Time Augmentation (TTA) was applied by averaging predictions from original and horizontally flipped images.
Additionally, temperature scaling was used to calibrate prediction confidence before final class selection, which helps improve robustness under the competition's asymmetric scoring metric.
3. Experiments and Results
To systematically improve model performance under the competition's asymmetric scoring metric, a series of controlled experiments were conducted. These experiments focused on optimizing model architecture, training dynamics, input resolution, and inference calibration. The objective was to identify configurations that improve sensitivity to minority classes while maintaining overall stability.
3.1 Validation Strategy
The dataset was split into training and validation sets using an 80:20 stratified split, ensuring that class distribution was preserved across both sets. This was particularly important due to the severe class imbalance observed in the dataset.
Model performance was evaluated using:
- Competition Score (primary metric)
- Macro F1-score
- Validation Loss
The competition score was used as the main criterion for:
- Model selection
- Learning rate scheduling
- Early stopping
3.2 Experimental Setup and Parameter Exploration
A series of experiments were conducted by varying key hyperparameters and training strategies. The exploration focused on identifying stable and performant configurations.
Learning Rate Exploration
| Learning Rate | Observation |
|---|---|
| 3e-4 | Unstable training, poor convergence |
| 1e-4 | Moderate performance |
| 8e-5 | Stable training, best baseline (Model A) |
| 5e-5 | Improved fine-tuning stability (Model B) |
| <1e-5 | Very slow convergence |
Insight: ConvNeXt demonstrated optimal performance in the range of 5e-5 to 1e-4, balancing convergence speed and stability.
Input Resolution Study
| Image Size | Training Speed | Performance | Observation |
|---|---|---|---|
| 224 | Fast | Baseline | Stable but limited feature capture |
| 256 | Moderate | Best | Improved representation and generalization |
| ≥320 | Slow | Degraded | Overfitting and instability |
Insight: Increasing resolution from 224 → 256 provided the most significant performance gain.
Label Smoothing Analysis
| Model | Label Smoothing | Effect |
|---|---|---|
| Model A | 0.05 | Improved stability, reduced overconfidence |
| Model B | 0.03 | Balanced stability and prediction sharpness |
Insight: Label smoothing improved generalization but required careful tuning to avoid overly soft predictions.
Data Augmentation Variations
| Configuration | Augmentation Strength | Result |
|---|---|---|
| Baseline | Mild | Stable training |
| Increased | Moderate | Improved generalization |
| Aggressive | High | Degraded performance |
Insight: Medical images require subtle augmentations, as aggressive transformations can distort clinically relevant features.
Inference Calibration (Temperature Scaling)
| Model | Temperature | Effect |
|---|---|---|
| Model A | 1.5 | Smoother predictions, reduced overconfidence |
| Model B | 1.2 | Sharper predictions, better class separation |
Insight: Temperature scaling significantly improved alignment with the asymmetric penalty structure, balancing false positives and false negatives.
3.3 Model Development and Comparison
Two primary models were developed through iterative refinement.
Model A (Baseline)
- Image size: 224
- Learning rate: 8e-5
- Label smoothing: 0.05
- Moderate augmentation
- Temperature: 1.5
This model established a stable baseline with consistent convergence and reasonable performance.
Model B (Improved Model)
- Image size: 256
- Learning rate: 1e-4
- Label smoothing: 0.03
- Refined augmentation strategy
- Temperature: 1.2
Model B was developed by building upon Model A and introducing targeted improvements aimed at enhancing feature representation and prediction sharpness.
3.4 Training Dynamics
The training process revealed important patterns:
- Training loss decreased steadily, indicating effective optimization
- Validation score plateaued early, suggesting difficulty in learning rare classes
- Improvements in accuracy did not always correspond to improvements in competition score
This behavior is explained by:
- Severe class imbalance
- High penalty for false negatives
Key Insight: Improving accuracy alone is insufficient; optimizing for the competition metric requires careful calibration and sensitivity to minority classes.
3.5 Results
Quantitative Comparison
| Model | Image Size | Learning Rate | Label Smoothing | Validation Score | Kaggle Score |
|---|---|---|---|---|---|
| Model A | 224 | 8e-5 | 0.05 | ~ -5.40 | -4.968755 |
| Model B | 256 | 1e-4 | 0.03 | ~ -4.88 | -4.94581 |
Performance Metrics
| Model | Accuracy | Macro F1 Score | Observation |
|---|---|---|---|
| Model A | Moderate | ~0.13 | Stable but less sensitive to rare classes |
| Model B | Higher | ~0.135 | Improved balance and generalization |
3.6 Key Observations and Insights
1. Input Resolution Impact
Increasing image size from 224 to 256 provided the largest improvement in performance, enabling better feature extraction.
2. Calibration is Critical
Temperature scaling had a significant impact on the competition score by improving the balance between false positives and false negatives.
3. Label Smoothing Trade-off
While label smoothing improved stability, reducing it allowed sharper predictions and better performance in later stages.
4. Effect of Class Imbalance
The model showed a natural bias toward dominant classes, highlighting the importance of sampling strategies and calibration.
5. Metric-Specific Optimization
The asymmetric scoring function required focusing on minimizing false negatives rather than maximizing overall accuracy.
6. Validation vs Leaderboard Gap
Differences between validation and Kaggle scores indicate slight distribution mismatch, a common challenge in medical datasets.
4. Conclusion
In this work, a deep learning-based pipeline was developed for multi-class classification of chest X-ray images across 20 thoracic pathologies. A ConvNeXt-Tiny architecture with ImageNet pretraining was employed, and performance was improved through staged fine-tuning, careful hyperparameter tuning, and calibrated inference strategies. The approach incorporated stratified data splitting, controlled data augmentation, and imbalance-aware sampling to ensure stable and representative learning across all classes.
Through systematic experimentation, it was observed that model performance was highly sensitive to input resolution, learning rate, and prediction calibration. Increasing image resolution from 224 to 256 yielded the most significant improvement by enhancing feature representation. Additionally, temperature scaling during inference played a critical role in aligning predictions with the asymmetric evaluation metric, improving the balance between false positives and false negatives.
A key insight from this study is that optimizing for standard metrics such as accuracy or F1-score does not necessarily translate to improved performance under domain-specific evaluation criteria. In this case, minimizing false negatives was essential due to the higher penalty associated with missed diagnoses. Furthermore, attempts to aggressively address class imbalance using class weighting were found to be unstable, highlighting the importance of calibrated and controlled approaches.
Overall, the results demonstrate that effective performance in medical image classification tasks requires not only strong model architectures but also careful consideration of dataset characteristics, evaluation metrics, and training dynamics. Future improvements could include ensemble methods and more advanced calibration techniques to further enhance robustness and generalization.
5. References
-
Kaggle Competition: Chest X-ray Multi-class Classification Challenge
https://www.kaggle.com/competitions/26-t-1-dl-gen-ainppe-1/overview -
Dataset Source
https://www.kaggle.com/competitions/26-t-1-dl-gen-ainppe-1/data -
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022).
A ConvNet for the 2020s (ConvNeXt)
https://arxiv.org/abs/2201.03545 -
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009).
ImageNet: A Large-Scale Hierarchical Image Database
https://www.image-net.org/ -
Buslaev, A., Parinov, A., Khvedchenya, E., Iglovikov, V. I., & Kalinin, A. A. (2020).
Albumentations: Fast and Flexible Image Augmentations
https://albumentations.ai/docs/