Chest X-ray Imaging Report

1. Introduction

Chest X-ray imaging is one of the most widely used and cost-effective diagnostic tools for identifying thoracic abnormalities, including pulmonary infections, pleural conditions, and cardiovascular diseases. Despite its clinical importance, accurate interpretation of chest radiographs is challenging due to subtle visual patterns, variability in image quality, and increasing workload on radiologists. These challenges motivate the development of automated deep learning systems that can assist in reliable and scalable diagnosis.

In this work, we address the task of classifying chest X-ray images into 20 thoracic pathology categories, including a No Finding class. Although the dataset is provided in a one-hot encoded format, exploratory data analysis confirms that each image is associated with exactly one active label, establishing this as a multi-class classification problem rather than a multi-label setting. The dataset consists of over 51,000 training images and 17,000 test images, with all images having consistent spatial dimensions, enabling a stable and uniform preprocessing pipeline.

A key challenge observed in the dataset is severe class imbalance. The No Finding class dominates the distribution with a significantly higher number of samples, while several pathological conditions are underrepresented, with some classes containing fewer than 50 instances. Additionally, analysis of pixel intensity distributions reveals that most diagnostic information is concentrated in a narrow mid-gray range, making contrast and normalization critical for effective feature learning. The dataset also exhibits potential spurious correlations arising from variations in imaging conditions, which may lead models to learn non-clinical shortcuts if not handled carefully.

Another important aspect of this problem is the asymmetric cost associated with prediction errors. In clinical practice, failing to detect a disease (false negative) is considerably more critical than incorrectly predicting its presence (false positive). This is reflected in the competition's evaluation metric, which imposes a substantially higher penalty on false negatives. Therefore, model development must prioritize sensitivity to pathological cases while maintaining balanced performance across all classes.

The objective of this study is to develop a robust deep learning-based classification framework that effectively captures discriminative features from chest X-ray images while addressing class imbalance and minimizing high-cost errors. The proposed approach focuses on optimizing performance under the given asymmetric scoring function, ensuring that the model does not bias toward dominant classes and remains sensitive to rare but clinically significant conditions.

2. Methodology

To address the chest X-ray classification task, a deep learning-based approach was adopted using convolutional neural networks with transfer learning. The primary model used in this work is ConvNeXt-Tiny, a modern convolutional architecture designed to incorporate design principles from transformer-based models while retaining the efficiency of CNNs. ConvNeXt was chosen due to its strong performance on image classification benchmarks and its ability to capture hierarchical visual features, which is particularly important for identifying subtle patterns in medical images.

The model was initialized with ImageNet-pretrained weights and trained using a staged fine-tuning strategy, where the backbone was initially frozen to stabilize optimization and subsequently unfrozen to enable full adaptation to the target dataset.

2.1 Data Preprocessing

All images were resized to a fixed resolution (224×224 for Model A and 256×256 for Model B) to ensure compatibility with the model architecture and to maintain computational efficiency. Since pretrained weights were used, ImageNet normalization was applied to align the input distribution with the pretraining domain.

Given the nature of medical images, where important features are often subtle intensity variations, normalization plays a critical role in improving model sensitivity. Additionally, all images were converted from grayscale to three-channel format to match the expected input of pretrained models.

2.2 Data Augmentation

To improve generalization and reduce overfitting, a set of controlled data augmentation techniques was applied using the Albumentations library. These augmentations were carefully chosen to preserve clinical relevance while introducing variability:

Horizontal flipping (to simulate anatomical variations)
ShiftScaleRotate (small rotations, scaling, and translations)
Brightness and contrast adjustments

Aggressive augmentations were avoided to prevent distortion of medically relevant features.

2.3 Handling Class Imbalance

The dataset exhibits significant class imbalance, with certain classes being heavily underrepresented. To address this, a WeightedRandomSampler was used during training to ensure that minority classes are sampled more frequently. This approach helps the model learn from rare classes without discarding valuable majority class data.

2.4 Loss Function and Optimization

The model was trained using CrossEntropyLoss with label smoothing, which helps improve generalization and reduces overconfidence in predictions. The optimizer used was AdamW, which provides better regularization through decoupled weight decay. A learning rate scheduler (ReduceLROnPlateau) was employed to adapt the learning rate based on validation performance.

Mixed precision training was utilized to improve computational efficiency and enable faster training on GPU hardware.

2.5 Regularization and Training Stability

Several techniques were incorporated to stabilize training and prevent overfitting:

Dropout in the classification head
Gradient clipping to avoid exploding gradients
Early stopping based on validation score
Learning rate reduction on plateau

2.6 Inference Strategy

During inference, a strict multi-class prediction strategy was followed using softmax activation and argmax selection to ensure that exactly one class is predicted per image. To further improve performance, Test-Time Augmentation (TTA) was applied by averaging predictions from original and horizontally flipped images.

Additionally, temperature scaling was used to calibrate prediction confidence before final class selection, which helps improve robustness under the competition's asymmetric scoring metric.

3. Experiments and Results

To systematically improve model performance under the competition's asymmetric scoring metric, a series of controlled experiments were conducted. These experiments focused on optimizing model architecture, training dynamics, input resolution, and inference calibration. The objective was to identify configurations that improve sensitivity to minority classes while maintaining overall stability.

3.1 Validation Strategy

The dataset was split into training and validation sets using an 80:20 stratified split, ensuring that class distribution was preserved across both sets. This was particularly important due to the severe class imbalance observed in the dataset.

Model performance was evaluated using:

Competition Score (primary metric)
Macro F1-score
Validation Loss

The competition score was used as the main criterion for:

Model selection
Learning rate scheduling
Early stopping

3.2 Experimental Setup and Parameter Exploration

A series of experiments were conducted by varying key hyperparameters and training strategies. The exploration focused on identifying stable and performant configurations.

Learning Rate Exploration

Learning Rate	Observation
3e-4	Unstable training, poor convergence
1e-4	Moderate performance
8e-5	Stable training, best baseline (Model A)
5e-5	Improved fine-tuning stability (Model B)
<1e-5	Very slow convergence

Insight: ConvNeXt demonstrated optimal performance in the range of 5e-5 to 1e-4, balancing convergence speed and stability.

Input Resolution Study

Image Size	Training Speed	Performance	Observation
224	Fast	Baseline	Stable but limited feature capture
256	Moderate	Best	Improved representation and generalization
≥320	Slow	Degraded	Overfitting and instability

Insight: Increasing resolution from 224 → 256 provided the most significant performance gain.

Label Smoothing Analysis

Model	Label Smoothing	Effect
Model A	0.05	Improved stability, reduced overconfidence
Model B	0.03	Balanced stability and prediction sharpness

Insight: Label smoothing improved generalization but required careful tuning to avoid overly soft predictions.

Data Augmentation Variations

Configuration	Augmentation Strength	Result
Baseline	Mild	Stable training
Increased	Moderate	Improved generalization
Aggressive	High	Degraded performance

Insight: Medical images require subtle augmentations, as aggressive transformations can distort clinically relevant features.

Inference Calibration (Temperature Scaling)

Model	Temperature	Effect
Model A	1.5	Smoother predictions, reduced overconfidence
Model B	1.2	Sharper predictions, better class separation

Insight: Temperature scaling significantly improved alignment with the asymmetric penalty structure, balancing false positives and false negatives.

3.3 Model Development and Comparison

Two primary models were developed through iterative refinement.

Model A (Baseline)

Image size: 224
Learning rate: 8e-5
Label smoothing: 0.05
Moderate augmentation
Temperature: 1.5

This model established a stable baseline with consistent convergence and reasonable performance.

Model B (Improved Model)

Image size: 256
Learning rate: 1e-4
Label smoothing: 0.03
Refined augmentation strategy
Temperature: 1.2

Model B was developed by building upon Model A and introducing targeted improvements aimed at enhancing feature representation and prediction sharpness.

3.4 Training Dynamics

The training process revealed important patterns:

Training loss decreased steadily, indicating effective optimization
Validation score plateaued early, suggesting difficulty in learning rare classes
Improvements in accuracy did not always correspond to improvements in competition score

This behavior is explained by:

Severe class imbalance
High penalty for false negatives

Key Insight: Improving accuracy alone is insufficient; optimizing for the competition metric requires careful calibration and sensitivity to minority classes.

3.5 Results

Quantitative Comparison

Model	Image Size	Learning Rate	Label Smoothing	Validation Score	Kaggle Score
Model A	224	8e-5	0.05	~ -5.40	-4.968755
Model B	256	1e-4	0.03	~ -4.88	-4.94581

Performance Metrics

Model	Accuracy	Macro F1 Score	Observation
Model A	Moderate	~0.13	Stable but less sensitive to rare classes
Model B	Higher	~0.135	Improved balance and generalization

3.6 Key Observations and Insights

1. Input Resolution Impact

Increasing image size from 224 to 256 provided the largest improvement in performance, enabling better feature extraction.

2. Calibration is Critical

Temperature scaling had a significant impact on the competition score by improving the balance between false positives and false negatives.

3. Label Smoothing Trade-off

While label smoothing improved stability, reducing it allowed sharper predictions and better performance in later stages.

4. Effect of Class Imbalance

The model showed a natural bias toward dominant classes, highlighting the importance of sampling strategies and calibration.

5. Metric-Specific Optimization

The asymmetric scoring function required focusing on minimizing false negatives rather than maximizing overall accuracy.

6. Validation vs Leaderboard Gap

Differences between validation and Kaggle scores indicate slight distribution mismatch, a common challenge in medical datasets.

4. Conclusion

In this work, a deep learning-based pipeline was developed for multi-class classification of chest X-ray images across 20 thoracic pathologies. A ConvNeXt-Tiny architecture with ImageNet pretraining was employed, and performance was improved through staged fine-tuning, careful hyperparameter tuning, and calibrated inference strategies. The approach incorporated stratified data splitting, controlled data augmentation, and imbalance-aware sampling to ensure stable and representative learning across all classes.

Through systematic experimentation, it was observed that model performance was highly sensitive to input resolution, learning rate, and prediction calibration. Increasing image resolution from 224 to 256 yielded the most significant improvement by enhancing feature representation. Additionally, temperature scaling during inference played a critical role in aligning predictions with the asymmetric evaluation metric, improving the balance between false positives and false negatives.

A key insight from this study is that optimizing for standard metrics such as accuracy or F1-score does not necessarily translate to improved performance under domain-specific evaluation criteria. In this case, minimizing false negatives was essential due to the higher penalty associated with missed diagnoses. Furthermore, attempts to aggressively address class imbalance using class weighting were found to be unstable, highlighting the importance of calibrated and controlled approaches.

Overall, the results demonstrate that effective performance in medical image classification tasks requires not only strong model architectures but also careful consideration of dataset characteristics, evaluation metrics, and training dynamics. Future improvements could include ensemble methods and more advanced calibration techniques to further enhance robustness and generalization.

5. References

Kaggle Competition: Chest X-ray Multi-class Classification Challenge
https://www.kaggle.com/competitions/26-t-1-dl-gen-ainppe-1/overview
Dataset Source
https://www.kaggle.com/competitions/26-t-1-dl-gen-ainppe-1/data
Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022).
A ConvNet for the 2020s (ConvNeXt)
https://arxiv.org/abs/2201.03545
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009).
ImageNet: A Large-Scale Hierarchical Image Database
https://www.image-net.org/
Buslaev, A., Parinov, A., Khvedchenya, E., Iglovikov, V. I., & Kalinin, A. A. (2020).
Albumentations: Fast and Flexible Image Augmentations
https://albumentations.ai/docs/