Radiology and Deep Learning

Detecting pneumonia opacities from chest X-Ray images using deep learning.

One day back in August, I was catching up with my best friend from high school who is now a radiology resident. One thing led to another, and we started talking about our interests in artificial intelligence and machine learning and its possible applications in radiology. A couple of months after our talk, I stumbled upon a Kaggle challenge hosted by the Radiological Society of North American (RSNA). It was a competition that we could work on together, so I immediately called my friend. Joined by my brother, we formed a team to compete in this Kaggle challenge. After a month of hard work, we ended up finishing in the top 3%. In this blog post, I’d like to detail what we did during the exciting month.

Pneumonia and Lung Opacities

The goal of the competition was to develop a model that can detect lung opacities that were caused by pneumonia and draw bounding boxes around them. When a person develops pneumonia, the infected areas in his/her lungs will show up as bright regions against the dark healthy regions in chest X-Rays, and these regions are called “opacities”. Detecting them is often the first step in diagnosing someone for pneumonia.

Example lung opacities. Taken from

One thing to note here is that not all opacities are caused by pneumonia, and this is one of the biggest challenges in detecting pneumonia opacities. As a result, in practice, radiologists use patient data to make the final assessment. For example, if a patient’s chest X-Ray shows opacities, and he/she has a fever and coughs, the patient most likely has pneumonia. You can read more about lung opacities in this Kaggle kernel.

Object Detection

The task of identifying pneumonia opacities and drawing bounding boxes around them can be reformulated as an object detection task. In object detection, we want to identify all the objects in a picture and draw bounding boxes around them, and you can easily see how object detection models can be used to detect pneumonia opacities. There are numerous object detection models, and they all share the general idea of dividing up the input image into grids or regions and trying to detect objects and generate coordinates for bounding boxes for each grid or region. I watched Andrew Ng’s online lectures to learn the basics before I jumped into actually implementing an object detection model.

Object detection example. Taken from

When we entered the competition, there was only one month left before the due date. As a result, my strategy was to train as many object detection models as possible using existing implementations and ensemble the results at the end, instead of trying to implement custom object detection models. I ended up successfully training 7 models (YOLO v3 Darknet, Mask RCNN ResNet 50, Mask RCNN ResNet 101, SSD MobileNet v2, Faster RCNN Inception v2, Faster RCNN ResNet 50 and Faster RCNN ResNet101) and ensembling their results.

Google Object Detection API

My first objective in this competition was to find a mature library that implements one (or more) object detection models, and I hit the jackpot with Google Object Detection API. It is “an open source framework built on top of TensorFlow that makes it easy to construct, train and deploy object detection models.” It is maintained by Google itself, has a number of state-of-the-art object detection models already implemented, and you can train these models easily with some configuration changes. You can see the details on how I trained the models here. I trained SSD MobileNet v2, Faster RCNN Inception v2, Faster RCNN ResNet 50 and Faster RCNN ResNet101 using this API. Out of the four models I trained, Faster RCNN ResNet101 turned out to perform the best. I did try to train bigger models such as Faster RCNN ResNet152, but it did not perform better than Faster RCNN ResNet101. I suspect this is due to the relatively small size of the training data.

Matterport Mask RCNN

The next model I trained was Mask RCNN. I used the implementation by Matterport, and referenced this Kaggle kernel heavily. You can see the details about how I trained the models here. Mask RCNN models turned out to be more “sensitive” to the opacities than the Faster RCNN models, detecting more opacities, but producing more false positives. Since Mask RCNN is an extended version of Faster RCNN that has a new branch that produces pixel masks for segmentation, my suspicion is that Mask RCNN’s segmentation layer somehow makes the whole model more sensitive to opacities. It would be great to investigate on this phenomenon further in the future.

Darknet YOLO v3

The last model I trained was YOLO v3 with Darknet. I used the official implementation by the author of the YOLO paper. You can read more about the model on his website. It is implemented in C without using the popular deep learning libraries like Tensorflow or PyTorch, so it was more difficult to train than other models, and this Kaggle kernel helped a ton. You can check out the details of how I trained the model here.

Training the Models

On paper, training deep learning models seems simple enough: define the neural network, feed it training data, and validate it. Unfortunately, as with all things in life, it is not as easy as it looks. In this section, I would like to discuss some of the technical challenges my team and I had to overcome in training these models.


Unfortunately, the only person on the team who had a GPU powerful enough to train deep neural networks was me, and I only have a GeForce GTX 1060. As a result, we had to find more GPUs so that we can train models in parallel for faster iteration.

The first option we tried was Paperspace. Unfortunately, I was not satisfied with its offerings. It is definitely cheaper than AWS EC2 On-Demand instances, but you have to pay extra to give your VM a public IP (no extra cost on EC2). As a result, I had to use their clunky web based terminal instead of SSH to interact with my VMs in order to avoid paying for a public IP. Also, I was charged for the monthly hard disk fee, which is prorated on AWS, every time I launched a new VM, which was an unpleasant surprise to say the least. Moreover, the AWS Spot Instances are cheaper than Paperspace VMs. Due to these reasons, I would not recommend Paperspace as it stands right now.

After Paperspace, my radiologist friend put us in touch with the group leader of the Big Data in Radiology Research Group at UCSF who generously agreed to let us use some of the GPUs the group owns. One problem with the set up was that the GPU servers could only be accessed through the UCSF VPN, which meant that my friend’s laptop was the only device that could access them. After some trial and error, we ended up using TeamViewer to remotely log in to my friend’s laptop and SSHed into the GPU servers using the Ubuntu Virtualbox VM we installed on his laptop. It was clunky and slow, but it worked perfectly for us. Furthermore, in order to save set up time, I created Docker images using nvidia-docker and ran them instead of manually setting up the training pipeline.

For bigger models like Faster RCNN Resnet 101, I decided to use AWS Spot instances. My brother, who is a masters student at Columbia, applied for AWS Educate, which gives students $100 credits for AWS. In order to minimize the time on expensive GPU instances, I launched cheaper generic instances to preprocess the data on an EBS volume, then mounted the volume on a GPU instance to train the model. I was worried about my Spot instances being preemptively terminated, but in practice, it never happened. I used AWS Deep Learning AMI and nvidia-docker to save additional set up time.


For bigger models like Faster RCNN Resnet 152, I tried to speed up the training by using the Adam optimizer, but it ended up actually hurting the validation accuracy. I did some research into this topic and found a blog post that discussed this phenomenon in more detail. For now, I think it is not wise to use Adam unless absolutely necessary and stick to the more classic optimization techniques such as stochastic gradient descent.

Data Augmentation

I also observed that using more data augmentation does not necessarily help with training. When I was training a Faster RCNN model with the ResNet152 feature extractor, I thought it might help to use additional data augmentation, in order to feed the model with more data. Unfortunately, the validation accuracy ended up being lower than the model trained without the additional data augmentation. After some research, I found a blog post that explained how for certain datasets where the images are consistent, such as a self-driving car dataset and a chest X-Ray dataset, it is better not to add too much data augmentation and let the model “overtrain” to images. Moreover, some data augmentation techniques can actually alter the ground truth bounding boxes, and need to be used with caution. For example, if we adjust the hue of our chest X-Ray training images, the opacities may expand or shrink, and we’d have to adjust the ground truth bounding boxes accordingly.

Validating the Models

Validation image example. Detected bounding boxes are green and ground truth boxes are black.

In order to validate my model, I randomly picked 20% of the training data as my validation data, and stopped the training when the validation accuracy stopped improving. After the training was done, I had to pick the right confidence threshold for bounding boxes. There was a trade-off: if the confidence threshold is too low, the model would produce too many false positives; if it’s too high, the model would miss too many true positives. Originally, I wanted to conduct systematic and rigorous analyses such as plotting the validation accuracy at different confidence thresholds, but we just did not have enough time to do it. So, we decided to rely on my radiologist friend’s intuition to decide on the best confidence threshold. I would first generate validation images with detected boxes in green and ground truth boxes in black. Then, my radiologist friend would look through them, run some basic statistical analyses and decide on a sensible confidence threshold. Finally, we’d make additional submissions on Kaggle with different thresholds around the selected confidence threshold to search for the best one.


After finding the best confidence threshold for each model, I tried ensembling all the models for the last bit of boost on our accuracy. My hypothesis was that because each model produced bounding boxes that are quite different from each other, especially in terms of their confidence levels, ensembling them would improve the accuracy. I used the code from a github repo I found on Google to ensemble the models. It takes in a list of detected boxes from each model and averages out the box sizes, positions and confidence levels according to the given weights. I wrote a script to preprocess the outputs of our models to feed to the ensemble script. Then, we followed the same model validation process described above to determine the right weights. Ultimately, the best weights turned out to be the test accuracy scores of the models, i.e. the more accurate a model is, the more weighted it is. The accuracy gain from the ensemble process was not huge, but it was enough to push us higher in the ranking. For example, our best score without ensemble was 0.19971 while the ensembled model gave us the score of 0.20125, which was our best score.

Possible Improvements

After the ensemble, our model was good enough to be in the top 3% of the competition. However, there are some improvements we wanted to make, but could not due to the time constraint, or just simply did not know about:

  1. I wanted to train a separate classifier for pneumonia to reduce the number of false positives, but unfortunately we did not have enough time. We observed that our object detection models frequently detected opacities that were not caused by pneumonia. This was probably due to the fact that the task of identifying and localizing pneumonia opacities specifically is a difficult task as opacities caused by other diseases look similar to the ones caused by pneunonia. However, The task of classifying a chest X-Ray image into pneumonia or non-pneumonia, which is an image classification task, is fundamentally an easier task as it lessens the burden of having to localize pneumonia opacities. For example, CheXNet, a DenseNet classifier trained on chest X-Ray images, reported a F1 score of 0.435, outperforming most of the radiologists who participated in the study. As a result, if we had used a classifier to adjust the confidence levels of the boxes, we could have had a significantly lower number of false positives. The winning team took this approach also.
  2. I should have also relied more on the validation set to determine our confidence thresholds rather than relying solely on my radiologist friend’s intuitions. As I mentioned earlier, the more systematic approach would be to first draw an accuracy-confidence-threshold curve to determine the best threshold. This would have helped us pick a more generalized confidence threshold that would work well on any test set.
  3. We did not spend enough time detecting systematic errors in our models. For example, the winning team found that their models consistently produced bounding boxes that are significantly bigger than the ground-truth boxes due to the way the organizers of the competition labeled the images (they took the intersections of the readings from different radiologists, thereby decreasing the size of the ground-truth boxes). So, the winning team simply added a post-processing step to decrease the size of their detected boxes and noticed a significant improvement on their accuracy score. If we had taken a more systematic approach, such as grouping errors produced by our model and finding a pattern among them, we could have noticed the same systematic error also.
  4. I could have been more thorough on cross validation. When I trained the model, I split the training data into validation and training only once. But after reading the winning team’s solution, I realized that the standard strategy is to create multiple cross validation folds, train a model for each fold, then ensemble the resulting models. This would have helped our model become more general.
  5. We could have spent more time in researching the state-of-the-art object detection models. It turns out that Faster RCNN is not the most accurate object detection architecture anymore, and a new architecture called RetinaNet has been developed. The winning team used it as part of their ensemble. This would have certainly boosted our accuracy score.

As the last hour of the competition approached, I kept refreshing the leaderboard to see if the final ranking was released. Even though I did not expect much going into this competition, I still hoped that we would do well. When I finally saw that we finished top 3%, I was beyond ecstatic. All the late nights I spent monitoring the training scripts after coming back home from full days of work finally paid off. I am so glad that I decided to participate in this competition, as it gave me the first opportunity to use machine learning techniques to solve real life problems. I hope I will not forget the excitement and joy I felt during this competition in the future and keep exploring the world of artificial intelligence and machine learning.

PhD Student at UMich Researching NLP and Cognitive Architectures • Perviously Real-time Distributed System Engineer turned NLP Research Engineer at ASAPP

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store