What to do when we have mismatched training and validation set?

Deep learning algorithms require a huge amount of training data. This makes us put more and more labeled data into our training set even if it does not belong to the same distribution of data we are actually interested in.

For example, let’s say we are building a cat classifier for door camera devices. We get 10,000 images from this camera. These images have a low resolution. We get another 200,000 cat pictures by crawling the web. These images are properly framed and have a higher resolution.

 

Low-Resolution Cat Images

 

High-Resolution Cat Images

We are only interested in our final model doing well on images from the door camera. But we have only 10,000 of these images and using only these images would result in a very small training set. We will have to use the 200,000 images obtained from Google. Since the Google images belong to a different distribution we have a data mismatch problem.

Andrew Ng in his course – Structuring Machine Learning Projects from the deeplearning.ai specialization on Coursera talks about the best practices to handle mismatched training and validation/dev sets. In this blog post, I will summarize the things I learned from this course.

Setting Up Training and Dev Set

As discussed earlier, we will be using images from the door camera as well as from Google. Let’s see how we can set up our training, dev and test sets. (I have used the terms validation and dev interchangeably here). We have 10,000 cat images from the door camera and 200,000 images from Google.

Option 1 (Bad Idea)data mismatch training-dev setOne option is to randomly shuffle images from both the sources and divide it into the training, dev and test sets. We can have 205,000 randomly shuffled images in the training set and 2,500 randomly shuffled images each in the dev and test set.

The advantage of this approach is that our training, dev and test set are all from the same distribution. But, the major disadvantage is that a major portion of our dev set comprises images from Google. Our dev set tells our model the target we need to aim for. And in this case, we are not aiming for the correct target (the door camera images).

Option 2 (Good Idea)data mismatch training-dev setA much better option is to have our dev and test set comprise samples from only the door camera images (2500 each). The training set will have all the remaining images (205,000 images). In this way, we can ensure that we are aiming the correct target – the images from the door camera.

The Problem with Mismatched Training and Dev Set

The disadvantage of the second approach above is that we have a mismatched training and dev set. Let’s discuss why would this be a problem. Suppose a mismatched training and dev set gives the following –

Training Error – 1%

Dev Error – 10%

We might be tempted to think that we are overfitting the training set and this is a high variance problem. But, this might not be always true. Since our training and dev sets come from different distributions, it is possible that the dev set is simply more difficult than the training set.

Since we have two moving pieces here, we are not in a position to understand if the dev error is higher because of high variance or because of data mismatch.

The Training-Dev Setdata mismatch training-dev set

To solve the two-moving pieces problem of variance and data mismatch, we can use a training-dev set. The training-dev set has the same distribution as the training set but it is not used for training. So, in this case, of the 10,000 door camera images, we will have 2,500 images each in the training-dev set, the dev set and the test set and the rest 202,500 images in the training set.

By using the training-dev set, we can understand whether our model is suffering from a variance problem or a data mismatch problem. Let’s take some examples to get this more clear –

Training Error – 1%

Training-Dev Error – 9%

Dev Error – 10%

Since the training-dev set has the same distribution as the training set and the training-dev error is higher, it is safe to assume that we have a variance problem.

Training Error – 1%

Training-Dev Error – 1.5%

Dev Error – 10%

Here, the training error and the training-dev error are very close but there is a high dev error. This shows that our model is suffering from a data mismatch problem.

data mismatch training-dev set In case of data mismatch, we can use the training-dev error to pinpoint the type of problem we have.

How to address the data mismatch problem?

There aren’t any systematic approaches to this but we can try the following options –

  • We can carry out manual error analysis (manually examining a few samples) to try to figure out the difference between the training and dev/test sets.
  • We can try to make the training data more similar to dev/test data.
  • We can collect more data similar to dev/test sets. Artificial data synthesis can be used for this.

I hope you enjoyed this. Thanks for reading. 🙂

 

 

 

 

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s