CS 462 Homework 2

The goal of this homework is to acclimate you to working with PyTorch and examining training results. There is also a lesson about training and data quality, which you will hopefully take to heart. Nearly any dataset collected in the real world will have labelling errors, either from humans or from sensor failure. A neural network is quite powerful, so it is capable of finding ways to predict the labels anyway. This can lead to a neural network that appears to work, but that produces incorrect outputs on images that match a particular bad data sample in the training set.

You will be given a training dataset with some number of evil twins: samples that look the same as others in the data, but which have incorrect labels. The task of this homework is to detect the incorrectly labelled samples, distinguishing them from the correctly labelled ones that have the same appearance. This falls under the category of work called anomaly detection; it is generally a difficult task.

Materials

You will be given four files in a tar archive:

The gz (gzipped) files contain the dataset and labels, with the evil twins mixed in. The train_linear_mnist.py script is an example training script. You will modify this training script to accept two arguments “–data” and “–labels”, and print out your best guess of the indexes of the mislabelled data. The mnist_util.py file has utilities to read the data.

To make debugging possible, the indexes of the fifteen evil twins in this dataset are:

In this case, they are all present in the training data, but you shouldn’t assume that is true in the final testing data that will be used for grading.

We will test your homework on the CS ilab machines. If you want to test on those machines as well, you will need to activate an environment with torch running. For example, by using the command

source /common/system/venv/python312/bin/activate

Desired Output

Modify the provided training code, or write new training code, that identifies suspected evil twins.

Print out your estimates for the mislabelled data in a comma separated list.

Deliverables

Name your program “hw02.py” and submit it through canvas.

Author Block

Your code must contain an author block at the top of the file. The author block must match the following format:

# Author: your name
# Netid: your netid
# Aid: If in doubt, list anything your referenced. A stack overflow post, chatgpt, gemini, your friend Bob, etc. If referring to an LLM, you must included a link to a particular session. Just saying "I asked chatgpt" isn't a get out of trouble free statement. If two submissions are nearly identical and nothing is listed, I will have to assume that there was copying with an attempt at obfuscation.

Examples

In this example, the prediction is perfect.

$ python3 hw02.py  --data bad_digits_ubyte.gz --labels bad_labels_ubyte.gz
2, 3, 12, 19, 35, 50, 61, 66, 85, 101, 129, 138, 147, 157, 163

Grading

Any missed evil twins and incorrectly identified non-evil twins will count as 1 error. No points will be deducted for the first 3 errors, and 5% will be deducted thereafter.

Your scripts will be run with the timeout command, preventing them from running for longer than 3 minutes.

timeout -k 1s 3m python3 hw02.py –data bad_digits_ubyte.gz –labels bad_labels_ubyte.gz

Verify that your code does not time out by running it on the ilab machine.

Advice

Linear networks theoretically can learn any input, but in this case we are giving them conflicting signals. The evil twins have the same input, but a different target label. It is not impossible to distinguish them however, because the correct labels are correlated with the rest of the (non-evil) dataset. If we could train without any of the twins, evil or otherwise, our train classifier would be more likely to predict the “true” labels and we would see prediction errors on the evil twins.

For example, one attempt at training with a clean dataset and testing on the dataset with the evil twins only shows the evil twins and sample 79 (a non-evil sample):

Final accuracy 149/165 (0.903030276298523)
Final failures at indices [2, 3, 12, 19, 35, 50, 61, 66, 79, 85, 101, 129, 138, 147, 157, 163]

If we train and test with the evil twins in our dataset though, we will see more classification failures from the training script:

Final accuracy 144/165 (0.8727272748947144)
Final failures at indices [2, 3, 12, 19, 32, 35, 50, 61, 62, 63, 66, 101, 102, 107, 129, 138, 147, 153, 157, 160, 163]

This occurs because the evil twins provide contradictory data, creating failures. The non-evil twins are more correlated with the rest of the dataset though, so over time we would expect them to be classified correctly more often. Results are stochastic, of course, so we could run this multiple times and combine the results.

Final accuracy 150/165 (0.9090909361839294)
Final failures at indices [2, 3, 12, 19, 35, 50, 61, 62, 66, 101, 103, 129, 138, 157, 163]

Final accuracy 150/165 (0.9090909361839294)
Final failures at indices [2, 12, 19, 35, 50, 61, 66, 70, 85, 101, 129, 138, 147, 157, 163]

Final accuracy 137/165 (0.8303030133247375)
Final failures at indices [2, 10, 12, 17, 19, 35, 50, 58, 61, 62, 66, 69, 70, 76, 83, 88, 101, 103, 110, 115, 127, 129, 138, 140, 149, 152, 157, 163]

Final accuracy 150/165 (0.9090909361839294)
Final failures at indices [2, 3, 12, 19, 35, 50, 61, 66, 85, 101, 129, 138, 147, 157, 163]

The last run is actually only failed at the evil twins, so it is possible to get lucky. The next to last run was terrible.

When something is stochastic, how do we make it more deterministic? With more samples.

Data Splits

Since you are racing the clock to determine which samples are the evil twins, you should not attempt to train on the entire dataset each time. Instead, pick a subset of it, perhaps 2/3 of the total data, and train a model with those samples. Since all of the “good” samples are correlated, you are dropping some theoretically similar samples so you are not losing too much unique information. If you manage to drop an “evil” sample, that only increases the chance that it will be misclassified when you run evaluation.

Tuning Your Script

You do not need to run for the default number of epochs currently in the code. Feel free to change it.

You also do not need to train with the network at the current size. Feel free to change that as well.

You may also attempt to examine the loss of individual samples, but the loss is correlated with the classification failure, so it is not guaranteed to provide more information.