Introduction to Machine Learning

You are not logged in.

Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

Lab attendance check

Type in your section passcode to get attendance credit (within the first fifteen minutes of your scheduled section).
Passcode:  

Create/join a group

  1. One person (and only one) should create a group.

  2. Everyone else should enter the group name below (in the form groupname_0000).

    Join group:

To join another group or leave your current group, reload the page.

You are not in a group.

Features, Neural Networks I

Due: Tuesday, March 11, 2025 at 09:00 PM

Welcome to Lab 5, where we will both explore how to represent input features and also start thinking about neural networks. The following resources may be useful:

-Course notes for Feature Representation and Neural Networks

-Lecture 5

A Colab notebook for question 1 of this lab is here. You may not need it to answer the questions but it contains several code blocks for fitting classifiers and plotting.

1) Corners vs. Lines§

In this question, we consider the question of classifying "lines" versus "corners". The "lines" will be assigned class 0 and the "corners" will be assigned class 1. We will first explore what happens when we use the same featurization that we used for images in lab 4, and then we will investigate new ways to featurize these images. The dataset is shown below.

1.1) Classifying using the pixel occupancy representation§

We begin by trying the pixel occupancy representation from lab 4 (i.e., each image is represented as a 9-dimensional vector, where an entry is 1 if the corresponding pixel is occupied and 0 if it is not). We run a classification algorithm on the dataset to find \theta, \theta_0. In the plot below, we show the value of \theta^T x + \theta_0 (horizontal axis) for each of our examples on a 1D line. In addition to plotting the values of \theta^T x + \theta_0, we also plot which input shapes generate each output \theta^T x + \theta_0 value.

lines_vs_corners_unreg_new.png

Observe that the horizontal and vertical axes divide up the space into four quadrants:

  • samples where label = 1 and predicted class = 1

  • samples where label = 1 and predicted class = 0

  • samples where label = 0 and predicted class = 1

  • samples where label = 0 and predicted class = 0

Corners classified correctly appear in the upper right part of the plot and lines classified correctly appear on the bottom left part of the plot.

Additionally, shapes that generate the same output g value are drawn with the same background color and are located at the same horizontal position. So, this plot provides more information than a typical confusion matrix.

1.1.1) §

Study the quadrants of the chart and think carefully about what they mean. Which shapes are being misclassified? Which shapes are being classified correctly?

1.1.2) §

Why do you think the learning algorithm might have found this hypothesis? In particular, what could be contributing to the misclassified shapes being misclassified? (Hint: look at the dark pixels in the misclassified shapes; in which class do they appear more often?)

1.2) Row and column sums§

One idea is to construct features representing the total number of dark pixels in each row and each column. Below, we plot these sums adjacent to the gameboard for some example shapes (see the legend below the images to decode the colors in the row and column sums.)

1.2.1) §

Write down the 6-dimensional feature for the image on the far left (line) as a list of numbers; each of the features is between 0 and 3 (the order should be [top row, middle row, bottom row, left column, middle column, right column]):  

1.2.2) §

Write down the 6 dimensional feature for the image on the far right (corner) as a list of numbers, each of the features is between 0 and 3 (the order should be [top row, middle row, bottom row, left column, middle column, right column]):  

1.2.3) §

In terms of the row and column sums only, how would you describe a rule (in English) that would correctly tell the difference between line and corner pieces?

1.3) Row and column sums: classification results§

Using these features (instead of the 9-dimensional features from question 1.1), we trained a classifier and obtained the following results:

Did the classifier discover the rule you articulated in question 1.2.3? Explain how your rule from question 1.2.3 can be encoded using a linear classifier or explain why it is not possible to do so.

1.4) Hip to be square§

What if we instead use features that are the squares of the number of dark pixels in a row, and squares of the number of dark pixels in a column? Let's plot these features adjacent to the gameboard to see some examples from the data.

1.4.1) §

Write down the 6-dimensional feature for the image on the far left (line) as a list of numbers, each of which is between 0 and 9 (the order should be [top row, middle row, bottom row, left column, middle column, right column]):  

1.4.2) §

Write down the 6-dimensional feature for the image on the far right (corner) as a list of numbers, each of which is between 0 and 9 (the order should be [top_row, middle row, bottom row, left column, middle column, right column]):  

1.5) §

Do you expect these new features to be able to separate lines from corners using a linear classifier? If so, explain the rule in English. If not, explain why not.

1.6) Separate ways§

We trained a classifier using the squared feature set with \lambda = 0.01 and got these parameters:

\theta = \begin{bmatrix} -4.31 \\ -4.24 \\ -4.31 \\ -4.31 \\ -4.24 \\ -4.31 \end{bmatrix}, \theta_0 = 47.58

and this result!

1.6.1) §

Explain what this classifier is doing, in words. Does it intuitively make sense to you? How does it relate to the rule you came up with when thinking about 1.5?

1.6.2) §

Now that we have a perfect classifier that tells lines from corners, will this be a good detector for a line vs. other patterns we haven't seen yet? Come up with a couple patterns that don’t look like a line, but would be classified as a line by our current classifier.

Checkoff 1:
Have a check-off conversation with a staff member, to explain your answers.

2) Activation and Loss Function Applications§

One important part of designing a neural network application is understanding the problem domain and choosing:

  • A representation for the input
  • The number of output units and what range of values they can take on
  • The loss function to try to minimize, based on actual and desired outputs

In terms of the input, there are various ways to represent different types of data. In question 1, we explored various featurizations for how to represent 3-by-3 images. A different example would be a text document, which can be represented using a bag-of-words approach, where each word in the vocabulary is encoded as a one-hot vector.

In this problem, we will concentrate on the output and loss function. These should generally be chosen jointly.

As a reminder, here are some different loss functions and activation functions we have learned about:

  • Activation functions: linear, ReLU, sigmoid, softmax

  • Loss functions: negative log likelihood (NLL), NLL multiclass (NLLM, a.k.a. cross-entropy loss), squared (or mean squared)

For each of the following application domains, specify good choices for:

  • The number of units in the output layer

  • The activation function(s) on the output layer

  • The loss function

When you choose to use multiple output units, be very clear on the details of how you are applying the activation and the loss.

2.1) §

Map the words on the front page of the New York Times to the predicted (numerical) change in the stock market average.

2.2) §

Map a satellite image centered on a particular location to a value that can be interpreted as the probability it will rain at that location sometime in the next four hours.

2.3) §

Map the words in an email message to one (and exactly one) of a user’s fixed set of email folders to which the email message should be filed in.

2.4) §

Map the words of a document into a vector of outputs, where each index represents a topic, and has value 1 if the document addresses that topic and 0 otherwise. Each document may contain multiple topics, so in the training data, the output vectors may have multiple 1 values.

3) Fairness through Unawareness§

According to Glassdoor, corporate job openings receive 250 resumes on average, and some receive many, many more. Screening resumes is an arduous and repetitive process. This seems like a great task to apply all of our classification algorithms! But what data should we give to a model to find the best resumes?

3.1) §

A typical resume includes a person’s name, contact details, education, work history, extracurriculars, skills, awards, and languages, all of which can be used as features for an ML model.

In the United States, it’s illegal to discriminate based on a job applicant’s race, religion, sex, or on a number of other protected characteristics. A typical resume does not include these attributes; hence, any model you train will be unaware of these features. However, it is still possible for a machine learning model to discriminate based on a protected characteristic when screening resumes. Why is this possible?

3.2) §

According to an ACLU report, Amazon started using ML to screen resumes, and found that their model showed bias against women. In particular, they observed that their resume screening tool penalized resumes that included the word “women’s” (as in “Women’s Chess Club Captain”) or resumes from women’s colleges. How could you conceal the features from a resume to prevent this? Describe some rules you might write for concealing these features.

Is it possible to conceal these features without also removing information that they are allowed to use?

3.3) §

After you conceal resume features, will the model be “fair” and why? What does “fairness” mean in this context?

There are many different definitions of fairness, and we will discuss these definitions in the coming weeks. One definition of fairness is “fairness through unawareness,” wherein you make sure protected features are not input to a model. Does a model trained on raw resume data meet this definition of “fairness through unawareness”? How about a model trained on concealed resume data?

Checkoff 2:
Have a check-off conversation with a staff member, to explain your answers.

Food for Thought

While waiting for your checkoff...

"After an audit of the algorithm, the resume screening company found that the algorithm found two factors to be most indicative of job performance: their name was Jared, and whether they played high school lacrosse."

Ref: Companies are on the hook if their hiring algorithms are biased

Another cool article on the power of demographic information is “Simple Demographics Often Identify People Uniquely” by Latanya Sweeney.

Survey

(The form below is to help us improve/calibrate for future assignments; submission is encouraged but not required. Thanks!)

How did you feel about the length of this lab?

How did you feel about the difficulty of this lab?

Do you have any feedback or comments about any questions in this lab? Anything else you want us to know?