Introduction to Machine Learning

You are not logged in.
Please Log In for full access to the web site.
Note that this link will take you to an external site (https://shimmer.mit.edu) to authenticate, and then you will be redirected back to this page.

Due: Tuesday, February 11, 2025 at 09:00 PM

Lab attendance check

Type in your section passcode to get attendance credit
(normally within the first 15 minutes of your scheduled section; but today only, passcode is accepted anytime in-session.)
Passcode:

Instructions

In 6.390 a large part of the learning happens by discussing lab questions with partners. Please form a group of 2-3 students (you may self-select partners) and create your group using the mechanism below. Note, the group creation form will not accept groups larger than three students.

Create/join a group

One person (and only one) should create a group.
New group name:
Everyone else should enter the group name below (in the form groupname_0000).
Join group:

To join another group or leave your current group, reload the page.

You are not in a group.

Note that this lab builds on the notes Chapter 1 - Intro to ML and Chapter 2 - Regression for background and terminology. You'll be expected to read those notes before you start on Homework 1. In fact, Homework 1 will ask you to implement code that we're only describing or using here (you don't get to see all that code in this lab) -- so this lab will help get you ready for the homework!

It is good practice for you and your team to write down short answers to the lab questions as you work through them so that you are prepared for a check-off conversation at the end (and you can reference those notes throughout the course!).

1) Problem Setup§

During Recitation 01, we investigated our first problem class: a supervised learning problem called regression.

This setup includes the following properties:

We are given the training data \mathcal{D}_{train}=\left\{\left(x^{(1)}, y^{(1)}\right), \ldots,\left(x^{(n)}, y^{(n)}\right)\right\}
x^{(i)}\in \mathbb{R}^d
y^{(i)}\in \mathbb{R}
Our goal is to predict a value y^{(n+1)} from the input x^{(n+1)}

1.1) §

Using the properties listed above, what aspect of this problem makes it a supervised learning problem, rather than an unsupervised learning problem? (If you need a refresher, see this section of the notes).

1.2) §

Using the properties enumerated above, what aspect of this problem setup makes it a regression problem? For comparison, what is one example of a supervised classification problem?

2) Linear Regression§

Following from Recitation 01, we will continue to investigate the linear hypothesis class: \mathcal{H} = \{h(x;\theta, \theta_0) = \theta^\top x + \theta_0 | \theta \in \mathbb{R}^d, \theta_0 \in \mathbb{R} \} (and x is defined as in the problem setup).

(We also observed how we can define x_{aug} and \theta_{aug} to create a more compact version of this hypothesis; you will not need that notation in this lab, but if you had any lingering questions about this, feel free to discuss now with your group and/or during your checkoff!)

2.1) §

How can we define a specific hypothesis within the linear hypothesis class \mathcal{H} defined above? What is one example of a linear hypothesis for input data x \in \mathbb{R}^2 (or an input data point with two features)?

2.2) §

What are some reasons we might choose to use a linear hypothesis class for our data? (For simplicity, you may find it useful to think about this in the case where x has a single feature, so our input is one number. Under the linear hypothesis class, we can only fit lines to this data. When would a line be a useful choice, and why?)

3) Random Hypotheses§

Now, we will test out our first learning algorithm on our supervised linear regression problem! For this "guess-and-check" style algorithm, we will generate a lot of random hypotheses within the linear hypothesis class -- i.e., several models with linear parametrization, but each with randomly generated parameters. Or more specifically for our d=1 starting case, we will generate a lot of lines, each with randomly generated slope and intercept.

We will then see which of these randomly generated models has the smallest training error on this data, and return that one as our answer (i.e. the "best" hypothesis that we found). We don't recommend this method in practice, but it gets us started and makes some useful insights.

Here is the top-level code for the algorithm random_regress(X, Y, k) defined below:

X is a n \times d matrix with n training input points in a d-dimensional space.
Y is a n \times 1 vector with n training output values.
k is the number of random hypotheses to try.

At a high level, the random_regress function below takes in the training data and the number of random hypotheses to try (k), generates the random hypotheses to examine, computes the mean squared error (MSE) for each hypothesis, and finally returns the specific hypothesis that achieves the lowest MSE (as well as the value of that minmal error).

As a reminder, MSE is a descriptive name for the average squared loss over a data set. We previously used MSE for our training error (\mathcal{E}_{train}) when evaluated over \mathcal{D}_{train}. Consider one hypothesis (out of the k possible hypotheses). Suppose the parameters of this hypothesis are \theta and \theta_0; then its MSE is defined as

\text{MSE} = \frac{1}{n}\sum_{i = 1}^{n}\left(g^{(i)} - y^{(i)}\right)^2

where g^{(i)}=\theta^\top x^{(i)} + \theta_0 is the prediction or guess by the hypothesis for sample i, and y^{(i)} is the actual label for sample i.

The code below uses normal Python and numpy. To compute the MSE over each randomly generated hypothesis, thise code uses the function lin_reg_err, which is very similar to one of the functions you will write in Homework 1. (Can you spot the difference? Hint: it is always useful to pay attention to the function specification!) You do not need to know the exact implementation of the lin_reg_err function below in order to complete the lab, but it is a helpful exercise to think about how you would implement it.

def random_regress(X, Y, k):
    n, d = X.shape

    # generate k random hypotheses
    ths = np.random.randn(d, k)
    th0s = np.random.randn(1, k)

    # compute the mean squared error of each hypothesis on the data set
    errors = lin_reg_err(X, Y, ths, th0s)

    # Find the index of the hypotheses with the lowest error
    i = np.argmin(errors)

    # return the theta and theta0 parameters that define that hypothesis
    theta, theta0 = ths[:,i:i+1], th0s[:,i:i+1]
    return (theta, theta0), errors[i]

You'll get more practice with numpy in the homework, but in the meantime you might find the numpy reference for np.random.randn helpful.

3.1) §

Talk through the code for random_regress with your group, and try to get a basic understanding of what is going on. Put yourself in the help queue to ask if you have any questions!

What distribution are the ths and th0s variables sampled from? What does this tell us about the values we can expect to see for our randomly generated parameters? (Hint: use the np.random.randn numpy reference.)

3.2) §

We'll start by studying example executions of this algorithm with a data set where d = 1, n = 10 for different values of k (so the horizontal axis is the input x with a single feature, and the vertical axis is the output y). The hypothesis with minimal MSE, of the k that were tested, is shown in red.

Here is the data set and one random hypothesis.

Is this a "good" hypothesis?
For all possible hypotheses within the linear hypothesis class (not limited to the k that were tested), which one would achieve the minimal MSE, roughly?

3.3) §

Now, here are some executions for different values of k (shown in red is the hypothesis with the lowest MSE, among the k tested).

(A) k=1	(B) k=5
(C) k=20	(D) k=50

What happens as we increase k? Compare the four “best” linear regressors found by the random regression algorithm with different values of k chosen, which one does your group think is "best of the best"?
How does it match your initial guess about the best hypothesis?
Will this method eventually get arbitrarily close to the best solution? What do you think about the efficiency of this method?

4) Evaluation§

In considering a machine learning algorithm, we are sometimes interested in how different settings of the algorithm affect its performance. In Part 1 above, we gained some intuition informally about the impact of k, for a case when we had n = 10 training points. In this part, we'll more formally evaluate the impact of training set size n, when k is large.

We have an experimental procedure(n, m, d, k); the implementation in Python is not provided here, but it works as follows:

Generate a random regression problem with d input dimensions, which we'll use as a test for the algorithm. We do this by randomly drawing some hidden parameters \theta and \theta_0 which the learning algorithm will not get to observe.
Generate a data set (X, Y) with (n+m) data points. The X values are drawn uniformly from range [0, 1]. The output vector Y contains \theta^T x^{(i)} + \theta_0 + \epsilon^{(i)} for each column vector x^{(i)} in X, where \epsilon^{(i)} is some small random noise.
Randomly split the data set into two parts, the training set containing n data points, and the validation set containing m data points.
Run the random_regress(X, Y, k) algorithm on the training set, \mathcal{D}_{train}, to get estimated \theta^* and \theta_0^*.
Compute the MSE of the hypothesis defined by \theta^*, \theta_0^* on the validation set, \mathcal{D}_{val} (which the learning algorithm didn't get to see).

4.1) §

Talk through this algorithm with your partner to be sure it makes sense. For instance, what should be the dimensions of \theta^* and \theta_0^*?

4.2) §

We will study some plots that show the performance of this algorithm and try to explain some observations we make of them.

We fix m = 100, d = 10, and k = 10000, and we run our procedure for different values of training set size n, with n ranging from 2 to 40.

The blue (solid lower) curve reports the mean squared error on the training set and the red (dashed upper) curve reports it on the validation set.

For each of the following observations, come up with an explanation; you may use the Potential Explanations (below) as a set of possible ideas for answers.

4.2.1) §

With small n relative to d, and large k, the training and validation errors are substantially different, with validation error much higher. Why?

4.2.2) §

As the amount of training data increases, the training error increases slightly. Why?

4.2.3) §

As the amount of training data increases, the validation error goes down a lot. Why?

Potential Explanations

A: With increasing the size of the training data, the learned hypothesis is more likely to be close to the true underlying hypothesis.

B: The variation of the errors (with respect to varying k) has to do with the random sampling of hypotheses -- sometimes when we run the learning algorithm (we run it independently once per experiment) it happens to get a good hypothesis, and sometimes it doesn't.

C: Because a small value of k is not guaranteed to generate a good hypothesis.

D: Because the error of a hypothesis on two different, but large, data sets from the same source is likely to be very similar, independent of whether it's a good or bad hypothesis.

E: Because with just a few points in a high-dimensional space, there are lots of different hypotheses that will have low error on the training set. This means we can generally find a hypothesis that goes near them. But when we draw a new set of points, they might be very far away from our hypothesis.

F: Because it is easier to get a small training error when you have a lot of parameters relative to the amount of training data.

Checkoff

Notes on the checkoff:

If you are doing the checkoff during lab time, only one person per team needs to get into the queue (using the Ask for Checkoff button below).
You do not have to finish the checkoff during lab time - lab checkoffs are typically due the following Tuesday at 9pm.
Outside of lab time, checkoffs may only be completed during office hours. Attend any office hours section and use the Ask for checkoff button below.
During office hours, only the group members present will get the checkoff. You may each get your checkoff individually in different sections (i.e. if your entire group does not finish during lab time, you do not need to get the checkoff as a group during office hours, but you may not get the checkoff for group members who are not present).

Checkoff:
Have a check-off conversation with a staff member; mainly for you to get used to the checkoff mechanism we'll use the rest of the semester. We also love to chat a bit about the lab and all the questions above.

Survey

(The form below is to help us improve/calibrate for future assignments; submission is encouraged but not required. Thanks!)

How did you feel about the length of this lab?

Too long.

About right.

Too short.

How did you feel about the difficulty of this lab?

Too hard. We should tone it down.

About right.

Too easy. I want more challenge.

Do you have any feedback or comments about any questions in this lab? Anything else you want us to know?