Hw2 (Instructions)

User Manual:

Open the PDF directly: View PDF .
Page Count: 3

CSC411 Fall 2018 Homework 2

Homework 2

Deadline: Wednesday, Oct. 3, at 11:59pm.

Submission: You need to submit one ﬁle through MarkUs1:

•Your answers to Questions 1, 2, and 3 as a PDF ﬁle titled hw2_writeup.pdf. You can produce

the ﬁle however you like (e.g. L

X, Microsoft Word, scanner), as long as it is readable.

Neatness Point: One of the 10 points will be given for neatness. You will receive this point as

long as we don’t have a hard time reading your solutions or understanding the structure of your code.

Late Submission: 10% of the marks will be deducted for each day late, up to a maximum of 3

days. After that, no submissions will be accepted.

Weekly homeworks are individual work. See the Course Information handout2for detailed policies.

1. [4pts] Information Theory. The goal of this question is to help you become more familiar

with the basic equalities and inequalities of information theory. They appear in many contexts

in machine learning and elsewhere, so having some experience with them is quite helpful. We

review some concepts from information theory, and ask you a few questions.

Recall the deﬁnition of the entropy of a discrete random variable Xwith probability mass

function p:H(X) = Pxp(x) log21

p(x). Here the summation is over all possible values of

x∈ X , which (for simplicity) we assume is ﬁnite. For example, Xmight be {1,2, . . . , N}.

(a) [1pt] Prove that the entropy H(X) is non-negative.

An important concept in information theory is the relative entropy or the KL-divergence of

two distributions pand q. It is deﬁned as

KL(p||q) = X

p(x) log2

p(x)

q(x).

The KL-divergence is one of the most commonly used measure of diﬀerence (or divergence)

between two distributions, and it regularly appears in information theory, machine learning,

and statistics. For this question, you may assume p(x)>0 and q(x)>0 for all x.

If two distributions are close to each other, their KL divergence is small. If they are exactly

the same, their KL divergence is zero. KL divergence is not a true distance metric (since it

isn’t symmetric and doesn’t satisfy the triangle inequality), but we often use it as a measure

of dissimilarity between two probability distributions.

(b) [2pt] Prove that KL(p||q) is non-negative. Hint: you may want to use Jensen’s Inequal-

ity, which is described in the Appendix.

H(Y)−H(Y|X). Show that

I(Y;X) = KL(p(x, y)||p(x)p(y)),

where p(x) = Pyp(x, y) is the marginal distribution of X.

1https://markus.teach.cs.toronto.edu/csc411-2018-09

2http://www.cs.toronto.edu/~rgrosse/courses/csc411_f18/syllabus.pdf

CSC411 Fall 2018 Homework 2

2. [2pts] Beneﬁt of Averaging. Consider mestimators h1, . . . , hm, each of which accepts an

input xand produces an output y, i.e., yi=hi(x). These estimators might be generated

through a Bagging procedure, but that is not necessary to the result that we want to prove.

Consider the squared error loss function L(y, t) = 1

2(y−t)2. Show that the loss of the average

estimator

h(x) = 1

i=1

hi(x),

is smaller than the average loss of the estimators. That is, for any xand t, we have

L(¯

h(x), t)≤1

i=1

L(hi(x), t).

Hint: you may want to use Jensen’s Inequality, which is described in the Appendix.

3. [3pts] AdaBoost. The goal of this question is to show that the AdaBoost algorithm changes

the weights in order to force the weak learner to focus on diﬃcult data points. Here we consider

the case that the target labels are from the set {−1,+1}and the weak learner also returns a

classiﬁer whose outputs belongs to {−1,+1}(instead of {0,1}). Consider the t-th iteration

of AdaBoost, where the weak learner is

ht←argmin

h∈H

i=1

wiI{h(x(i))6=t(i)},

the w-weighted classiﬁcation error is

errt=PN

i=1 wiI{ht(x(i))6=t(i)}

i=1 wi

and the classiﬁer coeﬃcient is αt=1

2log 1−errt

errt. (Here, log denotes the natural logarithm.)

AdaBoost changes the weights of each sample depending on whether the weak learner ht

classiﬁes it correctly or incorrectly. The updated weights for sample iis denoted by w0

iand is

i←wiexp −αtt(i)ht(x(i)).

Show that the error w.r.t. (w0

1, . . . , w0

N) is exactly 1

2. That is, show that

err0

t=PN

i=1 w0

iI{ht(x(i))6=t(i)}

i=1 w0

Note that here we use the weak learner of iteration tand evaluate it according to the new

weights, which will be used to learn the t+ 1-st weak learner. What is the interpretation of

this result?

Tips:

•Start from err0

tand divide the summation to two sets of E={i:ht(x(i))6=t(i)}and its

complement Ec={i:ht(x(i)) = t(i)}.

•Note that Pi∈Ewi

i=1 wi

= errt.

CSC411 Fall 2018 Homework 2

Appendix: Convexity and Jensen’s Inequality. Here, we give some background on con-

vexity which you may ﬁnd useful for some of the questions in this assignment. You may assume

anything given here.

Convexity is an important concept in mathematics with many uses in machine learning. We

brieﬂy deﬁne convex set and function and some of their properties here. Using these properties are

useful in solving some of the questions in the rest of this homework. If you are interested to know

more about convexity, refer to Boyd and Vandenberghe, Convex Optimization, 2004.

A set Cis convex if the line segment between any two points in Clies within C, i.e., if for any

x1, x2∈Cand for any 0 ≤λ≤1, we have

λx1+ (1 −λ)x2∈C.

For example, a cube or sphere in Rdare convex sets, but a cross (a shape like X) is not.

A function f:Rd→Ris convex if its domain is a convex set and if for all x1, x2in its domain,

and for any 0 ≤λ≤1, we have

f(λx1+ (1 −λ)x2)≤λf(x1) + (1 −λ)f(x2).

This inequality means that the line segment between (x1, f(x1)) and (x2, f(x2)) lies above the

graph of f. A convex function looks like `. We say that fis concave if −fis convex. A concave

function looks like a.

Some examples of convex and concave functions are (you do not need to use most of them in

your homework, but knowing them is useful):

•Powers: xpis convex on the set of positive real numbers when p≥1 or p≤0. It is concave

for 0 ≤p≤1.

•Exponential: eax is convex on R, for any a∈R.

•Logarithm: log(x) is concave on the set of positive real numbers.

•Norms: Every norm on Rdis convex.

•Max function: f(x) = max{x1, x2, . . . , xd}is convex on Rd.

•Log-sum-exp: The function f(x) = log(ex1+. . . +exd) is convex on Rd.

An important property of convex and concave functions, which you may need to use in your

homework, is Jensen’s inequality. Jensen’s inequality states that if φ(x) is a convex function of x,

we have

φ(E[X]) ≤E[φ(X)] .

In words, if we apply a convex function to the expectation of a random variable, it is less than or

equal to the expected value of that convex function when its argument is the random variable. If

the function is concave, the direction of the inequality is reversed.

Jensen’s inequality has a physical interpretation: Consider a set X={x1, . . . , xN}of points

on R. Corresponding to each point, we have a probability p(xi). If we interpret the probability as

mass, and we put an object with mass p(xi) at location (xi, φ(xi)), then the centre of gravity of

these objects, which is in R2, is located at the point (E[X],E[φ(X)]). If φis convex `, the centre

of gravity lies above the curve x7→ φ(x), and vice versa for a concave function a.

Hw2 (Instructions)

Navigation menu

Versions of this User Manual:

Views

Navigation