Study Guide

User Manual:

Open the PDF directly: View PDF .
Page Count: 41

Download
Open PDF In Browser	View PDF

Machine Learning Study Guide
William Watson
Johns Hopkins University
billwatson@jhu.edu

Contents

2.3

1 Linear Algebra and Calculus
1.1 General Notation . . . . . . . . . . . . .
1.1.1 Indentity Matrix . . . . . . . . .
1.1.2 Diagonal Matrix . . . . . . . . .
1.1.3 Orthogonal Matrix . . . . . . . .
1.2 Matrix Operations . . . . . . . . . . . .
1.2.1 Vector-Vector Products . . . . .
1.2.2 Vector-Matrix Products . . . . .
1.2.3 Matrix-Matrix Products . . . . .
1.2.4 The Transpose . . . . . . . . . .
1.2.5 The Trace . . . . . . . . . . . . .
1.2.6 The Inverse . . . . . . . . . . . .
1.2.7 The Determinant . . . . . . . . .
1.3 Matrix Properties . . . . . . . . . . . . .
1.3.1 Norms . . . . . . . . . . . . . . .
1.3.2 Linear Dependence and Rank . .
1.3.3 Span, Range, and Nullspace . . .
1.3.4 Symmetric Matrices . . . . . . .
1.3.5 Positive Semidefinite Matrices . .
1.3.6 Eigendecomposition . . . . . . .
1.3.7 Singlular Value Decomposition .
1.3.8 The Moore-Penrose Pseudoinverse
1.4 Matrix Calculus . . . . . . . . . . . . . .
1.4.1 The Gradient . . . . . . . . . . .
1.4.2 The Hessian . . . . . . . . . . . .
1.4.3 Gradient Properties . . . . . . .

4
4
4
4
4
5
5
5
5
6
6
6
6
7
7
7
8
8
8
9
9
9
10
10
10
11

2 Convex Optimization
2.1 Convexity . . . . . . . . . . . . .
2.1.1 Convex Sets . . . . . . . .
2.1.2 Convex Functions . . . .
2.1.3 First-Order Conditions . .
2.1.4 Second-Order Conditions
2.1.5 Jensen’s Inequality . . . .
2.1.6 Sublevel Sets . . . . . . .
2.2 Convex Optimization . . . . . . .
2.2.1 Global Optimality . . . .
2.2.2 Gradient Descent . . . . .
2.2.3 Newton’s Algorithm . . .

11
11
11
11
11
12
12
12
12
12
13
13

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.

1

Lagrange Duality and KKT Conditions
2.3.1 The Lagrangian . . . . . . . . . .
2.3.2 Primal and Dual Problems . . .
2.3.3 Strong and Weak Duality . . . .
2.3.4 Complementary Slackness . . . .
2.3.5 The KKT Conditions . . . . . .

13
13
13
14
14
14

3 Probability and Statistics
3.1 Basics . . . . . . . . . . . . . . . . . . .
3.1.1 Axioms of Probability . . . . . .
3.1.2 Permutation . . . . . . . . . . .
3.1.3 Combination . . . . . . . . . . .
3.2 Conditional Probability . . . . . . . . .
3.2.1 Bayes Rule . . . . . . . . . . . .
3.2.2 Independence . . . . . . . . . . .
3.3 Random Variables . . . . . . . . . . . .
3.3.1 Cumulative Distribution Function (CDF) . . . . . . . . . . . .
3.3.2 Probability Density Function
(PDF) . . . . . . . . . . . . . . .
3.3.3 Discrete PDF/CDF . . . . . . .
3.3.4 Continuous PDF/CDF . . . . . .
3.3.5 Expectation . . . . . . . . . . . .
3.3.6 Variance and Standard Deviation
3.4 Discrete Random Variables . . . . . . .
3.4.1 Bernoulli . . . . . . . . . . . . .
3.4.2 Binomial . . . . . . . . . . . . .
3.4.3 Geometric . . . . . . . . . . . . .
3.4.4 Poisson . . . . . . . . . . . . . .
3.5 Continuous Random Variables . . . . . .
3.5.1 Uniform . . . . . . . . . . . . . .
3.5.2 Exponential . . . . . . . . . . . .
3.5.3 Gaussian (Normal) . . . . . . . .
3.6 Jointly Distributed Random Variables .
3.6.1 Marginal Density . . . . . . . . .
3.6.2 Cumulative Distribution . . . . .
3.6.3 Conditional Density . . . . . . .
3.6.4 Independence . . . . . . . . . . .
3.6.5 Expectation . . . . . . . . . . . .
3.6.6 Covariance . . . . . . . . . . . .
3.6.7 Correlation . . . . . . . . . . . .

14
14
15
15
15
15
15
16
16
16
16
16
17
17
17
18
18
18
18
19
19
19
19
20
20
20
20
20
20
21
21
21

Machine Learning Study Guide

3.7

3.8

Parameter Estimation . . . . . . . . . .
3.7.1 Definitions . . . . . . . . . . . .
3.7.2 Bias . . . . . . . . . . . . . . . .
3.7.3 Mean and Central Limit Theorem
3.7.4 Variance . . . . . . . . . . . . . .
Probability Bounds and Inequalities . .
3.8.1 Markov . . . . . . . . . . . . . .
3.8.2 Chebyshev . . . . . . . . . . . .
3.8.3 Chernoff . . . . . . . . . . . . . .
3.8.4 Hoeffding . . . . . . . . . . . . .

21
21
21
21
21
21
21
21
22
22

4 Information Theory

22

5 Learning Theory
5.1 Bias and Variance . . . . . . . . . . . .
5.2 Notation . . . . . . . . . . . . . . . . . .
5.2.1 Union Bound . . . . . . . . . . .
5.2.2 Hoeffding
Inequality
For
Bernoulli Variables . . . . . . . .
5.3 Training Error . . . . . . . . . . . . . .
5.4 Probably Approximately Correct (PAC)
5.5 Hypothesis Classes . . . . . . . . . . . .
5.5.1 Shattering . . . . . . . . . . . . .
5.5.2 Upper Bound Theorem . . . . .
5.5.3 VC Dimension . . . . . . . . . .
5.5.4 Vapnik Theorem . . . . . . . . .

24
24
24
24

6 Linear Regression
6.1 LMS Algorithm . . . . . . . . . . . .
6.2 The Normal Equations . . . . . . . .
6.3 Probabilistic Interpretation . . . . .
6.4 Locally Weighted Linear Regression

.
.
.
.

.
.
.
.

24
24
25
26
26

7 Logistic Regression
7.1 The Logistic Function . . .
7.2 Cost Function . . . . . . . .
7.3 Gradient Descent . . . . . .
7.4 Newton-Raphson Algorithm

.
.
.
.

.
.
.
.

27
27
27
28
28

8 Softmax Regression
8.1 Softmax Function . . . . . . . . . . . . .
8.2 MLE and Cost Function . . . . . . . . .
8.3 Gradient Descent . . . . . . . . . . . . .

28
28
29
29

9 Generalized Linear Models
9.1 Exponentional Family . . . . .
9.2 Assumptions of GLMs . . . . .
9.3 Examples . . . . . . . . . . . .
9.3.1 Ordinary Least Squares
9.3.2 Logistic Regression . . .
9.3.3 Softmax Regression . .

30
30
30
30
30
30
30

2

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

.
.
.
.
.
.

24
24
24
24
24
24
24
24

10 Perceptron

30

11 Support Vector Machines

30

12 Margin Classification

30

13 Generative Learning:
nant Analysis
13.1 Assumptions . . .
13.2 Estimation . . . .
13.3 Prediction . . . . .

Gaussian Discrimi. . . . . . . . . . . .
. . . . . . . . . . . .
. . . . . . . . . . . .

14 Generative Learning: Naive Bayes
14.1 Assumptions . . . . . . . . . . . .
14.2 Estimation . . . . . . . . . . . . .
14.2.1 Laplace Smoothing . . . . .
14.3 Prediction . . . . . . . . . . . . . .

30
30
30
31

.
.
.
.

31
31
31
31
32

15 Tree-based Methods
15.1 Decision Trees . . . . . . . . . . . . . . .
15.2 Random Forest . . . . . . . . . . . . . .
15.3 Boosting . . . . . . . . . . . . . . . . . .

32
32
32
32

16 K-Nearest Neighbors
16.1 Classification . . . . . . . . . . . . . . .
16.2 Regression . . . . . . . . . . . . . . . . .

32
32
32

17 K-Means Clustering
17.1 Algorithm . . . . . . . . . . . . . . . . .
17.2 Hierarchical Clustering . . . . . . . . . .
17.3 Clustering Metrics . . . . . . . . . . . .

33
33
33
33

18 Expectation-Maximization
18.1 Mixture of Gaussians . . . . . . . . . . .
18.2 Factor Analysis . . . . . . . . . . . . . .

34
34
34

19 Principal Component Analysis
19.1 Eigenvalues, Eigenvectors, and the Spectral Theorem . . . . . . . . . . . . . . .
19.2 Algorithm . . . . . . . . . . . . . . . . .
19.3 Algorithm: SVD . . . . . . . . . . . . .

34

20 Independent Component Analysis

35

21 Reinforcement Learning
21.1 Markov Decision Processes .
21.2 Policy and Value Functions
21.3 Value Iteration Algorithm .
21.4 Q-Learning . . . . . . . . .

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

34
34
35

35
35
35
35
35

Machine Learning Study Guide

22 Probabilistic Graphical Models
22.1 Bayesian Networks . . . . . . .
22.1.1 Hidden Markov Models
22.2 Markov Random Fields . . . .
22.3 Conditional Random Fields . .

.
.
.
.

.
.
.
.

35
36
37
37
37

23 Deep Learning: Basics
23.1 Basics . . . . . . . . . . . . . . . . . .
23.2 Activation Functions . . . . . . . . . .
23.3 Loss Functions . . . . . . . . . . . . .
23.4 Backpropagation . . . . . . . . . . . .
23.5 Regularization Methods . . . . . . . .
23.6 Optimization Algorithms . . . . . . .
23.7 Convolutional Networks . . . . . . . .
23.8 Recurrent Networks . . . . . . . . . .
23.8.1 Elman RNN . . . . . . . . . . .
23.8.2 Long Short-Term Memory . . .
23.8.3 Gated Recurrent Unit . . . . .
23.8.4 Bidirectional RNNs . . . . . .
23.8.5 Vanishing/Exploding Gradient
23.8.6 Gradient Clipping . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.

37
37
37
37
37
37
37
37
37
37
37
38
38
38
38

.
.
.
.

.
.
.
.

.
.
.
.

24 Deep Learning: Advanced
24.1 Autoencoders . . . . . . . . . . . . . . .
24.1.1 Variational Autoencoders . . . .
24.2 General Adversarial Networks . . . . . .
24.3 Encoder-Decoder Models . . . . . . . . .
24.3.1 Encoders . . . . . . . . . . . . .
24.3.2 Decoders . . . . . . . . . . . . .
24.4 Attention Models . . . . . . . . . . . . .
24.4.1 Context Vector . . . . . . . . . .
24.4.2 Concat (Bahdanau) Attention . .
24.4.3 Luong Attention Mechanisms . .
24.5 Word Embeddings . . . . . . . . . . . .
24.5.1 N-Gram Models . . . . . . . . .
24.5.2 Continuous Bag-of-Words Models (CBOW) . . . . . . . . . . . .
24.5.3 Skip-Gram Model . . . . . . . .
24.5.4 Negative Sampling . . . . . . . .

38
38
38
38
38
39
39
39
39
39
39
40
40
40
41
41

3

Machine Learning Study Guide

1
1.1

Linear Algebra and Calculus
General Notation

A vector x ∈ Rn has n entries, and xi ∈ R is the i-th entry:


x1
 x2 


x =  .  ∈ Rn
 .. 
xn

(1)

We denote a matrix A ∈ Rm×n with m rows and n columns, and Aij ∈ R is the entry in the i-th row and j-th
column:


A11 · · · A1n

..  ∈ Rm×n
(2)
A =  ...
. 
Am1 · · · Amn
Vectors can be viewed as a n × 1 matrix.
1.1.1

Indentity Matrix

The identity matrix I ∈ Rn×n is a square matrix with ones along the diagonal and zero everywhere else:


1



I=



0
..
.
0

0
..
.
..

.
···

···
..
.
..
.
0


0
.. 
. 


0 
1

(3)

For all matrices A ∈ Rn×n we have A × I = I × A = A.
1.1.2

Diagonal Matrix

A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values along the diagonal and zero everywhere else:


d1


 0
D=
 .
 ..
0

0
..
.
..

.
···

···
..
.
..
.
0


0
.. 
. 


0 
dn

(4)

The diagonal matrix D is also written as diag(d1 , . . . , dn ).
1.1.3

Orthogonal Matrix

Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is normalized if ||x||2 = 1. A square matrix
U ∈ Rn×n is orthogonal if all its columns are orthogonal to each other and are normalized. Hence:
UT U = I = UUT
Hence the inverse of an orthogonal matrix is its transpose.m
4

(5)

Machine Learning Study Guide

1.2
1.2.1

Matrix Operations
Vector-Vector Products

Given two vectors x, y ∈ Rn , the inner product is:
xT y =

n
X

xi yi ∈ R

(6)

i=1

The outer product for a vector x ∈ Rm , y ∈ Rn is:


x1 y1
 ..
T
xy =  .
xm y1
1.2.2


x1 yn

..
m×n
∈R
.
xm yn

···
..
.
···

(7)

Vector-Matrix Products

The product of a matrix A ∈ Rm×n and vector x ∈ Rn is a
expressed as:

− aT1 −
 − aT2 −

y = Ax = 
..

.
− aTm −

vector y = Ax ∈ Rm . If we write A by the rows, Ax is








x = 



aT1 x
aT2 x
..
.
aTm x







(8)

Here, the i-th entry of y is the inner product of the i-th row of A and x, yi = aTi x. If we write A is column form:



|
y = Ax =  a1
|

|
a2
|



···

|


an  

|

x1
x2
..
.




 = a1 x1 + a2 x2 + . . . + an xn


(9)

xn
Here, y is a linear combination of the columns of A, where the coefficients of the linear combination are given by
the entries of x.
1.2.3

Matrix-Matrix Products

Given a matrix A ∈ m × n and matrix B ∈ n × p, we can define C = AB as follows:



C = AB = 


aT1 b1
aT2 b1
..
.

aT1 b2
aT2 b2
..
.

···
···
..
.

aT1 bp
aT2 bp
..
.

aTm b1

aTm b2

···

aTm bp







(10)

Hence, each (i, j)-th entry of C is equal to the inner product of the i-th row of A and the j-th column of B.
Compactly:
n
X
Cij = aTi bj =
aik bkj
(11)
k=1

5

Machine Learning Study Guide

1.2.4

The Transpose

The transpose of a matrix A ∈ Rm×n is AT ∈ Rn×m matrix whose entries are:
(AT )ij = Aji

(12)

Properties of the transpose:
1. (AT )T = A
2. (AB)T = B T AT
3. (A + B)T = AT + B T
1.2.5

The Trace

The trace of a square matrix A ∈ Rn×m is denoted tr(A). It is the sum of diagonal elements in the matrix:
tr A =

n
X

Aii

(13)

i=1

Properties of the trace:
1. For A ∈ Rn×n , tr A = tr AT
2. For A, B ∈ Rn×n , tr(A + B) = tr A + tr B
3. For A ∈ Rn×n , t ∈ R, tr(tA) = t · tr A
4. For A, B such that AB is square, tr AB = tr BA
5. For A, B, C such that ABC is square, tr ABC = tr BCA = tr CAB, and so on
1.2.6

The Inverse

The inverse of a matrix A is noted A−1 and is the unique matrix such that:
A−1 A = I = AA−1

(14)

Not all square matrices are invertible. In addition, assuming A, B ∈ Rn×n are non-singular:
1. (A−1 )−1 = A
2. (AB)−1 = B −1 A−1
3. (A−1 )T = (AT )−1
1.2.7

The Determinant

The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is expressed recursively in terms of A\i,\j ,
which is the matrix A without its i-th row and j-th column, as follows:
det(A) = |A| =

n
X
(−1)i+j Ai,j |A\i,\j |
j=1

Remark that A is invertible if and only if |A| =
6 0. Also, |AB| = |A||B| and |AT | = |A|.
6

(15)

Machine Learning Study Guide

1.3
1.3.1

Matrix Properties
Norms

A norm of a vector x is any function f : Rn → R that satisfies 4 properties:
1. Non-negativity: For all x ∈ Rn , f (x) ≥ 0
2. Definiteness: f (x) = 0 if and only if x = 0
3. Homogeneity: For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x)
4. Triangle Inequality: For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y)
However, most norms used come from the family of `p norms:
`p = ||x||p =

n
X

! p1
xpi

(16)

i=1

The p-norm is used in Holder’s inequality. The Manhattan norm, used in LASSO regularization, is:
`1 = ||x||1 =

n
X

|xi |

(17)

i=1

The euclidean norm, `2 is used in ridge regularization and distance measures:
v
u n
uX
`2 = ||x||2 = t
x2i

(18)

i=1

Finally, the infinity norm is used in uniform convergence:
`∞ = ||x||∞ = max |xi |
i

The Frobenius norm of a matrix is analogous to the `2 norm of a vector:
sX
A2ij
||A||F =

(19)

(20)

ij

The dot product of two vectors can be expressed in terms of norms:
xT y = ||x||2 ||y||2 cos θ

(21)

where θ is the angle between vectors x and y.
1.3.2

Linear Dependence and Rank

A set of vectors {x1 , x2 , . . . , xn } ⊂ Rm is linearly independent if no vector can be represented as a linear combination
of the remaining vectors. Conversely, if one vector in the set can be represented as a linear combination of the
remaining vectors, then the vectors are said to be linearly dependent. Formally, for scalar values α1 , . . . , αn−1 ∈ R:
xn =

n−1
X

α i xi

(22)

i=1

The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns.
This is equivalent to the maximum number of linearly independent columns of A. If rank(A) = min(m, n), then A
is said to be full rank.m
7

Machine Learning Study Guide

1.3.3

Span, Range, and Nullspace

The span of a set of vectors {x1 , x2 , . . . , xn } is the set of all vectors that can be expressed as a linear combination
of {x1 , x2 , . . . , xn }. Formally,
(
span ({x1 , . . . xn }) =

v:v=

n
X

)
αi xi ,

αi ∈ R

(23)

i=1

The projection of a vector y ∈ Rn onto the span of {x1 , x2 , . . . , xn } is the vector v ∈ span ({x1 , . . . xn }) such that v
is as close as possible to y as measured by the euclidean norm:
Proj (y; {x1 , . . . xn }) = argminv∈span({x1 ,...,xn }) ky − vk2

(24)

The range of a matrix A ∈ Rm×n is the span of the columns of A:
R(A) = {v ∈ Rm : v = Ax, x ∈ Rn }

(25)

The nullspace of a matrix A ∈ Rm×n is the set of all vectors that equal 0 when multiplied by A:
N (A) = {x ∈ Rn : Ax = 0}
1.3.4

(26)

Symmetric Matrices

A square matrix A ∈ Rn×n is symmetric if A = AT . It is anti-symmetric if A = −AT . For any matrix A ∈ Rn×n
the matrix A + AT is symmetric and the matrix A − AT is anti-symmetric. From this, any square matrix A ∈ Rn×n
can be represented as a sum of a symmetric matrix and an anti-symmetric matrix:
A=

A − AT
A + AT
+
2 }
2 }
| {z
| {z

Symmetric

1.3.5

(27)

Antisymmetric

Positive Semidefinite Matrices

Given a square matrix A ∈ Rn×n and a vector x ∈ Rn , the scalar value xT Ax is called the quadratic form:
xT Ax =

n X
n
X

Aij xi xj

(28)

i=1 j=1

A symmetric matrix A is:
1. Positive definite (PD) if for all nonzero vectors, xT Ax > 0
2. Positive semidefinite (PSD) if for all nonzero vectors, xT Ax ≥ 0
3. Negative definite (ND) if for all nonzero vectors, xT Ax < 0
4. Negative semidefinite (NSD) if for all nonzero vectors, xT Ax ≤ 0
5. Indefinite if it is neiter PSD nor NSD, i.e. if there exists a x1 , x2 such that xT1 Ax1 > 0 and xT2 Ax2 < 0
Given any matrix A ∈ Rm×n , the gram matrix G = AT A is always PSD. If m ≥ n and A is full rank, then G is PD.
8

Machine Learning Study Guide

1.3.6

Eigendecomposition

Given a square matrix A ∈ Rn×n , we say that λ ∈ C is an eigenvalue of A and v ∈ C is the corresponding eigenvector
if
Av = λv, v 6= 0
(29)
The trace of A is equal to the sum of its eigenvalues:
tr A =

n
X

λi

(30)

i=1

The determinant of A is equal to the product of its eigenvalues:
|A| =

n
Y

λi

(31)

i=1

Suppose matrix A has n linearly independent eigenvectors {v1 , . . . , vn } with corresponding eigenvalues {λ1 , . . . , λn }.
Let V be a matrix with one eigenvalue per column: V = [v1 , . . . , vn ]. Let Λ = diag(λ1 , . . . , λn ). Then the
eigendecomposition of A is given by:
A = V ΛV −1
(32)
For when A is symmetric, then the eigenvalues of A are real, the eigenvectors of A are orthonormal, and hence V
is an orthogonal matrix (henceforth renamed U ). Hence A = U ΛU T . A matrix whose eigenvalues are all positive
is called positive definite. A matrix whose eigenvalues are all positive or zero valued is called positive semidefinite.
Likewise, if all eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative or zero
valued, it is negative semidefinite. In addition, assuming sorted eigenvalues, for the following optimization problem:
max xT Ax

x∈Rn

subject to kxk22 = 1

(33)

The solution is v1 the eigenvector corresponding to λ1 . For the minimization problem:
min xT Ax

x∈Rn

subject to kxk22 = 1

(34)

the optimal solution for x is vn , the eigenvector corresponding to eigenvalue λn .
1.3.7

Singlular Value Decomposition

The singular value decomposition provides a way to factorize a m × n matrix A into singular vectors and singular
values. It is defined as:
A = U DV T
(35)
Suppose A ∈ Rm×n matrix. Then U ∈ Rm×m , D ∈ Rm×n , V ∈ Rn×n matrices. The matricies U and V are
orthogonal, and matrix D is diagonal. The elements along the diagonal of D are known as the singular values of
matrix A. The columns of U are known as the left-singular vectors while the columns of V are the right-singular
vectors. The left-singular vectors of A are the eigenvectors of AAT . The right-singular vectors of A are the
eigenvectors of AT A. We can use SVD to partially generalize matrix inversion to nonsquare matrices.
1.3.8

The Moore-Penrose Pseudoinverse

Matrix inversion is not defined for matrices that are not square. Note that for nonsingular A:
AA−1 A = A

(36)
9

Machine Learning Study Guide

However, if the inverse is not defined, we seek to find a matrix A+ such that:
AA+ A = A

(37)

The moore-penrose pseudoinverse A+ is defined as follows:
A+ = lim (AT A + αI)−1 AT

(38)

α→0

Practical algorithms use the singular value decomposition of A such that:
A+ = V D + U T

(39)

where U, D, V are from the SVD of A, and the pseudoinverse of D+ is obtained by taking the reciprocal of the
nonzero diagonal elements of D. If A has more columns than rows, then using the pseudoinverse to solve a linear
equation Ax = y provides one of many solutions, but provides x = A+ y with minimal euclidean norm ||x||2 . When
A has more rows than columns, the pseudoinverse gives us the x for which Ax is as close as possible to y, i.e.
minimizing ||Ax − y||2 .

1.4
1.4.1

Matrix Calculus
The Gradient

Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f with respect to A is a m × n matrix
noted as ∇A f (A) such that:
 ∂f (A) ∂f (A)

(A)
· · · ∂f
∂A11
∂A12
∂A1
 ∂f (A) ∂f (A)
(A) 
· · · ∂f
 ∂A21
∂A22
∂A2n 
m×n


(40)
∇A f (A) ∈ R
=
..
..
..

..
.


.
.
.
∂f (A)
∂f (A)
∂f (A)
· · · ∂A
∂Am1
∂Am2
mn
Or compactly for each ij entry:
∇A f (A)ij =

∂f (A)
∂Aij

(41)

However, the gradient of a vector x ∈ Rn is:



∇x f (x) = 



1.4.2

∂f (x)
∂x1
∂f (x)
∂x2

..
.

∂f (x)
∂xn








(42)

The Hessian

Let f : Rn → R be a function and x ∈ Rn be a vector. The hessian of f with respect to x is a n × n symmetric
matrix noted as H = ∇2x f (x) such that:

∇2x f (x)

10

∈R

n×n




=



∂ 2 f (x)
∂x21
∂ 2 f (x)
∂x2 ∂x1

..
.
∂ 2 f (x)
∂xn ∂x1

∂ 2 f (x)
∂x2
∂ 2 f (x)
∂x22

···

..
.

···
..
.

∂ 2 f (x)
∂xn ∂x2

···

∂ 2 f (x)
∂x1 ∂xn
∂ 2 f (x)
∂x2 ∂xn

..
.
∂ 2 f (x)
∂x2n









(43)

Machine Learning Study Guide

Or compactly:
∇2x f (x)ij =

∂ 2 f (x)
∂xi ∂xj

(44)

Note that the hessian is only defined when f (x) is real-valued.
1.4.3

Gradient Properties

For matrices A, B, C and vectors x, b:
1. ∇x bT x = b
2. ∇x xT Ax = 2Ax (if A symmetric)
3. ∇2x xT Ax = 2A (if A symmetric)
4. ∇A tr(AB) = B T
5. ∇AT f (A) = (∇A f (A))

T

6. ∇A tr(ABAT C) = CAB + C T AB T
7. ∇A |A| = |A|(A−1 )T

2
2.1
2.1.1

Convex Optimization
Convexity
Convex Sets

A set C is convex if, for any x, y ∈ C and α ∈ R with 0 ≤ α ≤ 1, we have
αx + (1 − α)y ∈ C

(45)

This point is known as a convex combination of x and y.
2.1.2

Convex Functions

A function f : Rn → R is convex if its domain D(f ) is a convex set, and if, for all x, y ∈ D(f ) and α ∈ R, 0 ≤ α ≤ 1,
we have:
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y)
(46)
A function is strictly convex if:
f (αx + (1 − α)y) < αf (x) + (1 − α)f (y)
2.1.3

(47)

First-Order Conditions

Suppose a function f : Rn → R is differentiable, then f is convex if and only if D(f ) is a convex set and for all
x, y ∈ D(f ) we have:
f (y) ≥ f (x) + ∇x f (x)T (y − x)
(48)
This is called the first-order approximation to the function f at point x. This is approximating f with its tangent
line at point x. The first order condition for convexity says that f is convex if and only if the tangent line is a
global underestimator of the function f . In other words, if we take our function and draw a tangent line at any
point, then every point on this line will lie below the corresponding point on f .
11

Machine Learning Study Guide

2.1.4

Second-Order Conditions

Suppose a function f : Rn → R is twice differentiable, i.e. the hessian ∇2x f (x) is defined for all points x in the
domain of f . Then f is convex if and only if D(f ) is a convex set and its hessian is PSD:
∇2x f (y)  0
2.1.5

(49)

Jensen’s Inequality

Suppose we start with the inequality in the basic definition of a convex function:
f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y)
We can extend this to convex combinations of more than one point:
!
k
k
X
X
f
α i xi ≤
αi f (xi )
i=1

For

P

k

(50)

(51)

i=1

α = 1 and all α ≥ 0. This can be extended to integrals:
Z
 Z
f
p(x)xdx ≤ p(x)f (x)dx

(52)

R
For p(x)dx = 1 and p(x) ≥ 0∀x. This implies p(x) is a probability density, and we can write our inequality in
terms of expectations:
f (E[x]) ≤ E[f (x)]
(53)
And is known as Jensen’s inequality.
2.1.6

Sublevel Sets

Given a convex function f : Rn → R and a real number α ∈ R, the α-sublevel set is:
{x ∈ D(f ) : f (x) ≤ α}

(54)

In other words, the α-sublevel set is the set of all points x such that f (x) ≤ α.

2.2

Convex Optimization

A convex optimization problem seeks to optimize a convex function f with respect to a variable x and possible
constraints.
minimize
f (x)
subject to gi (x) ≤ 0, i = 1, . . . , m
(55)
hi (x) = 0, i = 1, . . . , p
where f is a convex function, gi are convex functions, and hi are affine functions. The optimal value p∗ is equal to
the minimum possible value of the objective function in the feasible region:
p? = min {f (x) : gi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p}

(56)

The optimal point x∗ is a point such that f (x∗ ) = p∗ .
2.2.1

Global Optimality

A point x is locally optimal if it is feasible and if there exists some R > 0 such that all feasible points z with
||x − z||2 ≤ R, satisfy f (x) ≤ f (z). This means that for x to be locally optimal, there exists no nerby points that
have a lower objective value. A point x is globally optimal if it is feadible and for all feasible points z, f (x) ≤ f (z).
For a convex optimization problem, all locally optimal points are globally optimal.
12

Machine Learning Study Guide

2.2.2

Gradient Descent

Using α ∈ R as the learning rate, we can update a set of parameters θ with respect to minimizing a function f as
follows:
θ := θ − α∇f (θ)
(57)
Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient
descent is on a batch of training examples.
2.2.3

Newton’s Algorithm

Newton’s algorithm is a numerical method using information from the second derivative to find θ such that f 0 (θ) = 0.
θ := θ −

f 0 (θ)
f 00 (θ)

(58)

For multidimensional parameters:
θ := θ − αH −1 ∇θ f (θ)

(59)

Where H is the hessian matrix of second partial derivatives.
Hij =

2.3

∂ 2 f (θ)
∂θi ∂θj

(60)

Lagrange Duality and KKT Conditions

Generic differentiable convex optimization problems are of the form:
minimize
subject to

f (x)
gi (x) ≤ 0,
hi (x) = 0,

i = 1, . . . , m
i = 1, . . . , p

(61)

where x ∈ Rn is the optimization variable, f : Rn → R and gi : Rn → R are differentiable convex functions, and
hi : Rn → R are affine functions.
2.3.1

The Lagrangian

Given a convex constrained minimization problem, the generalized lagrangian is a function L : Rn × Rm × Rp → R
defined as:
p
m
X
X
L(x, α, β) = f (x) +
αi gi (x) +
βi hi (x)
(62)
i=1

i=1

We refer to x as the primal variables of the Lagrangian. The α and β are refered to as the dual variables or lagrange
multipliers. The key idea behind Lagrangian duality is for any convex optimization problem, there always exist
settings of the dual variables such that the unconstrained minimum of the Lagrangian with respect to the primal
variables (keeping the dual variables fixed) coincides with the solution of the original constrained minimization
problem.
2.3.2

Primal and Dual Problems

The primal problem is:


min
x


L(x, α, β) = min θP (x)
x
α,β:αi ≥0,∀i
|
{z
}
max

(63)

θP(x)

13

Machine Learning Study Guide

Here, θP : Rn → R is the primal objective, and the right hand side is the primal problem. In addition, p∗ = θP (x∗ )
is the optimal value of the primal objective.
The dual problem is:
h
i
max θD (x)
(64)
max
min L(x, α, β) =
x
α,β:αi ≥0,∀i
{z
} α,β:αi ≥0,∀i
|
θD(x)

Here, θD : Rm × Rp → R is the dual objective, and the right hand side is the dual problem. In addition, s∗ =
θD (α∗ , β ∗ ) is the optimal value of the dual objective.
2.3.3

Strong and Weak Duality

Weak duality for any pair of primal and dual problems, with d∗ the optimal objective value of the dual problem
and p∗ as the optimal objective value of the primal problem we have:
d∗ ≤ p ∗

(65)

Strong duality is when for any pair of primal and dual problems which satisfy certain technical conditions called
constraint qualifications, then:
d∗ = p ∗
(66)
2.3.4

Complementary Slackness

If strong duality holds, then αi∗ g(x∗i ) = 0 for each i = 1, . . . , m. This property is known as complementary slackness.
This implies the following:
αi∗ > 0 =⇒ gi (x∗ ) = 0
(67)
gi (x∗ ) < 0
2.3.5

=⇒

αi∗ = 0

(68)

The KKT Conditions

Suppose that x∗ ∈ Rn , α∗ ∈ Rm and β ∗ ∈ Rp satisfy the following conditions:
1. Primal feasibility: gi (x∗ ) ≤ 0, for i = 1, . . . , m and hi (x∗ ) = 0 for i = 1, . . . , p
2. Dual feasibility: αi∗ ≥ 0 for i = 1, . . . , m
3. Complementary slackness: αi∗ gi (x∗ ) = 0 for i = 1, . . . , m
4. Lagrangian stationarity: ∇x L(x∗ , α∗ , β ∗ ) = ~0
Then x∗ is primal optimal and (α∗ , β ∗ ) are dual optimal. Furthermore, if strong duality holds, then any primal
optimal x∗ and dual optimal (α∗ , β ∗ ) must satisfy the conditions 1 through 4. These conditions are known as the
Karush-Kuhn-Tucker (KKT) conditions.

3
3.1

Probability and Statistics
Basics

The set of all possible outcomes of an experiment is known as the sample space and denoted by S. Any subset E
of the sample space is known as an event. An event is a set consisting of possible outcomes of the experiment.
14

Machine Learning Study Guide

3.1.1

Axioms of Probability

For each event E, we denote P (E) as the probability of event E occuring. P (E) satisfies the following properties:
1. Every probability is between 0 and 1 included:
0 ≤ P (E) ≤ 1

(69)

2. The probability that at least one event in the sample space will occur is 1:
P (S) = 1

(70)

3. For any sequence of mutually exclusive events E1 , . . . , En we have:
!
n
n
[
X
P
Ei =
P (Ei )
i=1

3.1.2

(71)

i=1

Permutation

A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such
arrangements is given by P (n, r):
n!
(72)
P (n, r) =
(n − r)!
3.1.3

Combination

A combination is an arrangement of r objects from a pool of n objects, where order does not matter. The number
of such arrangements is given by C(n, r):
C(n, r) =

n!
P (n, r)
=
r!
r!(n − r)!

(73)

Note that for 0 ≤ r ≤ n we have P (n, r) ≥ C(n, r).

3.2

Conditional Probability

Let B be an event with non-zero probability, the conditional proability of any event A given B is:
P (A ∩ B)
P (B)

(74)

P (B|A)P (A)
P (B)

(75)

P (A|B) =
3.2.1

Bayes Rule

For events A, B such that P (B) > 0, we have:
P (A|B) =
From Bayes rule, we have:
P (A ∩ B) = P (A|B)P (B) = P (A)P (B|A)

(76)

Let {Ai , i ∈ [1, n]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition if we have:
∀i 6= j, Ai ∩ Aj = ∅

and

n
[

Ai = S

(77)

i=1

15

Machine Learning Study Guide

Remark that for any event B in the sample space, we have:
P (B) =

n
X

P (B|Ai )P (Ai )

(78)

i=1

Let {Ai , i ∈ [1, n]} be a partition of the sample space, we can extend bayes rule as:
P (Ak |B) =

P (B|Ak )P (Ak )
n
X
P (B|Ai )P (Ai )

(79)

i=1

3.2.2

Independence

Two events A and B are independent if and only if we have:
P (A ∩ B) = P (A)P (B)

3.3

(80)

Random Variables

A random variable X is a function that maps every element in a sample space to a real line.
3.3.1

Cumulative Distribution Function (CDF)

The cumulative distribution function F , which is monotonically non-decreasing and is such that limx→−∞ F (X) = 0
and limx→∞ F (X) = 1 is defined as:
F (x) = P (X ≤ x)
(81)
In addition:
P (a < X ≤ b) = F (b) − F (a)
3.3.2

(82)

Probability Density Function (PDF)

The probability density function is the derivative of the CDF. It has the following properties:
1. f (x) ≥ 0
R∞
2. −∞ f (x) = 1
R
3. x∈A f (x)dx = P (X ∈ A)
3.3.3

Discrete PDF/CDF

If X is discrete, by denoting f as the PDF and F as the CDF, we have:
X
F (X) =
P (X = xi )

(83)

xi ≤x

f (xj ) = P (X = xj )

(84)

0 ≤ f (xj ) ≤ 1

(85)

X

(86)

And the following properties for the PDF:

j

16

f (xj ) = 1

Machine Learning Study Guide

3.3.4

Continuous PDF/CDF

If X is continuous, by denoting f as the PDF and F as the CDF, we have:
Z x
f (y)dy
F (X) =

(87)

−∞

dF
dx

(88)

f (x) ≥ 0

(89)

f (x) =
And the following properties for the PDF:

Z

∞

f (x)dx = 1

(90)

−∞

3.3.5

Expectation

The expected value of a random variable, also known as the mean value or first moment, is denoted as E[X] or µ.
It is the value obtained by averaging the results of a random variable. We use (D) for discrete, (C) for continuous.
Z +∞
n
X
(D) E[X] =
xi f (xi )
and (C) E[X] =
xf (x)dx
(91)
−∞

i=1

The expected value of a function of a random variable g(X) is:
(D)

E[g(X)] =

n
X

Z
g(xi )f (xi )

i=1

and

+∞

(C) E[g(X)] =

g(x)f (x)dx

(92)

−∞

The k-th moment, noted E[X k ] is the value of X k that we expect to observe on average on infinitely many trials.
The k-th moment is a case of the previous definition with g : X 7→ X k .
Z +∞
n
X
(D) E[X k ] =
xki f (xi )
and (C) E[X k ] =
xk f (x)dx
(93)
i=1

−∞

A characteristic function ψ(ω) is derived from a probability density function f (x) and is defined as:
Z +∞
n
X
iωxi
(D) ψ(ω) =
f (xi )e
and (C) ψ(ω) =
f (x)eiωx dx
i=1

(94)

−∞

Remark that eiωx = cos(ωx) + i sin(ωx). The k-th moment can also be computed with the characteristic function
as:


1 ∂kψ
(95)
E[X k ] = k
i ∂ω k ω=0
3.3.6

Variance and Standard Deviation

The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the spread of its distribution function.
It is determined as follows:
Var[X] = E[(X − E[X])2 ] = E[X 2 ] − E[X]2
(96)
The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function
which is compatible with the units of the actual random variable. It is determined as follows:
p
σ = Var[X]
(97)
Note that the variance for any constant a is Var[a] = 0, and Var[af (X)] = a2 Var[f (X)].
17

Machine Learning Study Guide

3.4
3.4.1

Discrete Random Variables
Bernoulli

For X ∼ Bernoulli(p), we define a binary event with a probability of p for a true event, and a false event with
probability of q = 1 − p.
(
q = 1 − p if x = 0
P (X = x) =
(98)
p
if x = 1
It can also be expressed as:
f (x; p) = pk (1 − p)1−k

for k ∈ {0, 1}

(99)

Other properties:

3.4.2

E[X] = p

(100)

Var[X] = pq = p(1 − p)

(101)

ψ(ω) = (1 − p) + peiω

(102)

Binomial

For X ∼ Binomial(n, p), the number of true events in n independent experiments, with true probability of p, false
probability of q = 1 − p.
 
n x
P (X = x) =
p (1 − p)n−x
(103)
x
Other properties:

3.4.3

E[X] = np

(104)

Var[X] = npq = np(1 − p)

(105)

ψ(ω) = (peiω + (1 − p))n

(106)

Geometric

For X ∼ Geometric(p), is the number of experiments with true probability of p until the first true event (number
of trials to get one success).
P (X = x) = p(1 − p)x−1
(107)
Other properties:
E[X] =

Var[X] =

ψ(ω) =

18

1
p

1−p
p2

peiω
1 − (1 − p)eiω

(108)

(109)

(110)

Machine Learning Study Guide

3.4.4

Poisson

For X ∼ P oisson(λ), for λ > 0, a probability distribution over the nonnegative integers used for modeling the
frequency of rare events.
λx −λ
P (X = x) =
e
(111)
x!
Other properties:
E[X] = λ

(112)

Var[X] = λ

(113)

ψ(ω) = eλ(e

3.5
3.5.1

iω

−1)

(114)

Continuous Random Variables
Uniform

For X ∼ U nif orm(a, b), we have equal probability density to every value between a and b.
(
f (x) =

1
b−a

if a ≤ x ≤ b
otherwise

(115)

a+b
2

(116)

(b − a)2
12

(117)

eiωb − eiωa
(b − a)iω

(118)

0

Other properties:
E[X] =

Var[X] =

ψ(ω) =

3.5.2

Exponential

For X ∼ Exponential(λ), λ > 0, is the decaying probability density over the nonnegative reals.
(
λe−λx
f (x) =
0

if x ≥ 0
otherwise

(119)

Other properties:
E[X] =

Var[X] =

ψ(ω) =

1
λ

(120)

1
λ2

(121)

1
1 − iω
λ

(122)

19

Machine Learning Study Guide

3.5.3

Gaussian (Normal)

For X ∼ N ormal(µ, σ), denoted also X ∼ N (µ, σ).
f (x) = √

1 x−µ 2
1
e− 2 ( σ )
2πσ

(123)

Other properties:
E[X] = µ

(124)

Var[X] = σ 2

(125)

1

ψ(ω) = eiωµ− 2 ω

3.6

2

σ2

(126)

Jointly Distributed Random Variables

The joint probability distribution of two random variables X and Y , denoted as fXY is defined as:
(D)

(C)

fXY (xi , yj ) = P (X = xi and Y = yj )

(127)

fXY (x, y)∆x∆y = P (x 6 X 6 x + ∆x and y 6 Y 6 y + ∆y)

(128)

Again, denote (D) as the discrete case, and (C) as the continuous case.
3.6.1

Marginal Density

The marginal density for a random variable X is:
(D)

fX (xi ) =

X

Z
fXY (xi , yj )

and

fXY (x, y)dy

(129)

−∞

j

3.6.2

+∞

(C) fX (x) =

Cumulative Distribution

The cumulative distribution FXY is:
(D)

FXY (x, y) =

X X

Z
fXY (xi , yj )

and

−∞

xi 6x yj 6y

3.6.3

x

Z

y

(C) FXY (x, y) =

fXY (x0 , y 0 )dx0 dy 0

(130)

−∞

Conditional Density

The conditional density of X with respect to Y , denoted fX|Y is defined as:
fX|Y =
3.6.4

fXY (x, y)
fY (y)

(131)

Independence

Two random variables X and Y are independent if:
fXY (x, y) = fX (x)fY (y)
20

(132)

Machine Learning Study Guide

3.6.5

Expectation

Given two random variables X, Y and g : R2 → R is a function of these two variables. Then the expected value of
g is:
Z ∞Z ∞
XX
g(xi , yi )f (xi , yi )dydx (133)
(D) E[g(X, Y )] =
g(xi , yi )f (xi , yi )
and (C) E[g(X, Y )] =
i

3.6.6

−∞

j

−∞

Covariance

2
The covariance of two random variables X and Y , denoted σXY
or as Cov[X, Y ] is defined as:
2
Cov[X, Y ] , σXY
= E[(X − µX )(Y − µY )] = E[XY ] − µX µY

(134)

Where µX = E[X] and µY = E[Y ] respectively. If X and Y are independent, the covariance is 0.
Var[X + Y ] = Var[X] + Var[Y ] + 2Cov[X, Y ]
3.6.7

(135)

Correlation

By noting σX , σY as the standard deviations of X and Y , we define the correlation between the random variables
X and Y as ρXY :
σ2
ρXY = XY
(136)
σX σY
Correlation for X, Y is ρXY ∈ [−1, 1]. If X, Y independent, then ρXY = 0

3.7

Parameter Estimation

3.7.1

Definitions

3.7.2

Bias

3.7.3

Mean and Central Limit Theorem

3.7.4

Variance

3.8

Probability Bounds and Inequalities

This section looks at various bounds that define how likely a random variable is to be close to its expectation.
3.8.1

Markov

Let X ≥ 0 be a non-negative random variable. Then for all a ≥ 0:
P (X ≥ a) ≤
3.8.2

E[X]
a

(137)

Chebyshev

Let X be any random variable with finite expected value µ = E[X] and finite non-zero variance σ 2 . Then for all
k > 0:
1
P (|X − E[X]| ≥ kσ) ≤ 2
(138)
k
21

Machine Learning Study Guide

3.8.3

Chernoff

Recall that the moment generating function for a random variable X is:
MX (λ) := E[eλX ]

(139)

Then the chernoff bound for a random variable X, obtained by applying the markov inequality to eλX , for every
λ > 0:
E[eλX ]
(140)
P (X ≥ a) = P (eλX ≥ eλa ) ≤
eλa
For the multiplicative chernoff bound, suppose X1 , . . . , Xn are independent random variables taking values in {0, 1}.
Let X denote the sum, µ = E[X] denote the sum’s expected value. Then for any 0 < δ < 1,

µ
2
eδ
− δ 3µ
P (X > (1 + δ)µ) <
≤
e
(141)
(1 + δ)(1+δ)

P (X < (1 − δ)µ) <

e−δ
(1 − δ)(1−δ)

µ

≤ e−

δ2 µ
2

(142)

For δ ≥ 0,
δ2 µ

P (X ≥ (1 + δ)µ) ≤ e− 2+δ
P (X ≤ (1 − δ)µ) ≤ e−
3.8.4

(143)

δ2 µ
2

(144)

Hoeffding

Let X1 , . . . , Xn be independent random variables such that ai ≤ Xi ≤ bi . The sum of these variables Sn =
then the Hoeffding inequality is:


2t2
P
P (Sn − E [Sn ] ≥ t) ≤ exp −
2
i (bi − ai )
2t2
P (|Sn − E [Sn ]| ≥ t) ≤ 2 exp − P
2
i (bi − ai )


P

i

Xi ,

(145)


(146)

The hoeffding lemma states that for a real-valued random variable X with expected value E[X] = 0 and such that
a ≤ X ≤ b, then for all λ ∈ R we have,
 2

λ (b − a)2
λX
E[e ] ≤ exp
(147)
8

4

Information Theory

Information Theory revolves around quantifying how much information is present in a signal. The basic intuition
lies in the fact that learning an unlikely event has occured is more informative than learning that a likely event has
occured. The basics are:
1. Likely events should have low information content, and in the extreme case, events that are guaranteed to
happen should have no information content whatsoever.
2. Less likely events should have higher information content.
3. Independent events should have additive information.
22

Machine Learning Study Guide

We satisfy all three properties by defining self-information of an event x for a probability distribution P as:

I(x) = − log P (x)

(148)

We can quantify the amount of uncertainty in a distribution using Shannon entropy:

H(P ) = Ex∼P [I(x)] = −Ex∼P [log P (x)]

(149)

Which in the discrete setting is written as:

H(P ) = −

X

P (x) log P (x)

(150)

x

In other words, the Shannon entropy of a distribution is the expected amount of information in an event drawn
from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn
from a distribution P . If we have two separate probability distributions P (x) and Q(x) over the same random
variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence:

DKL (P kQ)

= Ex∼P



P (x)
log
Q(x)

= Ex∼P [log P (x) − log Q(x)]
X

=

P (x)

x

(151)

log P (x)
log Q(x)

In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols
drawn from probability distribution P , when we use a code that was designed to minimize the length of messages
drawn from probability distribution Q. The KL divergence is always non-negative, and is 0 if and only if P and Q
are the same. We can relate the KL divergence to cross-entropy.

H(P, Q)

= H(P ) + DKL (P kQ)
=

− Ex∼P [log Q(x)]

=

−

X

(152)

P (x) log Q(x)

x

Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not
participate in the omitted term (entropy is constant).
23

Machine Learning Study Guide

5

Learning Theory

5.1

Bias and Variance

5.2

Notation

5.2.1

Union Bound

5.2.2

Hoeffding Inequality For Bernoulli Variables

5.3

Training Error

For a given classifier h, we define the training error b
(h), also known as the empirical risk or empirical error, to be:
m

b
(h) =

5.4

1 X
1
(i)
(i)
m i=1 {h(x )6=y }

(153)

Probably Approximately Correct (PAC)

PAC learning is a framework with the following set of assumptions:

5.5

Hypothesis Classes

5.5.1

Shattering

5.5.2

Upper Bound Theorem

5.5.3

VC Dimension

5.5.4

Vapnik Theorem

6

Linear Regression

Linear Regression seeks to approximate a real valued label y as a linear function of x:
hθ (x) = θ0 + θ1 · x1 + · · · + θn · xn

(154)

The θi ’s are the parameters, or weights. If we include the intercept term via x0 = 1, we can write our model more
compactly as:
n
X
h(x) =
θ i · xi = θ T x
(155)
i=0

Here n is the number of input variables, or features. In Linear Regression, we seek to make h(x) as close to y for a
set of training examples. We define the cost function as:
m

J(θ) =

6.1

2
1 X   (i) 
h x
− y (i)
2 i=1

(156)

LMS Algorithm

We seek to find a set of θ such that we minimize J(θ) via a search algorithm that starts at some initial guess for our
parameters and takes incremental steps to make J(θ) smaller until convergence. This is know as gradient descent:
θj := θj − α
24

∂
J(θ)
∂θj

(157)

Machine Learning Study Guide

Here, α is the learning rate. We can derive the partial derivative as:
∂
J(θ)
∂θj

=
=
=
=

∂ 1
2
(h(x) − y)
∂θj 2
1
∂
2 · (h(x) − y) ·
(h(x) − y)
2
∂θj
!
n
X
∂
θi xi − y
(h(x) − y) ·
∂θj i=0

(158)

(h(x) − y) xj

Hence, for a single example (stochastic gradient descent):



(i)
θj := θj + α y (i) − h x(i) xj

(159)

This is called the LMS update rule. For a batched version, we can evaluate the gradient on a set of examples (batch
gradient descent), or the full set (gradient descent).
θj := θj + α

m 
X



(i)
y (i) − h x(i) xj

(160)

i=1

6.2

The Normal Equations

We can also directly minimize J without using an iterative algorithm. We define X as the matrix of all samples of
size m by n. We let ~y be a m dimensional vector of all target values. We can define our cost function J as:
m

J(θ) =

2
1
1 X   (i) 
(Xθ − ~y )T (Xθ − ~y ) =
h x
− y (i)
2
2 i=1

(161)

We then take the derivative and find its roots.

∇θ J(θ)

1
= ∇θ (Xθ − ~y )T (Xθ − ~y )
2

1
=
∇θ θT X T Xθ − θT X T ~y − ~y T Xθ + ~y T ~y
2

1
=
∇θ tr θT X T Xθ − 2 tr ~y T Xθ
2

1
=
X T Xθ + X T Xθ − 2X T ~y
2
= X T Xθ − X T ~y

(162)

To minimize J, we set its derivatives to zero, and obtain the normal equations:
X T Xθ = X T ~y

(163)

Which solves θ for a value that minimizes J(θ) in closed form:
θ = XT X

−1

X T ~y

(164)
25

Machine Learning Study Guide

6.3

Probabilistic Interpretation

Why does linear regression use the least-squares cost function? Assume that the target variables and inputs are
related via:
y (i) = θT x(i) + (i)
(165)
Here, (i) is an error term for noise. We assume each (i) is independently and identically
distributed according to a

Gaussian distribution with mean zero and some variance σ 2 . Hence, (i) ∼ N 0, σ 2 , so the density for any sample

x(i) with label y (i) is y (i) |x(i) ; θ ∼ N θT x(i) , σ 2 . This implies:
 !


(i)
T (i) 2
y
−
θ
x
1
exp −
(166)
p y (i) |x(i) ; θ = √
2σ 2
2πσ
The probability of a dataset X is quantified by a likelihood function:
L(θ) = L(θ; X, ~y ) = p(~y |X; θ)

(167)

Since we assume independence on each noise term (and samples), we can write the likelihood function as:
m


Y
L(θ) =
p y (i) |x(i) ; θ
i=1
m
Y

y (i) − θT x(i)
1
√
=
exp −
2σ 2
2πσ
i=1

2 !

(168)

To get the best choice of parameters θ, we perform maximum likelihood estimation such that L(θ) is maximized.
Usually we take the negative log and minimize:
`(θ) = − log L(θ)
m
Y

y (i) − θT x(i)
1
√
= − log
exp −
2σ 2
2πσ
i=1

2 !

2 !
y (i) − θT x(i)
1
log √
=−
exp −
2σ 2
2πσ
i=1
m
2
1 1 X  (i)
1
+ 2·
= −m log √
y − θT x(i)
2πσ σ 2 i=1
m
X

(169)

Hence, maximizing L(θ) is the same as minimizing the negative log likelihood `(θ), which for linear regression is
the least squares cost function:
m
2
1 X  (i)
y − θT x(i)
(170)
2 i=1
Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ. This is thus one set of assumptions under which least-squares regression can be
justified as performing maximum likelihood estimation. Note that θ is independent of σ 2 .

6.4

Locally Weighted Linear Regression

Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example
in its cost function by w(i) (x), which is defined with parameter τ ∈ R as:


(x(i) − x)2
(i)
(171)
w (x) = exp −
2τ 2
Hence, in LWR, we do the following:
26

Machine Learning Study Guide

1. Fit θ to minimize

P

i

w(i) y (i) − θT x(i)

2

2. Output θT x
This is a non-parametric algorithm, where non-parametric refers to the fact that the amount of information we
need to represent the hypothesis h grows linearly with the size of the training set.

7

Logistic Regression

We can extend this learning to classification problems, where we have binary labels y that are either 0 or 1.

7.1

The Logistic Function

For logistic regression, our new hypothesis for estimating the class of a sample x is:

h(x) = g θT x =

1
1 + e−θT x

(172)

where g(z) is the logistic or sigmoid function:
g(z) =

1
1 + e−z

(173)

The sigmoid function is bounded between 0 and 1, and tends towards 1 as z → ∞. It tends towards 0 when
z → −∞. A useful property of the sigmoid function is the form of its derivative:
g 0 (z)

d
1
dz 1 + e−z
1

=
=

2
e−z )

7.2



(1 +


1
1
·
1
−
(1 + e−z )
(1 + e−z )

=
=

e−z

(174)

g(z)(1 − g(z))

Cost Function

To fit θ for a set of training examples, we assume that:
P (y = 1|x; θ) = h(x)

(175)

P (y = 0|x; θ) = 1 − h(x)
This can be written more compactly as:
y

1−y

p(y|x; θ) = (h(x)) (1 − h(x))

(176)

Assume m training examples generated independently, we define the likelihood function of the parameters as:
L(θ)

=
=
=

p (~y |X; θ)
m


Y
p y (i) |x(i) ; θ
i=1
m 
Y

(177)


h x(i)

y(i) 



1 − h x(i)

1−y(i)

i=1

27

Machine Learning Study Guide

And taking the negative log likelihood to minimize:
`(θ)

=
=

− log L(θ)
m




 

X
−
y (i) log h x(i) + 1 − y (i) log 1 − h x(i)
(178)

i=1

=

−

m
X



y (i) θT x(i) − log 1 + eθ

T


(i)

x

i=1

This is known as the binary cross-entropy loss function.

7.3

Gradient Descent

Lets start by working with just one training example (x,y), and take derivatives to derive the stochastic gradient
ascent rule:



∂
1
∂
1
−
(1
−
y)
`(θ) = − y
g θT x
T
T
∂θj
g (θ x)
1 − g (θ x) ∂θj



 ∂ T
1
1
−
(1
−
y)
g θT x 1 − g θT x
= − y
θ x
(179)
T
T
g (θ x)
1 − g (θ x)
∂θj


= − y 1 − g θT x − (1 − y)g θT x xj
=

− (y − h(x)) xj

This therefore gives us the stochastic gradient ascent rule:



(i)
θj := θj + α y (i) − h x(i) xj

(180)

We must use gradient descent for logistic regression since there is no closed form solution for this problem.

7.4

Newton-Raphson Algorithm

The Hessian matrix for logistic regression is:
m

 


X
∂ 2 `(θ)
=−
x(i) x(i)T · h x(i) · 1 − h x(i)
∂θ∂θ
i=1

(181)

Hence the second order update is:

θ := θ −

8

∂ 2 `(θ)
∂θ∂θ

−1

∂`(θ)
∂θ

(182)

Softmax Regression

A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there
are more than 2 outcome classes.

8.1

Softmax Function

The softmax function creates a probability distribution over a set of k classes for a training example x, with θk
denoting the set of parameters to be optimzed for the k-th class.

exp θkT x

p(y = k|x; θ) = P
(183)
T
j exp θj x
28

Machine Learning Study Guide

8.2

MLE and Cost Function

We can write the maximum likelihood function for softmax regression as:
L(θ) =

m Y
Y

p(y = k|x; θ)1{yi =k}

(184)

i=1 k

Where 1{yi = k} is the indicator function which is 1 if its argument is true, 0 otherwise. By taking the negative
log likelihood:
`(θ) = − log L(θ)
m Y
Y
= − log
p(y = k|x; θ)1{yi =k}
i=1 k

=

−

m X
X
i=1

=







1{yi = k} · θkT xi − log 


X

exp θjT xi 


(185)

j

k



m
X
X

−θyTi xi + log 
exp θjT xi 
i=1

j

This is known as the cross-entropy loss function.

8.3

Gradient Descent

To perform gradient descent, we must take the derivative of our cost function, but it is important to note that the
derivative for the correct class is different than the other classes.
!!
m
X
X

T
T
∇θj `(θ) = ∇θj
−θyi xi + log
exp θk xi
i=1

=
=

=

m
X
i=1
m
X

k

!!
∇θ j

−θyTi xi



+ ∇θ j

1{yi = j} · (−xi ) +

i=1
m
X
i=1

log

X

exp

k
exp(θjT xi )
P
T
k exp(θk xi )

θkT xi


(186)

· xi

!
exp(θjT xi )
P
− 1{yi = j} · xi
T
k exp(θk xi )

And our update equation for the j-th parameter weights is:
θj := θj − α

!
exp(θjT xi )
P
− 1{yi = j} · xi
T
k exp(θk xi )

Note that since each class has a set of weights, our gradient is a matrix known
each with n feature weights.
 ∂`(θ)
· · · ∂`(θ)
∂θk1
h
i  ∂θ11
.
..
∂`(θ)
∂`(θ)
.

.
.
= .
Jθ =
· · · ∂θk
.
.
∂θ1
∂`(θ)
∂`(θ)
· · · ∂θkn
∂θ1n

(187)
as the jacobian J, with k classes





(188)

29

Machine Learning Study Guide

9

Generalized Linear Models

9.1

Exponentional Family

9.2

Assumptions of GLMs

9.3

Examples

9.3.1

Ordinary Least Squares

9.3.2

Logistic Regression

9.3.3

Softmax Regression

10

Perceptron

11

Support Vector Machines

12

Margin Classification

13

Generative Learning: Gaussian Discriminant Analysis

A generative model learns who the data is generated by estimating P (x|y) which is then used to estimate P (y|x)
via Bayes rule.

13.1

Assumptions

Gaussian Discriminant Analysis assumes the following:
1. y ∼ Bernoulli(φ)
2. x|y = 0 ∼ N (µ0 , Σ)
3. x|y = 1 ∼ N (µ1 , Σ)
Writing out the distributions be have:
P (y) = φy · (1 − φ)1−y
P (x|y = k) =

13.2

1
(2π)n/2 |Σ|1/2

(189)



1
T −1
exp − (x − µk ) Σ (x − µk )
2

(190)

Estimation

The log-likelihood of the data is given by:
`(φ, µ0 , µ1 , Σ)

=
=

log
log

m
Y
i=1
m
Y

P (x(i) , y (i) ; φ, µ0 , µ1 , Σ)
(191)
(i)

(i)

(i)

P (x |y ; φ, µ0 , µ1 , Σ) · P (y ; φ)

i=1

By maximizing ` with respect to the parameters, we get the following MLE of the parameters:
m

φ=

30

1 X
1 (i)
m i=1 {y =1}

(192)

Machine Learning Study Guide

Pm
(i)
i=1 1{y (i) =k} x
µk = P
m
i=1 1{y (i) =k}

(193)

m

Σ=

13.3

1 X (i)
(x − µy(i) )(x(i) − µy(i) )T
m i=1

(194)

Prediction

To classify a point x, we find the class y = k which maximizes the probability:
Y
ŷ = arg max P (y = k) ·
P (xj |y = k)
k

14
14.1

(195)

j

Generative Learning: Naive Bayes
Assumptions

The naive bayes model supposes that the features of each data point are all independent:
P (x|y) = P (x1 , x2 , ..., xn |y) = P (x1 |y)P (x2 |y)...P (xn |y) =

n
Y

P (xi |y)

(196)

i=1

14.2

Estimation

We can write the likelihood of the data as:
L(φy , φj=l|y=k ) =

m
Y

P (x(i) , y (i) )

(197)

i=1

with classes denoted by k and features xj = l (xj takes on value l). By maximizing the estimates, we get:
m

φy = P (y = k) =

1 X
1 (i)
m i=1 {y =k}

(198)

Pm
φj=l|y=k = P (xj = l|y = k) =
14.2.1

1{y(i) =k ∧ x(i) =l}
j
Pm
1
(i)
i=1 {y =k}

i=1

(199)

Laplace Smoothing

Laplace smoothing allows for unseen data to have a probability, and not automatically destory the prediciton process
by setting all classes to 0. We can replace our feature estimates with the following:
Pm
1{x(i) =j} + 1
φj = i=1
(200)
m+k
Or more generically:

Pm
φj=l|y=k =

i=1 1{x(i) =l∧y (i) =k} + 1
Pm j
i=1 1{y (i) =k} + |K|

(201)

where |K| is the number of classes
31

Machine Learning Study Guide

14.3

Prediction

To classify a point x, we find the class y = k which maximizes the probability:
ŷ = arg max P (y = k) ·
k

15
15.1

n
Y

P (xj |y = k)

(202)

i=j

Tree-based Methods
Decision Trees

Decision Trees are created via the Classification and Regression Trees (CART) training algorithm. They can be
represent as binary trees where at each node the partition the data space according to a threshold to minimize
either gini impurity or entropy between the two child nodes.

15.2

Random Forest

Random forests are a tree-based technique that uses a high number of decision trees built out of randomly selected
sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance
makes it a popular algorithm. It averages the decisions across several trees to combat overfitting. Random forests
are a type of ensemble methods.

15.3

Boosting

The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are
summed up below:
1. Adaptive boosting: Known as adaboost, it places high weights on errors to improve at the next boosting step.
2. Gradient boosting: Weak learners are trained on the remaining errors.
Boosting requires training several models to improve upon previous ones.

16

K-Nearest Neighbors

The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of
a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification
and regression settings. The higher the parameter k, the higher the bias, and the lower the parameter k, the higher
the variance.

16.1

Classification

In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors,
with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer,
typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.

16.2

Regression

In k-NN regression, the output is the property value for the object. This value is the average of the values of its k
nearest neighbors.
1 X
yi
(203)
y=
k
xi ∈Nk (x)

where Nk (x) is the k nearest points around x.
32

Machine Learning Study Guide

17

K-Means Clustering

CLustering seeks to group similiar points of data together in a cluster. We denote c(i) as the cluster for data point
i and µj as the center for cluster j. We denote k as the number of clusters and n as the dimension of our data.

17.1

Algorithm

After randomly initializing the cluster centroids µ1 , µ2 , . . . , µk ∈ Rn , repeat until convergence:
1. For every data point i:
c(i) = arg min||x(i) − µj ||2

(204)

j

2. For each cluster j:
m
X

µj =

1{c(i) =j} x(i)

i=1
m
X

(205)
1{c(i) =j}

i=1

The first step is known as cluster assignment, and the second updates the cluster center (i.e. the average of all
points in the cluster). In order to see if it converges, use the distortion function:
J(c, µ) =

m
X

||x(i) − µc(i) ||2

(206)

i=1

The distortion function J is non-convex, and coordinate descent of J is not guaranteed to converge to the global
minimum (i.e. susceptible to local optima).

17.2

Hierarchical Clustering

Hierarchical clustering is a clustering algorithm with an agglomerative hierarchical approach that builds nested
clusters in a successive manner. The types are:
1. Ward Linkage: minimize within cluster distance
2. Average Linkage: minimize average distance between cluster pairs
3. Complete Linkage: minimize maximum distance between cluster pairs

17.3

Clustering Metrics

In an unsupervised learning setting, it is often hard to assess the performance of a model since we don’t have the
ground truth labels as was the case in the supervised learning setting.
Silhouette coefficient By noting a and b the mean distance between a sample and all other points in the same
class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single
sample is defined as follows:
b−a
(207)
s=
max(a, b)
33

Machine Learning Study Guide

Calinskli-Harabaz Index By noting k the number of clusters, Bk and Wk the between and within-clustering
dispersion matricies defined as:
k
X
Bk =
nc(i) (µc(i) − µ)(µc(i) − µ)T
(208)
j=1

Wk =

m
X

(x(i) − µc(i) )(x(i) − µc(i) )T

(209)

i=1

the Calinksli-Harabaz index s(k) indicated how well a clustering model defines its clusters, such that higher scores
indicate more dense and well separated cluster assignments. It is defined as:
s(k) =

18

N −k
Tr(Bk )
×
Tr(Wk )
k−1

(210)

Expectation-Maximization

18.1

Mixture of Gaussians

18.2

Factor Analysis

19

Principal Component Analysis

Principal Component Analysis is a dimension reduction technique that finds the variance maximizing the directions
onto which to project the data.

19.1

Eigenvalues, Eigenvectors, and the Spectral Theorem

Recall that for a given matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if there exists a vector z ∈ Rn \{0},
called an eigenvector, such that:
Az = λz
(211)
The spectral theorem states that given matrix A ∈ Rn×n , if A is symmetric then A is diagonalizable by a real
orthogonal matrix U ∈ Rn×n . Note Λ = diag(λ1 , . . . , λn ). Then ∃ Λ such that:
A = U ΛU T

(212)

Note that the eigenvector associated with the largest eigenvalue is called the principal eigenvector of matrix A.

19.2

Algorithm

The PCA procedure projects the data onto k dimensions by maximizing the variance of the data as follows:
1. Normalize the data to have mean 0, standard deviation 1:
x(i) ←

x(i) − µ
σ

(213)

where µ and σ 2 are:
m

µ=

1 X (i)
x
m i=1

(214)

m

σ2 =

34

1 X (i)
(x − µ)2
m i=1

(215)

Machine Learning Study Guide

2. Compute covariance matrix Σ, which is symmetric with real eigenvalues.
m

Σ=

1 X (i) (i)T
x x
∈ Rn×n
m i=1

(216)

3. Compute u1 , . . . , uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the
k largest eigenvalues.
4. Project the data on spanR (u1 , ..., uk ) to create a vector y (i) from point x(i) :
 T (i) 
u1 x
 uT2 x(i) 


y (i) = U T x(i) = 
 ∈ Rk
..


.

(217)

uTk x(i)
This procedure maximizes the variance among all k-dimensional spaces.

19.3

Algorithm: SVD

Eigenvalue decomposition is deifned for square matricies, and using the SVD of a data matrix X is often used in
practice.
1. Zero-mean the data to have mean 0:
x(i) ← x(i) − µ

(218)

where µ is:
m

µ=

1 X (i)
x
m i=1

(219)

2. Computed the singular value decomposition of Xµ , our zero-meaned data matrix:
Xµ = U SV

(220)

3. Take the first k columns of U as our transform matrix, denoted Uk .
4. Project the data to create a matrix Y from data matrix Xµ :
Y = XUk

20

Independent Component Analysis

21

Reinforcement Learning

21.1

Markov Decision Processes

21.2

Policy and Value Functions

21.3

Value Iteration Algorithm

21.4

Q-Learning

22

(221)

Probabilistic Graphical Models

Probabilistic Graphical Models are models for which a graph can express the conditional dependence structure
between random variables. We will define the basics used to understand PGMs here.
35

Machine Learning Study Guide

For starters, a model is a set of probability distributions that may have generated data. For example, a model could
be the set of normal distributions





(x − µ)2
2
+
, µ ∈ R, σ ∈ R
exp −
p(x) : p(x) = √
2σ 2
2πσ 2
1

(222)

A graphical model is defined to be a pair (G, P) where G is a graph and P is a set of distributions which factorizes
according to G. A graph G = (V, E) consists of a set of vertices V and a set of edges E. Typically the vertices are
random variables and the edges relate the random variables.

22.1

Bayesian Networks

Bayesian Networks are a type of PGM constructed from directed acyclic graphs. A directed acyclic graph (DAG)
is a graph G = (V, E) such that E contains only directed edges (→) and there does not exist a sequence of directed
edges from Xi to Xi for all nodes in the graph.
Due to the structure of the DAG, we can factorize the joint distribution p(x) with respect to the graph as:

p(x) =

n
Y

p(xi | pa(xi , G))

(223)

i=1

Here, pa(xi , G) are the parents of xi with respect to graph G.
Using this factorization, we can specify conditional independencies from the graph’s structure. We define three
useful set construction operations for these networks:

an(xi , G) ≡ {xj | xj → · · · → xi ∈ G}
de(xi , G) ≡ {xj | xj ← · · · ← xi ∈ G}
nd(xi , G) ≡ {xj | xj ∈
/ de(xi , G)}

(ancestors of xi )
(descendants of xi )
(non-descendants of xi )

(224)

A distribution p(x) satisfies the local Markov property wrt DAG G if:

xi ⊥
⊥ nd∗ (xi , G) | pa(xi , G)

(225)

where nd∗ (xi , G) ≡ nd(xi , G) \ pa(xi , G)
For the global markov property, let A, B, C be disjoint subsets of X. A distribution p(X) satisfies the global markov
property wrt DAG G if:

⊥ B |C
A ⊥d B | C =⇒ A ⊥
36

(226)

Machine Learning Study Guide

22.1.1

Hidden Markov Models

22.2

Markov Random Fields

22.3

Conditional Random Fields

23

Deep Learning: Basics

23.1

Basics

23.2

Activation Functions

23.3

Loss Functions

23.4

Backpropagation

23.5

Regularization Methods

23.6

Optimization Algorithms

23.7

Convolutional Networks

23.8

Recurrent Networks

Recurrent networks allow us to work with sequences, where xt is the input at time t, ht is the hidden state at time
t, and ht−1 is the previous hidden state. RNNs allow us to propagate information through the hidden state h.
Hidden states defualt to zero.
23.8.1

Elman RNN

The Elman RNN is the simplest layers, yet prone to both the vanishing and exploding gradient problem.
ht = tanh (Wih xt + bih + Whh ht−1 + bhh )

(227)

Here, W are weight matricies, and b are bias vectors, one each for the input and hidden vectors.
23.8.2

Long Short-Term Memory

LSTM RNNs mitigate the vanishing gradient problem. They involve a more complex set of equations corresponding
to gates that control the amount of infomation to propagate through the sequence. Here, i is the input gate, f is
the forget gate, g is the cell gate, and o is the output gate. LSTMs have two hidden inputs, hidden state ht and
cell state ct .
it

= σ (Wii xt + bii + Whi ht−1 + bhi )

ft

= σ (Wif xt + bif + Whf ht−1 + bhf )

gt

=

ot

= σ (Wio xt + bio + Who ht−1 + bho )

tanh (Wig xt + big + Whg ht−1 + bhg )

ct

= ft ◦ ct−1 + it ◦ gt

ht

= ot ◦ tanh (ct )

(228)

Here, σ is the sigmoid function.
37

Machine Learning Study Guide

23.8.3

Gated Recurrent Unit

GRU RNNs are another rnn layer designed to mitigate the vanishing gradient problem. Here, we have the r reset
gate, z update gate, and n is the new gate. σ is the sigmoid function.

23.8.4

rt

=

σ (Wir xt + bir + Whr ht−1 + bhr )

zt

=

σ (Wiz xt + biz + Whz ht−1 + bhz )

nt

=

tanh (Win xt + bin + rt ◦ (Whn ht−1 + bhn ))

ht

=

(1 − zt ) ◦ nt + zt ◦ ht−1

(229)

Bidirectional RNNs

Bidirectional RNNs run two separate RNN layers on the forward and reverse sequence, and then concatenates them
into one output vector.
−
→
−
hf = RNN (→
x)
←
−
hr

=

−)
RNN (←
x

ho

=

h−
→ −
→i
hf ; hr

(230)

Where RNN can be any of the 3 previously mentioned layers.
23.8.5

Vanishing/Exploding Gradient

The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why
they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be
exponentially decreasing/increasing with respect to the number of layers.
23.8.6

Gradient Clipping

Gradient clipping is a technique used to cope with the exploding gradient problem sometimes encountered when
performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in
practice.
∇Lclipped = min (∇L, C)
(231)
For some max value C.

24
24.1
24.1.1

Deep Learning: Advanced
Autoencoders
Variational Autoencoders

24.2

General Adversarial Networks

24.3

Encoder-Decoder Models

Encoder-decoder models allow for many to many RNN sequence learning. The idea for them is two take a source
sequence, encode it to a vectorized output, and pass the hidden state over to the decoder (and possible the encodings
themselves) to generate output. These models come up in machine tranlsations, allowing models to translate a source
language into a target language.
38

Machine Learning Study Guide

24.3.1

Encoders

24.3.2

Decoders

24.4

Attention Models

Attention models were discussed by Bahdanau, and later Luong. They are a means to alter the representation of a
set of encodings to pay attention to certain sequence elements more so than others. In an encoder-decoder model
with out attention, the decoder relies exclusivly on the hidden state to hold all information. Attention, by contrast,
uses the hidden state to create a context vector of the encodings.
24.4.1

Context Vector

Attention mechanisms rely on one simple concept: producing context vectors. Hence, given a set of encodeings
h = [hi , · · · , hj , · · · hn ] from the encoder, and the current hidden state si−1 , we compute a score for each encoding
hj , denoted score(si−1 , hj ). These scores are discussed in later sections. With a score computed for each encoding,
we apply a softmax across all scores:
exp(score(si−1 , hj ))
0
j 0 exp(score(si−1 , hj ))

(232)

a(si−1 , hj ) = P

These softmax proabilities are attention weights, and our context vector ci is the weighted sum of the encodings
given these weights:
X
ci =
a(si−1 , hj 0 ) · hj 0
(233)
j0

These context vectors are appended to the embedding of the token set into the decoder.
24.4.2

Concat (Bahdanau) Attention

The concat attention scoring mechanism has two learnable weights, va and Wa and is defined as:
score(si−1 , hj ) = vaT · tanh (Wa · [si−1 ; hj ])
24.4.3

(234)

Luong Attention Mechanisms

Luong defined several global attention mechanisms, defined here:
(
sTi−1 · Wa · hj
score(si−1 , hj ) =
sTi−1 · hj

general
dot

(235)

Luong also explored local attention weights, with monotonic alignment and predictive alignment. The difference
in local attention is that the context vector is derived on source hidden states within the window [pt − D, pt + D]
for some D and decoder time step t (i.e. i − 1). In monotonic alignment, aligned position pt = t. In predicitve
alignment, the aligned position is:

pt = S · σ vpT tanh (Wp si−1 )
(236)
where vp and Wp are learnable parameters, S is the source sentence length. In addition, the weights are gaussian
centered around pt :
!
(s − pt )2
exp(score(si−1 , hj ))
· exp −
(237)
a(si−1 , hj ) = P
2
0
2 D
j 0 exp(score(si−1 , hj ))
2

Here, s is an integer within the window centered at pt .
39

Machine Learning Study Guide

24.5

Word Embeddings

Word Embeddings seek to create a dense vector representation of real values to represent a single word in a euclidean
space. Word vectors are weight matrices where each row corresponds to a word, acessed through a numerical token
representing the word itself. However, several models have been developed to train these word embedding weights
to encode lexical meaning.
24.5.1

N-Gram Models

A simple apporach is to take a context of size n words that appear before our target word, and attempt to train a
model to predict a word wi that appears in a training corpus. We effectively average the input word embeddings,
apply a linear layer to project the data to the vocab size, and maximize the probability of predicitng the target
word (i.e. apply a softmax of the output vector).
exp f (wi−n , . . . , wi−1 )i
max p(wi | wi−n , . . . , wi−1 ) = max P|V |
j=1 exp f (wi−n , . . . , wi−1 )j
Where our model is:

(238)




n
X
1
f (wi−n , . . . , wi−1 ) = Wd · 
We [wi−j ] + bd ∈ R|V |
n j=1

(239)

Here, |V | is the size of the vocabulary, Wd ∈ Rd×|V | is a projection matrix used for training along with bias bd , and
We ∈ R|V |×d are the word embeddings we seek to train. d is the embedding dimension.
Now we can write the negative log likelihood as:
min − log p(wi | wi−n , . . . , wi−1 ) = −f (wi−n , . . . , wi−1 )i + log

|V |
X

exp f (wi−n , . . . , wi−1 )j

(240)

j=1

24.5.2

Continuous Bag-of-Words Models (CBOW)

The continuous bag of words model, or CBOW, seeks to predict a center word given the surrounding context words. For instance, given a context size of n, for a target center word wi , we are given context words
wi−n , . . . wi−1 , wi+1 , . . . wi+n and attempt to predict wi . Hence:
exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )i
max p(wi | wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = max P|V |
j=1 exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )j

(241)

Where our model is:

f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = Wd · 

1
2∗n

n
X


We [wi+j ] + bd ∈ R|V |

(242)

j=−n,j6=0

Here, |V | is the size of the vocabulary, Wd ∈ Rd×|V | is a projection matrix used for training along with bias bd , and
We ∈ R|V |×d are the word embeddings we seek to train. d is the embedding dimension.
Now we can write the negative log likelihood as:
min − log p(wi | wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = − f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )i
+ log

|V |
X
j=1

40

exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )j

(243)

Machine Learning Study Guide

24.5.3

Skip-Gram Model

24.5.4

Negative Sampling

41

Source Exif Data:

File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 41
Page Mode                       : UseOutlines
Author                          : 
Title                           : 
Subject                         : 
Creator                         : LaTeX with hyperref package
Producer                        : pdfTeX-1.40.18
Create Date                     : 2019:02:17 20:43:04-05:00
Modify Date                     : 2019:02:17 20:43:04-05:00
Trapped                         : False
PTEX Fullbanner                 : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3

EXIF Metadata provided by EXIF.tools

Study Guide

Navigation menu

Versions of this User Manual:

Views

Navigation