Study Guide
User Manual:
Open the PDF directly: View PDF .
Page Count: 41
Download | |
Open PDF In Browser | View PDF |
Machine Learning Study Guide William Watson Johns Hopkins University billwatson@jhu.edu Contents 2.3 1 Linear Algebra and Calculus 1.1 General Notation . . . . . . . . . . . . . 1.1.1 Indentity Matrix . . . . . . . . . 1.1.2 Diagonal Matrix . . . . . . . . . 1.1.3 Orthogonal Matrix . . . . . . . . 1.2 Matrix Operations . . . . . . . . . . . . 1.2.1 Vector-Vector Products . . . . . 1.2.2 Vector-Matrix Products . . . . . 1.2.3 Matrix-Matrix Products . . . . . 1.2.4 The Transpose . . . . . . . . . . 1.2.5 The Trace . . . . . . . . . . . . . 1.2.6 The Inverse . . . . . . . . . . . . 1.2.7 The Determinant . . . . . . . . . 1.3 Matrix Properties . . . . . . . . . . . . . 1.3.1 Norms . . . . . . . . . . . . . . . 1.3.2 Linear Dependence and Rank . . 1.3.3 Span, Range, and Nullspace . . . 1.3.4 Symmetric Matrices . . . . . . . 1.3.5 Positive Semidefinite Matrices . . 1.3.6 Eigendecomposition . . . . . . . 1.3.7 Singlular Value Decomposition . 1.3.8 The Moore-Penrose Pseudoinverse 1.4 Matrix Calculus . . . . . . . . . . . . . . 1.4.1 The Gradient . . . . . . . . . . . 1.4.2 The Hessian . . . . . . . . . . . . 1.4.3 Gradient Properties . . . . . . . 4 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 8 8 8 9 9 9 10 10 10 11 2 Convex Optimization 2.1 Convexity . . . . . . . . . . . . . 2.1.1 Convex Sets . . . . . . . . 2.1.2 Convex Functions . . . . 2.1.3 First-Order Conditions . . 2.1.4 Second-Order Conditions 2.1.5 Jensen’s Inequality . . . . 2.1.6 Sublevel Sets . . . . . . . 2.2 Convex Optimization . . . . . . . 2.2.1 Global Optimality . . . . 2.2.2 Gradient Descent . . . . . 2.2.3 Newton’s Algorithm . . . 11 11 11 11 11 12 12 12 12 12 13 13 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Lagrange Duality and KKT Conditions 2.3.1 The Lagrangian . . . . . . . . . . 2.3.2 Primal and Dual Problems . . . 2.3.3 Strong and Weak Duality . . . . 2.3.4 Complementary Slackness . . . . 2.3.5 The KKT Conditions . . . . . . 13 13 13 14 14 14 3 Probability and Statistics 3.1 Basics . . . . . . . . . . . . . . . . . . . 3.1.1 Axioms of Probability . . . . . . 3.1.2 Permutation . . . . . . . . . . . 3.1.3 Combination . . . . . . . . . . . 3.2 Conditional Probability . . . . . . . . . 3.2.1 Bayes Rule . . . . . . . . . . . . 3.2.2 Independence . . . . . . . . . . . 3.3 Random Variables . . . . . . . . . . . . 3.3.1 Cumulative Distribution Function (CDF) . . . . . . . . . . . . 3.3.2 Probability Density Function (PDF) . . . . . . . . . . . . . . . 3.3.3 Discrete PDF/CDF . . . . . . . 3.3.4 Continuous PDF/CDF . . . . . . 3.3.5 Expectation . . . . . . . . . . . . 3.3.6 Variance and Standard Deviation 3.4 Discrete Random Variables . . . . . . . 3.4.1 Bernoulli . . . . . . . . . . . . . 3.4.2 Binomial . . . . . . . . . . . . . 3.4.3 Geometric . . . . . . . . . . . . . 3.4.4 Poisson . . . . . . . . . . . . . . 3.5 Continuous Random Variables . . . . . . 3.5.1 Uniform . . . . . . . . . . . . . . 3.5.2 Exponential . . . . . . . . . . . . 3.5.3 Gaussian (Normal) . . . . . . . . 3.6 Jointly Distributed Random Variables . 3.6.1 Marginal Density . . . . . . . . . 3.6.2 Cumulative Distribution . . . . . 3.6.3 Conditional Density . . . . . . . 3.6.4 Independence . . . . . . . . . . . 3.6.5 Expectation . . . . . . . . . . . . 3.6.6 Covariance . . . . . . . . . . . . 3.6.7 Correlation . . . . . . . . . . . . 14 14 15 15 15 15 15 16 16 16 16 16 17 17 17 18 18 18 18 19 19 19 19 20 20 20 20 20 20 21 21 21 Machine Learning Study Guide 3.7 3.8 Parameter Estimation . . . . . . . . . . 3.7.1 Definitions . . . . . . . . . . . . 3.7.2 Bias . . . . . . . . . . . . . . . . 3.7.3 Mean and Central Limit Theorem 3.7.4 Variance . . . . . . . . . . . . . . Probability Bounds and Inequalities . . 3.8.1 Markov . . . . . . . . . . . . . . 3.8.2 Chebyshev . . . . . . . . . . . . 3.8.3 Chernoff . . . . . . . . . . . . . . 3.8.4 Hoeffding . . . . . . . . . . . . . 21 21 21 21 21 21 21 21 22 22 4 Information Theory 22 5 Learning Theory 5.1 Bias and Variance . . . . . . . . . . . . 5.2 Notation . . . . . . . . . . . . . . . . . . 5.2.1 Union Bound . . . . . . . . . . . 5.2.2 Hoeffding Inequality For Bernoulli Variables . . . . . . . . 5.3 Training Error . . . . . . . . . . . . . . 5.4 Probably Approximately Correct (PAC) 5.5 Hypothesis Classes . . . . . . . . . . . . 5.5.1 Shattering . . . . . . . . . . . . . 5.5.2 Upper Bound Theorem . . . . . 5.5.3 VC Dimension . . . . . . . . . . 5.5.4 Vapnik Theorem . . . . . . . . . 24 24 24 24 6 Linear Regression 6.1 LMS Algorithm . . . . . . . . . . . . 6.2 The Normal Equations . . . . . . . . 6.3 Probabilistic Interpretation . . . . . 6.4 Locally Weighted Linear Regression . . . . . . . . 24 24 25 26 26 7 Logistic Regression 7.1 The Logistic Function . . . 7.2 Cost Function . . . . . . . . 7.3 Gradient Descent . . . . . . 7.4 Newton-Raphson Algorithm . . . . . . . . 27 27 27 28 28 8 Softmax Regression 8.1 Softmax Function . . . . . . . . . . . . . 8.2 MLE and Cost Function . . . . . . . . . 8.3 Gradient Descent . . . . . . . . . . . . . 28 28 29 29 9 Generalized Linear Models 9.1 Exponentional Family . . . . . 9.2 Assumptions of GLMs . . . . . 9.3 Examples . . . . . . . . . . . . 9.3.1 Ordinary Least Squares 9.3.2 Logistic Regression . . . 9.3.3 Softmax Regression . . 30 30 30 30 30 30 30 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 24 24 24 24 24 24 24 10 Perceptron 30 11 Support Vector Machines 30 12 Margin Classification 30 13 Generative Learning: nant Analysis 13.1 Assumptions . . . 13.2 Estimation . . . . 13.3 Prediction . . . . . Gaussian Discrimi. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 Generative Learning: Naive Bayes 14.1 Assumptions . . . . . . . . . . . . 14.2 Estimation . . . . . . . . . . . . . 14.2.1 Laplace Smoothing . . . . . 14.3 Prediction . . . . . . . . . . . . . . 30 30 30 31 . . . . 31 31 31 31 32 15 Tree-based Methods 15.1 Decision Trees . . . . . . . . . . . . . . . 15.2 Random Forest . . . . . . . . . . . . . . 15.3 Boosting . . . . . . . . . . . . . . . . . . 32 32 32 32 16 K-Nearest Neighbors 16.1 Classification . . . . . . . . . . . . . . . 16.2 Regression . . . . . . . . . . . . . . . . . 32 32 32 17 K-Means Clustering 17.1 Algorithm . . . . . . . . . . . . . . . . . 17.2 Hierarchical Clustering . . . . . . . . . . 17.3 Clustering Metrics . . . . . . . . . . . . 33 33 33 33 18 Expectation-Maximization 18.1 Mixture of Gaussians . . . . . . . . . . . 18.2 Factor Analysis . . . . . . . . . . . . . . 34 34 34 19 Principal Component Analysis 19.1 Eigenvalues, Eigenvectors, and the Spectral Theorem . . . . . . . . . . . . . . . 19.2 Algorithm . . . . . . . . . . . . . . . . . 19.3 Algorithm: SVD . . . . . . . . . . . . . 34 20 Independent Component Analysis 35 21 Reinforcement Learning 21.1 Markov Decision Processes . 21.2 Policy and Value Functions 21.3 Value Iteration Algorithm . 21.4 Q-Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 34 35 35 35 35 35 35 Machine Learning Study Guide 22 Probabilistic Graphical Models 22.1 Bayesian Networks . . . . . . . 22.1.1 Hidden Markov Models 22.2 Markov Random Fields . . . . 22.3 Conditional Random Fields . . . . . . . . . . 35 36 37 37 37 23 Deep Learning: Basics 23.1 Basics . . . . . . . . . . . . . . . . . . 23.2 Activation Functions . . . . . . . . . . 23.3 Loss Functions . . . . . . . . . . . . . 23.4 Backpropagation . . . . . . . . . . . . 23.5 Regularization Methods . . . . . . . . 23.6 Optimization Algorithms . . . . . . . 23.7 Convolutional Networks . . . . . . . . 23.8 Recurrent Networks . . . . . . . . . . 23.8.1 Elman RNN . . . . . . . . . . . 23.8.2 Long Short-Term Memory . . . 23.8.3 Gated Recurrent Unit . . . . . 23.8.4 Bidirectional RNNs . . . . . . 23.8.5 Vanishing/Exploding Gradient 23.8.6 Gradient Clipping . . . . . . . . . . . . . . . . . . . . . 37 37 37 37 37 37 37 37 37 37 37 38 38 38 38 . . . . . . . . . . . . 24 Deep Learning: Advanced 24.1 Autoencoders . . . . . . . . . . . . . . . 24.1.1 Variational Autoencoders . . . . 24.2 General Adversarial Networks . . . . . . 24.3 Encoder-Decoder Models . . . . . . . . . 24.3.1 Encoders . . . . . . . . . . . . . 24.3.2 Decoders . . . . . . . . . . . . . 24.4 Attention Models . . . . . . . . . . . . . 24.4.1 Context Vector . . . . . . . . . . 24.4.2 Concat (Bahdanau) Attention . . 24.4.3 Luong Attention Mechanisms . . 24.5 Word Embeddings . . . . . . . . . . . . 24.5.1 N-Gram Models . . . . . . . . . 24.5.2 Continuous Bag-of-Words Models (CBOW) . . . . . . . . . . . . 24.5.3 Skip-Gram Model . . . . . . . . 24.5.4 Negative Sampling . . . . . . . . 38 38 38 38 38 39 39 39 39 39 39 40 40 40 41 41 3 Machine Learning Study Guide 1 1.1 Linear Algebra and Calculus General Notation A vector x ∈ Rn has n entries, and xi ∈ R is the i-th entry: x1 x2 x = . ∈ Rn .. xn (1) We denote a matrix A ∈ Rm×n with m rows and n columns, and Aij ∈ R is the entry in the i-th row and j-th column: A11 · · · A1n .. ∈ Rm×n (2) A = ... . Am1 · · · Amn Vectors can be viewed as a n × 1 matrix. 1.1.1 Indentity Matrix The identity matrix I ∈ Rn×n is a square matrix with ones along the diagonal and zero everywhere else: 1 I= 0 .. . 0 0 .. . .. . ··· ··· .. . .. . 0 0 .. . 0 1 (3) For all matrices A ∈ Rn×n we have A × I = I × A = A. 1.1.2 Diagonal Matrix A diagonal matrix D ∈ Rn×n is a square matrix with nonzero values along the diagonal and zero everywhere else: d1 0 D= . .. 0 0 .. . .. . ··· ··· .. . .. . 0 0 .. . 0 dn (4) The diagonal matrix D is also written as diag(d1 , . . . , dn ). 1.1.3 Orthogonal Matrix Two vectors x, y ∈ Rn are orthogonal if xT y = 0. A vector x ∈ Rn is normalized if ||x||2 = 1. A square matrix U ∈ Rn×n is orthogonal if all its columns are orthogonal to each other and are normalized. Hence: UT U = I = UUT Hence the inverse of an orthogonal matrix is its transpose.m 4 (5) Machine Learning Study Guide 1.2 1.2.1 Matrix Operations Vector-Vector Products Given two vectors x, y ∈ Rn , the inner product is: xT y = n X xi yi ∈ R (6) i=1 The outer product for a vector x ∈ Rm , y ∈ Rn is: x1 y1 .. T xy = . xm y1 1.2.2 x1 yn .. m×n ∈R . xm yn ··· .. . ··· (7) Vector-Matrix Products The product of a matrix A ∈ Rm×n and vector x ∈ Rn is a expressed as: − aT1 − − aT2 − y = Ax = .. . − aTm − vector y = Ax ∈ Rm . If we write A by the rows, Ax is x = aT1 x aT2 x .. . aTm x (8) Here, the i-th entry of y is the inner product of the i-th row of A and x, yi = aTi x. If we write A is column form: | y = Ax = a1 | | a2 | ··· | an | x1 x2 .. . = a1 x1 + a2 x2 + . . . + an xn (9) xn Here, y is a linear combination of the columns of A, where the coefficients of the linear combination are given by the entries of x. 1.2.3 Matrix-Matrix Products Given a matrix A ∈ m × n and matrix B ∈ n × p, we can define C = AB as follows: C = AB = aT1 b1 aT2 b1 .. . aT1 b2 aT2 b2 .. . ··· ··· .. . aT1 bp aT2 bp .. . aTm b1 aTm b2 ··· aTm bp (10) Hence, each (i, j)-th entry of C is equal to the inner product of the i-th row of A and the j-th column of B. Compactly: n X Cij = aTi bj = aik bkj (11) k=1 5 Machine Learning Study Guide 1.2.4 The Transpose The transpose of a matrix A ∈ Rm×n is AT ∈ Rn×m matrix whose entries are: (AT )ij = Aji (12) Properties of the transpose: 1. (AT )T = A 2. (AB)T = B T AT 3. (A + B)T = AT + B T 1.2.5 The Trace The trace of a square matrix A ∈ Rn×m is denoted tr(A). It is the sum of diagonal elements in the matrix: tr A = n X Aii (13) i=1 Properties of the trace: 1. For A ∈ Rn×n , tr A = tr AT 2. For A, B ∈ Rn×n , tr(A + B) = tr A + tr B 3. For A ∈ Rn×n , t ∈ R, tr(tA) = t · tr A 4. For A, B such that AB is square, tr AB = tr BA 5. For A, B, C such that ABC is square, tr ABC = tr BCA = tr CAB, and so on 1.2.6 The Inverse The inverse of a matrix A is noted A−1 and is the unique matrix such that: A−1 A = I = AA−1 (14) Not all square matrices are invertible. In addition, assuming A, B ∈ Rn×n are non-singular: 1. (A−1 )−1 = A 2. (AB)−1 = B −1 A−1 3. (A−1 )T = (AT )−1 1.2.7 The Determinant The determinant of a square matrix A ∈ Rn×n , noted |A| or det(A) is expressed recursively in terms of A\i,\j , which is the matrix A without its i-th row and j-th column, as follows: det(A) = |A| = n X (−1)i+j Ai,j |A\i,\j | j=1 Remark that A is invertible if and only if |A| = 6 0. Also, |AB| = |A||B| and |AT | = |A|. 6 (15) Machine Learning Study Guide 1.3 1.3.1 Matrix Properties Norms A norm of a vector x is any function f : Rn → R that satisfies 4 properties: 1. Non-negativity: For all x ∈ Rn , f (x) ≥ 0 2. Definiteness: f (x) = 0 if and only if x = 0 3. Homogeneity: For all x ∈ Rn , t ∈ R, f (tx) = |t|f (x) 4. Triangle Inequality: For all x, y ∈ Rn , f (x + y) ≤ f (x) + f (y) However, most norms used come from the family of `p norms: `p = ||x||p = n X ! p1 xpi (16) i=1 The p-norm is used in Holder’s inequality. The Manhattan norm, used in LASSO regularization, is: `1 = ||x||1 = n X |xi | (17) i=1 The euclidean norm, `2 is used in ridge regularization and distance measures: v u n uX `2 = ||x||2 = t x2i (18) i=1 Finally, the infinity norm is used in uniform convergence: `∞ = ||x||∞ = max |xi | i The Frobenius norm of a matrix is analogous to the `2 norm of a vector: sX A2ij ||A||F = (19) (20) ij The dot product of two vectors can be expressed in terms of norms: xT y = ||x||2 ||y||2 cos θ (21) where θ is the angle between vectors x and y. 1.3.2 Linear Dependence and Rank A set of vectors {x1 , x2 , . . . , xn } ⊂ Rm is linearly independent if no vector can be represented as a linear combination of the remaining vectors. Conversely, if one vector in the set can be represented as a linear combination of the remaining vectors, then the vectors are said to be linearly dependent. Formally, for scalar values α1 , . . . , αn−1 ∈ R: xn = n−1 X α i xi (22) i=1 The rank of a given matrix A is noted rank(A) and is the dimension of the vector space generated by its columns. This is equivalent to the maximum number of linearly independent columns of A. If rank(A) = min(m, n), then A is said to be full rank.m 7 Machine Learning Study Guide 1.3.3 Span, Range, and Nullspace The span of a set of vectors {x1 , x2 , . . . , xn } is the set of all vectors that can be expressed as a linear combination of {x1 , x2 , . . . , xn }. Formally, ( span ({x1 , . . . xn }) = v:v= n X ) αi xi , αi ∈ R (23) i=1 The projection of a vector y ∈ Rn onto the span of {x1 , x2 , . . . , xn } is the vector v ∈ span ({x1 , . . . xn }) such that v is as close as possible to y as measured by the euclidean norm: Proj (y; {x1 , . . . xn }) = argminv∈span({x1 ,...,xn }) ky − vk2 (24) The range of a matrix A ∈ Rm×n is the span of the columns of A: R(A) = {v ∈ Rm : v = Ax, x ∈ Rn } (25) The nullspace of a matrix A ∈ Rm×n is the set of all vectors that equal 0 when multiplied by A: N (A) = {x ∈ Rn : Ax = 0} 1.3.4 (26) Symmetric Matrices A square matrix A ∈ Rn×n is symmetric if A = AT . It is anti-symmetric if A = −AT . For any matrix A ∈ Rn×n the matrix A + AT is symmetric and the matrix A − AT is anti-symmetric. From this, any square matrix A ∈ Rn×n can be represented as a sum of a symmetric matrix and an anti-symmetric matrix: A= A − AT A + AT + 2 } 2 } | {z | {z Symmetric 1.3.5 (27) Antisymmetric Positive Semidefinite Matrices Given a square matrix A ∈ Rn×n and a vector x ∈ Rn , the scalar value xT Ax is called the quadratic form: xT Ax = n X n X Aij xi xj (28) i=1 j=1 A symmetric matrix A is: 1. Positive definite (PD) if for all nonzero vectors, xT Ax > 0 2. Positive semidefinite (PSD) if for all nonzero vectors, xT Ax ≥ 0 3. Negative definite (ND) if for all nonzero vectors, xT Ax < 0 4. Negative semidefinite (NSD) if for all nonzero vectors, xT Ax ≤ 0 5. Indefinite if it is neiter PSD nor NSD, i.e. if there exists a x1 , x2 such that xT1 Ax1 > 0 and xT2 Ax2 < 0 Given any matrix A ∈ Rm×n , the gram matrix G = AT A is always PSD. If m ≥ n and A is full rank, then G is PD. 8 Machine Learning Study Guide 1.3.6 Eigendecomposition Given a square matrix A ∈ Rn×n , we say that λ ∈ C is an eigenvalue of A and v ∈ C is the corresponding eigenvector if Av = λv, v 6= 0 (29) The trace of A is equal to the sum of its eigenvalues: tr A = n X λi (30) i=1 The determinant of A is equal to the product of its eigenvalues: |A| = n Y λi (31) i=1 Suppose matrix A has n linearly independent eigenvectors {v1 , . . . , vn } with corresponding eigenvalues {λ1 , . . . , λn }. Let V be a matrix with one eigenvalue per column: V = [v1 , . . . , vn ]. Let Λ = diag(λ1 , . . . , λn ). Then the eigendecomposition of A is given by: A = V ΛV −1 (32) For when A is symmetric, then the eigenvalues of A are real, the eigenvectors of A are orthonormal, and hence V is an orthogonal matrix (henceforth renamed U ). Hence A = U ΛU T . A matrix whose eigenvalues are all positive is called positive definite. A matrix whose eigenvalues are all positive or zero valued is called positive semidefinite. Likewise, if all eigenvalues are negative, the matrix is negative definite, and if all eigenvalues are negative or zero valued, it is negative semidefinite. In addition, assuming sorted eigenvalues, for the following optimization problem: max xT Ax x∈Rn subject to kxk22 = 1 (33) The solution is v1 the eigenvector corresponding to λ1 . For the minimization problem: min xT Ax x∈Rn subject to kxk22 = 1 (34) the optimal solution for x is vn , the eigenvector corresponding to eigenvalue λn . 1.3.7 Singlular Value Decomposition The singular value decomposition provides a way to factorize a m × n matrix A into singular vectors and singular values. It is defined as: A = U DV T (35) Suppose A ∈ Rm×n matrix. Then U ∈ Rm×m , D ∈ Rm×n , V ∈ Rn×n matrices. The matricies U and V are orthogonal, and matrix D is diagonal. The elements along the diagonal of D are known as the singular values of matrix A. The columns of U are known as the left-singular vectors while the columns of V are the right-singular vectors. The left-singular vectors of A are the eigenvectors of AAT . The right-singular vectors of A are the eigenvectors of AT A. We can use SVD to partially generalize matrix inversion to nonsquare matrices. 1.3.8 The Moore-Penrose Pseudoinverse Matrix inversion is not defined for matrices that are not square. Note that for nonsingular A: AA−1 A = A (36) 9 Machine Learning Study Guide However, if the inverse is not defined, we seek to find a matrix A+ such that: AA+ A = A (37) The moore-penrose pseudoinverse A+ is defined as follows: A+ = lim (AT A + αI)−1 AT (38) α→0 Practical algorithms use the singular value decomposition of A such that: A+ = V D + U T (39) where U, D, V are from the SVD of A, and the pseudoinverse of D+ is obtained by taking the reciprocal of the nonzero diagonal elements of D. If A has more columns than rows, then using the pseudoinverse to solve a linear equation Ax = y provides one of many solutions, but provides x = A+ y with minimal euclidean norm ||x||2 . When A has more rows than columns, the pseudoinverse gives us the x for which Ax is as close as possible to y, i.e. minimizing ||Ax − y||2 . 1.4 1.4.1 Matrix Calculus The Gradient Let f : Rm×n → R be a function and A ∈ Rm×n be a matrix. The gradient of f with respect to A is a m × n matrix noted as ∇A f (A) such that: ∂f (A) ∂f (A) (A) · · · ∂f ∂A11 ∂A12 ∂A1 ∂f (A) ∂f (A) (A) · · · ∂f ∂A21 ∂A22 ∂A2n m×n (40) ∇A f (A) ∈ R = .. .. .. .. . . . . ∂f (A) ∂f (A) ∂f (A) · · · ∂A ∂Am1 ∂Am2 mn Or compactly for each ij entry: ∇A f (A)ij = ∂f (A) ∂Aij (41) However, the gradient of a vector x ∈ Rn is: ∇x f (x) = 1.4.2 ∂f (x) ∂x1 ∂f (x) ∂x2 .. . ∂f (x) ∂xn (42) The Hessian Let f : Rn → R be a function and x ∈ Rn be a vector. The hessian of f with respect to x is a n × n symmetric matrix noted as H = ∇2x f (x) such that: ∇2x f (x) 10 ∈R n×n = ∂ 2 f (x) ∂x21 ∂ 2 f (x) ∂x2 ∂x1 .. . ∂ 2 f (x) ∂xn ∂x1 ∂ 2 f (x) ∂x2 ∂ 2 f (x) ∂x22 ··· .. . ··· .. . ∂ 2 f (x) ∂xn ∂x2 ··· ∂ 2 f (x) ∂x1 ∂xn ∂ 2 f (x) ∂x2 ∂xn .. . ∂ 2 f (x) ∂x2n (43) Machine Learning Study Guide Or compactly: ∇2x f (x)ij = ∂ 2 f (x) ∂xi ∂xj (44) Note that the hessian is only defined when f (x) is real-valued. 1.4.3 Gradient Properties For matrices A, B, C and vectors x, b: 1. ∇x bT x = b 2. ∇x xT Ax = 2Ax (if A symmetric) 3. ∇2x xT Ax = 2A (if A symmetric) 4. ∇A tr(AB) = B T 5. ∇AT f (A) = (∇A f (A)) T 6. ∇A tr(ABAT C) = CAB + C T AB T 7. ∇A |A| = |A|(A−1 )T 2 2.1 2.1.1 Convex Optimization Convexity Convex Sets A set C is convex if, for any x, y ∈ C and α ∈ R with 0 ≤ α ≤ 1, we have αx + (1 − α)y ∈ C (45) This point is known as a convex combination of x and y. 2.1.2 Convex Functions A function f : Rn → R is convex if its domain D(f ) is a convex set, and if, for all x, y ∈ D(f ) and α ∈ R, 0 ≤ α ≤ 1, we have: f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) (46) A function is strictly convex if: f (αx + (1 − α)y) < αf (x) + (1 − α)f (y) 2.1.3 (47) First-Order Conditions Suppose a function f : Rn → R is differentiable, then f is convex if and only if D(f ) is a convex set and for all x, y ∈ D(f ) we have: f (y) ≥ f (x) + ∇x f (x)T (y − x) (48) This is called the first-order approximation to the function f at point x. This is approximating f with its tangent line at point x. The first order condition for convexity says that f is convex if and only if the tangent line is a global underestimator of the function f . In other words, if we take our function and draw a tangent line at any point, then every point on this line will lie below the corresponding point on f . 11 Machine Learning Study Guide 2.1.4 Second-Order Conditions Suppose a function f : Rn → R is twice differentiable, i.e. the hessian ∇2x f (x) is defined for all points x in the domain of f . Then f is convex if and only if D(f ) is a convex set and its hessian is PSD: ∇2x f (y) 0 2.1.5 (49) Jensen’s Inequality Suppose we start with the inequality in the basic definition of a convex function: f (αx + (1 − α)y) ≤ αf (x) + (1 − α)f (y) We can extend this to convex combinations of more than one point: ! k k X X f α i xi ≤ αi f (xi ) i=1 For P k (50) (51) i=1 α = 1 and all α ≥ 0. This can be extended to integrals: Z Z f p(x)xdx ≤ p(x)f (x)dx (52) R For p(x)dx = 1 and p(x) ≥ 0∀x. This implies p(x) is a probability density, and we can write our inequality in terms of expectations: f (E[x]) ≤ E[f (x)] (53) And is known as Jensen’s inequality. 2.1.6 Sublevel Sets Given a convex function f : Rn → R and a real number α ∈ R, the α-sublevel set is: {x ∈ D(f ) : f (x) ≤ α} (54) In other words, the α-sublevel set is the set of all points x such that f (x) ≤ α. 2.2 Convex Optimization A convex optimization problem seeks to optimize a convex function f with respect to a variable x and possible constraints. minimize f (x) subject to gi (x) ≤ 0, i = 1, . . . , m (55) hi (x) = 0, i = 1, . . . , p where f is a convex function, gi are convex functions, and hi are affine functions. The optimal value p∗ is equal to the minimum possible value of the objective function in the feasible region: p? = min {f (x) : gi (x) ≤ 0, i = 1, . . . , m, hi (x) = 0, i = 1, . . . , p} (56) The optimal point x∗ is a point such that f (x∗ ) = p∗ . 2.2.1 Global Optimality A point x is locally optimal if it is feasible and if there exists some R > 0 such that all feasible points z with ||x − z||2 ≤ R, satisfy f (x) ≤ f (z). This means that for x to be locally optimal, there exists no nerby points that have a lower objective value. A point x is globally optimal if it is feadible and for all feasible points z, f (x) ≤ f (z). For a convex optimization problem, all locally optimal points are globally optimal. 12 Machine Learning Study Guide 2.2.2 Gradient Descent Using α ∈ R as the learning rate, we can update a set of parameters θ with respect to minimizing a function f as follows: θ := θ − α∇f (θ) (57) Stochastic gradient descent (SGD) is updating the parameter based on each training example, and batch gradient descent is on a batch of training examples. 2.2.3 Newton’s Algorithm Newton’s algorithm is a numerical method using information from the second derivative to find θ such that f 0 (θ) = 0. θ := θ − f 0 (θ) f 00 (θ) (58) For multidimensional parameters: θ := θ − αH −1 ∇θ f (θ) (59) Where H is the hessian matrix of second partial derivatives. Hij = 2.3 ∂ 2 f (θ) ∂θi ∂θj (60) Lagrange Duality and KKT Conditions Generic differentiable convex optimization problems are of the form: minimize subject to f (x) gi (x) ≤ 0, hi (x) = 0, i = 1, . . . , m i = 1, . . . , p (61) where x ∈ Rn is the optimization variable, f : Rn → R and gi : Rn → R are differentiable convex functions, and hi : Rn → R are affine functions. 2.3.1 The Lagrangian Given a convex constrained minimization problem, the generalized lagrangian is a function L : Rn × Rm × Rp → R defined as: p m X X L(x, α, β) = f (x) + αi gi (x) + βi hi (x) (62) i=1 i=1 We refer to x as the primal variables of the Lagrangian. The α and β are refered to as the dual variables or lagrange multipliers. The key idea behind Lagrangian duality is for any convex optimization problem, there always exist settings of the dual variables such that the unconstrained minimum of the Lagrangian with respect to the primal variables (keeping the dual variables fixed) coincides with the solution of the original constrained minimization problem. 2.3.2 Primal and Dual Problems The primal problem is: min x L(x, α, β) = min θP (x) x α,β:αi ≥0,∀i | {z } max (63) θP(x) 13 Machine Learning Study Guide Here, θP : Rn → R is the primal objective, and the right hand side is the primal problem. In addition, p∗ = θP (x∗ ) is the optimal value of the primal objective. The dual problem is: h i max θD (x) (64) max min L(x, α, β) = x α,β:αi ≥0,∀i {z } α,β:αi ≥0,∀i | θD(x) Here, θD : Rm × Rp → R is the dual objective, and the right hand side is the dual problem. In addition, s∗ = θD (α∗ , β ∗ ) is the optimal value of the dual objective. 2.3.3 Strong and Weak Duality Weak duality for any pair of primal and dual problems, with d∗ the optimal objective value of the dual problem and p∗ as the optimal objective value of the primal problem we have: d∗ ≤ p ∗ (65) Strong duality is when for any pair of primal and dual problems which satisfy certain technical conditions called constraint qualifications, then: d∗ = p ∗ (66) 2.3.4 Complementary Slackness If strong duality holds, then αi∗ g(x∗i ) = 0 for each i = 1, . . . , m. This property is known as complementary slackness. This implies the following: αi∗ > 0 =⇒ gi (x∗ ) = 0 (67) gi (x∗ ) < 0 2.3.5 =⇒ αi∗ = 0 (68) The KKT Conditions Suppose that x∗ ∈ Rn , α∗ ∈ Rm and β ∗ ∈ Rp satisfy the following conditions: 1. Primal feasibility: gi (x∗ ) ≤ 0, for i = 1, . . . , m and hi (x∗ ) = 0 for i = 1, . . . , p 2. Dual feasibility: αi∗ ≥ 0 for i = 1, . . . , m 3. Complementary slackness: αi∗ gi (x∗ ) = 0 for i = 1, . . . , m 4. Lagrangian stationarity: ∇x L(x∗ , α∗ , β ∗ ) = ~0 Then x∗ is primal optimal and (α∗ , β ∗ ) are dual optimal. Furthermore, if strong duality holds, then any primal optimal x∗ and dual optimal (α∗ , β ∗ ) must satisfy the conditions 1 through 4. These conditions are known as the Karush-Kuhn-Tucker (KKT) conditions. 3 3.1 Probability and Statistics Basics The set of all possible outcomes of an experiment is known as the sample space and denoted by S. Any subset E of the sample space is known as an event. An event is a set consisting of possible outcomes of the experiment. 14 Machine Learning Study Guide 3.1.1 Axioms of Probability For each event E, we denote P (E) as the probability of event E occuring. P (E) satisfies the following properties: 1. Every probability is between 0 and 1 included: 0 ≤ P (E) ≤ 1 (69) 2. The probability that at least one event in the sample space will occur is 1: P (S) = 1 (70) 3. For any sequence of mutually exclusive events E1 , . . . , En we have: ! n n [ X P Ei = P (Ei ) i=1 3.1.2 (71) i=1 Permutation A permutation is an arrangement of r objects from a pool of n objects, in a given order. The number of such arrangements is given by P (n, r): n! (72) P (n, r) = (n − r)! 3.1.3 Combination A combination is an arrangement of r objects from a pool of n objects, where order does not matter. The number of such arrangements is given by C(n, r): C(n, r) = n! P (n, r) = r! r!(n − r)! (73) Note that for 0 ≤ r ≤ n we have P (n, r) ≥ C(n, r). 3.2 Conditional Probability Let B be an event with non-zero probability, the conditional proability of any event A given B is: P (A ∩ B) P (B) (74) P (B|A)P (A) P (B) (75) P (A|B) = 3.2.1 Bayes Rule For events A, B such that P (B) > 0, we have: P (A|B) = From Bayes rule, we have: P (A ∩ B) = P (A|B)P (B) = P (A)P (B|A) (76) Let {Ai , i ∈ [1, n]} be such that for all i, Ai 6= ∅. We say that {Ai } is a partition if we have: ∀i 6= j, Ai ∩ Aj = ∅ and n [ Ai = S (77) i=1 15 Machine Learning Study Guide Remark that for any event B in the sample space, we have: P (B) = n X P (B|Ai )P (Ai ) (78) i=1 Let {Ai , i ∈ [1, n]} be a partition of the sample space, we can extend bayes rule as: P (Ak |B) = P (B|Ak )P (Ak ) n X P (B|Ai )P (Ai ) (79) i=1 3.2.2 Independence Two events A and B are independent if and only if we have: P (A ∩ B) = P (A)P (B) 3.3 (80) Random Variables A random variable X is a function that maps every element in a sample space to a real line. 3.3.1 Cumulative Distribution Function (CDF) The cumulative distribution function F , which is monotonically non-decreasing and is such that limx→−∞ F (X) = 0 and limx→∞ F (X) = 1 is defined as: F (x) = P (X ≤ x) (81) In addition: P (a < X ≤ b) = F (b) − F (a) 3.3.2 (82) Probability Density Function (PDF) The probability density function is the derivative of the CDF. It has the following properties: 1. f (x) ≥ 0 R∞ 2. −∞ f (x) = 1 R 3. x∈A f (x)dx = P (X ∈ A) 3.3.3 Discrete PDF/CDF If X is discrete, by denoting f as the PDF and F as the CDF, we have: X F (X) = P (X = xi ) (83) xi ≤x f (xj ) = P (X = xj ) (84) 0 ≤ f (xj ) ≤ 1 (85) X (86) And the following properties for the PDF: j 16 f (xj ) = 1 Machine Learning Study Guide 3.3.4 Continuous PDF/CDF If X is continuous, by denoting f as the PDF and F as the CDF, we have: Z x f (y)dy F (X) = (87) −∞ dF dx (88) f (x) ≥ 0 (89) f (x) = And the following properties for the PDF: Z ∞ f (x)dx = 1 (90) −∞ 3.3.5 Expectation The expected value of a random variable, also known as the mean value or first moment, is denoted as E[X] or µ. It is the value obtained by averaging the results of a random variable. We use (D) for discrete, (C) for continuous. Z +∞ n X (D) E[X] = xi f (xi ) and (C) E[X] = xf (x)dx (91) −∞ i=1 The expected value of a function of a random variable g(X) is: (D) E[g(X)] = n X Z g(xi )f (xi ) i=1 and +∞ (C) E[g(X)] = g(x)f (x)dx (92) −∞ The k-th moment, noted E[X k ] is the value of X k that we expect to observe on average on infinitely many trials. The k-th moment is a case of the previous definition with g : X 7→ X k . Z +∞ n X (D) E[X k ] = xki f (xi ) and (C) E[X k ] = xk f (x)dx (93) i=1 −∞ A characteristic function ψ(ω) is derived from a probability density function f (x) and is defined as: Z +∞ n X iωxi (D) ψ(ω) = f (xi )e and (C) ψ(ω) = f (x)eiωx dx i=1 (94) −∞ Remark that eiωx = cos(ωx) + i sin(ωx). The k-th moment can also be computed with the characteristic function as: 1 ∂kψ (95) E[X k ] = k i ∂ω k ω=0 3.3.6 Variance and Standard Deviation The variance of a random variable, often noted Var(X) or σ 2 , is a measure of the spread of its distribution function. It is determined as follows: Var[X] = E[(X − E[X])2 ] = E[X 2 ] − E[X]2 (96) The standard deviation of a random variable, often noted σ, is a measure of the spread of its distribution function which is compatible with the units of the actual random variable. It is determined as follows: p σ = Var[X] (97) Note that the variance for any constant a is Var[a] = 0, and Var[af (X)] = a2 Var[f (X)]. 17 Machine Learning Study Guide 3.4 3.4.1 Discrete Random Variables Bernoulli For X ∼ Bernoulli(p), we define a binary event with a probability of p for a true event, and a false event with probability of q = 1 − p. ( q = 1 − p if x = 0 P (X = x) = (98) p if x = 1 It can also be expressed as: f (x; p) = pk (1 − p)1−k for k ∈ {0, 1} (99) Other properties: 3.4.2 E[X] = p (100) Var[X] = pq = p(1 − p) (101) ψ(ω) = (1 − p) + peiω (102) Binomial For X ∼ Binomial(n, p), the number of true events in n independent experiments, with true probability of p, false probability of q = 1 − p. n x P (X = x) = p (1 − p)n−x (103) x Other properties: 3.4.3 E[X] = np (104) Var[X] = npq = np(1 − p) (105) ψ(ω) = (peiω + (1 − p))n (106) Geometric For X ∼ Geometric(p), is the number of experiments with true probability of p until the first true event (number of trials to get one success). P (X = x) = p(1 − p)x−1 (107) Other properties: E[X] = Var[X] = ψ(ω) = 18 1 p 1−p p2 peiω 1 − (1 − p)eiω (108) (109) (110) Machine Learning Study Guide 3.4.4 Poisson For X ∼ P oisson(λ), for λ > 0, a probability distribution over the nonnegative integers used for modeling the frequency of rare events. λx −λ P (X = x) = e (111) x! Other properties: E[X] = λ (112) Var[X] = λ (113) ψ(ω) = eλ(e 3.5 3.5.1 iω −1) (114) Continuous Random Variables Uniform For X ∼ U nif orm(a, b), we have equal probability density to every value between a and b. ( f (x) = 1 b−a if a ≤ x ≤ b otherwise (115) a+b 2 (116) (b − a)2 12 (117) eiωb − eiωa (b − a)iω (118) 0 Other properties: E[X] = Var[X] = ψ(ω) = 3.5.2 Exponential For X ∼ Exponential(λ), λ > 0, is the decaying probability density over the nonnegative reals. ( λe−λx f (x) = 0 if x ≥ 0 otherwise (119) Other properties: E[X] = Var[X] = ψ(ω) = 1 λ (120) 1 λ2 (121) 1 1 − iω λ (122) 19 Machine Learning Study Guide 3.5.3 Gaussian (Normal) For X ∼ N ormal(µ, σ), denoted also X ∼ N (µ, σ). f (x) = √ 1 x−µ 2 1 e− 2 ( σ ) 2πσ (123) Other properties: E[X] = µ (124) Var[X] = σ 2 (125) 1 ψ(ω) = eiωµ− 2 ω 3.6 2 σ2 (126) Jointly Distributed Random Variables The joint probability distribution of two random variables X and Y , denoted as fXY is defined as: (D) (C) fXY (xi , yj ) = P (X = xi and Y = yj ) (127) fXY (x, y)∆x∆y = P (x 6 X 6 x + ∆x and y 6 Y 6 y + ∆y) (128) Again, denote (D) as the discrete case, and (C) as the continuous case. 3.6.1 Marginal Density The marginal density for a random variable X is: (D) fX (xi ) = X Z fXY (xi , yj ) and fXY (x, y)dy (129) −∞ j 3.6.2 +∞ (C) fX (x) = Cumulative Distribution The cumulative distribution FXY is: (D) FXY (x, y) = X X Z fXY (xi , yj ) and −∞ xi 6x yj 6y 3.6.3 x Z y (C) FXY (x, y) = fXY (x0 , y 0 )dx0 dy 0 (130) −∞ Conditional Density The conditional density of X with respect to Y , denoted fX|Y is defined as: fX|Y = 3.6.4 fXY (x, y) fY (y) (131) Independence Two random variables X and Y are independent if: fXY (x, y) = fX (x)fY (y) 20 (132) Machine Learning Study Guide 3.6.5 Expectation Given two random variables X, Y and g : R2 → R is a function of these two variables. Then the expected value of g is: Z ∞Z ∞ XX g(xi , yi )f (xi , yi )dydx (133) (D) E[g(X, Y )] = g(xi , yi )f (xi , yi ) and (C) E[g(X, Y )] = i 3.6.6 −∞ j −∞ Covariance 2 The covariance of two random variables X and Y , denoted σXY or as Cov[X, Y ] is defined as: 2 Cov[X, Y ] , σXY = E[(X − µX )(Y − µY )] = E[XY ] − µX µY (134) Where µX = E[X] and µY = E[Y ] respectively. If X and Y are independent, the covariance is 0. Var[X + Y ] = Var[X] + Var[Y ] + 2Cov[X, Y ] 3.6.7 (135) Correlation By noting σX , σY as the standard deviations of X and Y , we define the correlation between the random variables X and Y as ρXY : σ2 ρXY = XY (136) σX σY Correlation for X, Y is ρXY ∈ [−1, 1]. If X, Y independent, then ρXY = 0 3.7 Parameter Estimation 3.7.1 Definitions 3.7.2 Bias 3.7.3 Mean and Central Limit Theorem 3.7.4 Variance 3.8 Probability Bounds and Inequalities This section looks at various bounds that define how likely a random variable is to be close to its expectation. 3.8.1 Markov Let X ≥ 0 be a non-negative random variable. Then for all a ≥ 0: P (X ≥ a) ≤ 3.8.2 E[X] a (137) Chebyshev Let X be any random variable with finite expected value µ = E[X] and finite non-zero variance σ 2 . Then for all k > 0: 1 P (|X − E[X]| ≥ kσ) ≤ 2 (138) k 21 Machine Learning Study Guide 3.8.3 Chernoff Recall that the moment generating function for a random variable X is: MX (λ) := E[eλX ] (139) Then the chernoff bound for a random variable X, obtained by applying the markov inequality to eλX , for every λ > 0: E[eλX ] (140) P (X ≥ a) = P (eλX ≥ eλa ) ≤ eλa For the multiplicative chernoff bound, suppose X1 , . . . , Xn are independent random variables taking values in {0, 1}. Let X denote the sum, µ = E[X] denote the sum’s expected value. Then for any 0 < δ < 1, µ 2 eδ − δ 3µ P (X > (1 + δ)µ) < ≤ e (141) (1 + δ)(1+δ) P (X < (1 − δ)µ) < e−δ (1 − δ)(1−δ) µ ≤ e− δ2 µ 2 (142) For δ ≥ 0, δ2 µ P (X ≥ (1 + δ)µ) ≤ e− 2+δ P (X ≤ (1 − δ)µ) ≤ e− 3.8.4 (143) δ2 µ 2 (144) Hoeffding Let X1 , . . . , Xn be independent random variables such that ai ≤ Xi ≤ bi . The sum of these variables Sn = then the Hoeffding inequality is: 2t2 P P (Sn − E [Sn ] ≥ t) ≤ exp − 2 i (bi − ai ) 2t2 P (|Sn − E [Sn ]| ≥ t) ≤ 2 exp − P 2 i (bi − ai ) P i Xi , (145) (146) The hoeffding lemma states that for a real-valued random variable X with expected value E[X] = 0 and such that a ≤ X ≤ b, then for all λ ∈ R we have, 2 λ (b − a)2 λX E[e ] ≤ exp (147) 8 4 Information Theory Information Theory revolves around quantifying how much information is present in a signal. The basic intuition lies in the fact that learning an unlikely event has occured is more informative than learning that a likely event has occured. The basics are: 1. Likely events should have low information content, and in the extreme case, events that are guaranteed to happen should have no information content whatsoever. 2. Less likely events should have higher information content. 3. Independent events should have additive information. 22 Machine Learning Study Guide We satisfy all three properties by defining self-information of an event x for a probability distribution P as: I(x) = − log P (x) (148) We can quantify the amount of uncertainty in a distribution using Shannon entropy: H(P ) = Ex∼P [I(x)] = −Ex∼P [log P (x)] (149) Which in the discrete setting is written as: H(P ) = − X P (x) log P (x) (150) x In other words, the Shannon entropy of a distribution is the expected amount of information in an event drawn from that distribution. It gives a lower bound on the number of bits needed on average to encode symbols drawn from a distribution P . If we have two separate probability distributions P (x) and Q(x) over the same random variable x, we can measure how different these two distributions are using the Kullback-Leibler (KL) divergence: DKL (P kQ) = Ex∼P P (x) log Q(x) = Ex∼P [log P (x) − log Q(x)] X = P (x) x (151) log P (x) log Q(x) In the case of discrete variables, it is the extra amount of information needed to send a message containing symbols drawn from probability distribution P , when we use a code that was designed to minimize the length of messages drawn from probability distribution Q. The KL divergence is always non-negative, and is 0 if and only if P and Q are the same. We can relate the KL divergence to cross-entropy. H(P, Q) = H(P ) + DKL (P kQ) = − Ex∼P [log Q(x)] = − X (152) P (x) log Q(x) x Minimizing the cross-entropy with respect to Q is equivalent to minimizing the KL divergence, because Q does not participate in the omitted term (entropy is constant). 23 Machine Learning Study Guide 5 Learning Theory 5.1 Bias and Variance 5.2 Notation 5.2.1 Union Bound 5.2.2 Hoeffding Inequality For Bernoulli Variables 5.3 Training Error For a given classifier h, we define the training error b (h), also known as the empirical risk or empirical error, to be: m b (h) = 5.4 1 X 1 (i) (i) m i=1 {h(x )6=y } (153) Probably Approximately Correct (PAC) PAC learning is a framework with the following set of assumptions: 5.5 Hypothesis Classes 5.5.1 Shattering 5.5.2 Upper Bound Theorem 5.5.3 VC Dimension 5.5.4 Vapnik Theorem 6 Linear Regression Linear Regression seeks to approximate a real valued label y as a linear function of x: hθ (x) = θ0 + θ1 · x1 + · · · + θn · xn (154) The θi ’s are the parameters, or weights. If we include the intercept term via x0 = 1, we can write our model more compactly as: n X h(x) = θ i · xi = θ T x (155) i=0 Here n is the number of input variables, or features. In Linear Regression, we seek to make h(x) as close to y for a set of training examples. We define the cost function as: m J(θ) = 6.1 2 1 X (i) h x − y (i) 2 i=1 (156) LMS Algorithm We seek to find a set of θ such that we minimize J(θ) via a search algorithm that starts at some initial guess for our parameters and takes incremental steps to make J(θ) smaller until convergence. This is know as gradient descent: θj := θj − α 24 ∂ J(θ) ∂θj (157) Machine Learning Study Guide Here, α is the learning rate. We can derive the partial derivative as: ∂ J(θ) ∂θj = = = = ∂ 1 2 (h(x) − y) ∂θj 2 1 ∂ 2 · (h(x) − y) · (h(x) − y) 2 ∂θj ! n X ∂ θi xi − y (h(x) − y) · ∂θj i=0 (158) (h(x) − y) xj Hence, for a single example (stochastic gradient descent): (i) θj := θj + α y (i) − h x(i) xj (159) This is called the LMS update rule. For a batched version, we can evaluate the gradient on a set of examples (batch gradient descent), or the full set (gradient descent). θj := θj + α m X (i) y (i) − h x(i) xj (160) i=1 6.2 The Normal Equations We can also directly minimize J without using an iterative algorithm. We define X as the matrix of all samples of size m by n. We let ~y be a m dimensional vector of all target values. We can define our cost function J as: m J(θ) = 2 1 1 X (i) (Xθ − ~y )T (Xθ − ~y ) = h x − y (i) 2 2 i=1 (161) We then take the derivative and find its roots. ∇θ J(θ) 1 = ∇θ (Xθ − ~y )T (Xθ − ~y ) 2 1 = ∇θ θT X T Xθ − θT X T ~y − ~y T Xθ + ~y T ~y 2 1 = ∇θ tr θT X T Xθ − 2 tr ~y T Xθ 2 1 = X T Xθ + X T Xθ − 2X T ~y 2 = X T Xθ − X T ~y (162) To minimize J, we set its derivatives to zero, and obtain the normal equations: X T Xθ = X T ~y (163) Which solves θ for a value that minimizes J(θ) in closed form: θ = XT X −1 X T ~y (164) 25 Machine Learning Study Guide 6.3 Probabilistic Interpretation Why does linear regression use the least-squares cost function? Assume that the target variables and inputs are related via: y (i) = θT x(i) + (i) (165) Here, (i) is an error term for noise. We assume each (i) is independently and identically distributed according to a Gaussian distribution with mean zero and some variance σ 2 . Hence, (i) ∼ N 0, σ 2 , so the density for any sample x(i) with label y (i) is y (i) |x(i) ; θ ∼ N θT x(i) , σ 2 . This implies: ! (i) T (i) 2 y − θ x 1 exp − (166) p y (i) |x(i) ; θ = √ 2σ 2 2πσ The probability of a dataset X is quantified by a likelihood function: L(θ) = L(θ; X, ~y ) = p(~y |X; θ) (167) Since we assume independence on each noise term (and samples), we can write the likelihood function as: m Y L(θ) = p y (i) |x(i) ; θ i=1 m Y y (i) − θT x(i) 1 √ = exp − 2σ 2 2πσ i=1 2 ! (168) To get the best choice of parameters θ, we perform maximum likelihood estimation such that L(θ) is maximized. Usually we take the negative log and minimize: `(θ) = − log L(θ) m Y y (i) − θT x(i) 1 √ = − log exp − 2σ 2 2πσ i=1 2 ! 2 ! y (i) − θT x(i) 1 log √ =− exp − 2σ 2 2πσ i=1 m 2 1 1 X (i) 1 + 2· = −m log √ y − θT x(i) 2πσ σ 2 i=1 m X (169) Hence, maximizing L(θ) is the same as minimizing the negative log likelihood `(θ), which for linear regression is the least squares cost function: m 2 1 X (i) y − θT x(i) (170) 2 i=1 Under the previous probabilistic assumptions on the data, least-squares regression corresponds to finding the maximum likelihood estimate of θ. This is thus one set of assumptions under which least-squares regression can be justified as performing maximum likelihood estimation. Note that θ is independent of σ 2 . 6.4 Locally Weighted Linear Regression Locally Weighted Regression, also known as LWR, is a variant of linear regression that weights each training example in its cost function by w(i) (x), which is defined with parameter τ ∈ R as: (x(i) − x)2 (i) (171) w (x) = exp − 2τ 2 Hence, in LWR, we do the following: 26 Machine Learning Study Guide 1. Fit θ to minimize P i w(i) y (i) − θT x(i) 2 2. Output θT x This is a non-parametric algorithm, where non-parametric refers to the fact that the amount of information we need to represent the hypothesis h grows linearly with the size of the training set. 7 Logistic Regression We can extend this learning to classification problems, where we have binary labels y that are either 0 or 1. 7.1 The Logistic Function For logistic regression, our new hypothesis for estimating the class of a sample x is: h(x) = g θT x = 1 1 + e−θT x (172) where g(z) is the logistic or sigmoid function: g(z) = 1 1 + e−z (173) The sigmoid function is bounded between 0 and 1, and tends towards 1 as z → ∞. It tends towards 0 when z → −∞. A useful property of the sigmoid function is the form of its derivative: g 0 (z) d 1 dz 1 + e−z 1 = = 2 e−z ) 7.2 (1 + 1 1 · 1 − (1 + e−z ) (1 + e−z ) = = e−z (174) g(z)(1 − g(z)) Cost Function To fit θ for a set of training examples, we assume that: P (y = 1|x; θ) = h(x) (175) P (y = 0|x; θ) = 1 − h(x) This can be written more compactly as: y 1−y p(y|x; θ) = (h(x)) (1 − h(x)) (176) Assume m training examples generated independently, we define the likelihood function of the parameters as: L(θ) = = = p (~y |X; θ) m Y p y (i) |x(i) ; θ i=1 m Y (177) h x(i) y(i) 1 − h x(i) 1−y(i) i=1 27 Machine Learning Study Guide And taking the negative log likelihood to minimize: `(θ) = = − log L(θ) m X − y (i) log h x(i) + 1 − y (i) log 1 − h x(i) (178) i=1 = − m X y (i) θT x(i) − log 1 + eθ T (i) x i=1 This is known as the binary cross-entropy loss function. 7.3 Gradient Descent Lets start by working with just one training example (x,y), and take derivatives to derive the stochastic gradient ascent rule: ∂ 1 ∂ 1 − (1 − y) `(θ) = − y g θT x T T ∂θj g (θ x) 1 − g (θ x) ∂θj ∂ T 1 1 − (1 − y) g θT x 1 − g θT x = − y θ x (179) T T g (θ x) 1 − g (θ x) ∂θj = − y 1 − g θT x − (1 − y)g θT x xj = − (y − h(x)) xj This therefore gives us the stochastic gradient ascent rule: (i) θj := θj + α y (i) − h x(i) xj (180) We must use gradient descent for logistic regression since there is no closed form solution for this problem. 7.4 Newton-Raphson Algorithm The Hessian matrix for logistic regression is: m X ∂ 2 `(θ) =− x(i) x(i)T · h x(i) · 1 − h x(i) ∂θ∂θ i=1 (181) Hence the second order update is: θ := θ − 8 ∂ 2 `(θ) ∂θ∂θ −1 ∂`(θ) ∂θ (182) Softmax Regression A softmax regression, also called a multiclass logistic regression, is used to generalize logistic regression when there are more than 2 outcome classes. 8.1 Softmax Function The softmax function creates a probability distribution over a set of k classes for a training example x, with θk denoting the set of parameters to be optimzed for the k-th class. exp θkT x p(y = k|x; θ) = P (183) T j exp θj x 28 Machine Learning Study Guide 8.2 MLE and Cost Function We can write the maximum likelihood function for softmax regression as: L(θ) = m Y Y p(y = k|x; θ)1{yi =k} (184) i=1 k Where 1{yi = k} is the indicator function which is 1 if its argument is true, 0 otherwise. By taking the negative log likelihood: `(θ) = − log L(θ) m Y Y = − log p(y = k|x; θ)1{yi =k} i=1 k = − m X X i=1 = 1{yi = k} · θkT xi − log X exp θjT xi (185) j k m X X −θyTi xi + log exp θjT xi i=1 j This is known as the cross-entropy loss function. 8.3 Gradient Descent To perform gradient descent, we must take the derivative of our cost function, but it is important to note that the derivative for the correct class is different than the other classes. !! m X X T T ∇θj `(θ) = ∇θj −θyi xi + log exp θk xi i=1 = = = m X i=1 m X k !! ∇θ j −θyTi xi + ∇θ j 1{yi = j} · (−xi ) + i=1 m X i=1 log X exp k exp(θjT xi ) P T k exp(θk xi ) θkT xi (186) · xi ! exp(θjT xi ) P − 1{yi = j} · xi T k exp(θk xi ) And our update equation for the j-th parameter weights is: θj := θj − α ! exp(θjT xi ) P − 1{yi = j} · xi T k exp(θk xi ) Note that since each class has a set of weights, our gradient is a matrix known each with n feature weights. ∂`(θ) · · · ∂`(θ) ∂θk1 h i ∂θ11 . .. ∂`(θ) ∂`(θ) . . . = . Jθ = · · · ∂θk . . ∂θ1 ∂`(θ) ∂`(θ) · · · ∂θkn ∂θ1n (187) as the jacobian J, with k classes (188) 29 Machine Learning Study Guide 9 Generalized Linear Models 9.1 Exponentional Family 9.2 Assumptions of GLMs 9.3 Examples 9.3.1 Ordinary Least Squares 9.3.2 Logistic Regression 9.3.3 Softmax Regression 10 Perceptron 11 Support Vector Machines 12 Margin Classification 13 Generative Learning: Gaussian Discriminant Analysis A generative model learns who the data is generated by estimating P (x|y) which is then used to estimate P (y|x) via Bayes rule. 13.1 Assumptions Gaussian Discriminant Analysis assumes the following: 1. y ∼ Bernoulli(φ) 2. x|y = 0 ∼ N (µ0 , Σ) 3. x|y = 1 ∼ N (µ1 , Σ) Writing out the distributions be have: P (y) = φy · (1 − φ)1−y P (x|y = k) = 13.2 1 (2π)n/2 |Σ|1/2 (189) 1 T −1 exp − (x − µk ) Σ (x − µk ) 2 (190) Estimation The log-likelihood of the data is given by: `(φ, µ0 , µ1 , Σ) = = log log m Y i=1 m Y P (x(i) , y (i) ; φ, µ0 , µ1 , Σ) (191) (i) (i) (i) P (x |y ; φ, µ0 , µ1 , Σ) · P (y ; φ) i=1 By maximizing ` with respect to the parameters, we get the following MLE of the parameters: m φ= 30 1 X 1 (i) m i=1 {y =1} (192) Machine Learning Study Guide Pm (i) i=1 1{y (i) =k} x µk = P m i=1 1{y (i) =k} (193) m Σ= 13.3 1 X (i) (x − µy(i) )(x(i) − µy(i) )T m i=1 (194) Prediction To classify a point x, we find the class y = k which maximizes the probability: Y ŷ = arg max P (y = k) · P (xj |y = k) k 14 14.1 (195) j Generative Learning: Naive Bayes Assumptions The naive bayes model supposes that the features of each data point are all independent: P (x|y) = P (x1 , x2 , ..., xn |y) = P (x1 |y)P (x2 |y)...P (xn |y) = n Y P (xi |y) (196) i=1 14.2 Estimation We can write the likelihood of the data as: L(φy , φj=l|y=k ) = m Y P (x(i) , y (i) ) (197) i=1 with classes denoted by k and features xj = l (xj takes on value l). By maximizing the estimates, we get: m φy = P (y = k) = 1 X 1 (i) m i=1 {y =k} (198) Pm φj=l|y=k = P (xj = l|y = k) = 14.2.1 1{y(i) =k ∧ x(i) =l} j Pm 1 (i) i=1 {y =k} i=1 (199) Laplace Smoothing Laplace smoothing allows for unseen data to have a probability, and not automatically destory the prediciton process by setting all classes to 0. We can replace our feature estimates with the following: Pm 1{x(i) =j} + 1 φj = i=1 (200) m+k Or more generically: Pm φj=l|y=k = i=1 1{x(i) =l∧y (i) =k} + 1 Pm j i=1 1{y (i) =k} + |K| (201) where |K| is the number of classes 31 Machine Learning Study Guide 14.3 Prediction To classify a point x, we find the class y = k which maximizes the probability: ŷ = arg max P (y = k) · k 15 15.1 n Y P (xj |y = k) (202) i=j Tree-based Methods Decision Trees Decision Trees are created via the Classification and Regression Trees (CART) training algorithm. They can be represent as binary trees where at each node the partition the data space according to a threshold to minimize either gini impurity or entropy between the two child nodes. 15.2 Random Forest Random forests are a tree-based technique that uses a high number of decision trees built out of randomly selected sets of features. Contrary to the simple decision tree, it is highly uninterpretable but its generally good performance makes it a popular algorithm. It averages the decisions across several trees to combat overfitting. Random forests are a type of ensemble methods. 15.3 Boosting The idea of boosting methods is to combine several weak learners to form a stronger one. The main ones are summed up below: 1. Adaptive boosting: Known as adaboost, it places high weights on errors to improve at the next boosting step. 2. Gradient boosting: Weak learners are trained on the remaining errors. Boosting requires training several models to improve upon previous ones. 16 K-Nearest Neighbors The k-nearest neighbors algorithm, commonly known as k-NN, is a non-parametric approach where the response of a data point is determined by the nature of its k neighbors from the training set. It can be used in both classification and regression settings. The higher the parameter k, the higher the bias, and the lower the parameter k, the higher the variance. 16.1 Classification In k-NN classification, the output is a class membership. An object is classified by a majority vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor. 16.2 Regression In k-NN regression, the output is the property value for the object. This value is the average of the values of its k nearest neighbors. 1 X yi (203) y= k xi ∈Nk (x) where Nk (x) is the k nearest points around x. 32 Machine Learning Study Guide 17 K-Means Clustering CLustering seeks to group similiar points of data together in a cluster. We denote c(i) as the cluster for data point i and µj as the center for cluster j. We denote k as the number of clusters and n as the dimension of our data. 17.1 Algorithm After randomly initializing the cluster centroids µ1 , µ2 , . . . , µk ∈ Rn , repeat until convergence: 1. For every data point i: c(i) = arg min||x(i) − µj ||2 (204) j 2. For each cluster j: m X µj = 1{c(i) =j} x(i) i=1 m X (205) 1{c(i) =j} i=1 The first step is known as cluster assignment, and the second updates the cluster center (i.e. the average of all points in the cluster). In order to see if it converges, use the distortion function: J(c, µ) = m X ||x(i) − µc(i) ||2 (206) i=1 The distortion function J is non-convex, and coordinate descent of J is not guaranteed to converge to the global minimum (i.e. susceptible to local optima). 17.2 Hierarchical Clustering Hierarchical clustering is a clustering algorithm with an agglomerative hierarchical approach that builds nested clusters in a successive manner. The types are: 1. Ward Linkage: minimize within cluster distance 2. Average Linkage: minimize average distance between cluster pairs 3. Complete Linkage: minimize maximum distance between cluster pairs 17.3 Clustering Metrics In an unsupervised learning setting, it is often hard to assess the performance of a model since we don’t have the ground truth labels as was the case in the supervised learning setting. Silhouette coefficient By noting a and b the mean distance between a sample and all other points in the same class, and between a sample and all other points in the next nearest cluster, the silhouette coefficient s for a single sample is defined as follows: b−a (207) s= max(a, b) 33 Machine Learning Study Guide Calinskli-Harabaz Index By noting k the number of clusters, Bk and Wk the between and within-clustering dispersion matricies defined as: k X Bk = nc(i) (µc(i) − µ)(µc(i) − µ)T (208) j=1 Wk = m X (x(i) − µc(i) )(x(i) − µc(i) )T (209) i=1 the Calinksli-Harabaz index s(k) indicated how well a clustering model defines its clusters, such that higher scores indicate more dense and well separated cluster assignments. It is defined as: s(k) = 18 N −k Tr(Bk ) × Tr(Wk ) k−1 (210) Expectation-Maximization 18.1 Mixture of Gaussians 18.2 Factor Analysis 19 Principal Component Analysis Principal Component Analysis is a dimension reduction technique that finds the variance maximizing the directions onto which to project the data. 19.1 Eigenvalues, Eigenvectors, and the Spectral Theorem Recall that for a given matrix A ∈ Rn×n , λ is said to be an eigenvalue of A if there exists a vector z ∈ Rn \{0}, called an eigenvector, such that: Az = λz (211) The spectral theorem states that given matrix A ∈ Rn×n , if A is symmetric then A is diagonalizable by a real orthogonal matrix U ∈ Rn×n . Note Λ = diag(λ1 , . . . , λn ). Then ∃ Λ such that: A = U ΛU T (212) Note that the eigenvector associated with the largest eigenvalue is called the principal eigenvector of matrix A. 19.2 Algorithm The PCA procedure projects the data onto k dimensions by maximizing the variance of the data as follows: 1. Normalize the data to have mean 0, standard deviation 1: x(i) ← x(i) − µ σ (213) where µ and σ 2 are: m µ= 1 X (i) x m i=1 (214) m σ2 = 34 1 X (i) (x − µ)2 m i=1 (215) Machine Learning Study Guide 2. Compute covariance matrix Σ, which is symmetric with real eigenvalues. m Σ= 1 X (i) (i)T x x ∈ Rn×n m i=1 (216) 3. Compute u1 , . . . , uk ∈ Rn the k orthogonal principal eigenvectors of Σ, i.e. the orthogonal eigenvectors of the k largest eigenvalues. 4. Project the data on spanR (u1 , ..., uk ) to create a vector y (i) from point x(i) : T (i) u1 x uT2 x(i) y (i) = U T x(i) = ∈ Rk .. . (217) uTk x(i) This procedure maximizes the variance among all k-dimensional spaces. 19.3 Algorithm: SVD Eigenvalue decomposition is deifned for square matricies, and using the SVD of a data matrix X is often used in practice. 1. Zero-mean the data to have mean 0: x(i) ← x(i) − µ (218) where µ is: m µ= 1 X (i) x m i=1 (219) 2. Computed the singular value decomposition of Xµ , our zero-meaned data matrix: Xµ = U SV (220) 3. Take the first k columns of U as our transform matrix, denoted Uk . 4. Project the data to create a matrix Y from data matrix Xµ : Y = XUk 20 Independent Component Analysis 21 Reinforcement Learning 21.1 Markov Decision Processes 21.2 Policy and Value Functions 21.3 Value Iteration Algorithm 21.4 Q-Learning 22 (221) Probabilistic Graphical Models Probabilistic Graphical Models are models for which a graph can express the conditional dependence structure between random variables. We will define the basics used to understand PGMs here. 35 Machine Learning Study Guide For starters, a model is a set of probability distributions that may have generated data. For example, a model could be the set of normal distributions (x − µ)2 2 + , µ ∈ R, σ ∈ R exp − p(x) : p(x) = √ 2σ 2 2πσ 2 1 (222) A graphical model is defined to be a pair (G, P) where G is a graph and P is a set of distributions which factorizes according to G. A graph G = (V, E) consists of a set of vertices V and a set of edges E. Typically the vertices are random variables and the edges relate the random variables. 22.1 Bayesian Networks Bayesian Networks are a type of PGM constructed from directed acyclic graphs. A directed acyclic graph (DAG) is a graph G = (V, E) such that E contains only directed edges (→) and there does not exist a sequence of directed edges from Xi to Xi for all nodes in the graph. Due to the structure of the DAG, we can factorize the joint distribution p(x) with respect to the graph as: p(x) = n Y p(xi | pa(xi , G)) (223) i=1 Here, pa(xi , G) are the parents of xi with respect to graph G. Using this factorization, we can specify conditional independencies from the graph’s structure. We define three useful set construction operations for these networks: an(xi , G) ≡ {xj | xj → · · · → xi ∈ G} de(xi , G) ≡ {xj | xj ← · · · ← xi ∈ G} nd(xi , G) ≡ {xj | xj ∈ / de(xi , G)} (ancestors of xi ) (descendants of xi ) (non-descendants of xi ) (224) A distribution p(x) satisfies the local Markov property wrt DAG G if: xi ⊥ ⊥ nd∗ (xi , G) | pa(xi , G) (225) where nd∗ (xi , G) ≡ nd(xi , G) \ pa(xi , G) For the global markov property, let A, B, C be disjoint subsets of X. A distribution p(X) satisfies the global markov property wrt DAG G if: ⊥ B |C A ⊥d B | C =⇒ A ⊥ 36 (226) Machine Learning Study Guide 22.1.1 Hidden Markov Models 22.2 Markov Random Fields 22.3 Conditional Random Fields 23 Deep Learning: Basics 23.1 Basics 23.2 Activation Functions 23.3 Loss Functions 23.4 Backpropagation 23.5 Regularization Methods 23.6 Optimization Algorithms 23.7 Convolutional Networks 23.8 Recurrent Networks Recurrent networks allow us to work with sequences, where xt is the input at time t, ht is the hidden state at time t, and ht−1 is the previous hidden state. RNNs allow us to propagate information through the hidden state h. Hidden states defualt to zero. 23.8.1 Elman RNN The Elman RNN is the simplest layers, yet prone to both the vanishing and exploding gradient problem. ht = tanh (Wih xt + bih + Whh ht−1 + bhh ) (227) Here, W are weight matricies, and b are bias vectors, one each for the input and hidden vectors. 23.8.2 Long Short-Term Memory LSTM RNNs mitigate the vanishing gradient problem. They involve a more complex set of equations corresponding to gates that control the amount of infomation to propagate through the sequence. Here, i is the input gate, f is the forget gate, g is the cell gate, and o is the output gate. LSTMs have two hidden inputs, hidden state ht and cell state ct . it = σ (Wii xt + bii + Whi ht−1 + bhi ) ft = σ (Wif xt + bif + Whf ht−1 + bhf ) gt = ot = σ (Wio xt + bio + Who ht−1 + bho ) tanh (Wig xt + big + Whg ht−1 + bhg ) ct = ft ◦ ct−1 + it ◦ gt ht = ot ◦ tanh (ct ) (228) Here, σ is the sigmoid function. 37 Machine Learning Study Guide 23.8.3 Gated Recurrent Unit GRU RNNs are another rnn layer designed to mitigate the vanishing gradient problem. Here, we have the r reset gate, z update gate, and n is the new gate. σ is the sigmoid function. 23.8.4 rt = σ (Wir xt + bir + Whr ht−1 + bhr ) zt = σ (Wiz xt + biz + Whz ht−1 + bhz ) nt = tanh (Win xt + bin + rt ◦ (Whn ht−1 + bhn )) ht = (1 − zt ) ◦ nt + zt ◦ ht−1 (229) Bidirectional RNNs Bidirectional RNNs run two separate RNN layers on the forward and reverse sequence, and then concatenates them into one output vector. − → − hf = RNN (→ x) ← − hr = −) RNN (← x ho = h− → − →i hf ; hr (230) Where RNN can be any of the 3 previously mentioned layers. 23.8.5 Vanishing/Exploding Gradient The vanishing and exploding gradient phenomena are often encountered in the context of RNNs. The reason why they happen is that it is difficult to capture long term dependencies because of multiplicative gradient that can be exponentially decreasing/increasing with respect to the number of layers. 23.8.6 Gradient Clipping Gradient clipping is a technique used to cope with the exploding gradient problem sometimes encountered when performing backpropagation. By capping the maximum value for the gradient, this phenomenon is controlled in practice. ∇Lclipped = min (∇L, C) (231) For some max value C. 24 24.1 24.1.1 Deep Learning: Advanced Autoencoders Variational Autoencoders 24.2 General Adversarial Networks 24.3 Encoder-Decoder Models Encoder-decoder models allow for many to many RNN sequence learning. The idea for them is two take a source sequence, encode it to a vectorized output, and pass the hidden state over to the decoder (and possible the encodings themselves) to generate output. These models come up in machine tranlsations, allowing models to translate a source language into a target language. 38 Machine Learning Study Guide 24.3.1 Encoders 24.3.2 Decoders 24.4 Attention Models Attention models were discussed by Bahdanau, and later Luong. They are a means to alter the representation of a set of encodings to pay attention to certain sequence elements more so than others. In an encoder-decoder model with out attention, the decoder relies exclusivly on the hidden state to hold all information. Attention, by contrast, uses the hidden state to create a context vector of the encodings. 24.4.1 Context Vector Attention mechanisms rely on one simple concept: producing context vectors. Hence, given a set of encodeings h = [hi , · · · , hj , · · · hn ] from the encoder, and the current hidden state si−1 , we compute a score for each encoding hj , denoted score(si−1 , hj ). These scores are discussed in later sections. With a score computed for each encoding, we apply a softmax across all scores: exp(score(si−1 , hj )) 0 j 0 exp(score(si−1 , hj )) (232) a(si−1 , hj ) = P These softmax proabilities are attention weights, and our context vector ci is the weighted sum of the encodings given these weights: X ci = a(si−1 , hj 0 ) · hj 0 (233) j0 These context vectors are appended to the embedding of the token set into the decoder. 24.4.2 Concat (Bahdanau) Attention The concat attention scoring mechanism has two learnable weights, va and Wa and is defined as: score(si−1 , hj ) = vaT · tanh (Wa · [si−1 ; hj ]) 24.4.3 (234) Luong Attention Mechanisms Luong defined several global attention mechanisms, defined here: ( sTi−1 · Wa · hj score(si−1 , hj ) = sTi−1 · hj general dot (235) Luong also explored local attention weights, with monotonic alignment and predictive alignment. The difference in local attention is that the context vector is derived on source hidden states within the window [pt − D, pt + D] for some D and decoder time step t (i.e. i − 1). In monotonic alignment, aligned position pt = t. In predicitve alignment, the aligned position is: pt = S · σ vpT tanh (Wp si−1 ) (236) where vp and Wp are learnable parameters, S is the source sentence length. In addition, the weights are gaussian centered around pt : ! (s − pt )2 exp(score(si−1 , hj )) · exp − (237) a(si−1 , hj ) = P 2 0 2 D j 0 exp(score(si−1 , hj )) 2 Here, s is an integer within the window centered at pt . 39 Machine Learning Study Guide 24.5 Word Embeddings Word Embeddings seek to create a dense vector representation of real values to represent a single word in a euclidean space. Word vectors are weight matrices where each row corresponds to a word, acessed through a numerical token representing the word itself. However, several models have been developed to train these word embedding weights to encode lexical meaning. 24.5.1 N-Gram Models A simple apporach is to take a context of size n words that appear before our target word, and attempt to train a model to predict a word wi that appears in a training corpus. We effectively average the input word embeddings, apply a linear layer to project the data to the vocab size, and maximize the probability of predicitng the target word (i.e. apply a softmax of the output vector). exp f (wi−n , . . . , wi−1 )i max p(wi | wi−n , . . . , wi−1 ) = max P|V | j=1 exp f (wi−n , . . . , wi−1 )j Where our model is: (238) n X 1 f (wi−n , . . . , wi−1 ) = Wd · We [wi−j ] + bd ∈ R|V | n j=1 (239) Here, |V | is the size of the vocabulary, Wd ∈ Rd×|V | is a projection matrix used for training along with bias bd , and We ∈ R|V |×d are the word embeddings we seek to train. d is the embedding dimension. Now we can write the negative log likelihood as: min − log p(wi | wi−n , . . . , wi−1 ) = −f (wi−n , . . . , wi−1 )i + log |V | X exp f (wi−n , . . . , wi−1 )j (240) j=1 24.5.2 Continuous Bag-of-Words Models (CBOW) The continuous bag of words model, or CBOW, seeks to predict a center word given the surrounding context words. For instance, given a context size of n, for a target center word wi , we are given context words wi−n , . . . wi−1 , wi+1 , . . . wi+n and attempt to predict wi . Hence: exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )i max p(wi | wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = max P|V | j=1 exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )j (241) Where our model is: f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = Wd · 1 2∗n n X We [wi+j ] + bd ∈ R|V | (242) j=−n,j6=0 Here, |V | is the size of the vocabulary, Wd ∈ Rd×|V | is a projection matrix used for training along with bias bd , and We ∈ R|V |×d are the word embeddings we seek to train. d is the embedding dimension. Now we can write the negative log likelihood as: min − log p(wi | wi−n , . . . , wi−1 , wi+1 , . . . , wi+n ) = − f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )i + log |V | X j=1 40 exp f (wi−n , . . . , wi−1 , wi+1 , . . . , wi+n )j (243) Machine Learning Study Guide 24.5.3 Skip-Gram Model 24.5.4 Negative Sampling 41
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 41 Page Mode : UseOutlines Author : Title : Subject : Creator : LaTeX with hyperref package Producer : pdfTeX-1.40.18 Create Date : 2019:02:17 20:43:04-05:00 Modify Date : 2019:02:17 20:43:04-05:00 Trapped : False PTEX Fullbanner : This is pdfTeX, Version 3.14159265-2.6-1.40.18 (TeX Live 2017) kpathsea version 6.2.3EXIF Metadata provided by EXIF.tools