Solution Manual For PRML
User Manual:
Open the PDF directly: View PDF .
Page Count: 158
Download | |
Open PDF In Browser | View PDF |
S OLUTION M ANUAL F OR PATTERN R ECOGNITION AND M ACHINE L EARNING E DITED B Y ZHENGQI GAO Information Science and Technology School Fudan University N OV.2017 1 0.1 Introduction Problem 1.1 Solution We let the derivative of error function E with respect to vector w equals E = 0), and this will be the solution of w = {w i } which minimizes to 0, (i.e. ∂∂w error function E . To solve this problem, we will calculate the derivative of E with respect to every w i , and let them equal to 0 instead. Based on (1.1) and (1.2) we can obtain : => ∂E ∂w i => = N ∑ n=1 { y( xn , w) − t n } xni = 0 y( xn , w) xni = n=1 => N ∑ N ∑ xni t n n=1 N M N ∑ ∑ ∑ j xni t n ( w j xn ) xni = n=1 n=1 j =0 => M N ∑ ∑ n=1 j =0 => M ∑ N ∑ j =0 n=1 ∑N ( j+ i) w j xn = ( j+ i) wj = xn i+ j N ∑ n=1 N ∑ n=1 xni t n xni t n ∑N If we denote A i j = n=1 xn and T i = n=1 xn i t n , the equation above can be written exactly as (1.222), Therefore the problem is solved. Problem 1.2 Solution This problem is similar to Prob.1.1, and the only difference is the last term on the right side of (1.4), the penalty term. So we will do the same thing as in Prob.1.1 : => ∂E ∂w i => = N ∑ n=1 M ∑ N ∑ j =0 n=1 => { y( xn , w) − t n } xni + λw i = 0 ( j+ i) xn w j + λw i = N ∑ n=1 xni t n M ∑ N N ∑ ∑ ( j+ i) { xn + δ ji λ}w j = xni t n j =0 n=1 n=1 2 where δ ji { 0 j ̸= i 1 j=i Problem 1.3 Solution This problem can be solved by Bayes’ theorem. The probability of selecting an apple P (a) : P ( a) = P ( a| r ) P ( r ) + P ( a| b ) P ( b ) + P ( a| g ) P ( g ) = 1 3 3 × 0.2 + × 0.2 + × 0.6 = 0.34 10 2 10 Based on Bayes’ theorem, the probability of an selected orange coming from the green box P ( g| o) : P ( g | o) = P ( o| g ) P ( g ) P ( o) We calculate the probability of selecting an orange P ( o) first : P ( o) = P ( o| r ) P ( r ) + P ( o| b ) P ( b ) + P ( o| g ) P ( g ) = 4 1 3 × 0.2 + × 0.2 + × 0.6 = 0.36 10 2 10 Therefore we can get : P ( o| g ) P ( g ) P ( g | o) = = P ( o) 3 10 × 0. 6 0.36 = 0.5 Problem 1.4 Solution This problem needs knowledge about calculus, especially about Chain rule. We calculate the derivative of P y ( y) with respect to y, according to (1.27) : d p y ( y) dy = d ( p x ( g( y))| g‘ ( y)|) d p x ( g( y)) ‘ d | g‘ ( y)| = | g ( y)| + p x ( g( y)) dy dy dy (∗) The first term in the above equation can be further simplified: d p x ( g( y)) ‘ d p x ( g( y)) d g( y) ‘ | g ( y)| = | g ( y)| dy d g ( y) dy (∗∗) If x̂ is the maximum of density over x, we can obtain : d p x ( x) ¯¯ =0 dx x̂ Therefore, when y = ŷ, s.t. x̂ = g( ŷ), the first term on the right side of (∗∗) will be 0, leading the first term in (∗) equals to 0, however because of the existence of the second term in (∗), the derivative may not equal to 0. But 3 when linear transformation is applied, the second term in (∗) will vanish, (e.g. x = a y + b). A simple example can be shown by : p x ( x) = 2 x, x ∈ [0, 1] => x̂ = 1 And given that: x = sin( y) Therefore, p y ( y) = 2 sin( y) | cos( y)|, y ∈ [0, π2 ], which can be simplified : π y ∈ [0, ] 2 p y ( y) = sin(2 y), => ŷ = π 4 However, it is quite obvious : x̂ ̸= sin( ŷ) Problem 1.5 Solution This problem takes advantage of the property of expectation: var [ f ] = E[( f ( x) − E[ f ( x)])2 ] = E[ f ( x)2 − 2 f ( x)E[ f ( x)] + E[ f ( x)]2 ] = E[ f ( x)2 ] − 2E[ f ( x)]2 + E[ f ( x)]2 => var [ f ] = E[ f ( x)2 ] − E[ f ( x)]2 Problem 1.6 Solution Based on (1.41), we only need to prove when x and y is independent, E x,y [ x y] = E[ x]E[ y]. Because x and y is independent, we have : p( x, y) = p x ( x) p y ( y) Therefore: ∫ ∫ ∫ ∫ x yp( x, y) dx d y = x yp x ( x) p y ( y) dx d y ∫ ∫ = ( xp x ( x) dx)( yp y ( y) d y) => E x,y [ x y] = E[ x]E[ y] Problem 1.7 Solution This problem should take advantage of Integration by substitution. ∫ +∞ ∫ +∞ 1 1 2 exp(− 2 x2 − 2 y2 ) dx d y I = 2σ 2σ −∞ −∞ ∫ 2π ∫ +∞ 1 2 = exp(− 2 r ) r dr d θ 2σ 0 0 4 Here we utilize : x = r cos θ , y = r sin θ Based on the fact : ∫ +∞ 1 r 2 ¯+∞ exp(− 2 ) r dr = −σ2 exp(− 2 )¯0 = −σ2 (0 − (−1)) = σ2 2σ 2σ 0 Therefore, I can be solved : ∫ 2π 2 I = 0 σ2 d θ = 2πσ2 , => I = p 2πσ ¯ And next,we will show that Gaussian distribution N ( x¯µ, σ2 ) is normal¯ ∫ +∞ ized, (i.e. −∞ N ( x¯µ, σ2 ) dx = 1) : ∫ +∞ −∞ ¯ N ( x¯µ, σ2 ) dx ∫ +∞ 1 1 exp{− 2 ( x − µ)2 } dx p 2 2σ −∞ 2πσ ∫ +∞ 1 1 = exp{− 2 y2 } d y ( y = x − µ ) p 2σ −∞ 2πσ2 ∫ +∞ 1 1 exp{− 2 y2 } d y = p 2σ 2πσ2 −∞ = 1 = Problem 1.8 Solution The first question will need the result of Prob.1.7 : ∫ +∞ ∫ +∞ ¯ 1 1 N ( x¯µ, σ2 ) x dx = exp{− 2 ( x − µ)2 } x dx p 2σ −∞ −∞ 2πσ2 ∫ +∞ 1 1 = exp{− 2 y2 }( y + µ) d y ( y = x − µ) p 2 2 σ −∞ 2πσ ∫ +∞ ∫ +∞ 1 1 2 1 1 = µ exp{− 2 y } d y + exp{− 2 y2 } y d y p p 2 2 2σ 2σ −∞ −∞ 2πσ 2πσ = µ+0 = µ The second problem has already be given hint in the description. Given that : dg df d ( f g) = f +g dx dx dx We differentiate both side of (1.127) with respect to σ2 , we will obtain : ∫ +∞ −∞ (− ¯ 1 ( x − µ)2 + )N ( x¯µ, σ2 ) dx = 0 2 4 2σ 2σ 5 Provided the fact that σ ̸= 0, we can get: ∫ +∞ ∫ +∞ ¯ ¯ 2 2 ¯ ( x − µ) N ( x µ, σ ) dx = σ2 N ( x¯µ, σ2 ) dx = σ2 −∞ −∞ So the equation above has actually proven (1.51), according to the definition: ∫ +∞ ¯ var [ x] = ( x − E[ x])2 N ( x¯µ, σ2 ) dx −∞ Where E[ x] = µ has already been proved. Therefore : var [ x] = σ2 Finally, E[ x2 ] = var [ x] + E[ x]2 = σ2 + µ2 Problem 1.9 Solution Here we only focus on (1.52), because (1.52) is the general form of (1.42). Based on the definition : The maximum of distribution is known as its mode and (1.52), we can obtain : ¯ ∂N (x¯µ, Σ) ∂x ¯ 1 = − [Σ−1 + (Σ−1 )T ] (x − µ)N (x¯µ, Σ) 2 ¯ = −Σ−1 (x − µ)N (x¯µ, Σ) Where we take advantage of : ∂xT Ax ∂x = (A + AT )x Therefore, only when x = µ, and (Σ−1 )T = Σ−1 ¯ ∂N (x¯µ, Σ) ∂x =0 Note: You may also need to calculate Hessian Matrix to prove that it is maximum. However, here we find that the first derivative only has one root. Based on the description in the problem, this point should be maximum point. Problem 1.10 Solution We will solve this problem based on the definition of expectation, variation 6 and independence. ∫ ∫ E[ x + z ] = = ∫ ∫ ( x + z) p( x, z) dx dz ( x + z) p( x) p( z) dx dz ∫ ∫ = xp( x) p( z) dx dz + z p( x) p( z) dx dz ∫ ∫ ∫ ∫ = ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz ∫ ∫ = xp( x) dx + z p( z) dz ∫ ∫ = E[ x ] + E[ z ] ∫ ∫ var [ x + z] = ∫ ∫ ( x + z − E[ x + z] )2 p( x, z) dx dz = { ( x + z )2 − 2 ( x + z) E[ x + z]) + E2 [ x + z] } p( x, z) dx dz ∫ ∫ ∫ ∫ 2 ( x + z) p( x, z) dx dz − 2E[ x + z] ( x + z) p( x, z) dx dz + E2 [ x + z] ∫ ∫ ( x + z)2 p( x, z) dx dz − E2 [ x + z] ∫ ∫ ( x2 + 2 xz + z2 ) p( x) p( z) dx dz − E2 [ x + z] ∫ ∫ ∫ ∫ ∫ ∫ 2 ( p( z) dz) x p( x) dx + 2 xz p( x) p( z) dx dz + ( p( x) dx) z2 p( z) dz − E2 [ x + z] ∫ ∫ E[ x 2 ] + E[ z 2 ] − E2 [ x + z ] + 2 xz p( x) p( z) dx dz ∫ ∫ E[ x2 ] + E[ z2 ] − (E[ x] + E[ z])2 + 2 xz p( x) p( z) dx dz ∫ ∫ E[ x2 ] − E2 [ x] + E[ z2 ] − E2 [ z] − 2E[ x] E[ z] + 2 xz p( x) p( z) dx dz ∫ ∫ var [ x] + var [ z] − 2E[ x] E[ z] + 2( xp( x) dx)( z p( z) dz) = var [ x] + var [ z] = = = = = = = = Problem 1.11 Solution Based on prior knowledge that µ ML and σ2ML will decouple. We will first calculate µ ML : ¯ N d ( ln p(x¯ µ, σ2 )) 1 ∑ = 2 ( x n − µ) dµ σ n=1 We let : ¯ d ( ln p(x¯ µ, σ2 )) dµ =0 7 Therefore : µ ML = And because: ¯ d ( ln p(x¯ µ, σ2 )) d σ2 = N 1 ∑ xn N n=1 N 1 ∑ ( ( x n − µ )2 − N σ 2 ) 2σ4 n=1 ¯ d ( ln p(x¯ µ, σ2 )) We let : d σ2 Therefore : σ2ML = =0 N 1 ∑ ( xn − µ ML )2 N n=1 Problem 1.12 Solution It is quite straightforward for E[µ ML ], with the prior knowledge that xn is i.i.d. and it also obeys Gaussian distribution N (µ, σ2 ). E[µ ML ] = E[ N N 1 ∑ 1 ∑ x n ] = E[ x n ] = µ x n ] = E[ N n=1 N n=1 For E[σ2ML ], we need to take advantage of (1.56) and what has been given in the problem : E[σ2ML ] = E[ N 1 ∑ ( xn − µ ML )2 ] N n=1 = N 1 ∑ E[ ( xn − µ ML )2 ] N n=1 = N 1 ∑ ( x2 − 2 xn µ ML + µ2ML )] E[ N n=1 n = N N N 1 ∑ 1 ∑ 1 ∑ E[ x2n ] − E[ 2 xn µ ML ] + E[ µ2 ] N n=1 N n=1 N n=1 ML = µ2 + σ 2 − N N 2 ∑ 1 ∑ E[ xn ( xn )] + E[µ2ML ] N n=1 N n=1 = µ2 + σ 2 − N N N ∑ ∑ 1 ∑ 2 x )] + E [( x ( xn )2 ] E [ n n N N 2 n=1 n=1 n=1 = µ2 + σ 2 − N N ∑ ∑ 1 2 2 x ) ] + x n )2 ] E [( E [( n N 2 n=1 N 2 n=1 N ∑ 1 x n )2 ] E [( N 2 n=1 1 = µ2 + σ 2 − 2 [ N ( N µ2 + σ 2 ) ] N = µ2 + σ 2 − 8 Therefore we have: E[σ2ML ] = ( N −1 2 )σ N Problem 1.13 Solution This problem can be solved in the same method used in Prob.1.12 : E[σ2ML ] = E[ N 1 ∑ ( x n − µ )2 ] N n=1 (Because here we use µ to replace µ ML ) = N 1 ∑ E[ ( xn − µ)2 ] N n=1 = N 1 ∑ E[ ( x2 − 2 xn µ + µ2 )] N n=1 n = N N N 1 ∑ 1 ∑ 1 ∑ E[ x2n ] − E[ 2 x n µ ] + E[ µ2 ] N n=1 N n=1 N n=1 = µ2 + σ 2 − N 2µ ∑ E[ x n ] + µ2 N n=1 = µ2 + σ2 − 2µ2 + µ2 = σ2 Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in Prob.1.12). In other words, the difference can be shown by the following equations: E[ µ 2 ] = µ 2 (µ is determined, i.e. its expectation is itself, also true for µ2 ) N N ∑ 1 ∑ 1 1 σ2 E[µ2ML ] = E[( xn )2 ] = 2 E[( xn )2 ] = 2 N ( N µ2 + σ2 ) = µ2 + N n=1 N N N n=1 Problem 1.14 Solution This problem is quite similar to the fact that any function f ( x) can be written into the sum of an odd function and an even function. If we let: wSij = w i j + w ji 2 and w iAj = w i j − w ji 2 It is obvious that they satisfy the constraints described in the problem, which are : w i j = wSij + w iAj , wSij = wSji , w iAj = −w Aji 9 To prove (1.132), we only need to simplify it : D ∑ D ∑ wi j xi x j i =1 j =1 = = D ∑ D ∑ i =1 j =1 (wSij + w iAj ) x i x j D ∑ D ∑ i =1 j =1 wSij x i x j + D ∑ D ∑ i =1 j =1 w iAj x i x j Therefore, we only need to prove that the second term equals to 0, and here we use a simple trick: we will prove twice of the second term equals to 0 instead. 2 D ∑ D ∑ i =1 j =1 w iAj x i x j = = = = D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 D ∑ D ∑ i =1 j =1 ( w iAj + w iAj ) x i x j ( w iAj − w Aji ) x i x j w iAj x i x j − w iAj x i x j − D ∑ D ∑ i =1 j =1 D ∑ D ∑ j =1 i =1 w Aji x i x j w Aji x j x i = 0 Therefore, we choose the coefficient matrix to be symmetric as described in the problem. Considering about the symmetry, we can see that if and only if for i = 1, 2, ..., D and i ≤ j , w i j is given, the whole matrix will be determined. Hence, the number of independent parameters are given by : D + D − 1 + ... + 1 = D (D + 1) 2 Note: You can view this intuitively by considering if the upper triangular part of a symmetric matrix is given, the whole matrix will be determined. Problem 1.15 Solution This problem is a more general form of Prob.1.14, so the method can also be used here: we will find a way to use w i 1 i 2 ...i M to represent w e i 1 i 2 ...i M . We begin by introducing a mapping function: F ( x i1 x i2 ...x iM ) = x j1 x j2 ..., x jM s.t. M ∪ k=1 x ik = M ∪ k=1 x jk , and x j1 ≥ x j2 ≥ x j3 ... ≥ x jM 10 It is complexed to write F in mathematical form. Actually this function does a simple work: it rearranges the element in a decreasing order based on its subindex. Several examples are given below, when D = 5, M = 4: F ( x5 x2 x3 x2 ) = x5 x3 x2 x2 F ( x1 x3 x3 x2 ) = x3 x3 x2 x1 F ( x1 x4 x2 x3 ) = x4 x3 x2 x1 F ( x1 x1 x5 x2 ) = x5 x2 x1 x1 After introducing F , the solution will be very simple, based on the fact that F will not change the value of the term, but only rearrange it. D ∑ D ∑ ... i 1 =1 i 2 =1 where D ∑ i M =1 w i 1 i 2 ...i M x i1 x i2 ...x iM = w e j 1 j 2 ... j M = ∑ j1 D ∑ ∑ j 1 =1 j 2 =1 ... j∑ M −1 j M =1 w e j 1 j 2 ... j M x j1 x j2 ...x jM w w∈Ω ¯ Ω = {w i 1 i 2 ...i M ¯ F ( x i1 x i2 ...x iM ) = x j1 x j2 ...x jM , ∀ x i1 x i2 ...x iM } By far, we have already proven (1.134). Mathematical induction will be used to prove (1.135) and we will begin by proving D = 1, i.e. n(1, M ) = n(1, M − 1). When D = 1, (1.134) will degenerate into wx e 1M , i.e., it only has one term, whose coefficient is govern by w e regardless the value of M . Therefore, we have proven when D = 1, n(D, M ) = 1. Suppose (1.135) holds for D , let’s prove it will also hold for D + 1, and then (1.135) will be proved based on Mathematical induction. Let’s begin based on (1.134): i1 D∑ +1 ∑ i 1 =1 i 2 =1 ... i∑ M −1 i M =1 w e i 1 i 2 ...i M x i1 x i2 ...x iM (∗) We divide (∗) into two parts based on the first summation: the first part is made up of i i = 1, 2, ..., D and the second part i 1 = D + 1. After division, the first part corresponds to n(D, M ), and the second part corresponds to n(D + 1, M − 1). Therefore we obtain: n(D + 1, M ) = n(D, M ) + n(D + 1, M − 1) And given the fact that (1.135) holds for D : n(D, M ) = D ∑ i =1 n( i, M − 1) (∗∗) 11 Therefore,we substitute it into (∗∗) n ( D + 1, M ) = D ∑ i =1 n( i, M − 1) + n(D + 1, M − 1) = D∑ +1 i =1 n( i, M − 1) We will prove (1.136) in a different but simple way. We rewrite (1.136) in Permutation and Combination view: D ∑ i =1 M C iM+−M1−2 = C D + M −1 Firstly, We expand the summation. M −1 M −1 M −1 M CM + ... C D −1 + C M + M −2 = C D + M −1 M CM M M −1 Secondly, we rewrite the first term on the left side to C M , because C M = −1 = 1. In other words, we only need to prove: M M −1 M −1 M CM + CM + ... C D + M −2 = C D + M −1 r r r −1 Thirdly, we take advantage of the property : C N = CN + CN . So we −1 −1 can recursively combine the first term and the second term on the left side, and it will ultimately equal to the right side. (1.137) gives the mathematical form of n(D, M ), and we need all the conclusions above to prove it. Let’s give some intuitive concepts by illustrating M = 0, 1, 2. When M = 0, (1.134) will consist of only a constant term, which means n(D, 0) = 1. When M = 1,it is obvious n(D, 1) = D , because in this case (1.134) will only have D terms if we expand it. When M = 2, it degenerates to Prob.1.14, so n(D, 2) = D (D +1) is also obvious. Suppose (1.137) holds for M − 1, let’s prove it will also 2 hold for M . n(D, M ) = = = = = = D ∑ i =1 D ∑ n( i, M − 1) ( based on (1.135) ) C iM+−M1−2 ... = ( based on (1.137) holds for M − 1 ) i =1 M −1 M −1 M −1 M −1 CM + CM −1 + C M +1 ... + C D + M −2 M M −1 M −1 M −1 ( CM + CM ) + CM +1 ... + C D + M −2 M M −1 M −1 ( CM +1 + C M +1 )... + C D + M −2 M M −1 CM +2 ... + C D + M −2 M CD + M −1 By far, all have been proven. 12 Problem 1.16 Solution This problem can be solved in the same way as the one in Prob.1.15. Firstly, we should write the expression consisted of all the independent terms up to M th order corresponding to N (D, M ). By adding a summation regarding to M on the left side of (1.134), we obtain: i1 M ∑ D ∑ ∑ ... m=0 i 1 =1 i 2 =1 i∑ m−1 i m =1 w e i 1 i 2 ...i m x i1 x i2 ...x im (∗) (1.138) is quite obvious if we view m as an looping variable, iterating through all the possible orders less equal than M , and for every possible oder m, the independent parameters are given by n(D, m). Let’s prove (1.138) in a formal way by using Mathematical Induction. When M = 1,(∗) will degenerate to two terms: m = 0, corresponding to n(D, 0) and m = 1, corresponding to n(D, 1). Therefore N (D, 1) = n(D, 0) + n(D, 1). Suppose (1.138) holds for M , we will see that it will also hold for M + 1. Let’s begin by writing all the independent terms based on (∗) : D M +1 ∑ ∑ i1 ∑ ... m=0 i 1 =1 i 2 =1 i∑ m−1 i m =1 w e i 1 i 2 ...i m x i1 x i2 ...x im (∗∗) Using the same technique as in Prob.1.15, we divide (∗∗) to two parts based on the summation regarding to m: the first part consisted of m = 0, 1, ..., M and the second part m = M + 1. Hence, the first part will correspond to N (D, M ) and the second part will correspond to n(D, M + 1). So we obtain: N (D, M + 1) = N (D, M ) + n(D, M + 1) Then we substitute (1.138) into the equation above : N (D, M + 1) = = M ∑ m=0 M +1 ∑ n(D, m) + n(D, M + 1) n(D, m) m=0 To prove (1.139), we will also use the same technique in Prob.1.15 instead of Mathematical Induction. We begin based on already proved (1.138): N (D, M ) = M ∑ m=0 n(D, M ) 13 We then take advantage of (1.137): N (D, M ) = = = = M ∑ m CD + m−1 m=0 M C 0D −1 + C 1D + C 2D +1 + ... + C D + M −1 M ( C 0D + C 1D ) + C 2D +1 + ... + C D + M −1 M ( C 1D +1 + C 2D +1 ) + ... + C D + M −1 = ... = M CD +M Here as asked by the problem, we will view the growing speed of N (D, M ). We should see that in n(D, M ), D and M are symmetric, meaning that we only need to prove when D ≫ M , it will grow like D M , and then the situation of M ≫ D will be solved by symmetry. N (D, M ) = = = ≈ = = ≈ ( D + M )D + M (D + M )! ≈ D! M! DD MM 1 D+M D ( ) (D + M ) M D MM 1 M D [(1 + ) M ] M (D + M ) M M D M e M ( ) (D + M ) M M eM M (1 + ) M D M M D M eM M D M2 [(1 + ) M ] D D M M D M e M+ M2 D MM DM ≈ eM MM DM Where we use Stirling’s approximation, lim (1 + n1 )n = e and e n→+∞ M2 D ≈ e0 = 1. According to the description in the problem, When D ≫ M , we can actually eM M view M in this case. And by M as a constant, so N ( D, M ) will grow like D D symmetry, N (D, M ) will grow like M , when M ≫ D . Finally, we are asked to calculate N (10, 3) and N (100, 3): 3 N (10, 3) = C 13 = 286 3 N (100, 3) = C 103 = 176851 Problem 1.17 Solution 14 ∫ Γ( x + 1) = ∫ +∞ 0 u x e−u du +∞ − u x de−u 0 ¯+∞ ∫ ¯ − = − u x e−u ¯ = 0 = = +∞ e−u d (− u x ) 0 ∫ +∞ ¯+∞ x −u ¯ +x −u e ¯ e−u u x−1 du 0 0 ¯+∞ x −u ¯ + x Γ( x) −u e ¯ 0 Where we have taken advantage of Integration by parts and according to the equation above, we only need to prove the first term equals to 0. Given L’Hospital’s Rule: ux x! lim − u = lim − u = 0 u→+∞ e u→+∞ e And also when u = 0,− u x e u = 0, so we have proved Γ( x + 1) = xΓ( x). Based on the definition of Γ( x), we can write: ∫ +∞ ¯+∞ ¯ Γ(1) = e−u du = − e−u ¯ = −(0 − 1) = 1 0 0 Therefore when x is an integer: Γ( x) = ( x − 1) Γ( x − 1) = ( x − 1) ( x − 2) Γ( x − 2) = ... = x! Γ(1) = x! Problem 1.18 Solution Based on (1.124) and (1.126) and by substituting x to obvious to obtain : ∫ +∞ p 2 e− xi dx i = π p 2σ y, it is quite −∞ D Therefore, the left side of (1.42) will equal to π 2 . For the right side of (1.42): ∫ +∞ ∫ +∞ p D −1 2 SD e−r r D −1 dr = S D e−u u 2 d u ( u = r 2 ) 0 0 ∫ S D +∞ −u D −1 = e u 2 du 2 0 SD D Γ( ) = 2 2 Hence, we obtain: D π2 = SD D Γ( ) 2 2 D => SD = 2π 2 Γ( D2 ) 15 S D has given the expression of the surface area with radius 1 in dimension D , we can further expand the conclusion: the surface area with radius r in dimension D will equal to S D · r D −1 , and when r = 1, it will reduce to S D . This conclusion is naive, if you find that the surface area of different sphere in dimension D is proportion to the D − 1th power of radius, i.e. r D −1 . Considering the relationship between V and S of a sphere with arbitrary radius in dimension D : dV dr = S , we can obtain : ∫ V = ∫ S dr = S D r D −1 dr = SD D r D The equation above gives the expression of the volume of a sphere with radius r in dimension D , so we let r = 1 : VD = SD D For D = 2 and D = 3 : V2 = 1 2π S2 = · =π 2 2 Γ(1) 3 3 S3 1 2π 2 1 2π 2 4 V3 = = · 3 = · p = π π 3 3 Γ( 2 ) 3 3 2 Problem 1.19 Solution We have already given a hint in the solution of Prob.1.18, and here we will make it more clearly: the volume of a sphere with radius r is VD · r D . This is quite similar with the conclusion we obtained in Prob.1.18 about the surface area except that it is proportion to D th power of its radius, i.e. r D not r D −1 . D volume of sphere VD aD SD π2 = = D = (∗) volume of cube (2a)D 2 D 2D −1 D Γ( D ) 2 Where we have used the result of (1.143). And when D → +∞, we will use a simple method to show that (∗) will converge to 0. We rewrite it : (∗) = 2 π D 1 ·( ) 2 · D D 4 Γ( 2 ) Hence, it is now quite obvious, all the three terms will converge to 0 when D → +∞. Therefore their product will also converge to 0. The last problem is quite simple : p p center to one corner a2 · D p = = D and lim D = +∞ D →+∞ center to one side a Problem 1.20 Solution 16 The density of probability in a thin shell with radius r and thickness ϵ can be viewed as a constant. And considering that a sphere in dimension D with radius r has surface area S D r D −1 , which has already been proved in Prob.1.19 : ∫ ∫ shell p(x) d x = p(x) 2 shell dx = Thus we denote : p( r ) = exp(− 2rσ2 ) D (2πσ2 ) 2 S D r D −1 D 2 · V (shell) = exp(− exp(− 2rσ2 ) D (2πσ2 ) 2 S D r D −1 ϵ r2 ) 2σ2 (2πσ2 ) 2 We calculate the derivative of (1.148) with respect to r : d p( r ) SD r2 r2 D −2 = ) ( D − 1 − ) r exp ( − D dr 2σ 2 σ2 (2πσ2 ) 2 (∗) We let p the derivative equal to 0, we will obtain its unique root( stationary point) r̂ = D − 1 σ, because r ∈ [0, +∞]. When r < r̂ , the derivative is large than 0, p( r ) will increase as r ↑, and when r > r̂ , the derivative is less than 0, p( r ) will decrease as r ↑. Therefore r̂ will be the only maximum point. And it p is obvious when D ≫ 1, r̂ ≈ D σ. p( r̂ + ϵ ) p( r̂ ) ( r̂ + ϵ)D −1 exp(− (r̂2+σϵ2) ) 2 = 2 r̂ D −1 exp(− 2r̂σ2 ) ϵ 2ϵ r̂ + ϵ2 = (1 + )D −1 exp(− ) r̂ 2σ 2 2ϵ r̂ + ϵ2 ϵ = exp( − + (D − 1) ln(1 + ) ) 2 r̂ 2σ We process for the exponential term by using Taylor Theorems. − ϵ ϵ 2ϵ r̂ + ϵ2 ϵ2 2ϵ r̂ + ϵ2 + ( D − 1) ln (1 + + ( D − 1) ( ) ) ≈ − − r̂ r̂ 2 r̂ 2 2σ2 2σ 2 2ϵ r̂ + ϵ2 2 r̂ ϵ − ϵ2 = − + 2σ 2 2σ2 2 ϵ = − 2 σ 2 Therefore, p( r̂ + ϵ) = p( r̂ ) exp(− σϵ 2 ). Note: Here I draw a different conclusion compared with (1.149), but I do not think there is any mistake in my deduction. Finally, we see from (1.147) : ¯ 1 ¯ p(x)¯ = D x=0 (2πσ2 ) 2 17 ¯ ¯ p(x)¯ ||x||2 = r̂ 2 = 1 D (2πσ2 ) 2 exp(− r̂ 2 1 D )≈ exp(− ) D 2 2 2σ (2πσ2 ) 2 Problem 1.21 Solution The first question is rather simple : 1 1 1 1 (ab) 2 − a = a 2 ( b 2 − a 2 ) ≥ 0 Where we have taken advantage of b ≥ a ≥ 0. And based on (1.78): p(mistake) = p(x ∈ R 1 , C 2 ) + p(x ∈ R 2 , C 1 ) ∫ ∫ = p(x, C 2 ) dx + p(x, C 1 ) dx R1 R2 Recall that the decision rule which can minimize misclassification is that if p(x, C 1 ) > p(x, C 2 ), for a given value of x, we will assign that x to class C 1 . We can see that in decision area R 1 , it should satisfy p(x, C 1 ) > p(x, C 2 ). Therefore, using what we have proved, we can obtain : ∫ ∫ 1 { p(x, C 1 ) p(x, C 2 ) } 2 dx p(x, C 2 ) dx ≤ R1 R1 It is the same for decision area R 2 . Therefore we can obtain: ∫ 1 p(mistake) ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx Problem 1.22 Solution We need to deeply understand (1.81). When L k j = 1 − I k j : ∑ ∑ ¯ ¯ ¯ L k j p(C k ¯x) = p(C k ¯x) − p(C j ¯x) k k Given a specific x, the first term on the right side is a constant, which equals to 1, no matter which class C j we assign x to. Therefore if we want to ¯ ¯ minimize the loss, we will maximize p(C j x). Hence, we will ¯ assign x to class C j , which can give the biggest posterior probability p(C j ¯x). The explanation of the loss matrix is quite simple. If we label correctly, there is no loss. Otherwise, we will incur a loss, in the same degree whichever class we label it to. The loss matrix is given below to give you an intuitive view: 0 1 1 ... 1 1 0 1 . . . 1 . . . . . . . .. . .. . . . 1 1 1 ... Problem 1.23 Solution 0 18 E[ L ] = ∑∑∫ k j Rj L k j p(x, C k ) d x = ∑∑∫ k j Rj ¯ L k j p(C k ) p(x¯C k ) d x If we denote a new loss matrix by L⋆jk = L jk p(C k ), we can obtain a new equation : ∑∑∫ ¯ ¯ E[ L ] = L⋆ k j p(x C k ) d x j k Rj Problem 1.24 Solution This description of the problem is a little confusing, and what it really mean is that λ is the parameter governing the loss, just like θ governing the posterior probability p(C k |x) when we introduce the reject option. Therefore the reject option can be written in a new way when we view it from the view of λ and the loss: class C j min ∑k L kl p(C k | x) < λ l choice reject else Where C j is the class that can obtain the minimum. If L k j = 1 − I k j , according to what we have proved in Prob.1.22 : ∑ ∑ ¯ ¯ ¯ ¯ p(C k ¯x) − p(C j ¯x) = 1 − p(C j ¯x) L k j p(C k ¯x) = k k Therefore, the reject criterion from the view of λ above is actually equivalent to the largest posterior probability is larger than 1 − λ : min l ∑ L kl p(C k | x) < λ <=> max p(C l | x) > 1 − λ k l And from the view of θ and posterior probability, we label a class for x (i.e. we do not reject) is given by the constrain : max p(C l | x) > θ l Hence from the two different views, we can see that λ and θ are correlated with: λ+θ =1 Problem 1.25 Solution We can prove this informally by dealing with one dimension once a time just as the same process in (1.87) - (1.89) until all has been done, due to the fact that the total loss E can be divided to the summation of loss on every 19 dimension, and what’s more they are independent. Here, we will use a more informal way to prove this. In this case, the expected loss can be written : ∫ ∫ E[ L ] = {y(x) − t}2 p(x, t) d t d x Therefore, just as the same process in (1.87) - (1.89): ∫ ∂E[L] = 2 {y(x) − t} p(x, t) d t = 0 ∂ y(x) ∫ t p(x, t) d t => y(x) = = Et [t|x] p(x) Problem 1.26 Solution The process is identical as the deduction we conduct for (1.90). We will not repeat here. And what we should emphasize is that E[t|x] is a function of x, not t. Thus the integral over t and x can be simplified based on Integration by parts and that is how we obtain (1.90). ]Note: There is a mistake in (1.90), i.e. the second term on the right side is wrong. You can view (3.37) on P148 for reference. It should be : ∫ ∫ ¯ ¯ 2 ¯ E[L] = { y(x) − E[ t x]} p(x) d x + {E[ t¯x − t]}2 p(x, t) d x dt Problem 1.27 Solution We deal with this problem based on Calculus of Variations. ∫ ∂E[L q ] = q [ y(x − t)] q−1 si gn( y(x) − t) p(x, t) dt = 0 ∂ y(x) ∫ y(x) ∫ +∞ => [ y(x) − t] q−1 p(x, t) dt = [ y(x) − t] q−1 p(x, t) dt −∞ ∫ => y(x) −∞ [ y(x) − t] q−1 p( t|x) dt = ∫ y(x) +∞ y(x) [ y(x) − t] q−1 p( t|x) dt Where we take advantage of p(x, t) = p( t|x) p(x) and the property of sign function. Hence, when q = 1, the equation above will reduce to : ∫ y(x) ∫ +∞ p( t|x) dt = p( t|x) dt −∞ y(x) In other words, when q = 1, the optimal y(x) will be given by conditional median. When q = 0, it is non-trivial. We need to rewrite (1.91) : ∫ {∫ } E[ L q ] = | y(x) − t| q p( t|x) p(x) dt d x ∫ { ∫ } = p(x) | y(x) − t| q p( t|x) dt d x (∗) 20 If we want to minimize E[L q ], we only need to minimize the integrand of (∗): ∫ | y(x) − t| q p( t|x) dt (∗∗) When q = 0, | y(x) − t| q is close to 1 everywhere except in the neighborhood around t = y(x) (This can be seen from Fig1.29). Therefore: ∫ ∫ ∫ ∫ (∗∗) ≈ p( t|x) dt − (1 − | y(x) − t| q ) p( t|x) dt ≈ p( t|x) dt − p( t|x) dt U U ϵ ϵ Where ϵ means the small neighborhood,U means the whole space x lies in. Note that y(x) has no correlation with the first term, but the second term (because how to choose y(x) will affect the location of ϵ). Hence we will put ϵ at the location where p( t|x) achieve its largest value, i.e. the mode, because in this way we can obtain the largest reduction. Therefore, it is natural we choose y(x) equals to t that maximize p( t|x) for every x. Problem 1.28 Solution Basically this problem is focused on the definition of Information Content, i.e. h( x). We will rewrite the problem more precisely. In Information Theory, h(·) is also called Information Content and denoted as I (·). Here we will still use h(·) for consistency. The whole problem is about the property of h( x). Based on our knowledge that h(·) is a monotonic function of the probability p( x), we can obtain: h( x) = f ( p( x)) The equation above means that the Information we obtain for a specific value of a random variable x is correlated with its occurring probability p( x), and its relationship is given by a mapping function f (·). Suppose C is the intersection of two independent event A and B, then the information of event C occurring is the compound message of both independent events A and B occurring: h(C ) = h( A ∩ B ) = h( A ) + h(B ) (∗) Because A and B is independent: P (C ) = P ( A ) · P (B ) We apply function f (·) to both side: f (P (C )) = f (P ( A ) · P (B)) (∗∗) Moreover, the left side of (∗) and (∗∗) are equivalent by definition, so we can obtain: h( A ) + h(B) = f (P ( A ) · P (B)) => f ( p( A )) + f ( p(B)) = f (P ( A ) · P (B)) 21 We obtain an important property of function f (·): f ( x · y) = f ( x) + f ( y). Note: In problem (1.28), what it really wants us to prove is about the form and property of function f in our formulation, because there is one sentence in the description of the problem : "In this exercise, we derive the relation between h and p in the form of a function h( p)", (i.e. f (·) in our formulation is equivalent to h( p) in the description). At present, what we know is the property of function f (·): f ( x y ) = f ( x ) + f ( y) (∗) Firstly, we choose x = y, and then it is obvious : f ( x2 ) = 2 f ( x). Secondly, it is obvious f ( x n ) = n f ( x) , n ∈ N is true for n = 1, n = 2. Suppose it is also true for n, we will prove it is true for n + 1: f ( x n+1 ) = f ( x n ) + f ( x) = n f ( x) + f ( x) = ( n + 1) f ( x) Therefore, f ( x n ) = n f ( x) , n ∈ N has been proved. For an integer m, we n m rewrite x n as ( x m ) , and take advantage of what we have proved, we will obtain: n m n f ( x n ) = f (( x m ) ) = m f ( x m ) n Because f ( x n ) also equals to n f ( x), therefore n f ( x) = m f ( x m ). We simplify the equation and obtain: n f (x m ) = n f ( x) m For an arbitrary positive x , x ∈ R+ , we can find two positive rational array { yn } and { z n }, which satisfy: y1 < y2 < ... < yN < x and z1 > z2 > ... > z N > x, and lim yN = x N →+∞ lim z N = x N →+∞ We take advantage of function f (·) is monotonic: yN f ( p) = f ( p yN ) ≤ f ( p x ) ≤ f ( p z N ) = z N f ( p) And when N → +∞, we will obtain: f ( p x ) = x f ( p) , x ∈ R+ . We let p = e, it can be rewritten as : f ( e x ) = x f ( e). Finally, We denote y = e x : f ( y) = ln( y) f ( e) Where f ( e) is a constant once function f (·) is decided. Therefore f ( x) ∝ ln( x). Problem 1.29 Solution 22 This problem is a little bit tricky. The entropy for a M-state discrete random variable x can be written as : H [ x] = − M ∑ λ i ln(λ i ) i Where λ i is the probability that x choose state i . Here we choose a concave function f (·) = ln(·), we rewrite Jensen’s inequality, i.e.(1.115): ln( M ∑ i =1 We choose x i = 1 λi λi xi ) ≥ M ∑ i =1 λ i ln( x i ) and simplify the equation above, we will obtain : lnM ≥ − M ∑ i =1 λ i ln(λ i ) = H [ x] Problem 1.30 Solution Based on definition : ln{ p ( x) } = q ( x) = 1 s 1 ln( ) − [ 2 ( x − µ)2 − 2 ( x − m)2 ] σ 2σ 2s µ 1 m µ2 m2 s 1 ln( ) − [( 2 − 2 ) x2 − ( 2 − 2 ) x + ( 2 − 2 )] σ 2σ 2s σ s 2σ 2s We will take advantage of the following equations to solve this problem. ∫ E[ x2 ] = x2 N ( x|µ, σ2 ) dx = µ2 + σ2 ∫ E[ x ] = x N ( x|µ, σ2 ) dx = µ ∫ N ( x|µ, σ2 ) dx = 1 Given the equations above, it is easy to see : ∫ q ( x) K L( p|| q) = − p( x) ln{ } dx p ( x) ∫ p ( x) } dx = N ( x|µ, σ) ln{ q ( x) = = s 1 1 µ m µ2 m2 ln( ) − ( 2 − 2 )(µ2 + σ2 ) + ( 2 − 2 )µ − ( 2 − 2 ) σ 2σ 2s σ s 2σ 2s s σ2 + (µ − m)2 1 ln( ) + − σ 2 2 s2 23 We will discuss this result in more detail. Firstly, if K L distance is defined in Information Theory, the first term of the result will be log 2 ( σs ) instead of ln( σs ). Secondly, if we denote x = σs , K L distance can be rewritten as : K L( p|| q) = ln( x) + 1 1 − + a, 2 2 2x where a = (µ − m)2 2 s2 We calculate the derivative of K L with respect to x, and let it equal to 0: d (K L) 1 = − x−3 = 0 dx x => x = 1 ( ∵ s, σ > 0 ) When x < 1 the derivative is less than 0, and when x > 1, it is greater than 0, which makes x = 1 the global minimum. When x = 1, K L( p|| q) = a. What’s more, when µ = m, a will achieve its minimum 0. In this way, we have shown that the K L distance between two Gaussian Distributions is not less than 0, and only when the two Gaussian Distributions are identical, i.e. having same mean and variance, K L distance will equal to 0. Problem 1.31 Solution We evaluate H [x] + H [y] − H [x, y] by definition. Firstly, let’s calculate H [x, y] : ∫ ∫ H [x, y] = − p(x, y) lnp(x, y) d x d y ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y − p(x, y) lnp(y|x) d x d y ∫ ∫ ∫ = − p(x) lnp(x) d x − p(x, y) lnp(y|x) d x d y = H [x] + H [y|x] ∫ Where we take advantage of p(x, y) = p(x) p(y|x), p(x, y) d y = p(x) and (1.111). Therefore, we have actually solved Prob.1.37 here. We will continue our proof for this problem, based on what we have proved: H [x] + H [y] − H [x, y] = = = = = H [y] − H [y|x] ∫ ∫ ∫ − p(y) lnp(y) d y + p(x, y) lnp(y|x) d x d y ∫ ∫ ∫ ∫ − p(x, y) lnp(y) d x d y + p(x, y) lnp(y|x) d x d y ∫ ∫ ( p(x) p(y) ) − p(x, y) ln d xd y p(x, y) K L( p(x, y)|| p(x) p(y) ) = I (x, y) ≥ 0 Where we take advantage of the following properties: ∫ p(y) = p(x, y) d x 24 p(y) p(x) p(y) = p(y|x) p(x, y) Moreover, it is straightforward that if and only if x and y is statistically independent, the equality holds, due to the property of KL distance. You can also view this result by : ∫ ∫ H [x, y] = − p(x, y) lnp(x, y) d x d y ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y − p(x, y) lnp(y) d x d y ∫ ∫ ∫ = − p(x) lnp(x) d x − p(y) lnp(y) d y = H [x] + H [y] Problem 1.32 Solution It is straightforward based on definition and note that if we want to change variable in integral, we have to introduce a redundant term called Jacobian Determinant. ∫ H [y] = − p(y) lnp(y) d y ∫ p(x) p(x) ∂y = − ln | |d x | A| |A| ∂x ∫ p(x) = − p(x) ln dx |A| ∫ ∫ 1 = − p(x) lnp(x) d x − p(x) ln dx |A| = H [x] + ln|A| Where we have taken advantage of the following equations: ∂y ∂x =A p(x) = p(y) | and ∫ ∂y ∂x | = p(y) |A| p(x) d x = 1 Problem 1.33 Solution Based on the definition of Entropy, we write: ∑∑ H [ y| x ] = − p( x i , y j ) lnp( y j | x i ) xi y j Considering the property of probability, we can obtain that 0 ≤ p( y j | x i ) ≤ 1, 0 ≤ p( x i , y j ) ≤ 1. Therefore, we can see that − p( x i , y j ) lnp( y j | x i ) ≥ 0 when 0 < p( y j | x i ) ≤ 1. And when p( y j | x i ) = 0, provided with the fact that lim plnp = p→0 25 0, we can see that − p( x i , y j ) lnp( y j | x i ) = − p( x i ) p( y j | x i ) lnp( y j | x i ) ≈ 0, (here we view p( x) as a constant). Hence for an arbitrary term in the equation above, we have proved that it can not be less than 0. In other words, if and only if every term of H [ y| x] equals to 0, H [ y| x] will equal to 0. Therefore, for each possible value of random variable x, denoted as x i : ∑ − p( x i , y j ) lnp( y j | x i ) = 0 (∗) yj If there are more than one possible value of random variable y given x = x i , denoted as y j , such that p( y j | x i ) ̸= 0 (Because x i , y j are both "possible", p( x i , y j ) will also not equal to 0), constrained by 0 ≤ p( y j | x i ) ≤ 1 and ∑ j p( y j | x i ) = 1, there should be at least two value of y satisfied 0 < p( y j | x i ) < 1, which ultimately leads to (∗) > 0. Therefore, for each possible value of x, there will only be one y such that p( y| x) ̸= 0. In other words, y is determined by x. Note: This result is quite straightforward. If y is a function of x, we can obtain the value of y as soon as observing a x. Therefore we will obtain no additional information when observing a y j given an already observed x. Problem 1.34 Solution This problem is complicated. We will explain it in detail. According to Appenddix D, we can obtain the relation,i.e. (D.3) : ∫ ∂F F [ y( x) + ϵη( x)] = F [ y( x)] + ϵη( x) dx (∗∗) ∂y Where y( x) can be viewed as an operator that for any input x it will give an output value y, and equivalently, F [ y( x)] can be viewed as an functional operator that for any input value y( x), it will give an ouput value F [ y( x)]. Then we consider a functional operator: ∫ I [ p( x)] = p( x) f ( x) dx Under a small variation p( x) → p( x) + ϵη( x), we will obtain : ∫ ∫ I [ p( x) + ϵη( x)] = p( x) f ( x) dx + ϵη( x) f ( x) dx Comparing the equation above and (∗), we can draw a conclusion : ∂I ∂ p ( x) = f ( x) Similarly, let’s consider another functional operator: ∫ J [ p( x)] = p( x) lnp( x) dx 26 Then under a small variation p( x) → p( x) + ϵη( x): ∫ J [ p( x) + ϵη( x)] = ( p( x) + ϵη( x) ) ln( p( x) + ϵη( x) ) dx ∫ ∫ = p( x) ln( p( x) + ϵη( x) ) dx + ϵη( x) ln( p( x) + ϵη( x) ) dx Note that ϵη( x) is much smaller than p( x), we will write its Taylor Theorems at point p( x): ln( p( x) + ϵη( x) ) = lnp( x) + ϵη( x) p ( x) + O (ϵη( x)2 ) Therefore, we substitute the equation above into J [ p( x) + ϵη( x)]: ∫ ∫ J [ p( x) + ϵη( x)] = p( x) lnp( x) dx + ϵη( x) ( lnp( x) + 1) dx + O (ϵ2 ) Therefore, we also obtain : ∂J ∂ p ( x) = lnp( x) + 1 Now we can go back to (1.108). Based on ∂ ∂pJ( x) and ∂ p∂(Ix) , we can calculate the derivative of the expression just before (1.108) and let it equal to 0: − lnp( x) − 1 + λ1 + λ2 x + λ3 ( x − µ)2 = 0 Hence we rearrange it and obtain (1.108). From (1.108) we can see that p( x) should take the form of a Gaussian distribution. So we rewrite it into Gaussian form and then compare it to a Gaussian distribution with mean µ and variance σ2 , it is straightforward: exp(−1 + λ1 ) = 1 1 (2πσ2 ) 2 , exp(λ2 x + λ3 ( x − µ)2 ) = exp{ Finally, we obtain : λ1 = 1 − ln(2πσ2 ) λ2 = 0 λ3 = Problem 1.35 Solution 1 2σ 2 ( x − µ )2 } 2σ 2 27 If p( x) = N (µ, σ2 ), we write its entropy: ∫ H [ x] = − p( x) lnp( x) dx ∫ ∫ 1 ( x − µ)2 } dx − p ( x ) {− } dx = − p( x) ln{ 2πσ2 2σ 2 1 σ2 = − ln{ } + 2πσ2 2σ 2 1 = { 1 + ln(2πσ2 ) } 2 Where we have taken advantage of the following properties of a Gaussian distribution: ∫ ∫ p( x) dx = 1 and ( x − µ)2 p( x) dx = σ2 Problem 1.36 Solution Here we should make it clear that if the second derivative is strictly positive, the function must be strictly convex. However, the converse may not be true. For example f ( x) = x4 , g( x) = x2 , x ∈ R are both strictly convex by definition, but their second derivatives at x = 0 are both indeed 0 (See keyword convex function on Wikipedia or Page 71 of the book Convex Optimization written by Boyd, Vandenberghe for more details). Hence, here more precisely we will prove that a convex function is equivalent to its second derivative is non-negative by first considering Taylor Theorems: f ( x + ϵ) = f ( x ) + f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3 ϵ+ ϵ + ϵ + ... 1! 2! 3! f ( x − ϵ) = f ( x ) − f ′ ( x) f ′′ ( x) 2 f ′′′ ( x) 3 ϵ+ ϵ − ϵ + ... 1! 2! 3! Then we can obtain the expression of f ′′ ( x): f ′′ ( x) = lim ϵ→0 f ( x + ϵ) + f ( x − ϵ) − 2 f ( x ) ϵ2 Where O (ϵ4 ) is neglected and if f ( x) is convex, we can obtain: 1 1 1 1 f ( x) = f ( ( x + ϵ) + ( x − ϵ)) ≤ f ( x + ϵ) + f ( x − ϵ) 2 2 2 2 Hence f ′′ ( x) ≥ 0. The converse situation is a little bit complex, we will use Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion above : f ′′ ( x⋆ ) f ( x) = f ( x0 ) + f ′ ( x0 )( x − x0 ) + ( x − x0 ) 2 28 Where x⋆ lies between x and x0 . By hypothesis, f ′′ ( x) ≥ 0, the last term is non-negative for all x. We let x0 = λ x1 + (1 − λ) x2 , and x = x1 : f ( x1 ) ≥ f ( x0 ) + (1 − λ)( x1 − x2 ) f ′ ( x0 ) (∗) And then, we let x = x2 : f ( x2 ) ≥ f ( x0 ) + λ( x2 − x1 ) f ′ ( x0 ) (∗∗) We multiply (∗) by λ, (∗∗) by 1 − λ and then add them together, we will see : λ f ( x1 ) + (1 − λ) f ( x2 ) ≥ f (λ x1 + (1 − λ) x2 ) Problem 1.37 Solution See Prob.1.31. Problem 1.38 Solution When M = 2, (1.115) will reduce to (1.114). We suppose (1.115) holds for M , we will prove that it will also hold for M + 1. f( M ∑ m=1 = λm xm ) f (λ M +1 x M +1 + (1 − λ M +1 ) M ∑ m=1 1 − λ M +1 ≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) f ( ≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) M +1 ∑ ≤ m=1 λm M ∑ xm ) λm m=1 1 − λ M +1 xm ) M ∑ λm f ( xm ) 1 − λ M +1 m=1 λm f ( xm ) Hence, Jensen’s Inequality, i.e. (1.115), has been proved. Problem 1.39 Solution It is quite straightforward based on definition. H [ x] = − ∑ i H [ y] = − ∑ i H [ x, y] = − ∑ i, j H [ x | y] = − ∑ i, j 2 2 1 1 p( x i ) lnp( x i ) = − ln − ln = 0.6365 3 3 3 3 2 2 1 1 p( yi ) lnp( yi ) = − ln − ln = 0.6365 3 3 3 3 1 1 p( x i , y j ) lnp( x i , y j ) = −3 · ln − 0 = 1.0986 3 3 1 1 1 1 1 p( x i , y j ) lnp( x i | y j ) = − ln1 − ln − ln = 0.4621 3 3 2 3 2 29 H [ y| x ] = − ∑ i, j 1 1 1 1 1 p( x i , y j ) lnp( y j | x i ) = − ln − − ln − ln1 = 0.4621 3 2 3 2 3 I [ x, y] = − ∑ p( x i , y j ) ln p( x i ) p( y j ) i, j p( x i , y j ) 2 1 2 2 1 2 · · · 1 1 1 = − ln 3 3 − ln 3 3 − ln 3 3 = 0.1744 3 1/3 3 1/3 3 1/3 Their relations are given below, diagrams omitted. I [ x, y] = H [ x] − H [ x| y] = H [ y] − H [ y| x] H [ x, y] = H [ y| x] + H [ x] = H [ x| y] + H [ y] Problem 1.40 Solution f ( x) = lnx is actually a strict concave function, therefore we take advantage of Jensen’s Inequality to obtain: f( M ∑ i =1 We let λm = ln( λm xm ) ≥ 1 M , m = 1, 2, ..., M . M ∑ i =1 λm f ( xm ) Hence we will obtain: x1 + x2 + ... + xm 1 1 )≥ [ ln( x1 ) + ln( x2 ) + ... + ln( x M ) ] = ln( x1 x2 ...x M ) M M M We take advantage of the fact that f ( x) = lnx is strictly increasing and then obtain : x1 + x2 + ... + xm p ≥ M x1 x2 ...x M M Problem 1.41 Solution Based on definition of I [x, y], i.e.(1.120), we obtain: ∫ ∫ p(x) p(y) I [x, y] = − p(x, y) ln dx dy p(x, y) ∫ ∫ p(x) = − p(x, y) ln dx dy p(x|y) ∫ ∫ ∫ ∫ = − p(x, y) lnp(x) d x d y + p(x, y) lnp(x|y) d x d y ∫ ∫ ∫ ∫ = − p(x) lnp(x) d x + p(x, y) lnp(x|y) d x d y = H [x] − H [x|y] Where we have taken advantage of the fact: p(x, y) = p(y) p(x|y), and p(x, y) d y = p(x). The same process can be used for proving I [x, y] = H [y] − H [y|x], if we substitute p(x, y) with p(x) p(y|x) in the second step. ∫ 30 0.2 Probability Distribution Problem 2.1 Solution Based on definition, we can obtain : ∑ x i =0,1 E[ x] = ∑ x i =0,1 var [ x] = p( x i ) = µ + (1 − µ) = 1 x i p( x i ) = 0 · (1 − µ) + 1 · µ = µ ∑ x i =0,1 ( x i − E[ x])2 p( x i ) = (0 − µ)2 (1 − µ) + (1 − µ)2 · µ = µ(1 − µ) H [ x] = − ∑ x i =0,1 p( x i ) lnp( x i ) = − µ lnµ − (1 − µ) ln(1 − µ) Problem 2.2 Solution The proof in Prob.2.1. can also be used here. ∑ x i =−1,1 E[ x ] = ∑ x i =−1,1 var [E ] = p( x i ) = 1−µ 1+µ + =1 2 2 x i p ( x i ) = −1 · ∑ 1+µ 1−µ + 1· =µ 2 2 ( x i − E[ x])2 p( x i ) x i =−1,1 = (−1 − µ)2 · 1−µ 1+µ + (1 − µ)2 · 2 2 = (1 − µ)2 H [ x] = − ∑ x i =−1,1 p( x i ) lnp( x i ) = − 1−µ 1−µ 1+µ 1+µ ln − ln 2 2 2 2 Problem 2.3 Solution (2.262) is an important property of Combinations, which we have used m to before, such as in Prob.1.15. We will use the ’old fashioned’ denotation C N represent choose m objects from a total of N . With the prior knowledge: m CN = N! m! ( N − m)! 31 We evaluate the left side of (2.262) : m m−1 CN + CN N! N! + m! ( N − m)! ( m − 1)! ( N − ( m − 1))! N! 1 1 ( + ) ( m − 1)! ( N − m)! m N − m + 1 ( N + 1)! m = CN +1 m! ( N + 1 − m)! = = = To proof (2.263), here we will proof a more general form: ( x + y) N = N ∑ m=0 m m N −m CN x y (∗) If we let y = 1, (∗) will reduce to (2.263). We will proof it by induction. First, it is obvious when N = 1, (∗) holds. We assume that it holds for N , we will proof that it also holds for N + 1. ( x + y) N +1 = ( x + y) = = = = = = x N ∑ m=0 N ∑ m=0 N∑ +1 m=1 N ∑ m=1 N ∑ m=1 N∑ +1 m=0 N ∑ m=0 m m N −m CN x y m m N −m CN +y x y m m+1 N − m CN x y + N ∑ m=0 N ∑ m=0 m−1 m N +1− m + CN x y m m N −m CN x y m m N +1− m CN x y N ∑ m=0 m m N +1− m CN x y m−1 m m N +1− m (C N + CN )x y + x N +1 + y N +1 m m N +1− m CN + x N +1 + y N +1 +1 x y m m N +1− m CN +1 x y By far, we have proved (∗). Therefore, if we let y = 1 in (∗), (2.263) has been proved. If we let x = µ and y = 1 − µ, (2.264) has been proved. Problem 2.4 Solution Solution has already been given in the problem, but we will solve it in a 32 more intuitive way, beginning by definition: E[ m ] = = = N ∑ m=0 N ∑ m=1 N ∑ m=1 m m mC N µ (1 − µ) N −m m m mC N µ (1 − µ) N −m N! µm (1 − µ) N −m ( m − 1)!( N − m)! N ∑ ( N − 1)! µm−1 (1 − µ) N −m ( m − 1)!( N − m)! = N ·µ = N ·µ = N ·µ = N · µ [ µ + (1 − µ) ] N −1 = N µ m=1 N ∑ m=1 N∑ −1 k=0 m−1 m−1 CN (1 − µ) N −m −1 µ k k N −1− k CN −1 µ (1 − µ) Some details should be explained here. We note that m = 0 actually doesn’t affect the Expectation, so we let the summation begin from m = 1, i.e. (what we have done from the first step to the second step). Moreover, in the second last step, we rewrite the subindex of the summation, and what we actually do is let k = m − 1. And in the last step, we have taken advantage of (2.264). Variance is straightforward once Expectation has been calculated. var [ m] = E[ m2 ] − E[ m]2 N ∑ m m = m2 C N µ (1 − µ) N −m − E[ m] · E[ m] m=0 = = = N ∑ m=0 N ∑ m=1 N ∑ m=1 = Nµ = Nµ m m m2 C N µ (1 − µ) N −m − ( N µ) · m m m2 C N µ (1 − µ) N −m − N µ · m N ∑ m=0 N ∑ m=1 m m mC N µ (1 − µ) N −m m m mC N µ (1 − µ) N −m N ∑ N! m m mC N µ (1 − µ) N −m µm (1 − µ) N −m − ( N µ) · ( m − 1)!( N − m)! m=1 N ∑ m=1 N ∑ m=1 m N ∑ ( N − 1)! m m µm−1 (1 − µ) N −m − N µ · mC N µ (1 − µ) N −m ( m − 1)!( N − m)! m=1 m−1 m mµm−1 (1 − µ) N −m ( C N −1 − µC N ) Here we will use a little tick, −µ = −1 + (1 − µ) and then take advantage 33 m m m−1 of the property, C N = CN + CN . −1 −1 var [ m] = Nµ = Nµ = Nµ = = = Nµ N ∑ m=1 N ∑ m=1 N ∑ m=1 [ m−1 ] m m mµm−1 (1 − µ) N −m C N −1 − C N + (1 − µ)C N [ ] m m−1 m mµm−1 (1 − µ) N −m (1 − µ)C N + CN −1 − C N [ ] m m mµm−1 (1 − µ) N −m (1 − µ)C N − CN −1 {∑ N { m=1 m mµm−1 (1 − µ) N −m+1 C N − N ∑ m=1 m mµm−1 (1 − µ) N −m C N −1 N µ · N (1 − µ)[µ + (1 − µ)] N −1 − ( N − 1)(1 − µ)[µ + (1 − µ)] N −2 { } N µ N (1 − µ) − ( N − 1)(1 − µ) = N µ(1 − µ) } } Problem 2.5 Solution Hints have already been given in the description, and let’s make a little improvement by introducing t = y + x and x = tµ at the same time, i.e. we will do following changes: { { t = x+ y x = tµ and µ = x+x y y = t(1 − µ) Note t ∈ [0, +∞], µ ∈ (0, 1), and that when we change variables in integral, we will introduce a redundant term called Jacobian Determinant. ¯ ∂x ∂x ¯ ¯ ¯ ∂( x, y) ¯¯ ∂µ ∂ t ¯¯ ¯¯ t µ ¯¯ = ¯ ∂y ∂y ¯ = =t ∂(µ, t) ¯ ∂µ ∂ t ¯ ¯ − t 1 − µ ¯ Now we can calculate the integral. ∫ +∞ ∫ +∞ exp(− y) yb−1 d y exp(− x) xa−1 dx Γ(a)Γ( b) = 0 0 ∫ +∞ ∫ +∞ a−1 = exp(− x) x exp(− y) yb−1 d y dx 0 0 ∫ +∞ ∫ +∞ = exp(− x − y) xa−1 yb−1 d y dx 0 0 ∫ 1 ∫ +∞ = exp(− t) ( tµ)a−1 ( t(1 − µ))b−1 t dt d µ 0 0 ∫ +∞ ∫ 1 a+ b−1 = exp(− t) t dt · µa−1 (1 − µ)b−1 d µ 0 0 ∫ 1 = Γ(a + b) · µa−1 (1 − µ)b−1 d µ 0 34 Therefore, we have obtained : ∫ 1 0 µa−1 (1 − µ)b−1 d µ = Γ(a)Γ( b) Γ(a + b) Problem 2.6 Solution We will solve this problem based on definition. ∫ E[µ] = ∫ = = = = = 1 µ Beta(µ|a, b) d µ 0 Γ(a + b) a µ (1 − µ)b−1 d µ 0 Γ(a) Γ( b) ∫ Γ(a + b)Γ(a + 1) 1 Γ(a + 1 + b) a µ (1 − µ)b−1 d µ Γ(a + 1 + b)Γ(a) 0 Γ(a + 1) Γ( b) ∫ Γ(a + b)Γ(a + 1) 1 Beta(µ|a + 1, b) d µ Γ(a + 1 + b)Γ(a) 0 Γ(a + b) Γ(a + 1) · Γ(a + 1 + b) Γ(a) a a+b 1 Where we have taken advantage of the property: Γ( z + 1) = zΓ( z). For variance, it is quite similar. We first evaluate E [µ2 ]. ∫ E[ µ 2 ] = ∫ = = = = = 1 0 µ2 Beta(µ|a, b) d µ Γ(a + b) a+1 µ (1 − µ)b−1 d µ Γ ( a ) Γ ( b ) 0 ∫ Γ(a + b)Γ(a + 2) 1 Γ(a + 2 + b) a+1 µ (1 − µ)b−1 d µ Γ(a + 2 + b)Γ(a) 0 Γ(a + 2) Γ( b) ∫ Γ(a + b)Γ(a + 2) 1 Beta(µ|a + 2, b) d µ Γ(a + 2 + b)Γ(a) 0 Γ(a + b) Γ(a + 2) · Γ(a + 2 + b) Γ(a) a(a + 1) (a + b)(a + b + 1) 1 Then we use the formula: var [µ] = E [µ2 ] − E [µ]2 . var [µ] = = Problem 2.7 Solution a(a + 1) a 2 −( ) (a + b)(a + b + 1) a+b ab 2 (a + b) (a + b + 1) 35 The maximum likelihood estimation for µ, i.e. (2.8), can be written as : µ ML = m m+l Where m represents how many times we observe ’head’, l represents how many times we observe ’tail’. And the prior mean of µ is given by (2.15), the posterior mean value of x is given by (2.20). Therefore, we will prove that ( m + a) / ( m + a + l + b) lies between m / ( m + l ), a / (a + b). Given the fact that : λ m m+a a+b a + (1 − λ) = where λ = a+b m+l m+a+l +b m+l +a+b We have solved problem. Note : you can also solve it in a more simple way by prove that : ( m+a a m+a m − )·( − )≤0 m+a+l +b a+b m+a+l +b m+l The expression above can be proved by reduction of fractions to a common denominator. Problem 2.8 Solution We solve it base on definition. E y [E x [ x| y]]] = = = = = ∫ E x [ x| y] p( y) d y ∫ ∫ ( x p( x| y) dx) p( y) d y ∫ ∫ x p( x| y) p( y) dx d y ∫ ∫ x p( x, y) dx d y ∫ x p( x) dx = E[ x] (2.271) is complicated and we will calculate every term separately. ∫ E y [var x [ x| y]] = var x [ x| y] p( y) d y ∫ ∫ = ( ( x − E x [ x| y])2 p( x| y) dx ) p( y) d y ∫ ∫ = ( x − E x [ x| y])2 p( x, y) dx d y ∫ ∫ = ( x2 − 2 x E x [ x| y] + E x [ x| y]2 ) p( x, y) dx d y ∫ ∫ ∫ ∫ ∫ ∫ = x2 p( x) dx − 2 x E x [ x| y] p( x, y) dx d y + ( E x [ x | y] 2 ) p ( y) d y 36 About the second term in the equation above, we further simplify it : ∫ ∫ ∫ ∫ 2 x E x [ x| y] p( x, y) dx d y = 2 E x [ x| y] ( xp( x, y) dx ) d y ∫ ∫ = 2 E x [ x| y] p( y) ( xp( x| y) dx ) d y ∫ = 2 E x [ x | y] 2 p ( y) d y Therefore, we obtain the simple expression for the first term on the right side of (2.271) : ∫ ∫ ∫ ∫ E y [var x [ x| y]] = x2 p( x) dx − E x [ x| y]2 p( y) d y (∗) Then we process for the second term. ∫ var y [E x [ x| y]] = (E x [ x| y] − E y [E x [ x| y]])2 p( y) d y ∫ = (E x [ x| y] − E[ x])2 p( y) d y ∫ ∫ ∫ = E x [ x| y]2 p( y) d y − 2 E[ x]E x [ x| y] p( y) d y + E[ x]2 p( y) d y ∫ ∫ 2 = E x [ x| y] p( y) d y − 2E[ x] E x [ x| y] p( y) d y + E[ x]2 Then following the same procedure, we deal with the second term of the equation above. ∫ 2E[ x] · E x [ x| y] p( y) d y = 2E[ x] · E y [E x [ x| y]]] = 2E[ x]2 Therefore, we obtain the simple expression for the second term on the right side of (2.271) : ∫ var y [E x [ x| y]] = E x [ x| y]2 p( y) d y − E[ x]2 (∗∗) Finally, we add (∗) and (∗∗), and then we will obtain: E y [var x [ x| y]] + var y [E x [ x| y]] = E[ x2 ] − E[ x]2 = var [ x] Problem 2.9 Solution This problem is complexed, but hints have already been given in the description. Let’s begin by performing integral of (2.272) over µ M −1 . (Note : 37 by integral over µ M −1 , we actually obtain Dirichlet distribution with M − 1 variables.) ∫ p M −1 (µ, m, ..., µ M −2 ) = 1−µ− m − ... −µ M −2 CM 0 = CM M −2 ∏ k=1 α −1 µk k M −1 ∏ k=1 ∫ α −1 µk k 1−µ− m − ... −µ M −2 0 (1 − α M −1 ∑ j =1 µ MM−−11 −1 µ j )αM −1 d µ M −1 (1 − M −1 ∑ j =1 µ j )αM −1 d µ M −1 We change variable by : t= µ M −1 1 − µ − m − ... − µ M −2 The reason we do so is that µ M −1 ∈ [0, 1 − µ − m − ... − µ M −2 ], by making this changing of variable, we can see that t ∈ [0, 1]. Then we can further simplify the expression. p M −1 = CM = CM = M −2 ∏ k=1 M −2 ∏ k=1 CM M −2 ∏ k=1 α −1 µk k (1 − α −1 µk k α −1 µk k (1 − (1 − M −2 ∑ j =1 M −2 ∑ j =1 M −2 ∑ j =1 µ j) α M −1 +α M −1 µ j )αM −1 +αM −1 µ j )αM −1 +αM −1 ∫ 1 1 0 −1 (1 − ∑ M −1 j =1 µ j )αM −1 (1 − µ − m − ... − µ M −2 )αM −1 +αM −2 0 ∫ α µ MM−−11 tαM −1 −1 (1 − t)αM −1 dt Γ(α M −1 − 1)Γ(α M ) Γ(α M −1 + α M ) Comparing the expression above with a normalized Dirichlet Distribution with M −1 variables, and supposing that (2.272) holds for M −1, we can obtain that: Γ(α M −1 )Γ(α M ) Γ(α1 + α2 + ... + α M ) CM = Γ(α M −1 + α M ) Γ(α1 )Γ(α2 )...Γ(α M −1 + α M ) Therefore, we obtain CM = as required. Problem 2.10 Solution Γ(α1 + α2 + ... + α M ) Γ(α1 )Γ(α2 )...Γ(α M −1 )Γ(α M ) dt 38 Based on definition of Expectation and (2.38), we can write: ∫ E[µ j ] = µ j D ir (µ|α) d µ ∫ = = = = K ∏ Γ(α0 ) α −1 µ k dµ Γ(α1 )Γ(α2 )...Γ(αK ) k=1 k ∫ K ∏ Γ(α0 ) α −1 µj µk k d µ Γ(α1 )Γ(α2 )...Γ(αK ) k=1 µj Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 1)Γ(α j+1 )...Γ(αK ) Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 1) Γ(α0 )Γ(α j + 1) αj = Γ(α j )Γ(α0 + 1) α0 It is quite the same for variance, let’s begin by calculating E[µ2j ]. ∫ E[µ2j ] = = = = µ2j D ir (µ|α) d µ Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) ∫ µ2j K ∏ k=1 α −1 µk k dµ Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 2)Γ(α j+1 )...Γ(αK ) Γ(α0 ) Γ(α1 )Γ(α2 )...Γ(αK ) Γ(α0 + 2) Γ(α0 )Γ(α j + 2) α j (α j + 1) = Γ(α j )Γ(α0 + 2) α0 (α0 + 1) Hence, we obtain : var [µ j ] = E[µ2j ] − E[µ j ]2 = α j (α j + 1) α0 (α0 + 1) −( αj α0 )2 = α j (α0 − α j ) α20 (α0 + 1) It is the same for covariance. ∫ cov[µ j µl ] = (µ j − E[µ j ])(µl − E[µl ]) D ir (µ|α) d µ ∫ = (µ j µl − E[µ j ]µl − E[µl ]µ j + E[µ j ]E[µl ]) D ir (µ|α) d µ = = = = Γ(α0 )Γ(α j + 1)Γ(αl + 1) Γ(α j )Γ(αl )Γ(α0 + 2) α j αl α0 (α0 + 1) α j αl − E[µ j ]E[µl ] − α0 (α0 + 1) α j αl − 2 α0 (α0 + 1) α j αl α20 ( j ̸= l ) − 2E[µ j ]E[µl ] + E[µ j ]E[µl ] 39 Note : when j = l , cov[µ j µl ] will actually reduce to var [µ j ], however we cannot simply replace l with j in the expression of cov[µ j µl ] to get the right ∫ ∫ result and that is because µ j µl D ir (µ|α) d α will reduce to µ2j D ir (µ|α) d α in this case. Problem 2.11 Solution Based on definition of Expectation and (2.38), we first denote : Γ(α0 ) = K (α) Γ(α1 )Γ(α2 )...Γ(αK ) Then we can write : ∂D ir (µ|α) ∂α j = ∂(K (α) K ∏ i =1 K ∂K (α) ∏ = ∂α j i =1 K ∂K (α) ∏ = ∂α j α −1 µi i i =1 ) / ∂α j α −1 µ i i + K (α) α −1 µi i ∂ ∏K α i −1 i =1 µ i ∂α j + lnµ j · D ir (µ|α) Then let us perform integral to both sides: ∫ ∂D ir (µ|α) ∂α j ∫ dµ = K ∂ K (α ) ∏ ∂α j i =1 α −1 µi i dµ + ∫ lnµ j · D ir (µ|α) d µ The left side can be further simplified as : ∫ ∂ D ir (µ|α) d µ ∂1 left side = = =0 ∂α j ∂α j The right side can be further simplified as : right side = = = ∂K (α) ∂α j ∂K (α) ∫ ∏ K i =1 α −1 µi i d µ + E[ lnµ j ] 1 + E[ lnµ j ] ∂α j K (α) ∂ lnK (α) + E[ lnµ j ] ∂α j 40 Therefore, we obtain : ∂ lnK (α) E[ lnµ j ] = − ∂α j } { ∑ ∂ lnΓ(α0 ) − K i =1 lnΓ(α i ) = − = = = ∂α j ∂ lnΓ(α j ) ∂α j ∂ lnΓ(α j ) ∂α j ∂ lnΓ(α j ) − − − ∂ lnΓ(α0 ) ∂α j ∂ lnΓ(α0 ) ∂α0 ∂α0 ∂α j ∂ lnΓ(α0 ) ∂α j ∂α0 = ψ(α j ) − ψ(α0 ) Therefore, the problem has been solved. Problem 2.12 Solution Since we have : ∫ b a 1 dx = 1 b−a It is straightforward that it is normalized. Then we calculate its mean : ∫ E[ x ] = b x a 1 x2 ¯¯b a + b dx = ¯ = b−a 2( b − a) a 2 Then we calculate its variance. ∫ b 2 x a+b 2 x3 ¯¯b a+b 2 2 2 var [ x] = E[ x ] − E[ x] = dx − ( ) = ) ¯ −( 2 3( b − a) a 2 a b−a Hence we obtain: var [ x] = ( b − a )2 12 Problem 2.13 Solution This problem is an extension of Prob.1.30. We can follow the same procep ( x) dure to solve it. Let’s begin by calculating ln q(( x) : ln( p ( x) ) = q ( x) 1 |L| 1 1 ln ( ) + ( x − m)T L−1 ( x − m) − ( x − µ)T Σ−1 ( x − µ) 2 |Σ| 2 2 If x ∼ p( x) = N (µ|Σ), we then take advantage of the following properties. ∫ p( x) dx = 1 41 ∫ E[ x ] = x p( x) dx = µ E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) We obtain : ∫ { } 1 1 |L| 1 ln − ( x − µ)T Σ−1 ( x − µ) + ( x − m)T L−1 ( x − m) p( x) dx KL = 2 |Σ| 2 2 1 |L| 1 1 = ln − E [( x − µ)Σ−1 ( x − µ)T ] + E [( x − m)T L−1 ( x − m)] 2 | Σ| 2 2 1 |L| 1 1 1 = ln − tr { I D } + (µ − m)T L−1 (µ − m) + tr{L−1 Σ} 2 | Σ| 2 2 2 |L| 1 = [ ln − D + tr{L−1 Σ} + ( m − µ)T L−1 ( m − µ)] 2 |Σ| Problem 2.14 Solution The hint given in the problem is straightforward, however it is a little bit difficult to calculate, and here we will use a more simple method to solve this problem, taking advantage of the property of Kullback—Leibler Distance. Let g( x) be a Gaussian PDF with mean µ and variance Σ, and f ( x) an arbitrary PDF with the same mean and variance. { } ∫ ∫ g ( x) dx = − H ( f ) − f ( x) lng( x) dx (∗) 0 ≤ K L( f || g) = − f ( x) ln f ( x) Let’s calculate the second term of the equation above. { } ∫ ∫ [ 1 ] 1 1 T −1 exp − f ( x) lng( x) dx = f ( x) ln ( x − µ ) Σ ( x − µ ) dx 2 (2π)D /2 |Σ|1/2 { } ∫ ∫ [ 1 ] 1 1 T −1 = f ( x) ln dx + f ( x ) − ( x − µ ) Σ ( x − µ ) dx 2 (2π)D /2 |Σ|1/2 } { ] 1 1 [ 1 T −1 − E ( x − µ ) Σ ( x − µ ) = ln 2 (2π)D /2 |Σ|1/2 { } 1 1 1 = ln − tr{ I D } D /2 1/2 2 (2π) |Σ| { } 1 D = − ln|Σ| + (1 + ln(2π)) 2 2 = − H ( g) We take advantage of two properties of PDF f ( x), with mean µ and variance Σ, as listed below. What’s more, we also use the result of Prob.2.15, which we will proof later. ∫ f ( x) dx = 1 E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) 42 Now we can further simplify (∗) to obtain: H ( g) ≥ H ( f ) In other words, we have proved that an arbitrary PDF f ( x) with the same mean and variance as a Gaussian PDF g( x), its entropy cannot be greater than that of Gaussian PDF. Problem 2.15 Solution We have already used the result of this problem to solve Prob.2.14, and now we will prove it. Suppose x ∼ p( x) = N (µ|Σ) : ∫ H [ x] = − p( x) lnp( x) dx { } ∫ ] [ 1 1 1 T −1 = − p( x) ln exp − ( x − µ) Σ ( x − µ) dx 2 (2π)D /2 |Σ|1/2 { } ∫ ∫ [ 1 ] 1 1 = − p( x) ln dx − f ( x) − ( x − µ)T Σ−1 ( x − µ) dx D /2 1/2 2 (2π) | Σ| { } ] 1 1 1 [ = − ln + E ( x − µ)T Σ−1 ( x − µ) D /2 1/2 2 (2π) |Σ| { } 1 1 1 = − ln + tr{ I D } 2 (2π)D /2 |Σ|1/2 1 D = ln|Σ| + (1 + ln(2π)) 2 2 Where we have taken advantage of : ∫ p( x) dx = 1 E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a) Note : Actually in Prob.2.14, we have already solved this problem, you can intuitively view it by replacing the integrand f ( x) lng( x) with g( x) lng( x), and ∫ the same procedure in Prob.2.14 still holds to calculate g( x) lng( x) dx. Problem 2.16 Solution Let us consider a more general conclusion about the Probability Density Function (PDF) of the summation of two independent random variables. We denote two random variables X and Y . Their summation Z = X + Y , is still a random variable. We also denote f (·) as PDF, and F (·) as Cumulative Distribution Function (CDF). We can obtain : Ï F Z ( z) = P ( Z < z) = f X ,Y ( x, y) dxd y x+ y≤ z 43 Where z represents an arbitrary real number. We rewrite the double integral into iterated integral : ] ∫ +∞ [∫ z− y F Z ( z) = f X ,Y ( x, y) dx d y −∞ −∞ We fix z and y, and then make a change of variable x = u− y to the integral. ] ] ∫ +∞ [∫ z− y ∫ +∞ [∫ z F Z ( z) = f X ,Y ( x, y) dx d y = f X ,Y ( u − x, y) du d y −∞ −∞ −∞ −∞ Note: f X ,Y (·) is the joint PDF of X and Y , and then we rearrange the order, we will obtain : ] ∫ z [∫ +∞ F Z ( z) = f X ,Y ( u − y, y) d y du −∞ −∞ Compare the equation above with th definition of CDF : ∫ z F Z ( z) = f Z ( u) du −∞ We can obtain : ∫ f Z ( u) = +∞ −∞ f X ,Y ( u − y, y) d y And if X and Y are independent, which means f X ,Y ( x, y) = f X ( x) f Y ( y), we can simplify f Z ( z) : ∫ +∞ f X ( u − y) f Y ( y) d y i.e. f Z = f X ∗ f Y f Z ( u) = −∞ Until now we have proved that the PDF of the summation of two independent random variable is the convolution of the PDF of them. Hence it is straightforward to see that in this problem, where random variable x is the summation of random variable x1 and x2 , the PDF of x should be the convolution of the PDF of x1 and x2 . To find the entropy of x, we will use a simple method, taking advantage of (2.113)-(2.117). With the knowledge : 1 p( x2 ) = N (µ2 , τ− 2 ) 1 p( x| x2 ) = N (µ1 + x2 , τ− 1 ) We make analogies : x2 in this problem to x in (2.113), x in this problem to y in (2.114). Hence by using (2.115), we can obtain p( x) is still a normal distribution, and since the entropy of a Gaussian is fully decided by its variance, there is no need to calculate the mean. Still by using (2.115), the variance of 1 −1 x is τ− 1 + τ2 , which finally gives its entropy : H [ x] = ] 1[ 1 −1 1 + ln2π(τ− 1 + τ2 ) 2 44 Problem 2.17 Solution This is an extension of Prob.1.14. The same procedure can be used here. We suppose an arbitrary precision matrix Λ can be written as ΛS + Λ A , where they satisfy : Λ i j + Λ ji Λ i j − Λ ji ΛSij = , Λ iAj = 2 2 Hence it is straightforward that ΛSij = ΛSji , and Λ iAj = − Λ Aji . If we expand the quadratic form of exponent, we will obtain : ( x − µ)T Λ( x − µ) = D ∑ D ∑ i =1 j =1 ( x i − µ i )Λ i j ( x j − µ j ) (∗) It is straightforward then : (∗) = = D ∑ D ∑ i =1 j =1 D ∑ D ∑ ( x i − µ i )ΛSij ( x j − µ j ) + D ∑ D ∑ i =1 j =1 ( x i − µ i )Λ iAj ( x j − µ j ) ( x i − µ i )ΛSij ( x j − µ j ) i =1 j =1 Therefore, we can assume precision matrix is symmetric, and so is covariance matrix. Problem 2.18 Solution We will just follow the hint given in the problem. Firstly, we take complex conjugate on both sides of (2.45) : Σu i = λi u i => Σu i = λi u i Where we have taken advantage of the fact that Σ is a real matrix, i.e., Σ = Σ. Then using that Σ is a symmetric, i.e., ΣT = Σ : u i T Σu i = u i T ( Σu i ) = u i T ( λi u i ) = λi u i T u i T u i T Σ u i = ( Σ u i )T u i = ( λ i u i )T u i = λ i u i T u i T Since u i ̸= 0, we have u i T u i ̸= 0. Thus λTi = λ i , which means λ i is real. Next we will proof that two eigenvectors corresponding to different eigenvalues are orthogonal. λ i < u i , u j > = < λ i u i , u j > = < Σ u i , u j > = < u i , ΣT u j > = λ j < u i , u j > Where we have taken advantage of ΣT = Σ and for arbitrary real matrix A and vector x, y, we have : < Ax, y > = < x, A T y > 45 Provided λ i ̸= λ j , we have < u i , u j > = 0, i.e., u i and u j are orthogonal. And then if we perform normalization on every eigenvector to force its Euclidean norm to equal to 1, (2.46) is straightforward. By performing normalization, I mean multiplying the eigenvector by a real number a to let its Euclidean norm (length) to equal to 1, meanwhile we should also divide its corresponding eigenvalue by a. Problem 2.19 Solution For every N × N real symmetric matrix, the eigenvalues are real and the eigenvectors can be chosen such that they are orthogonal to each other. Thus a real symmetric matrix Σ can be decomposed as Σ = U ΛU T ,where U is an orthogonal matrix, and Λ is a diagonal matrix whose entries are the eigenvalues of A. Hence for an arbitrary vector x, we have: u 1T x λ1 u 1T x D ∑ . .. Σ x = U ΛU T x = U Λ .. = U λk u k u T =( . k )x uT x D λD u T x D k=1 And since Σ−1 = U Λ−1U T , the same procedure can be used to prove (2.49). Problem 2.20 Solution Since u 1 , u 2 , ..., u D can constitute a basis for RD , we can make projection for a : a = a 1 u 1 + a 2 u 2 + ... + a D u D We substitute the expression above into aT Σa, taking advantage of the property: u i u j = 1 only if i = j , otherwise 0, we will obtain : aT Σa = (a 1 u 1 + a 2 u 2 + ... + a D u D )T Σ(a 1 u 1 + a 2 u 2 + ... + a D u D ) = (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )Σ(a 1 u 1 + a 2 u 2 + ... + a D u D ) = (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )(a 1 λ1 u 1 + a 2 λ2 u 2 + ... + a D λD u D ) = λ1 a 1 2 + λ2 a 2 2 + ... + λD a D 2 Since a is real,the expression above will be strictly positive for any nonzero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue, λ i , is zero or negative, there will exist a vector a (e.g. a = u i ), for which this expression will be no greater than 0. Thus, that a real symmetric matrix has eigenvectors which are all strictly positive is a sufficient and necessary condition for the matrix to be positive definite. Problem 2.21 Solution It is straightforward. For a symmetric matrix Λ of size D × D , when the lower triangular part is decided, the whole matrix will be decided due to 46 symmetry. Hence the number of independent parameters is D + (D − 1) + ... + 1, which equals to D (D + 1)/2. Problem 2.22 Solution Suppose A is a symmetric matrix, and we need to prove that A −1 is also symmetric, i.e., A −1 = ( A −1 )T . Since identity matrix I is also symmetric, we have : A A −1 = ( A A −1 )T And since AB T = B T A T holds for arbitrary matrix A and B, we will obtain : T A A −1 = ( A −1 ) A T Since A = A T , we substitute the right side: T A A −1 = ( A −1 ) A And note that A A −1 = A −1 A = I , we rearrange the order of the left side : T A −1 A = ( A −1 ) A Finally, by multiplying A −1 to both sides, we can obtain: T A −1 A A −1 = ( A −1 ) A A −1 Using A A −1 = I , we will get what we are asked : A −1 = ( A −1 ) T Problem 2.23 Solution Let’s reformulate the problem. What the problem wants us to prove is that if ( x − µ)T Σ−1 ( x − µ) = r 2 , where r 2 is a constant, we will have the volume of the hyperellipsoid decided by the equation above will equal to VD |Σ|1/2 r D . Note that the center of this hyperellipsoid locates at µ, and a translation operation won’t change its volume, thus we only need to prove that the volume of a hyperellipsoid decided by xT Σ−1 x = r 2 , whose center locates at 0 equals to VD |Σ|1/2 r D . This problem can be viewed as two parts. Firstly, let’s discuss about VD , the volume of a unit sphere in dimension D . The expression of VD has already be given in the solution procedure of Prob.1.18, i.e., (1.144) : VD = SD 2πD /2 = D Γ( D2 + 1) And also in the procedure, we show that a D dimensional sphere with radius r , i.e., xT x = r 2 , has volume V ( r ) = VD r D . We move a step forward: we 47 perform a linear transform using matrix Σ1/2 , i.e., yT y = r 2 , where y = Σ1/2 x. After the linear transformation, we actually get a hyperellipsoid whose center locates at 0, and its volume is given by multiplying V ( r ) with the determinant of the transformation matrix, which gives |Σ|1/2 VD r D , just as required. Problem 2.24 Solution We just following the hint, and firstly let’s calculate : [ A C ] [ × B D M −D −1 CM − MBD −1 −1 D + D −1 CMBD −1 ] The result can also be partitioned into four blocks. The block located at left top equals to : AM − BD −1 CM = ( A − BD −1 C )( A − BD −1 C )−1 = I Where we have taken advantage of (2.77). And the right top equals to : − AMBD −1 + BD −1 + BD −1 CMBD −1 = ( I − AM + BD −1 CM )BD −1 = 0 Where we have used the result of the left top block. And the left bottom equals to : CM − DD −1 CM = 0 And the right bottom equals to : −CMBD −1 + DD −1 + DD −1 CMDD −1 = I we have proved what we are asked. Note: if you want to be more precise, you should also multiply the block matrix on the right side of (2.76) and then prove that it will equal to a identity matrix. However, the procedure above can be also used there, so we omit the proof and what’s more, if two arbitrary square matrix X and Y satisfied X Y = I , it can be shown that Y X = I also holds. Problem 2.25 Solution We will take advantage of the result of (2.94)-(2.98). Let’s first begin by grouping xa and x b together, and then we rewrite what has been given as : ( x= xa,b xc ) ( µ= µa,b µc ) [ Σ= Σ(a,b)(a,b) Σ(a,b) c Then we take advantage of (2.98), we can obtain : p( xa,b ) = N ( xa,b |µa,b , Σ(a,b)(a,b) ) Σ(a,b) c Σ cc ] 48 Where we have defined: ) ( µa µa,b = µb [ Σ(a,b)(a,b) = Σaa Σba Σab Σbb ] Since now we have obtained the joint contribution of xa and x b , we will take advantage of (2.96) (2.97) to obtain conditional distribution, which gives: 1 p( xa | x b ) = N ( x|µa|b , Λ− aa ) Where we have defined 1 µa|b = µa − Λ− aa Λab ( x b − µ b ) 1 And the expression of Λ− aa and Λab can be given by using (2.76) and (2.77) once we notice that the following relation exits: [ Λaa Λba Λab Λbb ] [ = Σaa Σba Σab Σbb ]−1 Problem 2.26 Solution This problem is quite straightforward, if we just follow the hint. ( ) ( A + BCD ) A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1 = A A −1 − A A −1 B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 − BCD A −1 B(C −1 + D A −1 B)−1 D A −1 = I − B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 + B(C −1 + D A −1 B)−1 D A −1 − BCD A −1 =I Where we have taken advantage of − BCD A −1 B(C −1 + D A −1 B)−1 D A −1 = −BC (−C −1 + C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1 = (−BC )(−C −1 )(C −1 + D A −1 B)−1 D A −1 + (−BC )(C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1 = B(C −1 + D A −1 B)−1 D A −1 − BCD A −1 Here we will also directly calculate the inverse matrix instead to give another solution. Let’s first begin by introducing two useful formulas. ( I + P )−1 = ( I + P )−1 ( I + P − P ) = I − ( I + P )−1 P And since P + PQP = P ( I + QP ) = ( I + PQ )P 49 The second formula is : ( I + PQ )−1 P = P ( I + QP )−1 And now let’s directly calculate ( A + BCD )−1 : ( A + BCD )−1 = [ A ( I + A −1 BCD )]−1 = ( I + A −1 BCD )−1 A −1 [ ] = I − ( I + A −1 BCD )−1 A −1 BCD A −1 = A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1 Where we have assumed that A is invertible and also used the first formula we introduced. Then we also assume that C is invertible and recursively use the second formula : ( A + BCD )−1 = A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1 = A −1 − A −1 ( I + BCD A −1 )−1 BCD A −1 = = A −1 − A −1 B( I + CD A −1 B)−1 CD A −1 [ ]−1 CD A −1 A −1 − A −1 B C (C −1 + D A −1 B) = A −1 − A −1 B(C −1 + D A −1 B)−1 C −1 CD A −1 = A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1 Just as required. Problem 2.27 Solution The same procedure used in Prob.1.10 can be used here similarly. ∫ ∫ E[ x + z ] = ( x + z) p( x, z) dx dz ∫ ∫ = ( x + z) p( x) p( z) dx dz ∫ ∫ ∫ ∫ = xp( x) p( z) dx dz + z p( x) p( z) dx dz ∫ ∫ ∫ ∫ = ( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz ∫ ∫ = xp( x) dx + z p( z) dz = E[ x ] + E[ z ] And for covariance matrix, we will use matrix integral : ∫ ∫ cov[ x + z] = ( x + z − E[ x + z] )( x + z − E[ x + z] )T p( x, z) dx dz Also the same procedure can be used here. We omit the proof for simplicity. 50 Problem 2.28 Solution It is quite straightforward when we compare the problem with (2.94)(2.98). We treat x in (2.94) as z in this problem, xa in (2.94) as x in this problem, x b in (2.94) as y in this problem. In other words, we rewrite the problem in the form of (2.94)-(2.98), which gives : ( z= x y ) ( E( z ) = µ Aµ + b ) [ cov( z) = Λ−1 A Λ−1 Λ−1 A T −1 L + A Λ−1 A T ] By using (2.98), we can obtain: p( x) = N ( x|µ, Λ−1 ) And by using (2.96) and (2.97), we can obtain : p( y| x) = N ( y|µ y| x , Λ−yy1 ) Where Λ yy can be obtained by the right bottom part of (2.104),which gives Λ yy = L−1 , and you can also calculate it using (2.105) combined with (2.78) and (2.79). Finally the conditional mean is given by (2.97) : µ y| x = A µ + L − L−1 (−LA )( x − µ) = Ax + L Problem 2.29 Solution It is straightforward. Firstly, we calculate the left top block : ]−1 [ = Λ−1 left top = (Λ + A T LA ) − (− A T L)(L−1 )(−LA ) And then the right top block : right top = −Λ−1 (− A T L)L−1 = Λ−1 A T And then the left bottom block : left bottom = −L−1 (−LA )Λ−1 = A Λ−1 Finally the right bottom block : right bottom = L−1 + L−1 (−LA )Λ−1 (− A T L)L−1 = L−1 + A Λ−1 A T Problem 2.30 Solution It is straightforward by multiplying (2.105) and (2.107), which gives : ( Λ−1 A Λ−1 Λ−1 A T L−1 + A Λ−1 A T Just as required in the problem. )( Λµ − A T Lb Lb ) ( = µ Aµ + b ) 51 Problem 2.31 Solution According to the problem, we can write two expressions : p ( x ) = N ( x |µ x , Σ x ) , p( y| x) = N ( y|µ z + x, Σ z ) By comparing the expression above and (2.113)-(2.117), we can write the expression of p( y) : p ( y) = N ( y|µ x + µ z , Σ x + Σ z ) Problem 2.32 Solution Let’s make this problem more clear. The deduction in the main text, i.e., (2.101-2.110), firstly denote a new random variable z corresponding to the joint distribution, and then by completing square according to z,i.e.,(2.103), obtain the precision matrix R by comparing (2.103) with the PDF of a multivariate Gaussian Distribution, and then it takes the inverse of precision matrix to obtain covariance matrix, and finally it obtains the linear term i.e., (2.106) to calculate the mean. In this problem, we are asked to solve the problem from another perspective: we need to write the joint distribution p( x, y) and then perform integration over x to obtain marginal distribution p( y). Let’s begin by write the quadratic form in the exponential of p( x, y) : 1 1 − ( x − µ)T Λ( x − µ) − ( y − Ax − b)T L( y − Ax − b) 2 2 We extract those terms involving x : 1 = − xT (Λ + A T LA ) x + xT [Λµ + A T L( y − b) ] + const 2 1 1 = − ( x − m)T (Λ + A T LA ) ( x − m) + mT (Λ + A T LA ) m + const 2 2 Where we have defined : m = (Λ + A T LA )−1 [Λµ + A T L( y − b) ] Now if we perform integration over x, we will see that the first term vanish to a constant, and we extract the terms including y from the remaining parts, we can obtain : ] 1 [ = − yT L − LA (Λ + A T LA )−1 A T L y 2 { [ ] + yT L − LA (Λ + A T LA )−1 A T L b } +LA (Λ + A T LA )−1 Λµ We firstly view the quadratic term to obtain the precision matrix, and then we take advantage of (2.289), we will obtain (2.110). Finally, using the 52 linear term combined with the already known covariance matrix, we can obtain (2.109). Problem 2.33 Solution p( x,y) According to Bayesian Formula, we can write p( x| y) = p( y) , where we have already known the joint distribution p( x, y) in (2.105) and (2.108), and the marginal distribution p( y) in Prob.2.32., we can follow the same procedure in Prob.2.32., i.e. firstly obtain the covariance matrix from the quadratic term and then obtain the mean from the linear term. The details are omitted here. Problem 2.34 Solution Let’s follow the hint by firstly calculating the derivative of (2.118) with respect to Σ and let it equal to 0 : − N 1 ∂ ∑ N ∂ ln|Σ| − ( xn − µ)T Σ−1 ( xn − µ) = 0 2 ∂Σ 2 ∂Σ n=1 By using (C.28), the first term can be reduced to : − N N ∂ N ln|Σ| = − (Σ−1 )T = − Σ−1 2 ∂Σ 2 2 Provided with the result that the optimal covariance matrix is the sample covariance, we denote sample matrix S as : S= N 1 ∑ ( xn − µ)( xn − µ)T N n=1 We rewrite the second term : second term = − N 1 ∂ ∑ ( xn − µ)T Σ−1 ( xn − µ) 2 ∂Σ n=1 N ∂ T r [Σ−1 S ] 2 ∂Σ N −1 Σ S Σ−1 2 = − = Where we have taken advantage of the following property, combined with the fact that S and Σ is symmetric. (Note : this property can be found in The Matrix Cookbook.) ∂ ∂X T r ( A X −1 B) = −( X −1 BA X −1 )T = −( X −1 )T A T B T ( X −1 )T Thus we obtain : − N −1 N −1 Σ + Σ S Σ−1 = 0 2 2 53 Obviously, we obtain Σ = S , just as required. Problem 2.35 Solution The proof of (2.62) is quite clear in the main text, i.e., from page 82 to page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin by proving (2.123) : E[µ ML ] = N 1 ∑ 1 E[ · Nµ = µ xn ] = N n=1 N Where we have taken advantage of the fact that xn is independently and identically distributed (i.i.d). Then we use the expression in (2.122) : E[Σ ML ] = N 1 ∑ E[ ( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[( xn − µ ML )( xn − µ ML )T ] N n=1 = N 1 ∑ E[ xn xn T − 2µ ML xn T + µ ML µT ML ] N n=1 = N N N 1 ∑ 1 ∑ 1 ∑ E[ x n x n T ] − 2 E[µ ML xn T ] + E[µ ML µT ML ] N n=1 N n=1 N n=1 By using (2.291), the first term will equal to : first term = 1 · N (µµT + Σ) = µµT + Σ N The second term will equal to : second term = −2 N 1 ∑ E[µ ML xn T ] N n=1 = −2 N N 1 ∑ 1 ∑ xm ) xn T ] E[ ( N n=1 N m=1 = −2 N N ∑ 1 ∑ E[ x m x n T ] N 2 n=1 m=1 N N ∑ 1 ∑ (µµT + I nm Σ) N 2 n=1 m=1 1 = −2 2 ( N 2 µµT + N Σ) N 1 = −2(µµT + Σ) N = −2 54 Similarly, the third term will equal to : third term = N 1 ∑ E[µ ML µT ML ] N n=1 = N N N 1 ∑ 1 ∑ 1 ∑ E[( x j) · ( x i )] N n=1 N j=1 N i=1 = N N N ∑ ∑ 1 ∑ E [( x ) · ( x i )] j N 3 n=1 j=1 i =1 N 1 ∑ ( N 2 µµT + N Σ) N 3 n=1 1 = µµT + Σ N = Finally, we combine those three terms, which gives: E[Σ ML ] = N −1 Σ N Note: the same procedure from (2.59) to (2.62) can be carried out to prove (2.291) and the only difference is that we need to introduce index m and n to represent the samples. (2.291) is quite straightforward if we see it in this way: If m = n, which means xn and xm are actually the same sample, (2.291) will reduce to (2.262) (i.e. the correlation between different dimensions exists) and if m ̸= n, which means xn and xm are different samples, also i.i.d, then no correlation should exist, we can guess E[ xn xm T ] = µµT in this case. Problem 2.36 Solution Let’s follow the hint. However, firstly we will find the sequential expression based on definition, which will make the latter process on finding coefficient a N −1 more easily. Suppose we have N observations in total, and then we can write: N) σ2( ML = = = N 1 ∑ N) 2 ( xn − µ(ML ) N n=1 [ ] −1 1 N∑ N) 2 N) 2 ( xn − µ(ML ) + ( x N − µ(ML ) N n=1 −1 N − 1 1 N∑ 1 N) 2 N) 2 ( xn − µ(ML ) + ( x N − µ(ML ) N N − 1 n=1 N N − 1 2( N −1) 1 N) 2 σ ML + ( x N − µ(ML ) N N [ ] 1 N −1) (N ) 2 2( N −1) = σ2( + ( x − µ ) − σ N ML ML ML N = 55 And then let us write the expression for σ ML . } { ¯ N 1 ∑ ∂ ¯ lnp ( x | µ , σ ) =0 ¯ n 2 σ ML N n=1 ∂σ By exchanging the summation and the derivative, and letting N → +∞, we can obtain : [ ] N 1 ∑ ∂ ∂ lim lnp ( x | µ , σ ) = E lnp ( x | µ , σ ) n x n N →+∞ N n=1 ∂σ2 ∂σ2 Comparing it with (2.127), we can obtain the sequential formula to estimate σ ML : N) σ2( ML N −1) = σ2( + a N −1 ML ∂ N −1) ∂σ2( ML [ N −1) + a N −1 − = σ2( ML N) N −1) lnp( x N |µ(ML , σ(ML ) (∗) 1 N −1) 2σ2( ML + N) 2 ( x N − µ(ML ) ] N −1) 2σ4( ML N) Where we use σ2( to represent the N th estimation of σ2ML , i.e., the estiML mation of σ2ML after the N th observation. What’s more, if we choose : a N −1 = N −1) 2σ4( ML N Then we will obtain : N) N −1) σ2( = σ2( + ML ML ] 1 [ 2( N −1) N) 2 −σ ML + ( x N − µ(ML ) N We can see that the results are the same. An important thing should be N) noticed : In maximum likelihood, when estimating variance σ2( , we will ML N) N) first estimate mean µ(ML , and then we we will calculate variance σ2( . ML In other words, they are decoupled. It is the same in sequential method. For instance, if we want to estimate both mean and variance sequentially, N −1) after observing the N th sample (i.e., x N ), firstly we can use µ(ML together N) with (2.126) to estimate µ(ML and then use the conclusion in this problem N) N) N −1) to obtain σ(ML . That is why in (∗) we write lnp( x N |µ(ML , σ(ML ) instead of N −1) ( N −1) lnp( x N |µ(ML , σ ML ). Problem 2.37 Solution (Wait for revising) We follow the same procedure in Prob.2.36 to solve this problem. Firstly, 56 we can obtain the sequential formula based on definition. N) Σ(ML N 1 ∑ N) N) T ( xn − µ(ML )( xn − µ(ML ) N n=1 [ ] −1 1 N∑ (N ) (N ) T (N ) (N ) T ( xn − µ ML )( xn − µ ML ) + ( x N − µ ML )( x N − µ ML ) N n=1 = = N − 1 ( N −1) 1 N) N) T Σ ML + ( x N − µ(ML )( x N − µ(ML ) N N [ ] 1 N −1) N) N) T N −1) = Σ(ML + ( x N − µ(ML )( x N − µ(ML ) − Σ(ML N = If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we can obtain : N) Σ(ML N −1) = Σ(ML + a N −1 N −1) + a N −1 = Σ(ML ∂ N −1) ∂Σ(ML ∂ N −1) ∂Σ(ML N) N −1) lnp( x N |µ(ML , Σ(ML ) N −1) N) ) , Σ(ML lnp( x N |µ(ML [ ] 1 1 N −1) N −1) −1 N −1) T N −1) N −1) −1 N −1) −1 + a N −1 − [Σ(ML ] = Σ(ML ) [Σ(ML )( xn − µ(ML ] + [Σ(ML ] ( xn − µ(ML 2 2 Where we have taken advantage of the procedure we carried out in Prob.2.34 to calculate the derivative, and if we choose : a N −1 = 2 2( N −1) Σ N ML We can see that the equation above will be identical with our previous conclusion based on definition. Problem 2.38 Solution It is straightforward. Based on (2.137), (2.138) and (2.139), we focus on the exponential term of the posterior distribution p(µ| X ), which gives : − N 1 1 ∑ 1 ( xn − µ)2 − 2 (µ − µ0 )2 = − 2 (µ − µ N )2 2 2σ n=1 2σ 0 2σ N We rewrite the left side regarding to µ. quadratic term = −( N 1 + ) µ2 2 2σ 2σ20 ∑N linear term = ( n=1 x n σ2 + µ0 σ20 )µ 57 We also rewrite the right side regarding to µ, and hence we will obtain : N 1 1 −( 2 + ) µ2 = − 2 µ2 , ( 2 2σ 2σ 0 2σ N Then we will obtain : 1 1 = σ2N σ20 ∑N n=1 x n σ2 + n=1 x n ∑N σ2N = µN = ( n=1 x n σ2 ·( 1 + σ20 )µ = µN σ2N µ + µ0 σ20 = N · µ ML , we can write : ) N µ ML σ20 + µ0 σ2 · σ2 + N σ20 σ2 = σ20 N −1 N µ ML µ0 ) ·( + 2) 2 2 σ σ σ0 σ20 σ2 = µ0 N σ2 ∑N And with the prior knowledge that + N σ20 + σ2 σσ20 µ0 + N σ20 N σ20 + σ2 µ ML Problem 2.39 Solution Let’s follow the hint. 1 σ2N = 1 σ20 + N N −1 1 1 1 1 + 2 = 2 + + 2 = 2 2 2 σ σ σ σ σ0 σ N −1 However, it is complicated to derive a sequential formula for µ N directly. Based on (2.142), we see that the denominator in (2.141) can be eliminated if we multiply 1/σ2N on both side of (2.141). Therefore we will derive a sequential formula for µ N /σ2N instead. µN σ2N = = = = = σ2 + N σ20 σ20 σ2 σ2 + N σ20 σ20 σ2 µ0 σ20 µ0 + ( ( σ2 N σ20 + σ2 σ2 N σ20 + σ2 N) N µ(ML σ2 ∑ N −1 + σ20 µ N −1 σ2N −1 n=1 σ2 + xN σ2 = xn µ0 σ20 + + xN σ2 µ0 + µ0 + ∑N N σ20 N σ20 + σ2 N σ20 N σ20 + σ2 n=1 x n σ2 N) µ(ML ) N) µ(ML ) 58 Another possible solution is also given in the problem. We solve it by completing the square. − 1 1 1 ( x N − µ)2 − 2 (µ − µ N −1 )2 = − 2 (µ − µ N )2 2 2σ 2σ N −1 2σ N By comparing the quadratic and linear term regarding to µ, we can obtain: 1 1 1 = 2 + 2 2 σ σN σ N −1 And : µN σ2N = xN µ N −1 + 2 2 σ σ N −1 It is the same as previous result. Note: after obtaining the N th observation, we will firstly use the sequential formula to calculate σ2N , and then µ N . This is because the sequential formula for µ N is dependent on σ2N . Problem 2.40 Solution Based on Bayes Theorem, we can write : p(µ| X ) ∝ p( X |µ) p(µ) We focus on the exponential term on the right side and then rearrange it regarding to µ. [ ] N ∑ 1 1 T −1 right = − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 ) 2 2 n=1 [ ] N ∑ 1 1 T −1 = − ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 ) 2 n=1 2 N ∑ 1 1 −1 = − µ (Σ0 −1 + N Σ−1 ) µ + µT (Σ− xn ) + const 0 µ0 + Σ 2 n=1 Where ’const’ represents all the constant terms independent of µ. According to the quadratic term, we can obtain the posterior covariance matrix. Σ−N1 = Σ0 −1 + N Σ−1 Then using the linear term, we can obtain : 1 −1 Σ−N1 µ N = (Σ− 0 µ0 + Σ N ∑ xn ) n=1 Finally we obtain posterior mean : 1 −1 µ N = (Σ0 −1 + N Σ−1 )−1 (Σ− 0 µ0 + Σ N ∑ n=1 xn ) 59 Which can also be written as : µ N = (Σ0 −1 + N Σ−1 )−1 (Σ0 −1 µ0 + Σ−1 N µ ML ) Problem 2.41 Solution Let’s compute the integral of (2.146) over λ. ∫ +∞ ∫ +∞ ba 1 a a−1 b λ exp(− bλ) d λ = λa−1 exp(− bλ) d λ Γ(a) Γ(a) 0 0 ∫ +∞ ba 1 u = ( )a−1 exp(− u) du Γ(a) 0 b b ∫ +∞ 1 = u a−1 exp(− u) du Γ(a) 0 1 = · Γ(a) = 1 Γ(a) Where we first perform change of variable bλ = u, and then take advantage of the definition of gamma function: ∫ +∞ Γ( x) = u x−1 e−u du 0 Problem 2.42 Solution We first calculate its mean. ∫ +∞ 1 a a−1 b λ exp(− bλ) d λ = λ Γ ( a) 0 = = = ∫ +∞ ba λa exp(− bλ) d λ Γ(a) 0 ∫ +∞ ba u 1 ( )a exp(− u) du Γ(a) 0 b b ∫ +∞ 1 u a exp(− u) du Γ(a) · b 0 1 a · Γ(a + 1) = Γ(a) · b b Where we have taken advantage of the property Γ(a + 1) = aΓ(a). Then we calculate E[λ2 ]. ∫ +∞ ∫ +∞ ba 2 1 a a−1 λ b λ exp(− bλ) d λ = λa+1 exp(− bλ) d λ Γ(a) Γ(a) 0 0 ∫ +∞ u 1 ba ( )a+1 exp(− u) du = Γ(a) 0 b b ∫ +∞ 1 a+1 = u exp(− u) du Γ(a) · b2 0 1 a(a + 1) = · Γ(a + 2) = Γ(a) · b2 b2 60 Therefore, according to var [λ] = E[λ2 ] − E[λ]2 , we can obtain : var [λ] = E[λ2 ] − E[λ]2 = a 2 a a(a + 1) −( ) = 2 2 b b b For the mode of a gamma distribution, we need to find where the maximum of the PDF occurs, and hence we will calculate the derivative of the gamma distribution with respect to λ. [ ] d 1 a a−1 1 a a−2 b λ exp(− bλ) = [(a − 1) − bλ] b λ exp(− bλ) d λ Γ(a) Γ(a) It is obvious that Gam(λ|a, b) has its maximum at λ = (a − 1)/ b. In other words, the gamma distribution Gam(λ|a, b) has mode (a − 1)/ b. Problem 2.43 Solution Let’s firstly calculate the following integral. ∫ +∞ ∫ +∞ xq | x| q exp(− 2 ) dx exp(− 2 ) dx = 2 2σ 2σ −∞ −∞ 1 ∫ +∞ (2σ2 ) q 1q −1 exp(− u) = 2 u du q 0 1 ∫ 1 (2σ2 ) q +∞ −1 exp(− u) u q dx = 2 q 0 1 (2σ2 ) q 1 = 2 Γ( ) q q And then it is obvious that (2.293) is normalized. Next, we consider about the log likelihood function. Since ϵ = t − y( x, w) and ϵ ∼ p(ϵ|σ2 , q), we can write: ln p(t| X , w, σ2 ) = N ∑ n=1 ( ) ln p y( xn , w) − t n |σ2 , q = − [ ] N 1 ∑ q q | y ( x , w ) − t | + N · ln n n 2σ2 n=1 2(2σ2 )1/ q Γ(1/ q) = − N 1 ∑ N | y( xn , w) − t n | q − ln(2σ2 ) + const 2 q 2σ n=1 Problem 2.44 Solution Here we use a simple method to solve this problem by taking advantage of (2.152) and (2.153). By writing the prior distribution in the form of (2.153), i.e., p(µ, λ|β, c, d ), we can easily obtain the posterior distribution. p(µ, λ| X ) ∝ p( X |µ, λ) · p(µ, λ) [ ] [ ] N +β N x2 N ∑ ∑ λµ2 n 1/2 ∝ λ exp(− xn )λµ − ( d + ) exp ( c + )λ 2 n=1 2 n=1 61 Therefore, we can see that the posterior distribution has parameters: β′ = ∑ ∑ x2 β + N , c′ = c + nN=1 xn , d ′ = d + nN=1 2n . And since the prior distribution is actually the product of a Gaussian distribution and a Gamma distribution: [ ] p(µ, λ|µ0 , β, a, b) = N µ|µ0 , (βλ)−1 Gam(λ|a, b) Where µ0 = c/β, a = 1 + β/2, b = d − c2 /2β. Hence the posterior distribution can also be written as the product of a Gaussian distribution and a Gamma distribution. [ ] p(µ, λ| X ) = N µ|µ′0 , (β′ λ)−1 Gam(λ|a′ , b′ ) Where we have defined: µ′0 = c′ /β′ = ( c + N ∑ n=1 / xn ) ( N + β) / a′ = 1 + β′ /2 = 1 + ( N + β) 2 b′ = d ′ − c′ /2β′ = d + 2 N x2 ∑ n n=1 2 − (c + N ∑ n=1 / xn )2 (2(β + N )) Problem 2.45 Solution Let’s begin by writing down the dependency of the prior distribution W (Λ|W, v) and the likelihood function p( X |µ, Λ) on Λ. p( X |µ, Λ) ∝ |Λ| N /2 exp And if we denote S= N [∑ ] 1 − ( xn − µ)T Λ( xn − µ) n=1 2 N 1 ∑ ( xn − µ)( xn − µ)T N n=1 Then we can rewrite the equation above as: [ 1 ] p( X |µ, Λ) ∝ |Λ| N /2 exp − Tr(S Λ) 2 Just as what we have done in Prob.2.34, and comparing this problem with Prob.2.34, one important thing should be noticed: since S and Λ are both ( ) symmetric, we have: Tr(S Λ) = Tr (S Λ)T = Tr(ΛT S T ) = Tr(ΛS ). And we can also write down the prior distribution as: [ 1 ] W (Λ|W, v) ∝ |Λ|(v−D −1)/2 exp − Tr(W −1 Λ) 2 Therefore, the posterior distribution can be obtained: p(Λ| X ,W, v) ∝ p( X |µ, Λ) · W (Λ|W, v) { 1 [ ]} ∝ |Λ|( N +v−D −1)/2 exp − Tr (W −1 + S )Λ 2 62 Therefore, p(Λ| X ,W, v) is also a Wishart distribution, with parameters: vN = N + v W N = (W −1 + S )−1 Problem 2.46 Solution It is quite straightforward. ∫ ∞ p( x|µ, a, b) = N ( x|µ, τ−1 )Gam(τ|a, b) d τ ∫ = = 0 { τ } b a exp(− bτ)τa−1 τ 1/2 ( ) exp − ( x − µ)2 d τ Γ(a) 2π 2 0 ∫ ∞ { } a b 1 τ ( )1/2 τa−1/2 exp − bτ − ( x − µ)2 d τ Γ(a) 2π 2 0 ∞ And if we make change of variable: z = τ[ b + ( x − µ)2 /2], the integral above can be written as: ∫ { } b a 1 1/2 ∞ a−1/2 τ p( x|µ, a, b) = ( ) τ exp − bτ − ( x − µ)2 d τ Γ(a) 2π 2 0 ]a−1/2 ∫ ∞[ a b 1 z 1 = dz ( )1/2 exp {− z} 2 /2 Γ(a) 2π b + ( x − µ ) b + ( x − µ)2 /2 0 [ ]a+1/2 ∫ ∞ b a 1 1/2 1 = ( ) z a−1/2 exp {− z} dz Γ(a) 2π b + ( x − µ)2 /2 0 [ ]−a−1/2 b a 1 1/2 ( x − µ)2 = ( ) b+ Γ(a + 1/2) Γ(a) 2π 2 And if we substitute a = v/2 and b = v/2λ, we will obtain (2.159). Problem 2.47 Solution We focus on the dependency of (2.159) on x. [ ]−v/2−1/2 λ( x − µ)2 St( x|µ, λ, v) ∝ 1+ v [ ] −v − 1 λ( x − µ)2 ∝ exp ln(1 + ) 2 v [ ] −v − 1 λ( x − µ)2 −2 ∝ exp ( + O (v )) 2 v ] [ λ( x − µ)2 (v → ∞) ≈ exp − 2 Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). We see that this, up to an overall constant, is a Gaussian distribution with mean µ and precision λ. 63 Problem 2.48 Solution The same steps in Prob.2.46 can be used here. ∫ +∞ ¯ ¯ ¯v v ¯ St( x µ, Λ, v) = N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η 2 2 0 { } ∫ +∞ 1 1 vη 1 v v/2 1/2 T = |ηΛ| exp − ( x − µ) (ηΛ)( x − µ) − ( ) ηv/2−1 d η D /2 2 2 Γ(v/2) 2 (2π) 0 } { v/2 1/2 ∫ +∞ (v/2) |Λ| vη D /2+v/2−1 1 T = ( x − µ ) ( η Λ )( x − µ ) − η dη exp − 2 2 (2π)D /2 Γ(v/2) 0 Where we have taken advantage of the property: |ηΛ| = ηD |Λ|, and if we denote: η ∆2 = ( x − µ)T Λ( x − µ) and z = (∆2 + v) 2 The expression above can be reduced to : ∫ ¯ 2 z D /2+v/2−1 (v/2)v/2 |Λ|1/2 +∞ 2 exp ( − z )( St( x ¯ µ, Λ, v) = ) · 2 dz 2 (2π)D /2 Γ(v/2) 0 ∆ +v ∆ +v D /2+v/2 ∫ +∞ (v/2)v/2 |Λ|1/2 2 exp(− z) · z D /2+v/2−1 dz = ) ( (2π)D /2 Γ(v/2) ∆2 + v 0 D /2+v/2 (v/2)v/2 |Λ|1/2 2 = ( ) Γ(D /2 + v/2) (2π)D /2 Γ(v/2) ∆2 + v And if we rearrange the expression above, we will obtain (2.162) just as required. Problem 2.49 Solution Firstly, we notice that if and only if x = µ, ∆2 equals to 0, so that St( x|µ, Λ, v) achieves its maximum. In other words, the mode of St( x|µ, Λ, v) is µ. Then we consider about its mean E[ x]. ∫ E[ x ] = St( x|µ, Λ, v) · x dx x∈RD [∫ +∞ ] ∫ ¯ ¯v v −1 ¯ ¯ N ( x µ, (ηΛ) ) · Gam(η , ) d η x dx = 2 2 x∈RD 0 ∫ ∫ +∞ ¯ ¯ v v = xN ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx D 2 2 x∈R 0 ] ∫ +∞ [∫ ¯ ¯v v −1 ¯ ¯ xN ( x µ, (ηΛ) ) dx · Gam(η , ) d η = 2 2 x∈RD 0 ∫ +∞ [ ¯v v ] = µ · Gam(η ¯ , ) d η 2 2 0 ∫ +∞ ¯v v = µ Gam(η ¯ , ) d η = µ 2 2 0 Where we have taken the following property: ∫ ¯ xN ( x ¯ µ, (ηΛ)−1 ) dx = E[ x] = µ x∈RD 64 Then we calculate E[ xxT ]. The steps above can also be used here. ∫ T E[ xx ] = St( x|µ, Λ, v) · xxT dx x∈RD ] [∫ +∞ ∫ ¯ ¯v v T −1 ¯ ¯ N ( x µ, (ηΛ) ) · Gam(η , ) d η xx dx = 2 2 x∈RD 0 ∫ ∫ +∞ ¯v v ¯ = xxT N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx 2 2 x∈RD 0 ] ∫ +∞ [∫ ¯ ¯v v −1 T ¯ ¯ = xx N ( x µ, (ηΛ) ) dx · Gam(η , ) d η 2 2 0 x∈RD ∫ +∞ [ ] ¯ v v = E[µµT ] · Gam(η ¯ , ) d η 2 2 0 ∫ +∞ [ ] ¯v v = µµT + (ηΛ)−1 Gam(η ¯ , ) d η 2 2 0 ∫ +∞ ¯ v v (ηΛ)−1 · Gam(η ¯ , ) d η = µµT + 2 2 0 ∫ +∞ v v/2 v/2−1 v 1 T −1 ( ) η exp(− η) d η = µµ + (ηΛ) · Γ ( v /2) 2 2 0 ∫ v 1 v v/2 +∞ v/2−2 T −1 = µµ + Λ η exp(− η) d η ( ) Γ(v/2) 2 2 0 If we denote: z = T vη 2 , E[ xx ] = µµ T = µµT = µµT = µµT = µµT the equation above can be reduced to : ∫ v v/2 +∞ 2 z v/2−2 2 1 +Λ ( ) ( ) exp(− z) dz Γ(v/2) 2 v v 0 ∫ +∞ v 1 z v/2−2 exp(− z) dz · + Λ−1 Γ(v/2) 2 0 Γ(v/2 − 1) v + Λ−1 · Γ(v/2) 2 1 v + Λ−1 v/2 − 1 2 v + Λ−1 v−2 −1 Where we have taken advantage of the property: Γ( x + 1) = xΓ( x), and [ ] since we have cov[ x] = E ( x − E[ x])( x − E[ x])T , together with E[ x] = µ, we can obtain: v cov[ x] = Λ−1 v−2 Problem 2.50 Solution 65 The same steps in Prob.2.47 can be used here. [ ]−D /2−v/2 ∆2 St( x|µ, Λ, v) ∝ 1+ v ] [ ∆2 ∝ exp (−D /2 − v/2) · ln(1 + ) v [ ] 2 D+v ∆ −2 ∝ exp − ·( + O (v )) 2 v ∆2 ≈ exp(− ) (v → ∞) 2 Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). And since ∆2 = ( x−µ)T Λ( x−µ), we see that this, up to an overall constant, is a Gaussian distribution with mean µ and precision Λ. Problem 2.51 Solution We first prove (2.177). Since we have exp( i A )· exp(− i A ) = 1, and exp( i A ) = cosA + isinA . We can obtain: ( cosA + isinA ) · ( cosA − isinA ) = 1 Which gives cos2 A + sin2 A = 1. And then we prove (2.178) using the hint. cos( A − B) = ℜ[ exp( i ( A − B))] / = ℜ[ exp( i A ) exp( iB)] cosA + isinA = ℜ[ ] cosB + isinB ( cosA + isinA )( cosB − isinB) = ℜ[ ] ( cosB + isinB)( cosB − isinB) = ℜ[( cosA + isinA )( cosB − isinB)] = cosAcosB + sinAsinB It is quite similar for (2.183). sin( A − B) = ℑ[ exp( i ( A − B))] = ℑ[( cosA + isinA )( cosB − isinB)] = sinAcosB − cosAsinB Problem 2.52 Solution Let’s follow the hint. We first derive an approximation for exp[ mcos(θ − 66 θ0 )]. { [ ]} (θ − θ0 )2 4 exp m 1 − + O ((θ − θ0 ) ) 2 { } (θ − θ0 )2 exp m − m − mO ((θ − θ0 )4 ) 2 } { { } (θ − θ0 )2 exp( m) · exp − m · exp − mO ((θ − θ0 )4 ) 2 exp { mcos(θ − θ0 )} = = = It is same for exp( mcosθ ) : exp { mcosθ } = exp( m) · exp(− m θ2 2 { } ) · exp − mO (θ 4 ) Now we rearrange (2.179): p (θ | θ 0 , m ) = = = = 1 exp { mcos(θ − θ0 )} 2π I 0 ( m ) 1 exp { mcos(θ − θ0 )} ∫ 2π 0 exp { mcosθ } d θ } { { } 2 exp( m) · exp − m (θ−2θ0 ) · exp − mO ((θ − θ0 )4 ) { } ∫ 2π θ2 4 0 exp( m) · exp(− m 2 ) · exp − mO (θ ) d θ { } 1 (θ − θ 0 )2 exp − m ∫ 2π 2 2 exp(− m θ ) d θ 0 2 Where we have taken advantage of the following fact: } { } { (when m → ∞) exp − mO ((θ − θ0 )4 ) ≈ exp − mO (θ 4 ) Therefore, it is straightforward that when m → ∞, (2.179) reduces to a Gaussian Distribution with mean θ0 and precision m. Problem 2.53 Solution Let’s rearrange (2.182) according to (2.183). N ∑ n=1 sin(θ − θ0 ) = = N ∑ n=1 ( sinθn cosθ0 − cosθn sinθ0 ) cosθ0 N ∑ n=1 sinθn − sinθ0 N ∑ n=1 cosθn Where we have used (2.183), and then together with (2.182), we can obtain : N N ∑ ∑ cosθn = 0 sinθn − sinθ0 cosθ0 n=1 n=1 67 Which gives: θ0ML = tan −1 {∑ } n sinθ n ∑ n cosθ n Problem 2.54 Solution We calculate the first and second derivative of (2.179) with respect to θ . p(θ |θ0 , m)′ = p(θ |θ0 , m)′′ = { } 1 [− msin(θ − θ0 )] exp mcos(θ − θ0 ) 2π I 0 ( m ) [ ] { } 1 − mcos(θ − θ0 ) + (− msin(θ − θ0 ))2 exp mcos(θ − θ0 ) 2π I 0 ( m ) If we let p(θ |θ0 , m)′ equals to 0, we will obtain its root: θ = θ0 + k π (k ∈ Z ) When k ≡ 0 ( mod 2), i.e. θ ≡ θ0 ( mod 2π), we have: p(θ |θ0 , m)′′ = − m exp( m) <0 2π I 0 ( m ) Therefore, when θ = θ0 , (2.179) obtains its maximum. And when k ≡ 1 ( mod 2), i.e. θ ≡ θ0 + π ( mod 2π), we have: p(θ |θ0 , m)′′ = m exp(− m) >0 2π I 0 ( m ) Therefore, when θ = θ0 + π ( mod 2π), (2.179) obtains its minimum. Problem 2.55 Solution According to (2.185), we have : A ( m ML ) = N 1 ∑ cos(θn − θ0ML ) N n=1 By using (2.178), we can write : N 1 ∑ cos(θn − θ0ML ) N n=1 ) N ( 1 ∑ = cosθn cosθ0ML + sinθn sinθ0ML N n=1 ( ) ( ) N N ∑ 1 ∑ 1 cosθ N cosθ0ML + sinθ N sinθ0ML = N n=1 N n=1 A ( m ML ) = By using (2.168), we can further derive: ) ( ) ( N N 1 ∑ 1 ∑ ML cosθ N cosθ0 + sinθ N sinθ0ML A ( m ML ) = N n=1 N n=1 = r̄cosθ̄ · cosθ0ML + r̄sinθ̄ · sinθ0ML = r̄cos(θ̄ − θ0ML ) 68 And then by using (2.169) and (2.184), it is obvious that θ̄ = θ0ML , and hence A ( m ML ) = r̄ . Problem 2.56 Solution Recall that the distributions belonging to the exponential family have the form: p( x|η) = h( x) g(η) exp(ηT u( x)) And according to (2.13), the beta distribution can be written as: Beta( x|a, b) = = = Γ(a + b) a−1 x (1 − x)b−1 Γ(a)Γ( b) Γ(a + b) exp [(a − 1) lnx + ( b − 1) ln(1 − x)] Γ(a)Γ( b) Γ(a + b) exp [alnx + bln(1 − x)] Γ(a)Γ( b) x(1 − x) Comparing it with the standard form of exponential family, we can obtain: η = [a, b]T u( x) = [ lnx, ln(1 − x)]T /[ ] g(η) = Γ(η 1 + η 2 ) Γ(η 1 )Γ(η 2 ) / h( x) = 1 ( x(1 − x)) Where η 1 means the first element of η, i.e. η 1 = a − 1, and η 2 means the second element of η, i.e. η 2 = b − 1. According to (2.146), Gamma distribution can be written as: Gam( x|a, b) = 1 a a−1 b x exp(− bx) Γ(a) Comparing it with the standard form of exponential family, we can obtain: T η = [a, b] u( x) = [0, − x] / η1 g ( η ) = η Γ(η 1 ) 2 h( x) = xη1 −1 According to (2.179), the von Mises distribution can be written as: p ( x |θ 0 , m ) = = 1 exp( mcos( x − θ0 )) 2π I 0 ( m ) 1 exp [ m( cosxcosθ0 + sinxsinθ0 )] 2π I 0 ( m ) 69 Comparing it with the standard form of exponential family, we can obtain: η = [ mcosθ0 , msinθ0 ]T u( x) = [ cosx, sinx] √ / g ( η ) = 1 2 π I ( η21 + η22 ) 0 h( x) = 1 Note : a given distribution can be written into the exponential family in several ways with different natural parameters. Problem 2.57 Solution Recall that the distributions belonging to the exponential family have the form: p( x|η) = h( x) g(η) exp(ηT u( x)) And the multivariate Gaussian Distribution has the form: } { 1 1 1 T −1 Σ ( x − µ ) N ( x|µ, Σ) = exp − ( x − µ ) 2 (2π)D /2 |Σ|1 /2 We expand the exponential term with respect to µ. { } 1 1 T −1 1 T −1 −1 exp − N ( x|µ, Σ) = ( x Σ x − 2 µ Σ x + µ Σ µ ) 2 (2π)D /2 |Σ|1/2 { } { } 1 1 1 T −1 1 T −1 −1 = exp − x Σ x + µ Σ x exp − µ Σ µ ) 2 2 (2π)D /2 |Σ|1/2 Comparing it with the standard form of exponential family, we can obtain: η = [Σ−1 µ, − 12 vec(Σ−1 ) ]T u( x) = [ x, vec( xxT ) ] g(η) = exp( 14 η 1 T η 2 −1 η 1 ) + | − 2η 2 |1/2 h( x) = (2π)−D /2 Where we have used η1 to denote the first element of η, and η2 to denote the second element of η. And we also take advantage of the vectorizing operator, i.e.vec(·). The vectorization of a matrix is a linear transformation which converts the matrix into a column vector. This can be viewed in an example : [ ] a b A= => vec( A ) = [a, c, b, d ]T c d Note: By introducing vectorizing operator, we actually have vec(Σ−1 ) · vec( xxT ) = xT Σ−1 x Problem 2.58 Solution (Wait for updating) 70 Based on (2.226), we rewrite the expression for ∇ g(η). ∇ g(η) = − g(η)E[ u( x)] And then we calculate the derivative of both sides of the equation above with respect to η. [ ] ∇∇ g(η) = − ∇ g(η)E[ u( x)T ] + g(η)∇E[ u( x)T ] If we multiply both sides by − g(1η) , we can obtain : −∇∇ lng(η) = ∇ lng(η)E[ u( x)T ] + ∇E[ u( x)T ] According to (2.225), we calculate ∇E[ u( x)T ]. ∫ { } ∇E[ u( x)T ] = ∇ g(η) h( x) exp ηT u( x) u( x)T dx + ∫ { } g(η) h( x) exp ηT u( x) u( x) u( x)T dx => ∇E[ u( x)T ] = ∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ] Therefore, we obtain : −∇∇ lng(η) = 2∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ] = −2E[ u( x)]E[ u( x)T ] + E[ u( x) u( x)T ] Problem 2.59 Solution It is straightforward. ∫ ∫ 1 x f ( ) dx σ σ ∫ 1 = f ( u)σ du σ ∫ = f ( u) du = 1 p( x|σ) dx = Where we have denoted u = x/σ. Problem 2.60 Solution Firstly, we write down the log likelihood function. N ∑ n=1 lnp( xn ) = M ∑ n i ln( h i ) i =1 Some details should be explained here. If xn falls into region ∆ i , then p( xn ) will equal to h i , and since we have already been given that among all the N observations, there are n i samples fall into region ∆ i , we can easily write down the likelihood function just as the equation above, and note we use 71 M to denote the number of different regions. Therefore, an implicit equation should hold: M ∑ ni = N i =1 We now need to take account of the constraint that p( x) must integrate ∑ to unity, which can be written as M j =1 h j ∆ j = 1. We introduce a Lagrange multiplier to the expression, and then we need to minimize: M ∑ i =1 M ∑ n i ln( h i ) + λ( h j ∆ j − 1) j We calculate its derivative with respect to h i and let it equal to 0. ni + λ∆ i = 0 hi Multiplying both sides by h i , performing summation over i and then using the constraint, we can obtain: N +λ=0 In other words, λ = − N . Then we substitute the result into the likelihood function, which gives: ni 1 hi = N ∆i Problem 2.61 Solution It is straightforward. In K nearest neighbours (KNN), when we want to estimate probability density at a point x i , we will consider a small sphere centered on x i and then allow the radius to grow until it contains K data points, and then p( x i ) will equal to K /( NVi ), where N is total observations and Vi is the volume of the sphere centered on x i . We can assume that Vi is small enough that p( x i ) is roughly constant in it. In this way, We can write down the integral: ∫ p( x) dx ≈ N ∑ i =1 p( x i ) · Vi = N K ∑ · Vi = K ̸= 1 i =1 NVi We also see that if we use "1NN" (K = 1), the probability density will be well normalized. Note that if and only if the volume of all the spheres are small enough and N is large enough, the equation above will hold. Fortunately, these two conditions can be satisfied in KNN. 72 0.3 Probability Distribution Problem 3.1 Solution Based on (3.6), we can write : 2σ(2a) − 1 = 2 1 − exp(−2a) exp(a) − exp(−a) −1= = 1 + exp(−2a) 1 + exp(−2a) exp(a) + exp(−a) Which is exactly tanh(a). Then we will find the relation between µ i , w i in (3.101) and (3.102). Let’s start from (3.101). y( x, w) = w0 + = w0 + = w0 + M ∑ j =1 M ∑ w j σ( wj x −µj s tanh( j =1 ) x−µ j 2s ) + 1 2 M w M ∑ x −µj 1∑ j wj + tanh( ) 2 j=1 2s j =1 2 Hence the relation is given by : µ0 = w0 + M 1∑ wj 2 j=1 and µ j = wj 2 Note: there is a typo in (3.102), the denominator should be 2 s instead of s, or alternatively you can view it as a new s′ , which equals to 2 s. Problem 3.2 Solution We first need to show that (ΦT Φ)−1 is invertible. Suppose, for the sake of contradiction, c is a nonzero vector in the kernel(Null space) of ΦT Φ. Then ΦT Φ c equals to 0 and so we have: 0 = c T ΦT Φ c = (Φ c)T Φ c = ||Φ c||2 The equation above shows that Φ c = 0. However, Φ c = c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M and {ϕ1 , ϕ2 , , ..., ϕ M } is a basis for Φ, there is no linear relation between the ϕ i and therefore we cannot have c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M = 0. This is the contradiction. Hence ΦT Φ is invertible. Then let’s first prove two specific cases. Case 1: w1 is in Φ. In this case, we have Φ c = w1 for some c. So we have: Φ(ΦT Φ)−1 ΦT w1 = Φ(ΦT Φ)−1 ΦT Φ c = Φ c = w1 Case 2:w2 is in Φ⊥ , where Φ⊥ is used to denote the orthogonal complement of Φ and then we have ΦT w2 = 0, which leads to: Φ(ΦT Φ)−1 ΦT w2 = 0 73 Recall that any vector x ∈ R M can be divided into the summation of two vectors w1 and w2 , were w1 ∈ Φ and w2 ∈ Φ⊥ separately. And so we have: Φ(ΦT Φ)−1 ΦT w = Φ(ΦT Φ)−1 ΦT (w1 + w2 ) = w1 Which is exactly what orthogonal projection is supposed to do. Problem 3.3 Solution Let’s calculate the derivative of (3.104) with respect to w. ∇ E D ( w) = N ∑ n=1 { } r n t n − wT Φ( xn ) Φ( xn )T We set the derivative equal to 0. 0= If we denote p N ∑ n=1 T ( r n t n Φ( xn ) − w T N ∑ n=1 ) r n Φ( xn )Φ( xn ) T p r n ϕ( xn ) = ϕ′ ( xn ) and r n t n = t′n , we can obtain: ( ) N N ∑ ∑ ′ ′ T T ′ ′ T 0= t n Φ ( xn ) − w Φ ( xn )Φ ( xn ) n=1 n=1 Taking advantage of (3.11) – (3.17), we can derive a similar result, i.e. w ML = (ΦT Φ)−1 ΦT t. But here, we define t as: [p ]T p p t= r 1 t 1 , r 2 t 2 , ... , r N t N p We also define Φ as a N × M matrix, with element Φ( i, j ) = r i ϕ j ( x i ). Problem 3.4 Solution Firstly, we rearrange E D (w). { }2 N D ∑ [ ] 1 ∑ E D ( w) = w0 + wi (xi + ϵi ) − t n 2 n=1 i =1 { }2 N D D ∑ ∑ ( ) 1 ∑ = w0 + wi xi − t n + wi ϵi 2 n=1 i =1 i =1 }2 { D N ∑ 1 ∑ wi ϵi = y( x n , w ) − t n + 2 n=1 i =1 } { D D N )( ) (∑ )2 ( )2 ( ∑ 1 ∑ w i ϵ i y( xn , w) − t n wi ϵi + 2 = y( x n , w ) − t n + 2 n=1 i =1 i =1 Where we have used y( xn , w) to denote the output of the linear model when input variable is xn , without noise added. For the second term in the equation above, we can obtain : Eϵ [( D ∑ i =1 w i ϵ i )2 ] = Eϵ [ D ∑ D ∑ i =1 j =1 wi w j ϵi ϵ j ] = D ∑ D ∑ i =1 j =1 w i w j Eϵ [ ϵ i ϵ j ] = σ 2 D ∑ D ∑ i =1 j =1 wi w j δi j 74 Which gives Eϵ [( D ∑ i =1 w i ϵ i )2 ] = σ2 D ∑ i =1 w2i For the third term, we can obtain: D D ∑ (∑ )( ) ( ) Eϵ [2 w i ϵ i y( xn , w) − t n ] = 2 y( xn , w) − t n Eϵ [ w i ϵ i ] i =1 i =1 ( = 2 y( x n , w ) − t n = 0 D )∑ i =1 Eϵ [ w i ϵ i ] Therefore, if we calculate the expectation of E D (w) with respect to ϵ, we can obtain: N ( D )2 σ2 ∑ 1 ∑ y( xn , w) − t n + w2 Eϵ [E D (w)] = 2 n=1 2 i=1 i Problem 3.5 Solution We can firstly rewrite the constraint (3.30) as : ( ) M 1 ∑ |w j | q − η ≤ 0 2 j=1 Where we deliberately introduce scaling factor 1/2 for convenience.Then it is straightforward to obtain the Lagrange function. ( ) }2 λ ∑ N { M 1 ∑ L(w, λ) = |w j | q − η t n − w T ϕ( x n ) + 2 n=1 2 j=1 It is obvious that L(w, λ) and (3.29) has the same dependence on w. Meanwhile, if we denote the optimal w that can minimize L(w, λ) as w⋆ (λ), we can see that M ∑ η= |w⋆j | q j =1 Problem 3.6 Solution Firstly, we write down the log likelihood function. lnp(T | X ,W, β) = − N [ [ ] ]T N 1 ∑ ln|Σ| − t n − W T ϕ( xn ) Σ−1 t n − W T ϕ( xn ) 2 2 n=1 Where we have already omitted the constant term. We set the derivative of the equation above with respect to W equals to zero. 0=− N ∑ n=1 ] Σ−1 [ t n − W T ϕ( xn ) ϕ( xn )T 75 Therefore, we can obtain similar result for W as (3.15). For Σ, comparing with (2.118) – (2.124), we can easily write down a similar result : Σ= N ] ]T 1 ∑ T [t n − W T ML ϕ( x n ) [ t n − W ML ϕ( x n ) N n=1 We can see that the solutions for W and Σ are also decoupled. Problem 3.7 Solution Let’s begin by writing down the prior distribution p(w) and likelihood function p( t| X , w, β). p ( w) = N ( w| m0 , S 0 ) , p( t| X , w, β) = N ∏ n=1 N ( t n |wT ϕ( xn ), β−1 ) Since the posterior PDF equals to the product of the prior PDF and likelihood function, up to a normalized constant. We mainly focus on the exponential term of the product. exponential term = = = N { β ∑ }2 1 1 − ( w − m0 ) T S − 0 ( w − m0 ) 2 } 1 N { β ∑ 1 − t2n − 2 t n wT ϕ( xn ) + wT ϕ( xn )ϕ( xn )T w − (w − m0 )T S − 0 ( w − m0 ) 2 n=1 2 [ ] N 1 T ∑ T −1 − w βϕ( xn )ϕ( xn ) + S 0 w 2 n=1 [ ] N ∑ 1 1 2β t n ϕ ( x n ) T w − −2 m0T S − 0 − 2 n=1 − 2 n=1 t n − w T ϕ( x n ) + const Hence, by comparing the quadratic term with standard Gaussian Distri1 −1 T bution, we can obtain: S − N = S 0 + βΦ Φ. And then comparing the linear term, we can obtain : 1 −2 m N T S N −1 = −2 m0T S − 0 − N ∑ n=1 2β t n ϕ( xn )T If we multiply −0.5 on both sides, and then transpose both sides, we can easily see that m N = S N (S 0 −1 m 0 + βΦT t) Problem 3.8 Solution Firstly, we write down the prior : p ( w) = N ( m N , S N ) 76 Where m N , S N are given by (3.50) and (3.51). And if now we observe another sample ( X N +1 , t N +1 ), we can write down the likelihood function : p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 ) Since the posterior equals to the production of likelihood function and the prior, up to a constant, we focus on the exponential term. exponential term 1 T 2 ( w − m N )T S − N (w − m N ) + β( t N +1 − w ϕ( x N +1 )) [ ] wT S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T w [ 1 ] −2wT S − N m N + β ϕ( x N +1 ) t N +1 = = +const Therefore, after observing ( X N +1 , t N +1 ), we have p(w) = N ( m N +1 , S N +1 ), where we have defined: 1 −1 T S− N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 ) And ( 1 ) m N +1 = S N +1 S − N m N + β ϕ( x N +1 ) t N +1 Problem 3.9 Solution We know that the prior p(w) can be written as: p ( w) = N ( m N , S N ) And the likelihood function p( t N +1 | x N +1 , w) can be written as: p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 ) According to the fact that y( x N +1 , w) = wT ϕ( x N +1 ) = ϕ( x N +1 )T w, the likelihood can be further written as: p( t N +1 | x N +1 , w) = N ( t N +1 |(ϕ( x N +1 )T w, β−1 ) Then we take advantage of (2.113), (2.114) and (2.116), which gives: } { p(w| x N +1 , t N +1 ) = N (Σ ϕ( x N +1 )β t N +1 + S N −1 m N , Σ) Where Σ = (S N −1 + ϕ( x N +1 )βϕ( x N +1 )T )−1 , and we can see that the result is exactly the same as the one we obtained in the previous problem. Problem 3.10 Solution We have already known: p( t|w, β) = N ( t| y( x, w), β−1 ) 77 And p(w|t, α, β) = N (w| m N , S N ) Where m N , S N are given by (3.53) and (3.54). As what we do in previous problem, we can rewrite p( t|w, β) as: p( t|w, β) = N ( t|ϕ( x)T w, β−1 ) And then we take advantage of (2.113), (2.114) and (2.115), we can obtain: p( t|t, α, β) = N (ϕ( x)T m N , β−1 + ϕ( x)T S N ϕ( x)) Which is exactly the same as (3.58), if we notice that ϕ( x ) T m N = m N T ϕ( x ) Problem 3.11 Solution We need to use the result obtained in Prob.3.8. In Prob.3.8, we have de1 rived a formula for S − : N +1 1 −1 T S− N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 ) And then using (3.110), we can obtain : S N +1 ]−1 S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T √ √ ]−1 [ = S N −1 + βϕ( x N +1 ) βϕ( x N +1 )T √ √ S N ( βϕ( x N +1 ))( βϕ( x N +1 )T )S N = SN − √ √ 1 + ( βϕ( x N +1 )T )S N ( βϕ( x N +1 )) = = [ SN − βS N ϕ( x N +1 )ϕ( x N +1 )T S N 1 + βϕ( x N +1 )T S N ϕ( x N +1 ) Now we calculate σ2N ( x) − σ2N +1 ( x) according to (3.59). σ2N ( x) − σ2N +1 ( x) = ϕ( x)T (S N − S N +1 )ϕ( x) = ϕ( x ) T = = βS N ϕ( x N +1 )ϕ( x N +1 )T S N 1 + βϕ( x N +1 )T S N ϕ( x N +1 ) ϕ( x ) ϕ( x)T S N ϕ( x N +1 )ϕ( x N +1 )T S N ϕ( x) 1/β + ϕ( x N +1 )T S N ϕ( x N +1 ) [ ]2 ϕ( x)T S N ϕ( x N +1 ) (∗) 1/β + ϕ( x N +1 )T S N ϕ( x N +1 ) And since S N is positive definite, (∗) is larger than 0. Therefore, we have proved that σ2N ( x) − σ2N +1 ( x) ≥ 0 Problem 3.12 Solution 78 Let’s begin by writing down the prior PDF p(w, β): p(w, β) N (w| m 0 , β−1 S 0 ) Gam(β|a 0 , b 0 ) (∗) β 2 1 a ∝ ( ) exp(− (w − m 0 )T βS 0−1 (w − m 0 )) b 0 0 βa0 −1 exp(− b 0 β) |S 0 | 2 = And then we write down the likelihood function p(t|X, w, β) : p(t|X, w, β) = ∝ N ∏ n=1 N ( t n |wT ϕ( xn ), β−1 ) N ∏ ] [ β β1/2 exp − ( t n − wT ϕ( xn ))2 2 n=1 (∗∗) According to Bayesian Inference, we have p(w, β|t) ∝ p(t|X, w, β)× p(w, β). We first focus on the quadratic term with regard to w in the exponent. N ∑ β β quadratic term = − wT S 0 −1 w + − wT ϕ( xn )ϕ( xn )T w 2 n=1 2 N ∑ [ ] β = − wT S 0 −1 + ϕ ( x n )ϕ ( x n ) T w 2 n=1 Where the first term is generated by (∗), and the second by (∗∗). By now, we know that: N ∑ S N −1 = S 0 −1 + ϕ( xn )ϕ( xn )T n=1 We then focus on the linear term with regard to w in the exponent. linear term = β m 0 T S 0 −1 w + N ∑ n=1 β t n ϕ( x n ) T w N ∑ ] [ t n ϕ( x n ) T w = β m 0 T S 0 −1 + n=1 Again, the first term is generated by (∗), and the second by (∗∗). We can also obtain: N ∑ m N T S N −1 = m 0 T S 0 −1 + t n ϕ( x n ) T n=1 Which gives: N ∑ ] [ t n ϕ( x n ) m N = S N S 0 −1 m 0 + n=1 Then we focus on the constant term with regard to w in the exponent. N β ∑ β constant term = (− m 0 T S 0 −1 m 0 − b 0 β) − t2 2 2 n=1 n N [1 ] 1 ∑ t2n = −β m 0 T S 0 −1 m 0 + b 0 + 2 2 n=1 79 Therefore, we can obtain: N 1 1 1 ∑ m N T S N −1 m N + b N = m 0 T S 0 −1 m 0 + b 0 + t2 2 2 2 n=1 n Which gives : bN = N 1 1 ∑ 1 m 0 T S 0 −1 m 0 + b 0 + t2 − m N T S N −1 m N 2 2 n=1 n 2 Finally, we focus on the exponential term whose base is β. exponent term = (2 + a 0 − 1) + N 2 Which gives: 2 + a N − 1 = (2 + a 0 − 1) + N 2 Hence, N 2 Problem 3.13 Solution(Waiting for update) a N = a0 + Similar to (3.57), we write down the expression of the predictive distribution p( t|X, t): ∫ ∫ p( t|X, t) = p( t|w, β) p(w, β|X, t) dw d β (∗) We know that: p( t|w, β) = N ( t| y( x, w), β−1 ) = N ( t|ϕ( x)T w, β−1 ) And that: p(w, β|X, t) = N (w| m N , β−1 S N ) Gam(β|a N , b N ) We go back to (∗), and we first deal with the integral with regard to w: ∫ ∫ [ ] p( t|X, t) = N ( t|ϕ( x)T w, β−1 ) N (w| m N , β−1 S N ) dw Gam(β|a N , b N ) d β ∫ = N ( t|ϕ( x)T m N , β−1 + ϕ( x)T β−1 S N ϕ( x)) Gam(β|a N , b N ) d β ∫ [ ] = N t|ϕ( x)T m N , β−1 (1 + ϕ( x)T S N ϕ( x)) Gam(β|a N , b N ) d β Where we have used (2.113), (2.114) and (2.115). Then, we compare the expression above with (2.160), we can see that p( t|X, t) = St( t|µ, λ, v), where we have defined: µ = ϕ( x ) T m N , [ ]−1 λ = 1 + ϕ ( x ) T S N ϕ( x ) , v = 2a N 80 Problem 3.14 Solution(Wait for updating) Firstly, according to (3.16), if we use the new orthonormal basis set specified in the problem to construct Φ, we can obtain an important property: ΦT Φ = I. Hence, if α = 0, together with (3.54), we know that SN = 1/β. Finally, according to (3.62), we can obtain: k( x, x′ ) = βψ( x)T SN ψ( x′ ) = ψ( x)T ψ( x′ ) Problem 3.15 Solution It is quite obvious if we substitute (3.92) and (3.95) into (3.82), which gives, β α N −γ γ N E ( m N ) = ||t − Φ m N ||2 + m N T m N = + = 2 2 2 2 2 Problem 3.16 Solution(Waiting for update) We know that p(t|w, β) = N ∏ n=1 And N (ϕ( xn )T w, β−1 ) ∝ N (Φw, β−1 I) p(w|α) = N (0, α−1 I) Comparing them with (2.113), (2.114) and (2.115), we can obtain: p(t|α, β) = N (0, β−1 I + α−1 ΦΦT ) Problem 3.17 Solution We know that: p(t|w, β) = = N ∏ n=1 N (ϕ( xn )T w, β−1 ) N ∏ 1 n=1 (2πβ−1 )1/2 { exp − } 1 ( t n − ϕ( xn )T w)2 − 1 2β N {∑ } β − ( t n − ϕ( xn )T w)2 2π n=1 2 { β } β = ( ) N /2 exp − ||t − Φw||2 2π 2 = ( β ) N /2 exp And that: p(w|α) = N (0, α−1 I) = α M /2 (2π) M /2 { α } exp − ||w||2 2 81 If we substitute the expressions above into (3.77), we can obtain (3.78) just as required. Problem 3.18 Solution We expand (3.79) as follows: E ( w) = = = β α wT w 2 2 β T α (t t − 2tT Φw + wT ΦT Φw) + wT w 2 2 ] 1[ T w (βΦT Φ + αI)w − 2βtT Φw + βtT t 2 ||t − Φw||2 + Observing the equation above, we see that E (w) contains the following term : 1 (w − m N )T A(w − m N ) (∗) 2 Now, we need to solve A and m N . We expand (∗) and obtain: (∗) = 1 T (w Aw − 2 m N T Aw + m N T A m N ) 2 We firstly compare the quadratic term, which gives: A = βΦT Φ + αI And then we compare the linear term, which gives: m N T A = βtT Φ Noticing that A = AT , which implies A−1 is also symmetric, we first transpose and then multiply A−1 on both sides, which gives: m N = βA−1 ΦT t Now we rewrite E (w): E ( w) = = = = = = = ] 1[ T w (βΦT Φ + αI)w − 2βtT Φw + βtT t 2 ] 1[ (w − m N )T A(w − m N ) + βtT t − m N T A m N 2 1 1 (w − m N )T A(w − m N ) + (βtT t − m N T A m N ) 2 2 1 1 T (w − m N ) A(w − m N ) + (βtT t − 2 m N T A m N + m N T A m N ) 2 2 1 1 (w − m N )T A(w − m N ) + (βtT t − 2 m N T A m N + m N T (βΦT Φ + αI) m N ) 2 2 ] α 1 1[ T (w − m N ) A(w − m N ) + βtT t − 2βtT Φ m N + m N T (βΦT Φ) m N + m N T m N 2 2 2 1 β α (w − m N )T A(w − m N ) + ||t − Φ m N ||2 + m N T m N 2 2 2 82 Just as required. Problem 3.19 Solution Based on the standard form of a multivariate normal distribution, we know that ∫ } { 1 1 1 exp − (w − m N )T A(w − m N ) dw = 1 M /2 1/2 2 (2π) | A| Hence, ∫ { 1 } exp − (w − m N )T A(w − m N ) dw = (2π) M /2 |A|1/2 2 And since E ( m N ) doesn’t depend on w, (3.85) is quite obvious. Then we substitute (3.85) into (3.78), which will immediately gives (3.86). Problem 3.20 Solution You can just follow the steps from (3.87) to (3.92), which is already very clear. Problem 3.21 Solution Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A is a M × M real symmetric matrix, with eigenvalues λ i , i = 1, 2, ..., M , |A| and Tr(A) can be written as: | A| = M ∏ i =1 λi , Tr(A) = M ∑ i =1 λi Back to this problem, according to section 3.5.2, we know that A has eigenvalues α + λ i , i = 1, 2, ..., M . Hence the left side of (3.117) equals to: left side = M M d M ∑ [∏ ] ∑ d 1 (α + λ i ) = ln ln(α + λ i ) = dα i =1 i =1 d α i =1 α + λ i And according to (3.81), we can obtain: A−1 d A = A−1 I = A−1 dα For the symmetric matrix A, its inverse A−1 has eigenvalues 1 / (α+λ i ) , i = 1, 2, ..., M . Therefore, M ∑ 1 d A) = Tr(A−1 dα α + λi i =1 Hence there are the same, and (3.92) is quite obvious. Problem 3.22 Solution 83 Let’s derive (3.86) with regard to β. The first term dependent on β in (3.86) is : d N N ( lnβ) = dβ 2 2β The second term is : d E (m N ) = dβ 1 β d d α ||t − Φ m N ||2 + ||t − Φ m N ||2 + mN T mN 2 2 dβ dβ 2 The last two terms in the equation above can be further written as: β d 2 dβ ||t − Φ m N ||2 + d α mN T mN dβ 2 = = = = = = {β } dm N d d α ||t − Φ m N ||2 + mN T mN · 2 dm N dm N 2 dβ {β } α dm N [−2ΦT (t − Φ m N )] + 2 m N · 2 2 dβ { } dm N − βΦT (t − Φ m N ) + α m N · dβ { } dm N − βΦT t + (αI + βΦT Φ) m N · dβ { } dm N − βΦT t + A m N · dβ 0 Where we have taken advantage of (3.83) and (3.84). Hence N d 1 1 ∑ E ( m N ) = ||t − Φ m N ||2 = ( t n − m N T ϕ( xn ))2 dβ 2 2 n=1 The last term dependent on β in (3.86) is: d 1 γ ( ln|A|) = dβ 2 2β Therefore, if we combine all those expressions together, we will obtain (3.94). And then if we rearrange it, we will obtain (3.95). Problem 3.23 Solution First, according to (3.10), we know that p(t|X, w, β) can be further written as p(t|X, w, β) = N (t|Φw, β−1 I), and given that p(w|β) = N ( m 0 , β−1 S0 ) and 84 p(β) = Gam(β|a 0 , b 0 ). Therefore, we just follow the hint in the problem. ∫ ∫ p(t) = p(t|X, w, β) p(w|β) dw p(β) d β ∫ ∫ { β } β = ( ) N /2 exp − (t − Φw)T (t − Φw) · 2π 2 { } β β M /2 ( ) |S0 |−1/2 exp − (w − m 0 )T S0 −1 (w − m 0 ) dw 2π 2 −1 a 0 a 0 −1 Γ(a 0 ) b 0 β exp(− b 0 β) d β ∫ ∫ a0 b0 { β } = exp − (t − Φw)T (t − Φw) ( M + N )/2 1/2 2 (2π) |S0 | { β } exp − (w − m 0 )T S0 −1 (w − m 0 ) dw 2 βa0 −1+ N /2+ M /2 exp(− b 0 β) d β ∫ ∫ a b0 0 } { β T −1 S ( w − m ) dw = exp − ( w − m ) N N N 2 (2π)( M + N )/2 |S0 |1/2 { β } exp − (tT t + m 0 T S0 −1 m 0 − m N T SN −1 m N ) 2 a N −1+ M /2 exp(− b 0 β) d β β Where we have defined m N = SN (S0 −1 m 0 + ΦT t) S N −1 = S0 −1 + ΦT Φ a N = a0 + N 2 N ∑ 1 b N = b 0 + ( m 0 T S0 −1 m 0 − m N T SN −1 m N + t2n ) 2 n=1 Which are exactly the same as those in Prob.3.12, and then we evaluate the integral, taking advantage of the normalized property of multivariate Gaussian Distribution and Gamma Distribution. ∫ 2π M /2 1/2 ) | S | βa N −1+ M /2 exp(− b N β) d β N β (2π)( M + N )/2 |S0 | ∫ a b0 0 M /2 1/2 (2π) |SN | βa N −1 exp(− b N β) d β (2π)( M + N )/2 |S0 |1/2 a0 |SN |1/2 b 0 Γ(a N ) 1 a p(t) = = = b0 0 ( 1/2 a (2π) N /2 |S0 |1/2 b NN Γ( b N ) Just as required. Problem 3.24 Solution 85 Let’s just follow the hint and we begin by writing down expression for the likelihood, prior and posterior PDF. We know that p(t|w, β) = N (t|Φw, β−1 I). What’s more, the form of the prior and posterior are quite similar: p(w, β) = N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 ) And p(w, β|t) = N (w|mN , β−1 SN ) Gam(β|a N , b N ) Where the relationships among those parameters are shown in Prob.3.12, Prob.3.23. Now according to (3.119), we can write: p(t) = N (t|Φw, β−1 I) N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 ) N (w|mN , β−1 SN ) Gam(β|a N , b N ) 0 a −1 N (w|m0 , β−1 S0 ) b 0 β 0 exp(− b 0 β) / Γ(a 0 ) N (w|mN , β−1 SN ) b aNN βa N −1 exp(− b N β) / Γ(a N ) a = N (t|Φw, β−1 I) a −1 = N (t|Φw, β 0 { } N (w|m0 , β−1 S0 ) b 0 Γ(a N ) a0 −a N I) exp − ( b − b ) β β 0 N a N (w|mN , β−1 SN ) b NN Γ(a 0 ) a −1 = N (t|Φw, β { } b 0 0 Γ(a N ) − N /2 N (w|m0 , β−1 S0 ) I) exp − ( b − b ) β β 0 N a N (w|mN , β−1 SN ) b NN Γ(a 0 ) Where we have used a N = a 0 + N 2 . Now we deal with the terms expressed in the form of Gaussian Distribution: N (w|m0 , β−1 S0 ) N (w|mN , β−1 SN ) { β } β = ( ) N /2 exp − (t − Φw)T (t − Φw) · 2π 2 { } −1 1/2 exp − β (w − m )T S −1 (w − m ) | β SN | 0 0 0 2 { } |β−1 S0 |1/2 exp − β (w − mN )T SN −1 (w − mN ) 2 Gaussian terms = N (t|Φw, β−1 I) { β } T exp − (t − Φ w ) (t − Φ w ) · 2π 2 |S0 |1/2 } { β exp − 2 (w − m0 )T S0 −1 (w − m0 ) } { β exp − 2 (w − mN )T SN −1 (w − mN ) = ( β ) N /2 |SN |1/2 We look back to the previous problem and we notice that at the last step in the deduction of p(t), we complete the square according to w. And if we carefully compare the left and right side at the last step, we can obtain : = { β } { β } exp − (t − Φw)T (t − Φw) exp − (w − m 0 )T S0 −1 (w − m 0 ) 2 2 } { } { β T −1 exp − (w − m N ) SN (w − m N ) exp − ( b N − b 0 )β 2 86 Hence, we go back to deal with the Gaussian terms: Gaussian terms = ( β 2π ) N /2 |SN |1/2 |S0 |1/2 { } exp − ( b N − b 0 )β If we substitute the expressions above into p(t), we will obtain (3.118) immediately. 0.4 Linear Models Classification Problem 4.1 Solution If the convex hull of {xn } and {yn } intersects, we know that there will be a ∑ ∑ point z which can be written as z = n αn xn and also z = n βn yn . Hence we can obtain: b T z + w0 w b T( = w = ( ∑ ∑ α n xn ) + w 0 n ∑ b T xn ) + ( α n ) w 0 αn w n = ∑ n T b xn + w 0 ) αn (w (∗) n ∑ Where we have used n αn = 1. And if {xn } and {yn } are linearly separab T xn + w0 > 0 and w b T yn + w0 < 0, for ∀xn , yn . Together with ble, we have w T b z + w0 > 0. And if we calculate w b T z + w0 αn ≥ 0 and (∗), we know that w from the perspective of {yn } following the same procedure, we can obtain b T z + w0 < 0. Hence contradictory occurs. In other words, they are not linw early separable if their convex hulls intersect. Now let’s assume they are linearly separable and try to prove their convex b T xn + w0 > 0 and hulls don’t intersect. This is obvious. We can obtain w b T yn + w0 < 0, for ∀xn , yn , if the two sets are linearly separable. And if there w is a point z, which belongs to both convex hulls of {xn } and {yn }, following the b T z + w0 > 0 from the perspective same procedure as above, we can see that w b T z + w0 < 0 from the perspective of {yn }, which directly leads to of {xn } and w a contradictory. Therefore, the convex hulls don’t intersect. Problem 4.2 Solution e on w0 explicitly: Let’s make the dependency of E D (W) e = E D (W) } 1 { Tr (XW + 1w0 T − T)T (XW + 1w0 T − T) 2 e with respect to w0 : Then we calculate the derivative of E D (W) e ∂E D (W) ∂w0 = 2 N w0 + 2(XW − T)T 1 87 Where we have used the property: ∂ ∂X [ ] Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)BT We set the derivative equals to 0, which gives: w0 = − 1 (XW − T)T 1 = t̄ − WT x̄ N Where we have denoted: t̄ = 1 T T 1, N and x̄ = 1 T X 1 N e we can obtain: If we substitute the equations above into E D (W), { } e = 1 Tr (XW + T̄ − X̄W − T)T (XW + T̄ − X̄W − T) E D (W) 2 Where we further denote T̄ = 1t̄T , and X̄ = 1x̄T e with regard to W to 0, which gives: Then we set the derivative of E D (W) b †T b W=X Where we have defined: b = X − X̄ , X b = T − T̄ and T Now consider the prediction for a new given x, we have: y(x) = WT x + w0 = WT x + t̄ − WT x̄ = t̄ + WT (x − x̄) If we know that aT tn + b = 0 holds for some a and b, we can obtain: aT t̄ = N 1 ∑ 1 T T a T 1= a T tn = − b N N n=1 Therefore, [ ] aT y(x) = aT t̄ + WT (x − x̄) = aT t̄ + aT WT (x − x̄) b T (X b † )T (x − x̄) = − b + aT T = −b 88 Where we have used: bT aT T = aT (T − T̄)T = aT (T − = aT TT − 1 T T 11 T) N 1 T T T a T 11 = − b1T + b1T N = 0T Problem 4.3 Solution Suppose there are Q constraints in total. We can write aq T tn + b q = 0 , q = 1, 2, ...,Q for all the target vector tn , n = 1, 2..., N . Or alternatively, we can group them together: A T tn + b = 0 Where A is a Q × Q matrix, and the qth column of A is aq , and meanwhile b is a Q × 1 column vector, and the qth element is bq . for every pair of {aq , b q } we can follow the same procedure in the previous problem to show that aq y(x) + b q = 0. In other words, the proofs will not affect each other. Therefore, it is obvious : AT y(x) + b = 0 Problem 4.4 Solution We use Lagrange multiplier to enforce the constraint wT w = 1. We now need to maximize : L(λ, w) = wT (m2 − m1 ) + λ(wT w − 1) We calculate the derivatives: ∂L(λ, w) ∂λ And ∂L(λ, w) ∂w = wT w − 1 = m2 − m1 + 2λw We set the derivatives above equals to 0, which gives: w=− 1 (m2 − m1 ) ∝ (m2 − m1 ) 2λ Problem 4.5 Solution We expand (4.25) using (4.22), (4.23) and (4.24). J (w) = = ( m 2 − m 1 )2 s21 + s22 ∑ ||wT (m2 − m1 )||2 ∑ T 2 T 2 n∈C 1 (w xn − m 1 ) + n∈C 2 (w xn − m 2 ) 89 The numerator can be further written as: [ ][ ]T numerator = wT (m2 − m1 ) wT (m2 − m1 ) = wT SB w Where we have defined: SB = (m2 − m1 )(m2 − m1 )T And ti is the same for the denominator: ∑ ∑ denominator = [wT (xn − m1 )]2 + [wT (xn − m2 )]2 n∈C 1 n∈C 2 T T = w Sw1 w + w Sw2 w = w T Sw w Where we have defined: ∑ ∑ Sw = (xn − m1 )(xn − m1 )T + (xn − m2 )(xn − m2 )T n∈C 1 n∈C 2 Just as required. Problem 4.6 Solution Let’s follow the hint, beginning by expanding (4.33). (4.33) = = = = = N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 = [ w T xn xn + w 0 N ∑ n=1 xn xn T w − w T m xn − N ∑ n=1 N ∑ n=1 xn − ( t n xn ∑ n∈C 1 xn xn T w − wT m · ( N m) − ( xn xn T w − N wT mm − N ( t n xn + ∑ ∑ N ∑ −N xn + xn ) n ∈ C 1 N1 n ∈ C 2 N2 ∑ n∈C 1 ∑ 1 1 xn − xn ) N1 n ∈ C 2 N2 xn xn T w − N mmT w − N (m1 − m2 ) N ∑ n=1 (xn xn T ) − N mmT ]w − N (m1 − m2 ) If we let the derivative equal to 0, we will see that: [ N ∑ (xn xn T ) − N mmT ]w = N (m1 − m2 ) n=1 Therefore, now we need to prove: N ∑ n=1 t n xn ) n∈C 2 (xn xn T ) − N mmT = Sw + N1 N2 SB N 90 Let’s expand the left side of the equation above: left = = = = = = = = N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 N ∑ n=1 = xn xn T − N ( xn xn T − N1 N2 m1 + m2 )2 N N N12 N22 N N2 N12 N ||m1 ||2 + 2 ||m1 ||2 − xn xn T + ( N 1 + N22 N ||m2 ||2 + 2 ||m2 ||2 − 2 N1 N2 m1 m2 T ) N2 N1 N2 m1 m2 T N N1 N2 N1 N2 N1 N2 − 2 N1 )||m1 ||2 + ( N2 + − 2 N2 )||m2 ||2 − 2 m1 m2 T N N N xn xn T + ( N1 − 2 N1 )||m1 ||2 + ( N2 − 2 N2 )||m2 ||2 + N1 N2 ||m1 − m2 ||2 N xn xn T + N1 ||m1 ||2 − 2m1 · ( N1 m1 T ) + N2 ||m2 ||2 − 2m2 · ( N2 m2 T ) + xn xn T + N1 ||m1 ||2 − 2m1 ∑ n∈C 1 + = xn xn T − N ( T n∈C 2 n∈C 1 ∑ n∈C 1 n∈C 1 xn xn + N1 ||m1 || − 2m1 ∑ ∑ 2 ∑ ∑ n∈C 1 xn xn T + N2 ||m2 ||2 − 2m2 ||xn − m1 ||2 + = Sw + n∈C 2 ∑ n∈C 2 xnT + N1 N2 SB N xnT ∑ n∈C 2 (xn xn T + ||m1 ||2 − 2m1 xnT ) + ∑ xnT + N2 ||m2 ||2 − 2m2 xnT + ∑ n∈C 2 ||xn − m2 ||2 + N1 N2 SB N (xn xn T + ||m2 ||2 − 2m2 xn T ) + N1 N2 SB N N1 N2 SB N N1 N2 SB N Just as required. Problem 4.7 Solution This problem is quite simple. We can solve it by definition. We know that logistic sigmoid function has the form: σ( a ) = 1 1 + exp(−a) Therefore, we can obtain: σ ( a ) + σ (− a ) = = = N1 N2 SB N 1 1 + 1 + exp(−a) 1 + exp(a) 2 + exp(a) + exp(−a) [1 + exp(−a)][1 + exp(a)] 2 + exp(a) + exp(−a) =1 2 + exp(a) + exp(−a) 91 Next we exchange the dependent and independent variables to obtain its inverse. 1 a= 1 + exp(− y) We first rearrange the equation above, which gives: exp(− y) = 1−a a Then we calculate the logarithm for both sides, which gives: y = ln( a ) 1−a Just as required. Problem 4.8 Solution According to (4.58) and (4.64), we can write: a = ln p(x|C 1 ) p(C 1 ) p(x|C 2 ) p(C 2 ) p (C 1 ) p (C 2 ) 1 1 p (C 1 ) = − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 ) + ln 2 2 p (C 2 ) 1 1 p ( C ) 1 = Σ−1 (µ1 − µ2 )x − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln 2 2 p (C 2 ) = ln p(x|C 1 ) − ln p(x|C 2 ) + ln = w T x + w0 Where in the last second step, we rearrange the term according to x, i.e., its quadratic, linear, constant term. We have also defined : w = Σ−1 (µ1 − µ2 ) And 1 1 p (C 1 ) w0 = − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln 2 2 p (C 2 ) Finally, since p(C 1 |x) = σ(a) as stated in (4.57), we have p(C 1 |x) = σ(wT x+ w0 ) just as required. Problem 4.9 Solution We begin by writing down the likelihood function. p({ϕn , t n }|π1 , π2 , ..., πK ) = = N ∏ K ∏ [ p(ϕn |C k ) p(C k )] t nk n=1 k=1 N ∏ K ∏ [πk p(ϕn |C k )] t nk n=1 k=1 92 Hence we can obtain the expression for the logarithm likelihood: ln p = N ∑ K ∑ n=1 k=1 N ∑ K ∑ [ ] t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln πk n=1 k=1 Since there is a constraint on πk , so we need to add a Lagrange Multiplier to the expression, which becomes: L= N ∑ K ∑ n=1 k=1 t nk ln πk + λ( K ∑ k=1 πk − 1) We calculate the derivative of the expression above with regard to πk : ∂L = ∂πk N t ∑ nk +λ π n=1 k And if we set the derivative equal to 0, we can obtain: πk = − ( N ∑ n=1 t nk ) / λ = − Nk λ (∗) And if we preform summation on both sides with regard to k, we can see that: K ∑ N Nk ) / λ = − 1 = −( λ k=1 Which gives λ = − N , and substitute it into (∗), we can obtain πk = Nk / N . Problem 4.10 Solution This time, we focus on the term which dependent on µk and Σ in the logarithm likelihood. ln p = N ∑ K ∑ n=1 k=1 N ∑ K ∑ [ ] t nk ln πk + ln p(ϕn |C k ) ∝ t nk ln p(ϕn |C k ) n=1 k=1 Provided p(ϕ|C k ) = N (ϕ|µk , Σ), we can further derive: ln p ∝ K N ∑ ∑ [ 1 ] 1 t nk − ln |Σ| − (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 n=1 k=1 We first calculate the derivative of the expression above with regard to µk : ∂ ln p ∂µk = N ∑ n=1 t nk Σ−1 (ϕn − µk ) We set the derivative equals to 0, which gives: N ∑ n=1 t nk Σ−1 ϕn = N ∑ n=1 t nk Σ−1 µk = Nk Σ−1 µk 93 Therefore, if we multiply both sides by Σ / Nk , we will obtain (4.161). Now let’s calculate the derivative of ln p with regard to Σ, which gives: ∂ ln p ∂Σ = = = N ∑ K ∑ N ∑ K 1 1 ∂ ∑ t nk (− Σ−1 ) − t nk (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 ∂ Σ n=1 k=1 n=1 k=1 N ∑ K ∑ n=1 k=1 − K ∑ N t nk −1 1 ∂ ∑ Σ − t nk (ϕn − µk )Σ−1 (ϕn − µk )T 2 2 ∂Σ k=1 n=1 N ∑ K 1 1 ∂ ∑ − Σ−1 − Nk Tr(Σ−1 Sk ) 2 ∂Σ k=1 n=1 2 = − K N −1 1 ∑ Σ + Nk Σ−1 Sk Σ−1 2 2 k=1 Where we have denoted Sk = N 1 ∑ t nk (ϕn − µk )(ϕn − µk )T Nk n=1 Now we set the derivative equals to 0, and rearrange the equation, which gives: K N ∑ k Σ= Sk N k=1 Problem 4.11 Solution Based on definition, we can write down p(ϕ|C k ) = M ∏ L ∏ m=1 l =1 ϕ ml µkml Note that here only one of the value among ϕm1 , ϕm2 , ... ϕmL is 1, and the others are all 0 because we have used a 1 − of − L binary coding scheme, and also we have taken advantage of the assumption that the M components of ϕ are independent conditioned on the class C k . We substitute the expression above into (4.63), which gives: ak = L M ∑ ∑ m=1 l =1 ϕml µkml + ln p(C k ) Hence it is obvious that a k is a linear function of the components of ϕ. Problem 4.12 Solution Based on definition, i.e., (4.59), we know that logistic sigmoid has the form: 1 σ( a ) = 1 + exp(−a) 94 Now, we calculate its derivative with regard to a. d σ( a ) exp(a) exp(a) 1 = = · = [ 1 − σ( a ) ] · σ( a ) 2 da 1 + exp(−a) 1 + exp(−a) [ 1 + exp(−a) ] Just as required. Problem 4.13 Solution Let’s follow the hint. ∇E (w) = −∇ = − N ∑ n=1 N ∑ n=1 { t n ln yn + (1 − t n ) ln(1 − yn ) } ∇{ t n ln yn + (1 − t n ) ln(1 − yn ) } = − N d { t ln y + (1 − t ) ln(1 − y ) } d y da ∑ n n n n n n d y da d w n n n=1 = − N t ∑ 1 − tn n ( − ) · yn (1 − yn ) · ϕn y 1 − yn n=1 n = − = − = N ∑ n=1 N ∑ t n − yn · yn (1 − yn ) · ϕn yn (1 − yn ) ( t n − yn )ϕn n=1 N ∑ n=1 ( yn − t n )ϕn Where we have used yn = σ(a n ), a n = wT ϕn , the chain rules and (4.88). Problem 4.14 Solution According to definition, we know that if a dataset is linearly separable, we can find w, for some points xn , we have wT ϕ(xn ) > 0, and the others wT ϕ(xm ) < 0. Then the boundary is given by wT ϕ(x) = 0. Note that for any point x0 in the dataset, the value of wT ϕ(x0 ) should either be positive or negative, but it can not equal to 0. Therefore, the maximum likelihood solution for logistic regression is trivial. We suppose for those points xn belonging to class C 1 , we have wT ϕ(xn ) > 0 and wT ϕ(xm ) < 0 for those belonging to class C 2 . According to (4.87), if |w| → ∞, we have p(C 1 |ϕ(xn )) = σ(wT ϕ(xn )) → 1 Where we have used wT ϕ(xn ) → +∞. And since wT ϕ(xm ) → −∞, we can also obtain: p(C 2 |ϕ(xm )) = 1 − p(C 1 |ϕ(xm )) = 1 − σ(wT ϕ(xm )) → 1 95 In other words, for the likelihood function, i.e.,(4.89), if we have |w| → ∞, and also we label all the points lying on one side of the boundary as class C 1 , and those on the other side as class C 2 , the every term in (4.89) can achieve its maximum value, i.e., 1, finally leading to the maximum of the likelihood. Hence, for a linearly separable dataset, the learning process may prefer to make |w| → ∞ and use the linear boundary to label the datasets, which can cause severe over-fitting problem. Problem 4.15 Solution(Waiting for update) Since yn is the output of the logistic sigmoid function, we know that 0 < yn < 1 and hence yn (1 − yn ) > 0. Then we use (4.97), for an arbitrary non-zero real vector a ̸= 0, we have: aT Ha = aT = = N [∑ N ∑ n=1 N ∑ n=1 n=1 ] yn (1 − yn )ϕn ϕT n a T T yn (1 − yn ) (ϕT n a) (ϕ n a) yn (1 − yn ) b2n Where we have denoted b n = ϕT n a. What’s more, there should be at least one of { b 1 , b 2 , ..., b N } not equal to zero and then we can see that the expression above is larger than 0 and hence H is positive definite. Otherwise, if all the b n = 0, a = [a 1 , a 2 , ..., a M ]T will locate in the null space of matrix Φ N × M . However, with regard to the rank-nullity theorem, we know that Rank(Φ) + Nullity(Φ) = M, and we have already assumed that those M features are independent, i.e., Rank(Φ) = M , which means there is only 0 in its null space. Therefore contradictory occurs. Problem 4.16 Solution We still denote yn = p( t = 1|ϕn ), and then we can write down the log likelihood by replacing t n with πn in (4.89) and (4.90). ln p(t|w) = N ∑ n=1 { πn ln yn + (1 − πn ) ln(1 − yn ) } Problem 4.17 Solution We should discuss in two situations separately, namely j = k and j ̸= k. When j ̸= k, we have: ∂ yk ∂a j = − exp(a k ) · exp(a j ) = − yk · y j ∑ [ j exp(a j ) ]2 And when j = k, we have: ∑ exp(a k ) j exp(a j ) − exp(a k ) exp(a k ) ∂ yk = = yk − yk2 = yk (1 − yk ) ∑ ∂a k [ j exp(a j ) ]2 96 Therefore, we can obtain: ∂ yk ∂a j = yk ( I k j − y j ) Where I k j is the elements of the indentity matrix. Problem 4.18 Solution We derive every term t nk ln ynk with regard to a j . ∂ t nk ln ynk ∂ t nk ln ynk ∂ ynk ∂a j = ∂wj ∂ ynk ∂a j ∂wj 1 · ynk ( I k j − yn j ) · ϕn t nk ynk t nk ( I k j − yn j ) ϕn = = Where we have used (4.105) and (4.106). Next we perform summation over n and k. ∇wj E = − = = = = K N ∑ ∑ n=1 k=1 K N ∑ ∑ n=1 k=1 t nk ( I k j − yn j ) ϕn t nk yn j ϕn − N ∑ K ∑ n=1 k=1 t nk I k j ϕn N [ ∑ K N ∑ ] ∑ t nk ) yn j ϕn − ( t n j ϕn n=1 N ∑ n=1 N ∑ n=1 k=1 yn j ϕn − N ∑ n=1 t n j ϕn ( yn j − t n j ) ϕn n=1 Where we have used the fact that for arbitrary n, we have Problem 4.19 Solution We write down the log likelihood. ln p(t|w) = N { ∑ n=1 t n ln yn + (1 − t n ) ln(1 − yn ) Therefore, we can obtain: ∇w ln p = = = ∂ ln p ∂ yn ∂a n · · ∂ yn ∂a n ∂w N t ∑ 1 − tn ′ n ( − )Φ (a n )ϕn 1 − yn n=1 yn N ∑ n=1 yn − t n Φ′ (a n )ϕn yn (1 − yn ) } ∑K k=1 t nk = 1. 97 Where we have used y = p( t = 1|a) = Φ(a) and a n = wT ϕn . According to (4.114), we can obtain: ¯ 1 1 Φ′ (a) = N (θ |0, 1)¯θ=a = p exp(− a2 ) 2 2π Hence, we can obtain: ∇w ln p = N ∑ n=1 a2 yn − t n exp(− 2n ) ϕn p yn (1 − yn ) 2π To calculate the Hessian Matrix, we need to first evaluate several derivatives. yn − t n } = ∂w yn (1 − yn ) ∂ { = yn − t n ∂ yn ∂a n }· · ∂ yn yn (1 − yn ) ∂a n ∂w yn (1 − yn ) − ( yn − t n )(1 − 2 yn ) ′ Φ (a n )ϕn [ yn (1 − yn ) ]2 ∂ { a2 = yn2 + t n − 2 yn t n exp(− 2n ) ϕn p yn2 (1 − yn )2 2π And a2 exp(− n ) { p 2 } = ∂w 2π ∂ a2 exp(− n ) ∂a n { p 2 } ∂a n ∂w 2π 2 a an = − p exp(− n )ϕn 2 2π ∂ Therefore, using the chain rule, we can obtain: a2 yn − t n exp(− 2n ) { } = p ∂w yn (1 − yn ) 2π ∂ a2 a2 yn − t n exp(− 2n ) yn − t n ∂ exp(− 2n ) { } p + { p } ∂w yn (1 − yn ) yn (1 − yn ) ∂w 2π 2π ∂ a2 a2 [ yn2 + t n − 2 yn t n exp(− 2n ) ] exp(− 2n ) − a n ( yn − t n ) p ϕn = p yn (1 − yn ) 2π 2π yn (1 − yn ) Finally if we perform summation over n, we can obtain the Hessian Matrix: H = ∇∇w ln p a2 = N ∂ ∑ yn − t n exp(− 2n ) { } · ϕn p 2π n=1 ∂w yn (1 − yn ) = N [ y2 + t − 2 y t exp(− n ) ∑ ] exp(− 2n ) n n n n 2 − a n ( yn − t n ) p ϕn ϕn T p (1 y ) y − 2 π 2 π y (1 − y ) n n n=1 n n a2 a2 98 Problem 4.20 Solution(waiting for update) We know that the Hessian Matrix is of size MK × MK , and the ( j, k) th block with size M × M is given by (4.110), where j, k = 1, 2, ..., K . Therefore, we can obtain: K ∑ K ∑ uT Hu = uT (∗) j Hj,k uk j =1 k=1 Where we use uk to denote the k th block vector of u with size M × 1, and Hj,k to denote the ( j, k) th block matrix of H with size M × M . Then based on (4.110), we further expand (4.110): (∗) = = = = K ∑ K ∑ j =1 k=1 uT j {− K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ N ∑ k=1 n=1 N ∑ n=1 ynk ( I k j − yn j ) ϕn ϕn T }uk T uT j {− ynk ( I k j − yn j ) ϕ n ϕ n }uk T uT j {− ynk I k j ϕ n ϕ n }uk + T uT k {− ynk ϕ n ϕ n }uk + K ∑ K ∑ N ∑ j =1 k=1 n=1 K ∑ K ∑ N ∑ j =1 k=1 n=1 T uT j { ynk yn j ϕ n ϕ n }uk T yn j uT j { ϕ n ϕ n } ynk uk Problem 4.21 Solution It is quite obvious. ∫ Φ( a ) = = = = = = = a N (θ |0, 1) d θ ∫ a 1 + N (θ |0, 1) d θ 2 ∫0 a 1 + N (θ |0, 1) d θ 2 0 ∫ a 1 1 exp(−θ 2 /2) d θ +p 2 2π 0 p ∫ a 2 1 π 1 +p p exp(−θ 2 /2) d θ 2 π 2π 2 0 ∫ a 2 1 1 (1 + p p exp(−θ 2 /2) d θ ) 2 π 2 0 { } 1 1 1 + p er f (a) 2 2 −∞ Where we have used ∫ 0 −∞ N (θ |0, 1) d θ = 1 2 99 Problem 4.22 Solution If we denote f (θ ) = p(D |θ ) p(θ ), we can write: ∫ ∫ p(D ) = p ( D | θ ) p (θ ) d θ = f (θ ) d θ (2π) M /2 = f (θ M AP ) = p(D |θ M AP ) p(θ M AP ) |A|1/2 (2π) M /2 |A|1/2 Where θ M AP is the value of θ at the mode of f (θ ), A is the Hessian Matrix of − ln f (θ ) and we have also used (4.135). Therefore, ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + M 1 ln 2π − ln |A| 2 2 Just as required. Problem 4.23 Solution According to (4.137), we can write: 1 M ln 2π − ln |A| 2 2 M 1 1 = ln p(D |θ M AP ) − ln 2π − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) 2 2 2 M 1 + ln 2π − ln |A| 2 2 1 1 1 = ln p(D |θ M AP ) − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |A| 2 2 2 ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) + Where we have used the definition of the multivariate Gaussian Distribution. Then, from (4.138), we can write: A = −∇∇ ln p(D |θ M AP ) p(θ M AP ) = −∇∇ ln p(D |θ M AP ) − ∇∇ ln p(θ M AP ) } { 1 = H − ∇∇ − (θ M AP − m)T V0 −1 (θ M AP − m) 2 { −1 } = H + ∇ V0 (θ M AP − m) = H + V0 −1 Where we have denoted H = −∇∇ ln p(D |θ M AP ). Therefore, the equation 100 above becomes: 1 (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 = ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 ln p(D ) = ln p(D |θ M AP ) − } 1 { 1 ln |V0 | · |H + V− 0 | 2 } 1 { ln |V0 H + I| 2 1 1 ln |V0 | − ln |H| 2 2 1 ln |H| + const 2 Where we have used the property of determinant: |A|·|B| = |AB|, and the fact that the prior is board, i.e. I can be neglected with regard to V0 H. What’s more, since the prior is pre-given, we can view V0 as constant. And if the data is large, we can write: N ∑ b H= Hn = N H b = 1/ N Where H n=1 ∑N n=1 Hn , and then 1 (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 1 ≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) − 2 M ln N ≈ ln p(D |θ M AP ) − 2 ln p(D ) ≈ ln p(D |θ M AP ) − 1 ln |H| + const 2 1 b | + const ln | N H 2 M 1 b | + const ln N − ln |H 2 2 This is because when N >> 1, other terms can be neglected. Problem 4.24 Solution(Waiting for updating) Problem 4.25 Solution We first need to obtain the expression for the first derivative of probit function Φ(λa) with regard to a. According to (4.114), we can write down: d Φ (λ a ) = da = d Φ(λa) d λa · d (λa) da { 1 } λ p exp − (λa)2 2 2π Which further gives: ¯ d λ ¯ Φ(λa)¯ = p a=0 da 2π And for logistic sigmoid function, according to (4.88), we have dσ 1 = σ (1 − σ) = 0.5 × 0.5 = da 4 101 Where we have used σ(0) = 0.5. Let their derivatives at origin equals, we have: λ 1 = p 4 2π p / / 2 i.e., λ = 2π 4. And hence λ = π 8 is obvious. Problem 4.26 Solution We will prove (4.152) in a more simple and intuitive way. But firstly, we need to prove a trivial yet useful statement: Suppose we have a random variable satisfied normal distribution denoted as X ∼ N ( X |µ, σ2 ), the probability x−µ of X ≤ x is P ( X ≤ x) = Φ( σ ), and here x is a given real number. We can see this by writing down the integral: ∫ x [ ] 1 1 P ( X ≤ x) = exp − 2 ( X − µ)2 d X p 2σ −∞ 2πσ2 ∫ x−µ σ 1 1 exp(− γ2 ) σ d γ = p 2 2 −∞ 2πσ ∫ x−µ σ 1 1 = p exp(− γ2 ) d γ 2 −∞ 2π x−µ ) = Φ( σ Where we have changed the variable X = µ + σγ. Now consider two random variables X ∼ N (0, λ−2 ) and Y ∼ N (µ, σ2 ). We first calculate the conditional probability P ( X ≤ Y | Y = a): a−0 ) = Φ(λa) λ−1 Together with Bayesian Formula, we can obtain: ∫ +∞ P(X ≤ Y ) = P ( X ≤ Y | Y = a) pd f (Y = a) dY −∞ ∫ +∞ Φ(λa) N (a|µ, σ2 ) da = P ( X ≤ Y | Y = a ) = P ( X ≤ a ) = Φ( −∞ Where pd f (·) denotes the probability density function and we have also used pd f (Y ) = N (µ, σ2 ). What’s more, we know that X − Y should also satisfy normal distribution, with: E [ X − Y ] = E [ X ] − E [Y ] = 0 − µ = − µ And var [ X − Y ] = var [ X ] + var [Y ] = λ−2 + σ2 Therefore, X − Y ∼ N (−µ, λ−2 + σ2 ) and it follows that: 0 − (−µ) µ P ( X − Y ≤ 0) = Φ( p ) = Φ( p ) λ−2 + σ2 λ−2 + σ2 Since P ( X ≤ Y ) = P ( X − Y ≤ 0), we obtain what have been required. 102 0.5 Neural Networks Problem 5.1 Solution Based on definition of tanh(·), we can obtain: e a − e −a e a + e −a 2 ea = −1 + a e + e −a 1 = −1 + 2 1 + e−2a = 2σ(2a) − 1 tanh(a) = s) s) s) s) , w(2 for a network whose If we have parameters w(1 , w(1 and w(2 ji j0 k0 kj t) t) hidden units use logistic sigmoid function as activation and w(1 , w(1 and ji j0 t) t) w(2 , w(2 for another one using tanh(·), for the network using tanh(·) as kj k0 activation, we can write down the following expression by using (5.4): a(kt) = = = M ∑ j =1 M ∑ j =1 M ∑ j =1 t) t) w(2 tanh(a(jt) ) + w(2 kj k0 t) t) w(2 [ 2σ(2a(jt) ) − 1 ] + w(2 kj k0 M [ ∑ t) t) ] t) w(2 + w(2 2 w(2 σ(2a(jt) ) + − kj k0 kj j =1 What’s more, we also have : a(ks) = M ∑ j =1 s) s) w(2 σ(a(js) ) + w(2 kj k0 To make the two networks equivalent, i.e., a(ks) = a(kt) , we should make sure: a(s) = 2a(jt) j s) t) w(2 = 2w(2 kj kj ∑ w(2s) = − M w(2 t) + w(2 t) k0 j =1 kj k0 Note that the first condition can be achieved by simply enforcing: s) t) w(1 = 2w(1 , ji ji and s) t) w(1 = 2w(1 j0 j0 Therefore, these two networks are equivalent under a linear transformation. Problem 5.2 Solution 103 It is obvious. We write down the likelihood. p(T|X, w) = N ∏ n=1 N (tn |y(xn , w), β−1 I) Taking the negative logarithm, we can obtain: E (w, β) = − ln p(T|X, w) = N [ β ∑ 2 n=1 ] NK ln β+const ( y(xn , w)−tn )T ( y(xn , w)−tn ) − 2 Here we have used const to denote the term independent of both w and β. Note that here we have used the definition of the multivariate Gaussian Distribution. What’s more, we see that the covariance matrix β−1 I and the weight parameter w have decoupled, which is distinct from the next problem. We can first solve wML by minimizing the first term on the right of the equation above or equivalently (5.11), i.e., imaging β is fixed. Then according to the derivative of E (w, β) with regard to β, we can obtain (5.17) and hence β ML . Problem 5.3 Solution Following the process in the previous question, we first write down the negative logarithm of the likelihood function. E (w, Σ) = N { } N 1 ∑ [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + ln |Σ| + const (∗) 2 n=1 2 Note here we have assumed Σ is unknown and const denotes the term independent of both w and Σ. In the first situation, if Σ is fixed and known, the equation above will reduce to: E (w) = N { } 1 ∑ [ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + const 2 n=1 We can simply solve wML by minimizing it. If Σ is unknown, since Σ is in the first term on the right of (∗), solving wML will involve Σ. Note that in the previous problem, the main reason that they can decouple is due to the independent assumption, i.e., Σ reduces to β−1 I, so that we can bring β to the front and view it as a fixed multiplying factor when solving wML . Problem 5.4 Solution Based on (5.20), the current conditional distribution of targets, considering mislabel, given input x and weight w is: p( t = 1|x, w) = (1 − ϵ) · p( t r = 1|x, w) + ϵ · p( t r = 0|x, w) 104 Note that here we use t to denote the observed target label, t r to denote its real label, and that our network is aimed to predict the real label t r not t, i.e., p( t r = 1|x, w) = y(x, w), hence we see that: [ ] p( t = 1|x, w) = (1 − ϵ) · y(x, w) + ϵ · 1 − y(x, w) (∗) Also, it is the same for p( t = 0|x, w): [ ] p( t = 0|x, w) = (1 − ϵ) · 1 − y(x, w) + ϵ · y(x, w) (∗∗) Combing (∗) and (∗∗), we can obtain: p( t|x, w) = (1 − ϵ) · y t (1 − y)1− t + ϵ · (1 − y) t y1− t Where y is short for y(x, w). Therefore, taking the negative logarithm, we can obtain the error function: E (w) = − N ∑ n=1 { 1− t } t ln (1 − ϵ) · ynn (1 − yn )1− t n + ϵ · (1 − yn ) t n yn n When ϵ = 0, it is obvious that the equation above will reduce to (5.21). Problem 5.5 Solution It is obvious by using (5.22). E (w) = − ln = − ln = − = − = − N ∏ n=1 p(t|xn , w) K N ∏ ∏ n=1 k=1 K N ∑ ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 { [ ]1− t nk } ln yk (xn , w) t nk 1 − yk (xn , w) [ t nk ] ln ynk ( 1 − ynk )1− t nk N ∑ K { ∑ n=1 k=1 [ ]1− t nk yk (xn , w) t nk 1 − yk (xn , w) } t nk ln ynk + (1 − t nk ) ln( 1 − ynk ) Where we have denoted ynk = yk (xn , w) Problem 5.6 Solution We know that yk = σ(a k ), where σ(·) represents the logistic sigmoid function. Moreover, dσ = σ(1 − σ) da 105 dE (w) da k ] ] 1[ 1 [ yk (1 − yk ) + (1 − t k ) yk (1 − yk ) yk 1 − yk [ ] [ 1 − tk tk ] − = yk (1 − yk ) 1 − yk yk = (1 − t k ) yk − t k (1 − yk ) = −t k = yk − t k Just as required. Problem 5.7 Solution It is similar to the previous problem. First we denote ykn = yk (xn , w). If we use softmax function as activation for the output unit, according to (4.106), we have: d ykn = ykn ( I k j − y jn ) da j Therefore, dE (w) da j = N ∑ K } d { ∑ − t kn ln yk (xn , w) da k n=1 k=1 = − = − = − = − = − = N ∑ K ∑ } d { t kn ln ykn n=1 k=1 da j N ∑ K ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 N ∑ K ∑ n=1 k=1 N ∑ n=1 N ∑ n=1 t kn ] 1 [ ykn ( I k j − y jn ) ykn ( t kn I k j − t kn y jn ) t kn I k j + t jn + N ∑ n=1 N ∑ K ∑ n=1 k=1 t kn y jn y jn ( y jn − t jn ) Where we have used the fact that only when k = j , I k j = 1 ̸= 0 and that k=1 t kn = 1. ∑K Problem 5.8 Solution It is obvious based on definition of ’tanh’, i.e., (5.59). ( e a + e−a )( e a + e−a ) − ( e a − e−a )( e a − e−a ) ( e a + e − a )2 ( e a − e−a )2 = 1− a ( e + e−a )2 = 1 − tanh(a)2 d tanh(a) = da 106 Problem 5.9 Solution We know that the logistic sigmoid function σ(a) ∈ [0, 1], therefore if we perform a linear transformation h(a) = 2σ(a) − 1, we can find a mapping function h(a) from (−∞, +∞) to [−1, 1]. In this case, the conditional distribution of targets given inputs can be similarly written as: p( t|x, w) = [ 1 + y(x, w) ](1+ t)/2 [ 1 − y(x, w) ](1− t)/2 2 2 [ ] Where 1 + y(x, w) /2 represents the conditional probability p(C 1 | x). Since now y(x, w) ∈ [−1, 1], we also need to perform the linear transformation to make it satisfy the constraint for probability.Then we can further obtain: E (w) = − = − N {1+ t ∑ n n=1 2 ln 1 + yn 1 − t n 1 − yn } + ln 2 2 2 N ∑ { } 1 (1 + t n ) ln(1 + yn ) + (1 − t n ) ln(1 − yn ) + N ln 2 2 n=1 Problem 5.10 Solution It is obvious. Suppose H is positive definite, i.e., (5.37) holds. We set v equals to the eigenvector of H, i.e., v = ui which gives: vT Hv = vT (Hv) = ui T λ i ui = λ i ||ui ||2 Therefore, every λ i should be positive. On the other hand, If all the eigenvalues λ i are positive, from (5.38) and (5.39), we see that H is positive definite. Problem 5.11 Solution It is obvious. We follow (5.35) and then write the error function in the form of (5.36). To obtain the contour, we enforce E (w) to equal to a constant C. 1∑ E (w) = E (w∗ ) + λ i α2i = C 2 i We rearrange the equation above, and then obtain: ∑ λ i α2i = B i Where B = 2C − 2E (w∗ ) is a constant. Therefore, the contours of constant error are ellipses whose axes are aligned with the eigenvector ui of the Hessian Matrix H. The length for the j th axis is given by setting all α i = 0, s.t.i ̸= j : √ B αj = λj 107 In other words, the length is inversely proportional to the square root of the corresponding eigenvalue λ j . Problem 5.12 Solution If H is positive definite, we know the second term on the right side of (5.32) will be positive for arbitrary w. Therefore, E (w∗ ) is a local minimum. On the other hand, if w∗ is a local minimum, we have 1 E (w∗ ) − E (w) = − (w − w∗ )T H(w − w∗ ) < 0 2 In other words, for arbitrary w, (w − w∗ )T H(w − w∗ ) > 0, according to the previous problem, we know that this means H is positive definite. Problem 5.13 Solution It is obvious. Suppose that there are W adaptive parameters in the network. Therefore, b has W independent parameters. Since H is symmetric, there should be W (W + 1)/2 independent parameters in it. Therefore, there are W + W (W + 1)/2 = W (W + 3)/2 parameters in total. Problem 5.14 Solution It is obvious. Since we have E n (w ji + ϵ) = E n (w ji ) + ϵE ′n (w ji ) + And E n (w ji − ϵ) = E n (w ji ) − ϵE ′n (w ji ) + ϵ2 2 E ′′n (w ji ) + O (ϵ3 ) ϵ2 E ′′ (w ji ) + O (ϵ3 ) 2 n We combine those two equations, which gives, E n (w ji + ϵ) − E n (w ji − ϵ) = 2ϵE ′n (w ji ) + O (ϵ3 ) Rearrange the equation above, we obtain what has been required. Problem 5.15 Solution It is obvious. The back propagation formalism starts from performing summation near the input, as shown in (5.73). By symmetry, the forward propagation formalism should start near the output. Jki = ∂ yk ∂xi = ∂ h( a k ) = h′ ( a k ) ∂xi ∂a k ∂xi (∗) Where h(·) is the activation function at the output node a k . Considering all the units j , which have links to unit k: ∂a k ∂xi = ∑ ∂a k ∂a j j ∂a j ∂ x i = ∑ j w k j h′ ( a j ) ∂a j ∂xi (∗∗) 108 Where we have used: ak = ∑ z j = h( a j ) wk j z j , j It is similar for ∂a j /∂ x i . In this way we have obtained a recursive formula starting from the input node: { wl i , if there is a link from input unit i to l ∂a l = ∂xi 0, if there isn’t a link from input unit i to l Using recursive formula (∗∗) and then (∗), we can obtain the Jacobian Matrix. Problem 5.16 Solution It is obvious. We begin by writing down the error function. E= N N ∑ M 1 ∑ 1 ∑ ||yn − tn ||2 = ( yn,m − t n,m )2 2 n=1 2 n=1 m=1 Where the subscript m denotes the mthe element of the vector. Then we can write down the Hessian Matrix as before. H = ∇∇E = N ∑ M ∑ n=1 m=1 ∇yn,m ∇yn,m + N ∑ M ∑ n=1 m=1 ( yn,m − t n,m )∇∇yn,m Similarly, we now know that the Hessian Matrix can be approximated as: H≃ N ∑ M ∑ n=1 m=1 bn,m bT n,m Where we have defined: bn,m = ∇ yn,m Problem 5.17 Solution It is obvious. ∂2 E ∂wr ∂ws = = ∂ 1 ∫ ∫ 2( y − t) ∂wr 2 ∫ ∫ [ ( y − t) Since we know that ∫ ∫ ∂ y2 p(x, t) d x dt ( y − t) ∂wr ∂ws ∂y p(x, t) d x dt ∂ws ∂ y2 ∂wr ∂ws + ∂y ∂y ] ∂ws ∂wr ∫ ∫ = ( y − t) ∫ = = 0 ∂ y2 ∂wr ∂ws ∂ y2 p(x, t) d x dt p( t|x) p(x) d x dt ∂wr ∂ws ∫ { } ( y − t) p( t|x) dt p(x) d x 109 Note that in the last step, we have used y = tute it into the second derivative, which gives, ∫ ∫ ∂2 E = ∂wr ∂ws ∫ = ∫ tp( t|x) dt. Then we substi- ∂y ∂y p(x, t) d x dt ∂ws ∂wr ∂y ∂y p(x) d x ∂ws ∂wr Problem 5.18 Solution skip By analogy with section 5.3.2, we denote wki as those parameters corresponding to skip-layer connections, i.e., it connects the input unit i with the output unit k. Note that the discussion in section 5.3.2 is still correct and now we only need to obtain the derivative of the error function with respect skip to the additional parameters wki . ∂E n = skip ∂wki ∂E n ∂a k ∂a k ∂wskip ki = δk x i Where we have used a k = yk due to linear activation at the output unit and: M ∑ ∑ skip yk = w(2) z + wki x i j kj j =0 i Where the first term on the right side corresponds to those information conveying from the hidden unit to the output and the second term corresponds to the information conveying directly from the input to output. Problem 5.19 Solution The error function is given by (5.21). Therefore, we can obtain: ∇E (w) = N ∂E ∑ ∇a n n=1 ∂a n = − = − = − = − = N ∑ ∂ [ n=1 ∂a n ] t n ln yn + (1 − t n ) ln(1 − yn ) ∇a n N { ∂( t ln y ) ∂ y ∑ n n n ∂ yn n=1 N [t ∑ n ∂a n + ∂(1 − t n ) ln(1 − yn ) ∂ yn } ∇a n ∂ yn ∂a n · yn (1 − yn ) + (1 − t n ) ] −1 · yn (1 − yn ) ∇a n 1 − yn n=1 yn N [ ∑ ] t n (1 − yn ) − (1 − t n ) yn ∇a n n=1 N ∑ n=1 ( yn − t n )∇a n 110 Where we have used the conclusion of problem 5.6. Now we calculate the second derivative. ∇∇E (w) = N { ∑ n=1 yn (1 − yn )∇a n ∇a n + ( yn − t n )∇∇a n } Similarly, we can drop the last term, which gives exactly what has been asked. Problem 5.20 Solution(waiting for update) We begin by writing down the error function. E (w) = − N ∑ K ∑ n=1 k=1 t nk ln ynk Here we assume that the output of the network has K units in total and there are W weights parameters in the network. WE first calculate the first derivative: ∇E = N dE ∑ · ∇an n=1 d a n = − = N [ d K ∑ ∑ ] ( t nk ln ynk ) · ∇an n=1 d a n k=1 N ∑ n=1 cn · ∇an Note that here cn = − dE / d an is a vector with size K × 1, ∇an is a matrix with size K × W . Moreover, the operator · means inner product, which gives ∇E as a vector with size 1 × W . According to (4.106), we can obtain the j th element of cn : c n, j = − ∂ ∂a j ( K ∑ t nk ln ynk ) k=1 = − K ∑ ∂ ( t nk ln ynk ) k=1 ∂a j = − K t ∑ nk ynk ( I k j − yn j ) y k=1 nk = − K ∑ k=1 t nk I k j + = − t n j + yn j ( = yn j − t n j K ∑ t nk yn j k=1 K ∑ k=1 t nk ) 111 Now we calculate the second derivative: ∇∇E N dc ∑ n ( ∇an ) · ∇an + cn ∇∇an d a n n=1 = Here d cn / d an is a matrix with size K × K . Therefore, the second term can be neglected as before, which gives: H= N dc ∑ n ∇an ) · ∇an ( d a n n=1 Problem 5.21 Solution We first write down the expression of Hessian Matrix in the case of K outputs. N ∑ K ∑ H N,K = bn,k bT n,k n=1 k=1 Where bn,k = ∇w an,k . Therefore, we have: H N +1,K = H N,K + K ∑ k=1 T b N +1,k bT N +1,k = H N,K + B N +1 B N +1 Where B N +1 = [b N +1,1 , b N +1,2 , ..., b N +1,K ] is a matrix with size W × K , and here W is the total number of the parameters in the network. By analogy with (5.88)-(5.89), we can obtain: 1 −1 H− N +1,K = H N,K − 1 H− B BT H−1 N,K N +1 N +1 N,K 1 + BT H−1 B N +1 N,K N +1 (∗) Furthermore, similarly, we have: H N +1,K +1 = H N +1,K + N∑ +1 n=1 T bn,K +1 bT n,K +1 = H N +1,K + BK +1 BK +1 Where BK +1 = [b1,K +1 , b2,K +1 , ..., b N +1,K +1 ] is a matrix with size W ×( N + 1). Also, we can obtain: 1 H− N +1,K +1 = 1 H− N +1,K − 1 H− B BT H−1 N +1,K K +1 K +1 N +1,K 1 + BT H−1 B K +1 N +1,K K +1 1 Where H− is defined by (∗). If we substitute (∗) into the expression N +1,K 1 1 above, we can obtain the relationship between H− and H− . N +1,K +1 N,K Problem 5.22 Solution 112 We begin by handling the first case. ∂2 E n ∂ = ∂w(2) ∂w(2) kj k′ j ′ ( (2) ∂E n ∂wk j ∂w(2) k′ j ′ ∂ = ∂w(2) kj ( ∂ = ) ∂ E n ∂ a k′ ) ∂a k′ ∂w(2)′ ′ k j ∑ ∂ E n ∂ j ′ w k′ j ′ z j ′ ( ∂ a k′ ∂w(2) kj ∂ = ∂w(2) kj ∂ = ∂w(2) kj ∂ = ∂a k ∂ = ( ( ( ( ∂E n ∂ a k′ ∂E n ∂a k′ ) z j′ ) ) z j′ + ∂E n ∂ z j′ ∂a k′ ∂w(2) kj ∂E n ∂a k ) z j′ + 0 ∂a k′ ∂w(2) kj ∂E n ∂ a k ∂ a k′ z j z j′ M kk′ = ∂w(2) k′ j ′ ) z j z j′ Then we focus on the second case, and if here j ̸= j ′ ∂2 E n ∂w(1) ∂w(1) ji j′ i′ = = = = = ∂ ∂w(1) ji ∂ ∂E n ( ( ∂w(1) j′ i′ ) ∑ ∂ E n ∂ a k′ ) ∂a k′ ∂w(1) ′ ′ ′ ∂w(1) ji k ∑ ∂ (1) k′ ∂w ji ∑ k′ ∑ k′ ji ( ∂E n ∂ a k′ h′ ( a j ′ ) x i ′ h′ ( a j ′ ) x i ′ w(2) h′ ( a j ′ ) x i ′ ) k′ j ′ ∂ ∂w(1) ji ( ∂E n ∂ a k′ w(2) ) k′ j ′ ∑ ∂ ∂E n (2) ∂a k ( wk′ j′ ) (1) ∂w k ∂ a k ∂ a k′ ji ∑ ∑ ∂ ∂E n (2) = h′ ( a j ′ ) x i ′ ( wk′ j′ ) · (w(2) h′ ( a j ) x i ) kj ′ ∂ a ∂ a ′ k k k k ∑ ′ ∑ = h (a j′ ) x i′ M kk′ w(2) · w(2) h′ ( a j ) x i k′ j ′ kj k′ = ′ k ′ x i′ x i h (a j′ ) h (a j ) ∑∑ k′ k · w(2) M kk′ w(2) k′ j ′ kj 113 When j = j ′ , similarly we have: ∂2 E n ∂w(1) ∂w(1) ji ji ′ = = = = = ∑ k′ x i′ ∂ ∂E n ∂w ji ∂ a k′ ( (1) ∑ k′ ∂ ( (1) w(2) h′ ( a j ) x i ′ ) k′ j ∂E n ∂w ji ∂a k′ x i ′ x i h′ ( a j ) h′ ( a j ) w(2) ) h′ ( a j ) + x i ′ k′ j ∑ ∂E n (2) ∂ h′ (a j ) ( w k′ j ) ∂w(1) k′ ∂ a k′ ji ∑∑ k′ k ∑ ∂E n (2) ∂ h′ (a j ) (2) ′ ′ · w M + x ( w(2) w k′ j ) i kk kj k′ j ∂w(1) k′ ∂ a k′ ji ∑∑ ∑ ∂E n (2) ′′ (2) ′ + x i′ x i ′ x i h′ ( a j ) h′ ( a j ) w(2) · w M ( w k′ j ) h ( a j ) x i ′ kk k j kj k′ k k′ ∂ a k′ ∑ ∑ (2) ∑ x i ′ x i h′ ( a j ) h′ ( a j ) wk′ j · w(2) M kk′ + h′′ (a j ) x i x i′ δk′ w(2) kj k′ j k′ k k′ It seems that what we have obtained is slightly different from (5.94) when j = j ′ . However this is not the case, since the summation over k′ in the second term of our formulation and the summation over k in the first term of (5.94) is actually the same (i.e., they both represent the summation over all the output units). Combining the situation when j = j ′ and j ̸= j ′ , we can obtain (5.94) just as required. Finally, we deal with the third case. Similarly we first focus on j ̸= j ′ : ∂2 E n ∂w(2) ∂w(1) ji k j′ = = = = ∂ ∂w(1) ji ( ∂ ( (1) ∂w ji ∂ ∂E n ∂w(2) k j′ ) ∂E n ∂a k ) ∂a k ∂w(2)′ kj ∑ ∂E n ∂ j′ wk j′ z j′ ( ∂a k ∂w(1) ji ∂ ∂w(1) ji ( ∂w(2) k j′ ∂E n ∂a k z j′ ) = ∑ ∂ ∂ E n ∂ a k′ ) (1) ( k′ ∂a k′ ∂a k ∂w ji ∑ z j′ M kk′ w(2) h′ ( a j ) x i k′ j = x i h (a j ) z j′ = ) z j′ k′ ′ ∑ k′ M kk′ w(2) k′ j Note that in (5.95), there are two typos: (i) H kk′ should be M kk′ . (ii) j should 114 exchange position with j ′ in the right side of (5.95). When j = j ′ , we have: ∂2 E n ∂w(1) ∂w(2) ji kj = = = = = = = ∂ ( (1) ∂E n ∂w ji ∂w(2) kj ∂ ∂w(1) ji ( ∂ ) ∂E n ∂a k ) ∂a k ∂w(2) kj ∑ ∂E n ∂ j wk j z j ( ∂a k ∂w(1) ji ∂ ∂w(1) ji ∂ ∂w(1) ji ( ( ∂w(2) kj ∂E n z j) ∂a k ∂E n )z j + ∂a k x i h′ ( a j ) z j x i h′ ( a j ) z j ) ∑ k′ ∑ k′ ∂E n ∂ z j ∂a k w(1) ji M kk′ w(2) + k′ j ∂E n ∂ z j ∂a k w(1) ji M kk′ w(2) + δ k h′ ( a j ) x i k′ j Combing these two situations, we obtain (5.95) just as required. Problem 5.23 Solution It is similar to the previous problem. ∂2 E n ∂ w k′ i ′ ∂ w k j = = = = ∂ ( ∂E n ) ∂ w k′ i ′ ∂ w k j ∂ ∂E n z j) ( ′ ′ ∂wk i ∂a k ∂ w k′ i ′ ∂ ∂ E n zj ( ) ∂ a k′ ∂ a k′ ∂ a k z j x i′ M kk′ 115 And ∂2 E n ∑ ∂E n ∂a k ( ) ∂wk′ i′ k ∂a k ∂w ji ∑ ∂E n ∂ ( w k j h′ ( a j ) x i ) ∂ w k′ i ′ k ∂ a k ∑ ′ ∂ ∂E n h (a j ) x i wk j ( ) ∂ w k′ i ′ ∂ a k k ∑ ′ ∂ ∂ E n a k′ ( ) h (a j ) x i wk j ∂ a k′ ∂ a k w k′ i ′ k ∑ ′ h (a j ) x i wk j M kk′ x i′ ∂ = ∂wk′ i′ ∂w ji = = = = k x i x i ′ h′ ( a j ) = ∑ wk j M kk′ k Finally, we have ∂2 E n = ∂wk′ i′ wki = = = ∂ ∂E n ( ) ∂wk′ i′ ∂wki ∂ ∂E n ( xi ) ∂ w k′ i ′ ∂ a k ∂ ∂ E n ∂ a k′ ( ) xi ∂ a k′ ∂ a k w k′ i ′ x i x i′ M kk′ Problem 5.24 Solution It is obvious. According to (5.113), we have: ∑ ae j = w e ji e xi + w e j0 i = ∑1 i = ∑ a w ji · (ax i + b) + w j0 − b∑ w ji a i w ji x i + w j0 = a j i Where we have used (5.115), (5.116) and (5.117). Currently, we have proved that under the transformation the hidden unit a j is unchanged. If the activation function at the hidden unit is also unchanged, we have e z j = z j. Now we deal with the output unit e yk : ∑ e yk = w ek j e zj + w e k0 j = ∑ cwk j · z j + cwk0 + d j ∑[ ] w k j · z j + w k0 + d = c = c yk + d j 116 Where we have used (5.114), (5.119) and (5.120). To be more specific, here we have proved that the linear transformation between e yk and yk can be achieved by making transformation (5.119) and (5.120). Problem 5.25 Solution Since we know the gradient of the error function with respect to w is: ∇E = H(w − w∗ ) Together with (5.196), we can obtain: w(τ) = w(τ−1) − ρ ∇E = w(τ−1) − ρ H(w(τ−1) − w∗ ) Multiplying both sides by uTj , using w j = wT u j , we can obtain: w(jτ) [ ] = uTj w(τ−1) − ρ H(w(τ−1) − w∗ ) = w(jτ−1) − ρ uTj H(w(τ−1) − w∗ ) = w(jτ−1) − ρη j uTj (w(τ−1) − w∗ ) = w(jτ−1) − ρη j (w(jτ−1) − w∗j ) = (1 − ρη j )w(jτ−1) + ρη j w∗j Where we have used (5.198). Then we use mathematical deduction to prove (5.197), beginning by calculating w(1) : j w(1) j + ρη j w∗j = (1 − ρη j )w(0) j = ρη j w∗j [ ] = 1 − (1 − ρη j ) w∗j Suppose (5.197) holds for τ, we now prove that it also holds for τ + 1. w(jτ+1) = (1 − ρη j w(jτ) + ρη j w∗j [ ] = (1 − ρη j ) 1 − (1 − ρη j )τ w∗j + ρη j w∗j { [ ] } = (1 − ρη j ) 1 − (1 − ρη j )τ + ρη j w∗j [ ] = 1 − (1 − ρη j )τ+1 w∗j Hence (5.197) holds for τ = 1, 2, .... Provided |1 − ρη j | < 1, we have (1 − ρη j )τ → 0 as τ → ∞ ans thus w(τ) = w∗ . If τ is finite and η j >> (ρτ)−1 , the above argument still holds since τ is still relatively large. Conversely, when η j << (ρτ)−1 , we expand the expression above: [ ] |w(jτ) | = | 1 − (1 − ρη j )τ w∗j | ≈ |τρη j w∗j | << |w∗j | 117 We can see that (ρτ)−1 works as the regularization parameter α in section 3.5.3. Problem 5.26 Solution Based on definition or by analogy with (5.128), we have: Ωn 1 ∑ ∂ ynk ¯¯ )2 ( 2 k ∂ξ ξ=0 1 ∑ ∑ ∂ ynk ∂ x i ¯¯ )2 ( 2 k i ∂ x i ∂ξ ξ=0 1∑ ∑ ∂ ynk )2 ( τi 2 k i ∂xi = = = Where we have denoted τi = ∂ x i ¯¯ ∂ξ ξ=0 And this is exactly the form given in (5.201) and (5.202) if the nth observation ynk is denoted as yk in short. Firstly, we define α j and β j as (5.205) shows, where z j and a j are given by (5.203). Then we will prove (5.204) holds: αj = ∑ τi i = ∑ τi ∂z j ∂xi = ∑ τi ∂ h( a j ) ∂xi i ∂ h( a j ) ∂ a j ∂a j ∂ x i ∑ ∂ h′ ( a j ) τ i a j = h ′ ( a j )β j ∂ x i i i = Moreover, βj = ∑ τi i = ∑ i = ∑ i′ τi ∂a j = ∑ τi ∂xi i ∑ ∂w ji′ z i′ i′ w ji′ ∂xi ∑ i τi ∂ z i′ ∂xi ∂ ∑ i ′ w ji ′ z i ′ ∂xi ∑ ∑ ∂ z i′ = τ i w ji′ ∂xi i i′ ∑ = w ji′ α i′ i′ So far we have proved that (5.204) holds and now we aim to find a forward propagation formula to calculate Ωn . We firstly begin by evaluating {β j } at the input units, and then use the first equation in (5.204) to obtain {α j } at the input units, and then the second equation to evaluate {β j } at the first hidden layer, and again the first equation to evaluate {α j } at the first hidden layer. We repeatedly evaluate {β j } and {α j } in this way until reaching the output 118 layer. Then we deal with (5.206): ∂Ωn ∂wrs = } 1 ∑ ∂(G yk )2 ∂ {1 ∑ (G yk )2 = ∂wrs 2 k 2 k ∂wrs ∂G yk 1 ∑ ∂(G yk )2 ∂(G yk ) ∑ G yk = 2 k ∂(G yk ) ∂wrs ∂wrs k ∑ [ ∂ yk ∂a r ] [ ∂ yk ] ∑ = = αk G G yk G ∂wrs ∂a r ∂wrs k k ∑ [ ] ∑ { } = αk G δkr z s = αk G [δkr ] z s + G [ z s ]δkr = k = ∑ k { } αk ϕkr z s + αs δkr k Provided with the idea in section 5.3, the backward propagation formula is easy to derive. We can simply replace E n with yk to obtain a backward equation, so we omit it here. Problem 5.27 Solution Following the procedure in section 5.5.5, we can obtain: ∫ 1 (τT ∇ y(x))2 p(x) d x Ω= 2 / Since we have τ = ∂s(x, ξ) ∂ξ and s = x + ξ, so we have τ = I. Therefore, substituting τ into the equation above, we can obtain: ∫ 1 Ω= (∇ y(x))2 p(x) d x 2 Just as required. Problem 5.28 Solution The modifications only affect derivatives with respect to the weights in the convolutional layer. The units within a feature map (indexed m) have different inputs, but all share a common weight vector, w(m) . Therefore, we can write: ( m) ∑ ∂E n ∂a j ∑ ( m) ( m) ∂E n = = δ j z ji ( m) ( m) ∂w(im) j ∂a j ∂w i j Here a(jm) denotes the activation of the j th unit in th mth feature map, whereas w(im) denotes the i th element of the corresponding feature vector ) and finally z(im denotes the i th input for the j th unit in the mth feature map. j Note that δ(jm) can be computed recursively from the units in the following layer. Problem 5.29 Solution 119 It is obvious. Firstly, we know that: ∂ { } wi − µ j π j N (w i |µ j , σ2j ) = −π j N (w i |µ j , σ2j ) ∂w i σ2j We now derive the error function with respect to w i : e ∂E ∂w i = = = = = = = = ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i ∂E ∂w i + ∂λΩ(w) −λ −λ ∂w i ∂ ∂w i ∂ { ∑ { ∂w i i ( ln ( ln j =1 M ∑ j =1 j =1 π j N + λ ∑M +λ +λ )} π j N (w i |µ j , σ2j ) π jN )} (w i |µ j , σ2j ) ∂ 1 − λ ∑M +λ M ∑ (w i |µ j , σ2j ) ∂w i { M ∑ 1 { M ∑ j =1 πj 2 j =1 π j N (w i |µ j , σ j ) j =1 ∑M w i −µ j 2 j =1 π j σ2 N (w i |µ j , σ j ) ∑ M ∑ j =1 M ∑ j =1 ∑ } π jN (w i |µ j , σ2j ) wi − µ j σ2j } N (w i |µ j , σ2j ) j k πk N (w i |µk , σ2k ) π j N (w i |µ j , σ2j ) k πk N γ j (w i ) (w i |µk , σ2k ) wi − µ j σ2j wi − µ j σ2j Where we have used (5.138) and defined (5.140). Problem 5.30 Solution Is is similar to the previous problem. Since we know that: } wi − µ j ∂ { π j N (w i |µ j , σ2j ) = π j N (w i |µ j , σ2j ) ∂µ j σ2j 120 We can derive: e ∂E ∂µ j = ∂λΩ(w) ∂µ j = −λ { ∑ ∂ ∂µ j = −λ i ∑M { ln ( M ∑ j =1 )} π jN M ∑ j =1 j =1 π j N 1 (w i |µ j , σ2j ) )} π j N (w i |µ j , σ2j ) ∂ (w i |µ j , σ2j ) ∂µ j { M ∑ j =1 } π jN (w i |µ j , σ2j ) wi − µ j πj N (w i |µ j , σ2j ) 2 2 σ π N ( w | µ , σ ) j i j i j j =1 j 2 π j N ( w i |µ j , σ j ) ∑ ∑ µ j − wi µ j − wi λ ∑K = λ γ j (w i ) 2 2 σj σ2j i i k=1 π k N (w i |µ k , σ k ) = −λ = ∑ ln i ∑ ∂ = −λ i ∂µ j ∑ ( ∑M 1 Note that there is a typo in (5.142). The numerator should be µ j − w i instead of µ i − w j . This can be easily seen through the fact that the mean and variance of the Gaussian Distribution should have the same subindex and since σ j is in the denominator, µ j should occur in the numerator instead of µi . Problem 5.31 Solution It is similar to the previous problem. Since we know that: ( ) } ∂ { 1 (w i − µ j )2 2 π j N ( w i |µ j , σ j ) = − + π j N (w i |µ j , σ2j ) ∂σ j σj σ3j 121 We can derive: e ∂E ∂σ j = ∂λΩ(w) ∂σ j = −λ { ∑ ∂ ∂σ j = −λ i = = { ∑M ln ( M ∑ j =1 M ∑ j =1 j =1 π j N 1 )} π jN (w i |µ j , σ2j ) )} π j N (w i |µ j , σ2j ) ∂ (w i |µ j , σ2j ) ∂σ j { M ∑ j =1 } π jN (w i |µ j , σ2j ) } ∂ { π j N (w i |µ j , σ2j ) 2 ∂σ j i j =1 π j N (w i |µ j , σ j ) ( ) ∑ 1 (w i − µ j )2 1 λ ∑M π j N (w i |µ j , σ2j ) − 3 2) σ σ π N ( w | µ , σ j i j i j j =1 j j ( ) 2 π j N ( w i |µ j , σ j ) ∑ 1 ( w i − µ j )2 λ ∑M − 2) σ σ3j π N ( w | µ , σ j i i k k k=1 k ) ( ∑ 1 (w i − µ j )2 − λ γ j (w i ) σj σ3j i = −λ = ln i ∑ ∂ = −λ i ∂σ j ∑ ( ∑ ∑M 1 Just as required. Problem 5.32 Solution It is trivial. We begin by verifying (5.208) when j ̸= k. } { ∂πk ∂ exp(η k ) = ∑ ∂η j ∂η j k exp(η k ) − exp(η k ) exp(η j ) = [∑ ]2 k exp(η k ) = −π j πk And if now we have j = k: { } ∂πk exp(η k ) ∂ = ∑ ∂η k ∂η k k exp(η k ) [∑ ] exp(η k ) k exp(η k ) − exp(η k ) exp(η k ) = [∑ ]2 k exp(η k ) = πk − πk πk If we combine these two cases, we can easily see that (5.208) holds. Now 122 we prove (5.147). e ∂E ∂η j = λ ∂Ω(w) = −λ ∂η j ∂ { ∑ ∂η j = −λ i = −λ ∑ i = −λ ∑ i = = { { j =1 ln M ∑ j =1 }} π j N (w i |µ j , σ2j ) }} π jN (w i |µ j , σ2j ) ∂ 1 ∑M j =1 π j N (w i |µ j , σ2j ) ∂η j { M ∑ k=1 } πk N (w i |µk , σ2k ) M ∂ { ∑ } πk N (w i |µk , σ2k ) 2 ∂η j j =1 π j N (w i |µ j , σ j ) k=1 1 ∑M M ∑ } ∂πk ∂ { πk N (w i |µk , σ2k ) 2 ∂πk ∂η j j =1 π j N (w i |µ j , σ j ) k=1 1 ∑M M ∑ 1 N (w i |µk , σ2k )(δ jk π j − π j πk ) 2) π N ( w | µ , σ i j i j =1 j j k=1 { } M ∑ ∑ 1 2 2 πk N (w i |µk , σk )) π j N ( w i |µ j , σ j ) − π j −λ ∑ M 2 i k=1 j =1 π j N (w i |µ j , σ j ) { } ∑ π j N (w i |µ j , σ2j ) ∑ π j kM=1 πk N (w i |µk , σ2k )) −λ − ∑M ∑M 2 2 i j =1 π j N (w i |µ j , σ j ) j =1 π j N (w i |µ j , σ j ) ∑{ ∑ } { } −λ γ j (w i ) − π j = λ π j − γ j (w i ) = −λ = ∑ M ∑ ln i ∑ ∂ = −λ i ∂η j ∑ { ∑M i i Just as required. Problem 5.33 Solution It is trivial. We set the attachment point of the lower arm with the ground as the origin of the coordinate. We first aim to find the vertical distance from the origin to the target point, and this is also the value of x2 . x2 = L 1 sin(π − θ1 ) + L 2 sin(θ2 − (π − θ1 )) = L 1 sin θ1 − L 2 sin(θ1 + θ2 ) Similarly, we calculate the horizontal distance from the origin to the target point. x1 = −L 1 cos(π − θ1 ) + L 2 cos(θ2 − (π − θ1 )) = L 1 cos θ1 − L 2 cos(θ1 + θ2 ) From these two equations, we can clearly see the ’forward kinematics’ of the robot arm. 123 Problem 5.34 Solution By analogy with (5.208), we can write: ∂πk (x) ∂aπj = δ jk π j (x) − π j (x)πk (x) Using (5.153), we can see that: { } K ∑ 2 E n = − ln πk N (tn |µk , σk ) k=1 Therefore, we can derive: { } K ∑ ∂E n ∂ 2 = − π ln πk N (tn |µk , σk ) ∂aπj ∂a j k=1 { } K ∑ 1 ∂ 2 πk N (tn |µk , σk ) = − ∑K π πk N (tn |µk , σ2 ) ∂a j k=1 k=1 k K ∂π ∑ k 2 π N (t n |µ k , σ k ) 2) ∂ a π N (t | µ , σ n k j k=1 k k k=1 = − ∑K = − ∑K 1 1 k=1 π k N K [ ∑ ] δ jk π j (xn ) − π j (xn )πk (xn ) N (tn |µk , σ2k ) (tn |µk , σ2k ) k=1 { } K ∑ 1 = − ∑K π j (xn )N (tn |µ j , σ2j ) − π j (xn ) πk (xn )N (tn |µk , σ2k ) 2) π N (t | µ , σ n k k = 1 k k=1 k { } K ∑ 1 2 2 = ∑K −π j (xn )N (tn |µ j , σ j ) + π j (xn ) πk (xn )N (tn |µk , σk ) 2) π N (t | µ , σ n k k = 1 k k=1 k And if we denoted (5.154), we will have: ∂E n ∂aπj = −γ j + π j Note that our result is slightly different from (5.155) by the subindex. But there are actually the same if we substitute index j by index k in the final expression. Problem 5.35 Solution We deal with the derivative of error function with respect to µk instead, which will give a vector as result. Furthermore, the l th element of this vector will be what we have been required. Since we know that: } tn − µk ∂ { πk N (tn |µk , σ2k ) = πk N (tn |µk , σ2k ) 2 ∂µk σk 124 One thing worthy noticing is that here we focus on the isotropic case as stated in page 273 of the textbook. To be more precise, N (tn |µk , σ2k ) should be N (tn |µk , σ2k I). Provided with the equation above, we can further obtain: ∂E n ∂µk = { ∂ − ln ∂µk K ∑ k=1 } πk N (tn |µk , σ2k ) ∂ 1 = − ∑K { K ∑ } (tn |µk , σ2k ) πk N (tn |µk , σ2k ) ∂µk k=1 tn − µk 1 · πk N (tn |µk , σ2k ) = − ∑K 2 2 σ π N (t | µ , σ ) n k k k=1 k k k=1 π k N = −γk tn − µk σ2k Hence noticing (5.152), the l th element of the result above is what we are required. ∂E n ∂E n µkl − tl = γk µ = ∂µkl σ2k ∂a kl Problem 5.36 Solution Similarly, we know that: ∂ { ∂σk πk N (tn |µk , σ2k ) } { = } D ||tn − µk ||2 − + πk N (tn |µk , σ2k ) 3 σk σk Therefore, we can obtain: } { K ∑ ∂E n ∂ 2 πk N (tn |µk , σk ) − ln = ∂σk ∂σk k=1 { } K ∑ 1 ∂ = − ∑K πk N (tn |µk , σ2k ) 2 ) ∂σ π N (t | µ , σ k n k k = 1 k k=1 k } { 1 D ||tn − µk ||2 + πk N (tn |µk , σ2k ) = − ∑K · − 3 2 σ σk k k=1 π k N (t n |µ k , σ k ) { } 2 D ||tn − µk || = −γk − + σk σ3k Note that there is a typo in (5.157) and the underlying reason is that: = (σ2k )D |σ2k ID ×D | Problem 5.37 Solution First we know two properties for the Gaussian distribution N (t|µ, σ2 I): ∫ E[t] = tN (t|µ, σ2 I) d t = µ 125 ∫ And E[||t||2 ] = ||t||2 N (t|µ, σ2 I) d t = Lσ2 + ||µ||2 Where we have used E[tT At] = Tr[Aσ2 I] + µT Aµ by setting A = I. This property can be found in Matrixcookbook eq(378). Here L is the dimension of t. Noticing (5.148), we can write: ∫ E[t|x] = t p(t|x) d t ∫ = = = t K ∑ k=1 K ∑ πk N (t|µk , σ2k ) d t ∫ tN (t|µk , σ2k ) d t πk k=1 K ∑ k=1 πk µk Then we prove (5.160). ( ) s2 (x) = E[||t − E[t|x]||2 |x] = E[ t2 − 2tE[t|x] + E[t|x]2 |x] = E[t2 |x] − E[2tE[t|x]|x] + E[t|x]2 = E[t2 |x] − E[t|x]2 ∫ K K ∑ ∑ = ||t||2 πk N (µk , σ2k ) d t − || πl µl ||2 k=1 = = K ∑ k=1 = L = L = = = = ||t||2 N (µk , σ2k ) d t − || πk k=1 K ∑ πk (Lσ2k + ||µk ||2 ) − || K ∑ k=1 K ∑ k=1 L K ∑ k=1 L K ∑ k=1 L K ∑ k=1 K ∑ k=1 l =1 ∫ πk σ2k + πk σ2k + πk σ2k + πk σ2k + πk σ2k + ( K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ k=1 K ∑ l =1 K ∑ l =1 πl µl ||2 πk ||µk ||2 − || K ∑ l =1 πl µl ||2 πk ||µk ||2 − 2 × || 2 πk ||µk || − 2( πk ||µk ||2 − 2( πk ||µk − πk Lσ2k + ||µk − K ∑ l =1 πl µl ||2 K ∑ l =1 K ∑ l =1 K ∑ l =1 K ∑ l =1 πl µl ||2 + 1 × || πl µl )( πl µl )( K ∑ k=1 K ∑ k=1 K ∑ l =1 ( πk µk ) + πk µk ) + πl µl ||2 K ∑ k=1 K ∑ k=1 ) πk || πk || K ∑ l =1 K ∑ l =1 πl µl ||2 πl µl ||2 πl µl ||2 ) πl µl ||2 Note that there is a typo in (5.160), i.e., the coefficient L in front of σ2k is missing. 126 Problem 5.38 Solution From (5.167) and (5.171), we can write down the expression for the predictive distribution: ∫ p( t|x, D, α, β) = p(w|D, α, β) p( t|x, w, β) d w ∫ ≈ q(w|D ) p( t|x, w, β) d w ∫ = N (w|wMAP , A−1 )N ( t|gT w − gT wMAP + y(x, wMAP ), β−1 ) d w Note here p( t|x, w, β) is given by (5.171) and q(w|D ) is the approximation to the posterior p(w|D, α, β), which is given by (5.167). Then by analogy with (2.115), we first deal with the mean of the predictive distribution: mean = gT w − gT wMAP + y(x, wMAP )|w = wMAP = y(x, wMAP ) Then we deal with the covariance matrix: Covariance matrix = β−1 + gT A−1 g Just as required. Problem 5.39 Solution Using Laplace Approximation, we can obtain: p(D |w, β) p(w|α) = { } p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) Then using (5.174), (5.162) and (5.163), we can obtain: ∫ p(D |α, β) = p(D |w, β) p(w, α) d w ∫ { } = p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) d w = = p(D |wMAP , β) p(wMAP |α) N ∏ n=1 (2π)W /2 |A|1/2 N ( t n | y(xn , wMAP ), β−1 )N (wMAP |0, α−1 I) (2π)W /2 |A|1/2 If we take logarithm of both sides, we will obtain (5.175) just as required. Problem 5.40 Solution For a k-class classification problem, we need to use softmax activation function and also the error function is now given by (5.24). Therefore, the 127 Hessian matrix should be derived from (5.24) and the cross entropy in (5.184) will also be replaced by (5.24). Problem 5.41 Solution By analogy to Prob.5.39, we can write: p(D |α) = p(D |wMAP ) p(wMAP |α) (2π)W /2 |A|1/2 Since we know that the prior p(w|α) follows a Gaussian distribution, i.e., (5.162), as stated in the text. Therefore we can obtain: 1 ln |A| + const 2 W 1 α ln α − ln |A| + const = ln p(D |wMAP ) − wT w + 2 2 2 W 1 = −E (wMAP ) + ln α − ln |A| + const 2 2 ln p(D |α) = ln p(D |wMAP ) + ln p(wMAP |α) − Just as required. 0.6 Kernel Methods Problem 6.1 Solution Recall that in section.6.1, a n can be written as (6.4). We can derive: an 1 = − {wT ϕ(xn ) − t n } λ 1 = − {w1 ϕ1 (xn ) + w2 ϕ2 (xn ) + ... + w M ϕ M (xn ) − t n } λ w1 w2 wM tn ϕ1 (xn ) − ϕ2 (xn ) − ... − ϕ M (xn ) + = − λ λ λ λ w1 w2 wM = (cn − )ϕ1 (xn ) + ( c n − )ϕ2 (xn ) + ... + ( c n − )ϕ M (xn ) λ λ λ Here we have defined: cn = t n /λ ϕ1 (xn ) + ϕ2 (xn ) + ... + ϕ M (xn ) From what we have derived above, we can see that a n is a linear combination of ϕ(xn ). What’s more, we first substitute K = ΦΦT into (6.7), and then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain (6.2) just as required. Problem 6.2 Solution 128 If we set w(0) = 0 in (4.55), we can obtain: w(τ+1) = N ∑ n=1 η c n t n ϕn where N is the total number of samples and c n is the times that t n ϕn has been added from step 0 to step τ + 1. Therefore, it is obvious that we have: w= N ∑ n=1 αn t n ϕn We further substitute the expression above into (4.55), which gives: N ∑ n=1 α(nτ+1) t n ϕn = N ∑ n=1 α(nτ) t n ϕn + η t n ϕn In other words, the update process is to add learning rate η to the coefficient αn corresponding to the misclassified pattern xn , i.e., α(nτ+1) = α(nτ) + η Now we similarly substitute it into (4.52): = f ( wT ϕ(x) ) N ∑ α n t n ϕT f( n ϕ(x) ) = f( y(x) = n=1 N ∑ n=1 αn t n k(xn , x) ) Problem 6.3 Solution We begin by expanding the Euclidean metric. ||x − xn ||2 = (x − xn )T (x − xn ) = (xT − xT n )(x − x n ) T = xT x − 2xT n x + xn xn Similar to (6.24)-(6.26), we use a nonlinear kernel k(xn , x) to replace xT n x, which gives a general nonlinear nearest-neighbor classifier with cost function defined as: k(x, x) + k(xn , xn ) − 2 k(xn , x) Problem 6.4 Solution To construct such a matrix, let us suppose the two eigenvalues are 1 and 2, and the matrix has form: [ ] a b c d 129 Therefore, based on the definition of eigenvalue, we have two equations: { (a − 2)( d − 2) = bc (1) (a − 1)( d − 1) = bc (2) (2)-(1), yielding: a+d =3 Therefore, we set a = 4 and d = −1. Then we substitute them into (1), and thus we see: bc = −6 Finally, we choose b = 3 and c = −2. The constructed matrix is: [ ] 4 3 −2 −1 Problem 6.5 Solution Since k 1 (x, x′ ) is a valid kernel, it can be written as: k 1 (x, x′ ) = ϕ(x)T ϕ(x′ ) We can obtain: k(x, x′ ) = ck 1 (x, x′ ) = [p ]T [p ] cϕ(x) cϕ(x′ ) Therefore, (6.13) is a valid kernel. It is similar for (6.14): [ ]T [ ] k(x, x′ ) = f (x) k 1 (x, x′ ) f (x′ ) = f (x)ϕ(x) f (x′ )ϕ(x′ ) Just as required. Problem 6.6 Solution We suppose q( x) can be written as: q( x) = a n x n + a n−1 x n−1 + ... + a 1 x + a 0 We now obtain: k(x, x′ ) = a n k 1 (x, x′ ) + a n−1 k 1 (x, x′ ) n n−1 + ... + a 1 k 1 (x, x′ ) + a 0 By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x, x′ ) is a valid kernel. For (6.16), we can use Taylor expansion, and since the coefficients of Taylor expansion are all positive, we can similarly prove its validity. Problem 6.7 Solution To prove (6.17), we will use the property stated below (6.12). Since we know k 1 (x, x′ ) and k 2 (x, x′ ) are valid kernels, their Gram matrix K1 and K2 130 are both positive semidefinite. Given the relation (6.12), it can be easily shown K = K1 + K2 is also positive semidefinite and thus k(x, x′ ) is also a valid kernel. To prove (6.18), we assume the map function for kernel k 1 (x, x′ ) is ϕ(1) (x), and similarly ϕ(2) (x) for k 2 (x, x′ ). Moreover, we further assume the dimension of ϕ(1) (x) is M , and ϕ(2) (x) is N . We expand k(x, x′ ) based on (6.18): k(x, x′ ) = k 1 (x, x′ ) k 2 (x, x′ ) = ϕ(1) (x)T ϕ(1) (x′ )ϕ(2) (x)T ϕ(2) (x′ ) M N ∑ ∑ (1) ′ = ϕ(1) (x) ϕ (x ) ϕ(2) (x)ϕ(2) (x′ ) i i j j i =1 = = j =1 M ∑ N [ ∑ i =1 j =1 MN ∑ k=1 ϕ(1) (x)ϕ(2) (x) i j ][ ] ′ (2) ′ ϕ(1) (x ) ϕ (x ) i j ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ ) where ϕ(1) (x) is the i th element of ϕ(1) (x), and ϕ(2) (x) is the j th element i j of ϕ(2) (x). To be more specific, we have proved that k(x, x′ ) can be written as ϕ(x)T ϕ(x′ ). Here ϕ(x) is a MN ×1 column vector, and the kth ( k = 1, 2, ..., MN ) element is given by ϕ(1) (x) × ϕ(2) (x). What’s more, we can also express i, j in i j terms of k: i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1 where ⊘ and ⊙ means integer division and remainder, respectively. Problem 6.8 Solution For (6.19) we suppose k 3 (x, x′ ) = g(x)T g(x′ ), and thus we have: k(x, x′ ) = k 3 (ϕ(x), ϕ(x′ )) = g(ϕ(x))T g(ϕ(x′ )) = f (x)T f (x′ ) where we have denoted g(ϕ(x)) = f (x) and now it is obvious that (6.19) holds. To prove (6.20), we suppose x is a N × 1 column vector and A is a N × N symmetric positive semidefinite matrix. We know that A can be decomposed to QBQT . Here Q is a N × N orthogonal matrix, and B is a N × N diagonal matrix whose elements are no less than 0. Now we can derive: k(x, x′ ) = xT Ax′ = xT QBQT x′ = (QT x)T B(QT x′ ) = yT By′ N N √ √ ∑ ∑ ′ ′ B ii yi yi = ( B ii yi )( B ii yi ) = ϕ(x)T ϕ(x′ ) = i =1 i =1 To be more specific, we have proved that k(x, x′ ) = ϕ(x)T ϕ(x′ ), and here ϕ vector, whose i th ( i = 1, 2, ..., N ) element is given by √(x) is a N ×√1 column B ii yi , i.e., B ii (QT x) i . 131 Problem 6.9 Solution To prove (6.21), let’s first expand the expression: ′ k(x, x′ ) = ′ k a (xa , xa ) + k b (xb , xb ) M ∑ = i =1 M∑ +N = ′ ϕ(ia) (xa )ϕ(ia) (xa ) + k=1 N ∑ ′ j =1 ϕ(ib) (xb )ϕ(ib) (xb ) ′ ϕk (x)ϕk (x ) = ϕ(x)T ϕ(x′ ) where we have assumed the dimension of xa is M and the dimension of xb is N . The mapping function ϕ(x) is a ( M + N ) × 1 column vector, whose kth ( k = 1, 2, ..., M + N ) element ϕk (x) is: { 1≤k≤M ϕ(ka) (x) ϕk (x) = ϕ(kb−) M (xa ) M + 1 ≤ k ≤ M + N (6.22) is quite similar to (6.18). We follow the same procedure: k(x, x′ ) = = = = ′ ′ k a (xa , xa ) k b (xb , xb ) M ∑ i =1 ′ ϕ(ia) (xa )ϕ(ia) (xa ) M ∑ N [ ∑ i =1 j =1 MN ∑ k=1 N ∑ j =1 ′ ϕ(jb) (xb )ϕ(jb) (xb ) ϕ(ia) (xa )ϕ(jb) (xb ) ][ ′ ′ ϕ(ia) (xa )ϕ(jb) (xb ) ] ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ ) By analogy to (6.18), the mapping function ϕ(x) is a MN ×1 column vector, whose kth ( k = 1, 2, ..., MN ) element ϕk (x) is: ϕk (x) = ϕ(ia) (xa ) × ϕ(jb) (xb ) To be more specific, xa is the sub-vector of x made up of the first M element of x, and xb is the sub-vector of x made up of the last N element of x. What’s more, we can also express i, j in terms of k: i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1 where ⊘ and ⊙ means integer division and remainder, respectively. Problem 6.10 Solution According to (6.9), we have: T −1 y(x) = k(x) (K + λI N ) T t = k(x) a = N ∑ n=1 [ f (xn ) · f (x) · a n = N ∑ n=1 ] f (xn ) · ·a n f (x) 132 We see that if we choose k(x, x′ ) = f (x) f (x′ ) we will always find a solution y(x) proportional to f (x). Problem 6.11 Solution We follow the hint. k(x, x′ ) = exp(−xT x/2σ2 ) · exp(xT x′ /σ2 ) · exp(−(x′ )T x′ /2σ2 ) x T x′ 2 T ′ ) ( x x 2 = exp(−xT x/2σ2 ) · 1 + 2 + σ + · · · · exp(−(x′ )T x′ /2σ2 ) 2! σ = ϕ(x)T ϕ(x′ ) where ϕ(x) is a column vector with infinite dimension. To be more specific, (6.12) gives a simple example on how to decompose (xT x′ )2 . In our case, we can also decompose (xT x′ )k , k = 1, 2, ..., ∞ in the similar way. However, since k → ∞, i.e., the decomposition will consist monomials with infinite degree. Thus, there will be infinite terms in the decomposition and the feature mapping function ϕ(x) will have infinite dimension. Problem 6.12 Solution First, let’s explain the problem a little bit. According to (6.27), what we need to prove here is: k ( A 1 , A 2 ) = 2| A 1 ∩ A 2 | = ϕ ( A 1 ) T ϕ ( A 2 ) The biggest difference from the previous problem is that ϕ( A ) is a 2|D | × 1 column vector and instead of indexed by 1, 2, ..., 2|D | here we index it by {U |U ⊆ D } (Note that {U |U ⊆ D } is all the possible subsets of D and thus there are 2|D | elements in total). Therefore, according to (6.95), we can obtain: ϕ( A 1 ) T ϕ ( A 2 ) = ∑ U ⊆D ϕU ( A 1 )ϕU ( A 2 ) By using the summation, we actually iterate through all the possible subsets of D . If and only if the current iterating subset U is a subset of both A 1 and A 2 simultaneously, the current adding term equals to 1. Therefore, we actually count how many subsets of D is in the intersection of A 1 and A 2 . Moreover, since A 1 and A 2 are both defined in the subset space of D , what we have deduced above can be written as: ϕ ( A 1 ) T ϕ ( A 2 ) = 2| A 1 ∩ A 2 | Just as required. Problem 6.13 SolutionWait for update Problem 6.14 Solution 133 Since the covariance matrix S is fixed, according to (6.32) we can obtain: ( ) ∂ 1 T −1 g(µ, x) = ∇µ ln p(x|µ) = − (x − µ) S (x − µ) = S−1 (x − µ) ∂µ 2 Therefore, according to (6.34), we can obtain: [ ] [ ] F = Ex g(µ, x)g(µ, x)T = S−1 Ex (x − µ)(x − µ)T S−1 Since x ∼ N (x|µ, S), we have: [ ] Ex (x − µ)(x − µ)T = S So we obtain F = S−1 and then according to (6.33), we have: k(x, x′ ) = g(µ, x)T F−1 g(µ, x′ ) = (x − µ)T S−1 (x′ − µ) Problem 6.15 Solution We rewrite the problem. What we are required to prove is that the Gram matrix K: [ ] k 11 k 12 K= , k 21 k 22 where k i j ( i, j = 1, 2) is short for k( x i , x j ), should be positive semidefinite. A positive semidefinite matrix should have positive determinant, i.e., k 12 k 21 ≤ k 11 k 22 . Using the symmetric property of kernel, i.e., k 12 = k 21 , we obtain what has been required. Problem 6.16 Solution Based on the total derivative of function f , we have: ) ∑ ( N f (w + ∆w)T ϕ1 , (w + ∆w)T ϕ2 , ..., (w + ∆w)T ϕ N = ∂f T n=1 ∂(w ϕ n ) · ∆wT ϕn Which can be further written as: ( T T T ) f (w + ∆w) ϕ1 , (w + ∆w) ϕ2 , ..., (w + ∆w) ϕ N = [ N ∑ ] ∂f n=1 ∂(w Tϕ n) · ϕT n ∆w Note that here ϕn is short for ϕ(xn ). Based on the equation above, we can obtain: N ∑ ∂f ∇w f = · ϕT n Tϕ ) ∂ (w n=1 n 134 Now we focus on the derivative of function g with respect to w: ∇w g = ∂g ∂(wT w) · 2wT In order to find the optimal w, we set the derivative of J with respect to w equal to 0, yielding: ∇w J = ∇w f + ∇w g = N ∑ n=1 ∂f ∂(wT ϕ n) · ϕT n + ∂g ∂(wT w) · 2wT = 0 Rearranging the equation above, we can obtain: w= N 1 ∑ ∂f ·ϕ 2a n=1 ∂(wT ϕn ) n ∂g Where we have defined: a = 1 ÷ ∂(wT w) , and since g is a monotonically increasing function, we have a > 0. Problem 6.17 Solution We consider a variation in the function y(x) of the form: y(x) → y(x) + ϵη(x) Substituting it into (6.39) yields: E [ y + ϵη] = = = N ∫ { }2 1 ∑ y + ϵη − t n v(ξ) d ξ 2 n=1 N ∫ { } 1 ∑ ( y − t n )2 + 2 · (ϵη) · ( y − t n ) + (ϵη)2 v(ξ) d ξ 2 n=1 N ∫ ∑ E [ y] + ϵ { y − t n } ηvd ξ + O (ϵ2 ) n=1 Note that here y is short for y(xn + ξ), η is short for η(xn + ξ) and v is short for v(ξ) respectively. Several clarifications must be made here. What we have done is that we vary the function y by a little bit (i.e., ϵη) and then we expand the corresponding error with respect to the small variation ϵ. The coefficient before ϵ is actually the first derivative of the error E [ y + ϵη] with respect to ϵ at ϵ = 0. Since we know that y is the optimal function that can make E the smallest, the first derivative of the error E [ y + ϵη] should equal to zero at ϵ = 0, which gives: N ∫ ∑ n=1 { y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = 0 135 Now we are required to find a function y that can satisfy the equation above no matter what η is. We choose: η(x) = δ(x − z) This allows us to evaluate the integral: N ∫ ∑ n=1 { y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = N ∑ n=1 { y(z) − t n } v(z − xn ) We set it to zero and rearrange it, which finally gives (6.40) just as required. Problem 6.18 Solution According to the main text below Eq (6.48), we know that f ( x, t), i.e., f (z), follows a zero-mean isotropic Gaussian: f (z) = N (z|0, σ2 I) Then f ( x − xm , t − t m ), i.e., f (z − zm ) should also satisfy a Gaussian distribution: f (z − zm ) = N (z|zm , σ2 I) Where we have defined: zm = ( xm , t m ) ∫ The integral f (z − zm ) dt corresponds to the marginal distribution with respect to the remaining variable x and, thus, we obtain: ∫ f (z − zm ) dt = N ( x| xm , σ2 ) We substitute all the expressions into Eq (6.48), which gives: ∑ 2 p( t, x) n N (z|z m , σ I) p ( t| x) = ∫ = ∑ 2 p( t, x) dt m N ( x| x m , σ ) ( 1 ) ∑ 1 T 2 −1 n 2πσ2 exp − 2 (z − z n ) (σ I) (z − z n ) ( ) = ∑ 1 1 2 exp − ( x − x ) m m (2πσ2 )1/2 2σ 2 ( ) ( ) ∑ 1 1 1 2 2 n 2πσ2 exp − 2σ2 ( x − x n ) exp − 2σ2 ( t − t n ) ( ) = ∑ 1 1 2 exp − ( x − x ) m m (2πσ2 )1/2 2σ2 ( ) ) ( p 1 exp − 2σ1 2 ( x − xn )2 ∑ 1 1 2πσ2 2 ( ) ( t − t ) exp − = · p n ∑ 1 1 2σ 2 2 n 2πσ2 m (2πσ2 )1/2 exp − 2σ2 ( x − x m ) ∑ = π n · N ( t | t n , σ2 ) n 136 Where we have defined: ( ) exp − 2σ1 2 ( x − xn )2 ( ) πn = ∑ 1 2 − exp ( x − x ) m m 2σ 2 We also observe that: ∑ πn = 1 n Therefore, the conditional distribution p( t| x) is given by a Gaussian Mixture. Similarly, we attempt to find a specific form for Eq (6.46): ∫ f ( x − xn , t) dt k( x, xn ) = ∑ ∫ m f ( x − x m , t) dt = ∑ N ( x | x n , σ2 ) 2 m N ( x| x m , σ ) = πn In other words, the conditional distribution can be more precisely written as: p ( t| x) = ∑ k( x, xn ) · N ( t| t n , σ2 ) n Thus its mean is given by: E[ t | x ] = ∑ k( x, xn ) · t n n Its variance is given by: var[ t| x] = E[( t| x)2 ] − E[ t| x]2 ( )2 ∑ ∑ k( x, xn ) · t n = k( x, xn ) · ( t2n + σ2 ) − n n Problem 6.19 Solution Similar to Prob.6.17, it is straightforward to show that: y(x) = ∑ t n k(x, xn ) n Where we have defined: g(xn − x) k(x, xn ) = ∑ n g(x n − x) Problem 6.20 Solution Since we know that t N +1 = ( t 1 , t 2 , ..., t N , t N +1 )T follows a Gaussian distribution, i.e., t N +1 ∼ N (t N +1 |0, C N +1 ) given in Eq (6.64), if we rearrange its 137 order by putting the last element (i.e., t N +1 ) to the first position, denoted as t̄ N +1 , it should also satisfy a Gaussian distribution: t̄ N +1 = ( t N +1 , t 1 , ..., t 2 , t N )T ∼ N (t̄ N +1 |0, C̄ N +1 ) Where we have defined: ( C̄ N +1 = c kT k CN ) Where k and c have been given in the main text below Eq (6.65). The conditional distribution p( t N +1 |t N ) should also be a Gaussian. By analogy to Eq (2.94)-(2.98), we can simply treat t N +1 as xa , t N as xb , c as Σaa , k as Σba , kT as Σab and C N as Σbb . Substituting them into Eq (2.79) and Eq (2.80) yields: Λaa = ( c − kT C−N1 k)−1 And: Λab = −( c − kT C−N1 k)−1 kT C−N1 Then we substitute them into Eq (2.96) and (2.97), yields: 1 p( t N +1 |t N ) = N (µa|b , Λ− aa ) For its mean µa|b , we have: ( ) [ ] 1 T −1 −1 T −1 µa|b = 0 − c − kT C− k · − ( c − k C k) k C N N N · (t N − 0) 1 = kT C− N t N = m(x N +1 ) 1 Similarly, for its variance Λ− aa (Note that here since t N +1 is a scalar, the mean and the covariance matrix actually degenerate to one dimension case), we have: 1 T −1 2 Λ− aa = c − k C N k = σ (x N +1 ) Problem 6.21 Solution We follow the hint beginning by verifying the mean. We write Eq (6.62) in a matrix form: 1 C N = ΦΦT + β−1 I N α Where we have used Eq (6.54). Here Φ is the design matrix defined below Eq (6.51) and I N is an identity matrix. Before we use Eq (6.66), we need to obtain k: k = [ k(x1 , x N +1 ), k(x2 , x N +1 ), ..., k(x N , x N +1 )]T 1 = [ϕ(x1 )T ϕ(x N +1 ), ϕ(x2 )T ϕ(x N +1 ), ..., ϕ(xn )T ϕ(x N +1 )]T α 1 Φϕ(x N +1 )T = α 138 Now we substitute all the expressions into Eq (6.66), yielding: [ ]−1 m(x N +1 ) = α−1 ϕ(x N +1 )T ΦT α−1 ΦΦT + β−1 I N t Next using matrix identity (C.6), we obtain: [ ]−1 [ ]−1 ΦT α−1 ΦΦT + β−1 I N = αβ βΦT Φ + αI M ΦT = αβS N ΦT Where we have used Eq (3.54). Substituting it into m(x N +1 ), we obtain: m(x N +1 ) = βϕ(x N +1 )T S N ΦT t = < ϕ(x N +1 )T , βS N ΦT t > Where < ·, · > represents the inner product. Comparing the result above with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is similar for the variance. We substitute c, k and C N into Eq (6.67). Then we simplify the expression using matrix identity (C.7). Finally, we will observe that it is equal to Eq (3.59). Problem 6.22 Solution Based on Eq (6.64) and (6.65), We first write down the joint distribution for t N +L = [ t 1 (x), t 2 (x), ..., t N +L (x)]T : p(t N +L ) = N (t N +L |0, C N +L ) Where C N +L is similarly given by: ( C1,N C N +L = KT K ) C N +1,N +L The expression above has already implicitly divided the vector t N +L into two parts. Similar to Prob.6.20, for later simplicity we rearrange the order of t N +L denoted as t̄ N +L = [ t N +1 , ..., t N +L , t 1 , ..., t N ]T . Moreover, t̄ N +L should also follows a Gaussian distribution: p(t̄ N +L ) = N (t̄ N +L |0, C̄ N +L ) Where we have defined: ( C̄ N +L = C N +1,N +L K KT C1,N ) Now we use Eq (2.94)-(2.98) and Eq (2.79)-(2.80) to derive the conditional distribution, beginning by calculate Λaa : 1 −1 Λaa = (C N +1,N +L − KT · C− 1,N · K) and Λab : 1 −1 1 Λab = − (C N +1,N +L − KT · C− · KT · C− 1,N · K) 1,N 139 Now we can obtain: 1 p( t N +1 , ..., t N +L |t N ) = N (µa|b , Λ− aa ) Where we have defined: T −1 1 µa|b = 0 + KT · C− 1,N · t N = K · C1,N · t N If now we want to find the conditional distribution p( t j |t N ), where N + 1 ≤ j ≤ N + L, we only need to find the corresponding entry in the mean (i.e., the ( j − N )-th entry) and covariance matrix (i.e., the ( j − N )-th diagonal entry) of p( t N +1 , ..., t N +L |t N ). In this case, it will degenerate to Eq (6.66) and (6.67) just as required. Problem 6.24 Solution By definition, we only need to prove that for arbitrary vector x ̸= 0, xT Wx is positive. Here suppose that W is a M × M matrix. We expand the multiplication: M ∑ M M ∑ ∑ xT Wx = Wi j · x i · x j = Wii · x2i i =1 j =1 i =1 where we have used the fact that W is a diagonal matrix. Since Wii > 0, we obtain xT Wx > 0 just as required. Suppose we have two positive definite matrix, denoted as A1 and A2 , i.e., for arbitrary vector x, we have xT A1 x > 0 and xT A2 x > 0. Therefore, we can obtain: xT (A1 + A2 )x = xT A1 x + xT A2 x > 0 Just as required. Problem 6.25 Solution Based on Newton-Raphson formula, Eq(6.81) and Eq(6.82), we have: anew N 1 −1 −1 = a N − (−W N − C− N ) (t N − σ N − C N a N ) 1 −1 −1 = a N + (W N + C− N ) (t N − σ N − C N a N ) [ ] 1 −1 1 −1 = (W N + C− (W N + C− N ) N )a N + t N − σ N − C N a N 1 −1 −1 = C N C− N (W N + C N ) (t N − σ N + W N a N ) = C N (C N W N + I)−1 (t N − σ N + W N a N ) Just as required. Problem 6.26 Solution Using Eq(6.77), (6.78) and (6.86), we can obtain: ∫ p(a N +1 |t N ) = p(a N +1 |a N ) p(a N |t N ) d a N ∫ 1 T −1 ⋆ −1 = N (a N +1 |kT C− N a N , c − k C N k) · N (a N |a N , H ) d a N 140 By analogy to Eq (2.115), i.e., ∫ p(y) = p(y|x) p(x) d x We can obtain: p(a N +1 |t N ) = N (Aµ + b, L−1 + AΛ−1 AT ) (∗) Where we have defined: 1 −1 1 A = kT C− = c − kT C− N , b = 0, L N k And µ = a⋆ N, Λ = H Therefore, the mean is given by: 1 ⋆ T −1 T Aµ + b = kT C− N a N = k C N C N (t N − σ N ) = k (t N − σ N ) Where we have used Eq (6.84). The covariance matrix is given by: L−1 + AΛ−1 AT = = = = 1 T −1 −1 T −1 T c − kT C− N k + k C N H (k C N ) 1 1 −1 −1 c − kT (C− − C− N H C N )k ( N ) 1 −1 −1 −1 −1 c − kT C− − C (W + C ) C N N N N N k ( ) 1 −1 −1 c − kT C− − (C W C + C ) k N N N N N Where we have used Eq (6.85) and the fact that C N is symmetric. Then we use matrix identity (C.7) to further reduce the expression, which will finally give Eq (6.88). Problem 6.27 Solution(Wait for update) This problem is really complicated. What’s more, I find that Eq (6.91) seems not right. 0.7 Sparse Kernel Machines Problem 7.1 Solution By analogy to Eq (2.249), we can obtain: +1 1 N∑ 1 · k(x, xn ) N+1 n=1 Z k p(x| t) = −1 1 N∑ 1 · k(x, xn ) N−1 n=1 Z k t = +1 t = −1 141 where N+1 represents the number of samples with label t = +1 and it is the same for N−1 . Z k is a normalization constant representing the volume of the hypercube. Since we have equal prior for the class, i.e., { 0. 5 t = + 1 p ( t) = 0. 5 t = − 1 Based on Bayes’ Theorem, we have p( t|x) ∝ p(x| t) · p( t), yielding: +1 1 1 N∑ · · k(x, xn ) t = +1 Z N+1 n=1 p( t|x) = −1 1 1 N∑ · k(x, xn ) t = −1 · Z N−1 n=1 Where 1/ Z is a normalization constant to guarantee the integration of the posterior equal to 1. To classify a new sample x⋆ , we try to find the value t⋆ that can maximize p( t|x). Therefore, we can obtain: +1 −1 1 N∑ 1 N∑ · k (x , x ) ≥ · k(x, xn ) + 1 if n N+1 n=1 N−1 n=1 ⋆ (∗) t = +1 −1 1 N∑ 1 N∑ · k(x, xn ) ≤ · k(x, xn ) −1 if N+1 n=1 N−1 n=1 If we now choose the kernel function as k(x, x′ ) = xT x′ ,we have: +1 +1 1 N∑ 1 N∑ k(x, xn ) = xT xn = xT x̃+1 N+1 n=1 N+1 n=1 Where we have denoted: x̃+1 = +1 1 N∑ xn N+1 n=1 and similarly for x̃−1 . Therefore, the classification criterion (∗) can be written as: { +1 if x̃+1 ≥ x̃−1 ⋆ t = −1 if x̃+1 ≤ x̃−1 When we choose the kernel function as k(x, x′ ) = ϕ(x)T ϕ(x′ ), we can similarly obtain the classification criterion: { +1 if ϕ̃(x+1 ) ≥ ϕ̃(x−1 ) ⋆ t = −1 if ϕ̃(x+1 ) ≤ ϕ̃(x−1 ) Where we have defined: ϕ̃(x+1 ) = +1 1 N∑ ϕ(xn ) N+1 n=1 142 Problem 7.2 Solution Suppose we have find w0 and b 0 , which can let all points satisfy Eq (7.5) and simultaneously minimize Eq (7.3). This hyperlane decided by w0 and b 0 is the optimal classification margin. Now if the constraint in Eq (7.5) becomes: t n (wT ϕ(xn ) + b) ≥ γ We can conclude that if we perform change of variables: w0 − > γw0 and b− > γ b, the constraint will still satisfy and Eq (7.3) will be minimize. In other words, if the right side of the constraint changes from 1 to γ, The new hyperlane decided by γw0 and γ b 0 is the optimal classification margin. However, the minimum distance from the points to the classification margin is still the same. Problem 7.3 Solution Suppose we have x1 belongs to class one and we denote its target value t 1 = 1, and similarly x2 belongs to class two and we denote its target value t 2 = −1. Since we only have two points, they must have t i · y(x i ) = 1 as shown in Fig. 7.1. Therefore, we have an equality constrained optimization problem: 1 minimize ||w||2 2 s.t. { T w ϕ(x1 ) + b = 1 wT ϕ(x2 ) + b = −1 This is an convex optimization problem and it has been proved that global optimal exists. Problem 7.4 Solution Since we know that ρ= 1 ||w|| Therefore, we have: 1 = ||w||2 ρ2 In other words, we only need to prove that ||w||2 = N ∑ n=1 an When we find th optimal solution, the second term on the right hand side of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its dual is given by: N ∑ 1 L̃(a) = a n − ||w||2 2 n=1 143 Therefore, we have: N ∑ 1 1 ||w||2 = L(a) = L̃(a) = a n − ||w||2 2 2 n=1 Rearranging it, we will obtain what we are required. Problem 7.5 Solution We have already proved this problem in the previous one. Problem 7.6 Solution If the target variable can only choose from {−1, 1}, and we know that p ( t = 1| y ) = σ ( y ) We can obtain: p( t = −1| y) = 1 − p( t = 1| y) = 1 − σ( y) = σ(− y) Therefore, combining these two situations, we can derive: p( t| y) = σ( yt) Consequently, we can obtain the negative log likelihood: − ln p(D) = − ln N ∏ n=1 σ( yn t n ) = − N ∑ n=1 ln σ( yn t n ) = N ∑ n=1 E LR ( yn t n ) Here D represents the dataset, i.e.,D = {(xn , t n ); n = 1, 2, ..., N }, and E LR ( yt) is given by Eq (7.48). With the addition of a quadratic regularization, we obtain exactly Eq (7.47). Problem 7.7 Solution The derivatives are easy to obtain. Our main task is to derive Eq (7.61) 144 using Eq (7.57)-(7.60). L = C − = C − = C − N ∑ N ∑ 1 (ξn + ξbn ) + ||w||2 − (µ n ξ n + µ bn ξbn ) 2 n=1 n=1 N ∑ n=1 a n (ϵ + ξn + yn − t n ) − N ∑ n=1 abn (ϵ + ξbn + yn − t n ) N N ∑ ∑ 1 (a n + µn )ξn − (abn + µ bn )ξbn (ξn + ξbn ) + ||w||2 − 2 n=1 n=1 n=1 N ∑ N ∑ n=1 a n (ϵ + yn − t n ) − N ∑ n=1 abn (ϵ + yn − t n ) N ∑ N N ∑ ∑ 1 (ξn + ξbn ) + ||w||2 − C ξn − C ξbn 2 n=1 n=1 n=1 N ∑ n=1 (a n + abn )ϵ − N ∑ n=1 (a n − abn )( yn − t n ) = N N ∑ ∑ 1 ||w||2 − (a n + abn )ϵ − (a n − abn )( yn − t n ) 2 n=1 n=1 = N N N ∑ ∑ ∑ 1 (a n + abn )ϵ + (a n − abn )(wT ϕ(xn ) + b − t n ) − ||w||2 − 2 n=1 n=1 n=1 = N N N ∑ ∑ ∑ 1 ||w||2 − (a n − abn )(wT ϕ(xn ) + b) − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 n=1 = N N N ∑ ∑ ∑ 1 ||w||2 − (a n − abn )wT ϕ(xn ) − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 n=1 = N N ∑ ∑ 1 ||w||2 − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 N N ∑ ∑ 1 = − ||w||2 − (a n + abn )ϵ + (a n − abn ) t n 2 n=1 n=1 Just as required. Problem 7.8 Solution This obviously follows from the KKT condition, described in Eq (7.67) and (7.68). Problem 7.9 Solution The prior is given by Eq (7.80). p(w|α) = M ∏ i =1 1 −1 N (0, α− i ) = N (w|0, A ) Where we have defined: A = diag(α i ) 145 The likelihood is given by Eq (7.79). N ∏ p(t|X, w, β) = n=1 N ∏ = n=1 p( t n |xn , w, β−1 ) N ( t n |wT ϕ(xn ), β−1 ) = N (t|Φw, β−1 I) Where we have defined: Φ = [ϕ(x1 ), ϕ(x2 ), ..., ϕ(xn )]T Our definitions of Φ and A as consistent with the main text. Therefore, according to Eq (2.113)-Eq (2.117), we have: p(w|t, X, α, β) = N (m, Σ) Where we have defined: Σ = (A + βΦT Φ)−1 And m = βΣΦT t Just as required. Problem 7.10&7.11 Solution It is quite similar to the previous problem. We begin by writting down the prior: p(w|α) = M ∏ i =1 1 −1 N (0, α− i ) = N (w|0, A ) Then we write down the likelihood: N ∏ p(t|X, w, β) = n=1 N ∏ = n=1 p( t n |xn , w, β−1 ) N ( t n |wT ϕ(xn ), β−1 ) = N (t|Φw, β−1 I) Since we know that: ∫ p(t|X, α, β) = p(t|X, w, β) p(w|α) d w 146 First as required by Prob.7.10, we will solve it by completing the square. We begin by write down the expression for p(t|X, w, β): ∫ p(t|X, α, β) = N (w|0, A−1 )N (t|Φw, β−1 I) d w = ( β 2π ) N /2 · 1 (2π) M /2 · M ∏ m=1 ∫ α1/2 i · exp{−E (w)} d w Where we have defined: β 1 T w Aw + ||t − Φw||2 2 2 We expand E (w) with respect to w: } 1{ T E (w) = w (A + βΦT Φ)w − 2βtT (Φw) + βtT t 2 } 1 { T −1 w Σ w − 2mT Σ−1 w + βtT t = 2 } 1{ = (w − m)T Σ−1 (w − m) + βtT t − mT Σ−1 m 2 Where we have used Eq (7.82) and Eq (7.83). Substituting E (w) into the integral, we will obtain: ∫ M ∏ β N /2 1 1/2 · α · exp{−E (w)} d w p(t|X, α, β) = ( ) · 2π (2π) M /2 m=1 i { 1 } M ∏ β 1 1/2 M /2 1/2 T T −1 = ( ) N /2 · · α · | Σ | exp − t − m Σ m) · (2 π ) ( β t 2π 2 (2π) M /2 m=1 i { } M ∏ β 1 T T −1 = ( ) N /2 · |Σ|1/2 · α1/2 · exp − ( β t t − m Σ m) i 2π 2 m=1 { } M ∏ β = ( ) N /2 · |Σ|1/2 · α1/2 i · exp − E (t) 2π m=1 E (w) = We further expand E (t): E (t) = = = = = = = 1 (βtT t − mT Σ−1 m) 2 1 (βtT t − (βΣΦT t)T Σ−1 (βΣΦT t)) 2 1 (βtT t − β2 tT ΦΣΣ−1 ΣΦT t) 2 1 (βtT t − β2 tT ΦΣΦT t) 2 1 T t (βI − β2 ΦΣΦT )t 2 ] 1 T[ t βI − βΦ(A + βΦT Φ)−1 ΦT β t 2 1 T −1 1 t (β I + ΦA−1 ΦT )−1 t = tT C−1 t 2 2 147 Note that in the last step we have used matrix identity Eq (C.7). Therefore, as we know that the pdf is Gaussian and the exponential term has been given by E (t), we can easily write down Eq (7.85) considering those normalization constant. What’s more, as required by Prob.7.11, the evaluation of the integral can be easily performed using Eq(2.113)- Eq(2.117). Problem 7.12 Solution According to the previous problem, we can explicitly write down the log marginal likelihood in an alternative form: ln p(t|X, α, β) = M N 1 1∑ N ln β − ln 2π + ln |Σ| + ln α i − E (t) 2 2 2 2 i=1 We first derive: dE (t) dαi 1 d (mT Σ−1 m) 2 dαi 1 d − (β2 tT ΦΣΣ−1 ΣΦT t) 2 dαi 1 d (β2 tT ΦΣΦT t) − 2 dαi 1 [ d d Σ−1 2 T T − Tr ( β t ΦΣΦ t) · ] 2 dαi d Σ−1 1 2 [ 1 β T r Σ(ΦT t)(ΦT t)T Σ · I i ] = m2ii 2 2 = − = = = = In the last step, we have utilized the following equation: d T r (AX−1 B) = −X−T AT BT X−T dX Moreover, here I i is a matrix with all elements equal to zero, expect the i -th diagonal element, and the i -th diagonal element equals to 1. Then we utilize matrix identity Eq (C.22) to derive: d ln |Σ| dαi d ln |Σ−1 | dαi [ d ] = −T r Σ (A + βΦT Φ) dαi = −Σ ii = − Therefore, we can obtain: d ln p 1 1 1 = − m2 − Σ ii dαi 2α i 2 i 2 148 Set it to zero and obtain: αi = 1 − α i Σ ii γi = 2 mi mi Then we calculate the derivatives of ln p with respect to β beginning by: d ln |Σ| dβ d ln |Σ−1 | dβ [ d ] = −T r Σ (A + βΦT Φ) dβ [ ] = −T r ΣΦT Φ = − Then we continue: dE (t) dβ = = = = = = = = = = = 1 T 1 d t t− (mT Σ−1 m) 2 2 dβ 1 T 1 d 2 T t t− (β t ΦΣΣ−1 ΣΦT t) 2 2 dβ 1 T 1 d 2 T t t− (β t ΦΣΦT t) 2 2 dβ 1 T 1 d T t t − βtT ΦΣΦT t − β2 (t ΦΣΦT t) 2 2 dβ } 1{ T d T t t − 2βtT ΦΣΦT t − β2 (t ΦΣΦT t) 2 dβ } { 1 T d T t t − 2tT (Φm) − β2 (t ΦΣΦT t) 2 dβ { [ d 1 T d Σ−1 ]} T T (t ΦΣΦ t) · t t − 2tT (Φm) − β2 T r 2 dβ d Σ−1 { [ ]} 1 T t t − 2tT (Φm) + β2 T r Σ(ΦT t)(ΦT t)T Σ · ΦT Φ 2 } [ 1{ T t t − 2tT (Φm) + T r mmT · ΦT Φ] 2 } [ 1{ T t t − 2tT (Φm) + T r ΦmmT · ΦT ] 2 1 ||t − Φm||2 2 Therefore, we have obtained: ) 1( N d ln p = − ||t − Φm||2 − T r [ΣΦT Φ] dβ 2 β 149 Using Eq (7.83), we can obtain: ΣΦT Φ = ΣΦT Φ + β−1 ΣA − β−1 ΣA = Σ(βΦT Φ + A)β−1 − β−1 ΣA = Iβ−1 − β−1 ΣA = (I − ΣA)β−1 Setting the derivative equal to zero, we can obtain: β−1 = ||t − Φm||2 ||t − Φm||2 = ∑ N − T r (I − ΣA) N − i γi Just as required. Problem 7.13 Solution This problem is quite confusing. In my point of view, the posterior should be denoted as p(w|t, X, {a i , b i }, a β , b β ), where a β , b β controls the Gamma distribution of β, and a i , b i controls the Gamma distribution of α i . What we should do is to maximize the marginal likelihood p(t|X, {a i , b i }, a β , b β ) with respect to {a i , b i }, a β , b β . Now we do not have a point estimation for the hyperparameters β and α i . We have a distribution (controled by the hyper priors, i.e., {a i , b i }, a β , b β ) instead. Problem 7.14 Solution We begin by writing down p( t|x, w, β∗ ). Using Eq (7.76) and Eq (7.77), we can obtain: p( t|x, w, β∗ ) = N ( t|wT ϕ(x), (β∗ )−1 ) Then we write down p(w|X, t, α∗ , β∗ ). Using Eq (7.81), (7.82) and (7.83), we can obtain: p(w|X, t, α∗ , β∗ ) = N (w|m, Σ) Where m and Σ are evaluated using Eq (7.82) and (7.83) given α = α∗ and β = β∗ . Then we utilize Eq (7.90) and obtain: ∫ ∗ ∗ p ( t | x , X, t , α , β ) = N ( t|wT ϕ(x), (β∗ )−1 )N (w|m, Σ) d w ∫ = N ( t|ϕ(x)T w, (β∗ )−1 )N (w|m, Σ) d w Using Eq (2.113)-(2.117), we can obtain: p( t|x, X, t, α∗ , β∗ ) = N (µ, σ2 ) Where we have defined: µ = mT ϕ(x) 150 And σ2 = (β∗ )−1 + ϕ(x)T Σϕ(x) Just as required. Problem 7.15 Solution We just follow the hint. 1 L(α) = − { N ln 2π + ln |C| + tT C−1 t} 2 1{ 1 T −1 = − N ln 2π + ln |C− i | + ln |1 + α− i φ i C− i φ i | 2 1 C− φ φT C−1 } −i i i −i 1 +tT (C− − )t −i α i + φT C−1 φ i −i i = = = = L (α − i ) − T −1 −1 1 T C− i φ i φ i C− i 1 1 ln |1 + α−i 1 φTi C− φ | + t t −i i 1φ 2 2 α i + φTi C− −i i 2 1 qi 1 −1 L(α− i ) − ln |1 + α i s i | + 2 2 αi + s i 2 1 αi + s i 1 q i L(α− i ) − ln + 2 αi 2 αi + s i [ q2i ] 1 L(α− i ) + ln α i − ln(α i + s i ) + = L(α− i ) + λ(α i ) 2 αi + s i Where we have defined λ(α i ), s i and q i as shown in Eq (7.97)-(7.99). Problem 7.16 Solution We first calculate the first derivative of Eq(7.97) with respect to α i : q2i 1 1 1 = [ − − ] ∂α i 2 α i α i + s i (α i + s i )2 ∂λ Then we calculate the second derivative: ∂2 λ ∂α2i = 2 q2i 1 1 1 [− 2 + + ] 2 α i (α i + s i )2 (α i + s i )3 Next we aim to prove that when α i is given by Eq (7.101), i.e., setting the first derivative equal to 0, the second derivative (i.e., the expression above) is negative. First we can obtain: αi + s i = s2i q2i − s i + si = s i q2i q2i − s i 151 Therefore, substituting α i + s i and α i into the second derivative, we can obtain: ∂2 λ ∂α2i = 2 2 ( q2i − s i )2 2 q2i ( q2i − s i )3 1 (q i − s i ) [− + + ] 2 s4i s2i q4i s3i q6i = 4 2 2 s2i ( q2i − s i )2 2 s i ( q2i − s i )3 1 q i (q i − s i ) + + ] [− 2 q4i s4i s4i q4i s4i q4i = 2 2 1 (q i − s i ) [− q4i + s2i + 2 s i ( q2i − s i )] 2 q4i s4i = 2 2 1 (q i − s i ) [−( q2i − s i )2 ] 2 q4i s4i = − 2 4 1 (q i − s i ) <0 2 q4i s4i Just as required. Problem 7.17 Solution We just follow the hint. According to Eq (7.102), Eq (7.86) and matrix identity (C.7), we have: Qi −1 = φT i C t −1 −1 T −1 = φT i (β I + ΦA Φ ) t T −1 T = φT i (βI − βIΦ(A + Φ βIΦ) Φ βI)t 2 T −1 T = φT i (β − β Φ(A + βΦ Φ) Φ )t 2 T = φT i (β − β ΦΣΦ )t 2 T T = βφT i t − β φ i ΦΣΦ t Similarly, we can obtain: Si −1 = φT i C φi 2 T = φT i (β − β ΦΣΦ )φ i 2 T T = βφT i φ i − β φ i ΦΣΦ φ i Just as required. Problem 7.18 Solution We begin by deriving the first term in Eq (7.109) with respect to w. This can be easily evaluate based on Eq (4.90)-(4.91). N ∂ {∑ ∂w n=1 } ∑ N t n ln yn + (1 − t n ) ln(1 − yn ) = ( t n − yn )ϕn = ΦT (t − y) n=1 152 Since the derivative of the second term in Eq (7.109) with respect to w is rather simple to obtain. Therefore, The first derivative of Eq (7.109) with respect to w is: ∂ ln p = ΦT (t − y) − Aw ∂w For the Hessian matrix, we can first obtain: ∂ { ∂w ΦT (t − y) } } N ∂ { ∑ ( t n − yn )ϕn n=1 ∂w } N ∂ { ∑ = − yn · ϕn n=1 ∂w = = − = − N ∂σ(wT ϕ ) ∑ n n=1 ∂w · ϕT n N ∂σ(a) ∂a ∑ · · ϕT ∂w n n=1 ∂a Where we have defined a = wT ϕn . Then we can utilize Eq (4.88) to derive: ∂ { } N ∑ T ΦT (t − y) = − σ(1 − σ) · ϕn · ϕT n = −Φ BΦ ∂w n=1 Where B is a diagonal N × N matrix with elements b n = yn (1 − yn ). Therefore, we can obtain the Hessian matrix: ∂ { ∂ ln p } = −(ΦT BΦ + A) H= ∂w ∂w Just as required. Problem 7.19 Solution We begin from Eq (7.114). p(t|w∗ ) p(w∗ |α)(2π) M /2 |Σ|1/2 ¯ ] [∏ ][ ∏ N M 1 M /2 1/2 ¯ = p( t n | xn , w) N (w i |0, α− ) (2 π ) | Σ | ¯ i p(t|α) = n=1 = [∏ N n=1 i =1 ¯ ] ¯ p( t n | xn , w) · N (w|0, A) · (2π) M /2 |Σ|1/2 ¯ w = w∗ w = w∗ We further take logarithm for both sides. ln p(t|α) = [∑ N n=1 ln p( t n | xn , w) + ln N (w|0, A) + [∑ N [ ]¯ M 1 ¯ ln 2π + ln |Σ| ¯ w = w∗ 2 2 ]¯ ] 1 1 1 ¯ t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw − ln |A| + ln |Σ| + const ¯ w = w∗ 2 2 2 n=1 ] [1 ]¯ [∑ N [ ] 1 1 ¯ t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw + ln |Σ| − ln |A| + const ¯ = w = w∗ 2 2 2 n=1 = 153 Using the Chain rule, we can obtain: ∂ ln p(t|α) ¯¯ ∂ ln p(t|α) ∂w ¯¯ = ¯ ¯ w = w∗ ∂α i ∂w ∂α i w = w∗ Observing Eq (7.109), (7.110) and that (7.110) will equal 0 at w∗ , we can conclude that the first term on the right hand side of ln p(t|α) will have zero derivative with respect to w at w∗ . Therefore, we only need to focus on the second term: [ ]¯ ∂ 1 1 ∂ ln p(t|α) ¯¯ ¯ = ln |Σ| − ln |A| ¯ ¯ w = w∗ w = w∗ ∂α i ∂α i 2 2 It is rather easy to obtain: ] 1 1 ∂ [∑ 1 [− ln |A|] = − ln α−i 1 = ∂α i 2 2 ∂α i i 2α i ∂ Then we follow the same procedure as in Prob.7.12, we can obtain: ∂ 1 1 [ ln |Σ|] = − Σ ii ∂α i 2 2 Therefore, we obtain: ∂ ln p(t|α) ∂α i = 1 1 − Σ ii 2α i 2 Note: here I draw a different conclusion as the main text. I have also verified my result in another way. You can write the prior as the product of 1 N (w i |0, α− ) instead of N (w|0, A). In this form, since we know that: i M ∂ ∑ ∂α i i =1 ln N (w i |0, α−i 1 ) = ∂ 1 αi 1 1 ( ln α i − w2i ) = − (w∗i )2 ∂α i 2 2 2α i 2 The above expression can be used to replace the derivative of −1/2wT Aw− 1/2 ln |A|. Since the derivative of the likelihood with respect to α i is not zero at w∗ , (7.115) seems not right anyway. 0.8 Graphical Models Problem 8.1 Solution We are required to prove: ∫ x p(x) d x = ∫ ∏ K x k=1 p( xk | pa k ) d x = 1 154 Here we adopt the same assumption as in the main text: No arrows lead from a higher numbered node to a According to Eq(8.5), we can write: ∫ ∫ ∏ K p(x) d x = p( xk | pa k ) d x x ∫ = ∫ = ∫ = ∫ = ∫ = x k=1 x p( xK | pa K ) ∫ K∏ −1 p( xk | pa k ) d x k=1 [ [ x1 ,x2 ,...,xK −1 ] xK p( xK | pa K ) [ K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 [ K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 K∏ −1 [ x1 ,x2 ,...,xK −1 ] k=1 K∏ −1 k=1 ∫ p( xk | pa k ) xK ] p( xk | pa k ) dxK dx1 dx2 , ...dxK −1 ] p( xK | pa K ) dxK dx1 dx2 , ...dxK −1 ] p( xk | pa k ) dx1 dx2 , ...dxK −1 p( xk | pa k ) dx1 dx2 , ...dxK −1 Note that from the third line to the fourth line, we have used the fact that x1 , x2 , ...xK −1 do not depend on xK , and thus the product from k = 1 to K − 1 can be moved to the outside of the integral with respect to xK , and that we have used the fact that the conditional probability is correctly normalized from the fourth line to the fifth line. The aforementioned procedure will be repeated for K times until all the variables have been integrated out. Problem 8.2 Solution This statement is obvious. Suppose that there exists an ordered numbering of the nodes such that for each node there are no links going to a lower-numbered node, and that there is a directed cycle in the graph: a 1 → a 2 → ... → a N To make it a real cycle, we also require a N → a 1 . According to the assumption, we have a 1 ≤ a 2 ≤ ... ≤ a N . Therefore, the last link a N → a 1 is invalid since a N ≥ a 1 . Problem 8.3 Solution Based on definition, we can obtain: 0.336, if a 0.264, if a p(a, b) = p(a, b, c = 0) + p(a, b, c = 1) = 0.256, if a 0.144, if a Similarly, we can obtain: p(a) = p(a, b = 0) + p(a, b = 1) = { = 0, b = 0, b = 1, b = 1, b 0.6, if a = 0 0.4, if a = 1 =0 =1 =0 =1 155 And { p ( b ) = p ( a = 0, b ) + p ( a = 1, b ) = 0.592, if b = 0 0.408, if b = 1 Therefore, we conclude that p(a, b) ̸= p(a) p( b). For instance, we have p(a = 1, b = 1) = 0.144, p(a = 1) = 0.4 and p( b = 1) = 0.408. It is obvious that: 0.144 = p(a = 1, b = 1) ̸= p(a = 1) p( b = 1) = 0.4 × 0.408 To prove the conditional dependency, we first calculate p( c): p( c) = ∑ a,b = 0,1 { p(a, b, c) = 0.480, if c = 0 0.520, if c = 1 According to Bayes’ Theorem, we have: 0.400, if a 0.277, if a 0 .100, if a p(a, b, c) 0.415, if a = p(a, b| c) = 0.400, if a p( c) 0 .123, if a 0.100, if a 0.185, if a = 0, b = 0, b = 0, b = 0, b = 1, b = 1, b = 1, b = 1, b = 0, c = 0, c = 1, c = 1, c = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 =0 =1 =0 =1 Similarly, we also have: 0.240/0.480 = 0.500, if a p(a, c) 0.360/0.520 = 0.692, if a = p ( a| c ) = 0.240/0.480 = 0.500, if a p( c) 0.160/0.520 = 0.308, if a = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 Where we have used p(a, c) = p(a, b = 0, c) + p(a, b = 1, c). Similarly, we can obtain: 0.384/0.480 = 0.800, if b = 0, c = 0 p( b, c) 0.208/0.520 = 0.400, if b = 0, c = 1 p( b| c) = = 0 .096/0.480 = 0.200, if b = 1, c = 0 p( c) 0.312/0.520 = 0.600, if b = 1, c = 1 Now we can easily verify the statement p(a, b| c) = p(a| c) p( b| c). For instance, we have: 0.1 = p(a = 1, b = 1| c = 0) = p(a = 1| c = 0) p( b = 1| c = 0) = 0.5 × 0.2 = 0.1 Problem 8.4 Solution 156 This problem follows the previous one. We have already calculated p(a) and p( b| c), we rewrite it here. { p(a) = p(a, b = 0) + p(a, b = 1) = And 0.6, if a = 0 0.4, if a = 1 0.384/0.480 = 0.800, if b p( b, c) 0.208/0.520 = 0.400, if b p( b| c) = = 0.096/0.480 = 0.200, if b p( c) 0.312/0.520 = 0.600, if b = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 We can also obtain p( c|a): 0.24/0.6 = 0.4, if a p(a, c) 0.36/0.6 = 0.6, if a p ( c | a) = = 0.24/0.4 = 0.6, if a p ( a) 0.16/0.4 = 0.4, if a = 0, c = 0, c = 1, c = 1, c =0 =1 =0 =1 Now we can easily verify the statement that p(a, b, c) = p(a) p( c|a) p( b| c) given Table 8.2. The directed graph looks like: a→c→b Problem 8.5 Solution It looks quite like Figure 8.6. The difference is that we introduce α i for each w i , where i = 1, 2, ..., M . Figure 1: probabilistic graphical model corresponding to the RVM described in (7.79) and (7.80). Problem 8.6 Solution(Wait for update) Problem 8.7 Solution Let’s just follow the hint. We begin by calculating the mean µ. E[ x 1 ] = b 1 157 According to Eq (8.15), we can obtain: ∑ E[ x 2 ] = j ∈ pa 2 w2 j E[ x j ] + b 2 = w21 b 1 + b 2 Then we can obtain: E[ x3 ] = w32 E[ x2 ] + b 3 = w32 (w21 b 1 + b 2 ) + b 3 = w32 w21 b 1 + w32 b 2 + b 3 Therefore, we obtain Eq (8.17) just as required. Next, we deal with the covariance matrix. cov[ x1 , x1 ] = v1 Then we can obtain: cov[ x1 , x2 ] = ∑ k=1 w2k cov[ x1 , xk ] + I 12 v2 = w21 cov[ x1 , x1 ] = w21 v1 And also cov[ x2 , x1 ] = cov[ x1 , x2 ] = w21 v1 . Hence, we can obtain: cov[ x2 , x2 ] = ∑ k=1 2 w2k cov[ x2 , xk ] + I 22 v2 = w21 v1 + v2 Next, we can obtain: cov[ x1 , x3 ] = ∑ k=2 w3k cov[ x1 , xk ] + I 31 v1 = w32 w21 v1 Then, we can obtain: cov[ x2 , x3 ] = ∑ k=2 2 w3k cov[ x2 , xk ] + I 23 v3 = w32 (v2 + w21 v1 ) Finally, we can obtain: cov[ x3 , x3 ] = = ∑ k=2 w3k cov[ x3 , xk ] + I 33 v3 [ ] 2 w32 w32 (v2 + w21 v1 ) + v3 Where we have used the fact that cov[ x3 , x2 ] = cov[ x2 , x3 ]. By now, we have obtained Eq (8.18) just as required. Problem 8.8 Solution
Source Exif Data:
File Type : PDF File Type Extension : pdf MIME Type : application/pdf PDF Version : 1.5 Linearized : No Page Count : 158 Creator : XeTeX output 2018.11.11:1337 Producer : xdvipdfmx (20170318) Create Date : 2018:11:11 13:37:03+08:00EXIF Metadata provided by EXIF.tools