Solution Manual For PRML

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 158

DownloadSolution Manual For PRML
Open PDF In BrowserView PDF
S OLUTION M ANUAL F OR
PATTERN R ECOGNITION AND M ACHINE
L EARNING
E DITED B Y

ZHENGQI GAO
Information Science and Technology School
Fudan University
N OV.2017

1

0.1 Introduction
Problem 1.1 Solution
We let the derivative of error function E with respect to vector w equals
E
= 0), and this will be the solution of w = {w i } which minimizes
to 0, (i.e. ∂∂w
error function E . To solve this problem, we will calculate the derivative of E
with respect to every w i , and let them equal to 0 instead. Based on (1.1) and
(1.2) we can obtain :
=>

∂E
∂w i

=>

=

N
∑

n=1

{ y( xn , w) − t n } xni = 0

y( xn , w) xni =

n=1

=>

N
∑

N
∑

xni t n

n=1

N
M
N ∑
∑
∑
j
xni t n
( w j xn ) xni =
n=1

n=1 j =0

=>

M
N ∑
∑
n=1 j =0

=>

M ∑
N
∑
j =0 n=1

∑N

( j+ i)

w j xn

=

( j+ i)

wj =

xn

i+ j

N
∑
n=1
N
∑
n=1

xni t n

xni t n

∑N

If we denote A i j = n=1 xn and T i = n=1 xn i t n , the equation above can
be written exactly as (1.222), Therefore the problem is solved.
Problem 1.2 Solution
This problem is similar to Prob.1.1, and the only difference is the last
term on the right side of (1.4), the penalty term. So we will do the same thing
as in Prob.1.1 :
=>

∂E
∂w i

=>

=

N
∑
n=1

M ∑
N
∑
j =0 n=1

=>

{ y( xn , w) − t n } xni + λw i = 0

( j+ i)

xn

w j + λw i =

N
∑
n=1

xni t n

M ∑
N
N
∑
∑
( j+ i)
{
xn + δ ji λ}w j =
xni t n

j =0 n=1

n=1

2
where
δ ji

{
0

j ̸= i

1

j=i

Problem 1.3 Solution
This problem can be solved by Bayes’ theorem. The probability of selecting
an apple P (a) :

P ( a) = P ( a| r ) P ( r ) + P ( a| b ) P ( b ) + P ( a| g ) P ( g ) =

1
3
3
× 0.2 + × 0.2 + × 0.6 = 0.34
10
2
10

Based on Bayes’ theorem, the probability of an selected orange coming
from the green box P ( g| o) :

P ( g | o) =

P ( o| g ) P ( g )
P ( o)

We calculate the probability of selecting an orange P ( o) first :

P ( o) = P ( o| r ) P ( r ) + P ( o| b ) P ( b ) + P ( o| g ) P ( g ) =

4
1
3
× 0.2 + × 0.2 + × 0.6 = 0.36
10
2
10

Therefore we can get :

P ( o| g ) P ( g )
P ( g | o) =
=
P ( o)

3
10

× 0. 6

0.36

= 0.5

Problem 1.4 Solution
This problem needs knowledge about calculus, especially about Chain
rule. We calculate the derivative of P y ( y) with respect to y, according to
(1.27) :

d p y ( y)
dy

=

d ( p x ( g( y))| g‘ ( y)|) d p x ( g( y)) ‘
d | g‘ ( y)|
=
| g ( y)| + p x ( g( y))
dy
dy
dy

(∗)

The first term in the above equation can be further simplified:

d p x ( g( y)) ‘
d p x ( g( y)) d g( y) ‘
| g ( y)| =
| g ( y)|
dy
d g ( y)
dy

(∗∗)

If x̂ is the maximum of density over x, we can obtain :

d p x ( x) ¯¯
=0
dx x̂
Therefore, when y = ŷ, s.t. x̂ = g( ŷ), the first term on the right side of (∗∗)
will be 0, leading the first term in (∗) equals to 0, however because of the
existence of the second term in (∗), the derivative may not equal to 0. But

3
when linear transformation is applied, the second term in (∗) will vanish,
(e.g. x = a y + b). A simple example can be shown by :

p x ( x) = 2 x,

x ∈ [0, 1]

=>

x̂ = 1

And given that:

x = sin( y)
Therefore, p y ( y) = 2 sin( y) | cos( y)|, y ∈ [0, π2 ], which can be simplified :
π
y ∈ [0, ]
2

p y ( y) = sin(2 y),

=>

ŷ =

π

4

However, it is quite obvious :

x̂ ̸= sin( ŷ)
Problem 1.5 Solution
This problem takes advantage of the property of expectation:

var [ f ] = E[( f ( x) − E[ f ( x)])2 ]
= E[ f ( x)2 − 2 f ( x)E[ f ( x)] + E[ f ( x)]2 ]
= E[ f ( x)2 ] − 2E[ f ( x)]2 + E[ f ( x)]2
=> var [ f ] = E[ f ( x)2 ] − E[ f ( x)]2

Problem 1.6 Solution
Based on (1.41), we only need to prove when x and y is independent,
E x,y [ x y] = E[ x]E[ y]. Because x and y is independent, we have :

p( x, y) = p x ( x) p y ( y)
Therefore:
∫ ∫

∫ ∫

x yp( x, y) dx d y =

x yp x ( x) p y ( y) dx d y
∫
∫
= ( xp x ( x) dx)( yp y ( y) d y)

=> E x,y [ x y] = E[ x]E[ y]

Problem 1.7 Solution
This problem should take advantage of Integration by substitution.
∫ +∞ ∫ +∞
1
1
2
exp(− 2 x2 − 2 y2 ) dx d y
I
=
2σ
2σ
−∞ −∞
∫ 2π ∫ +∞
1 2
=
exp(− 2 r ) r dr d θ
2σ
0
0

4
Here we utilize :

x = r cos θ ,

y = r sin θ

Based on the fact :
∫ +∞
1
r 2 ¯+∞
exp(− 2 ) r dr = −σ2 exp(− 2 )¯0 = −σ2 (0 − (−1)) = σ2
2σ
2σ
0
Therefore, I can be solved :
∫

2π

2

I =

0

σ2 d θ = 2πσ2 ,

=> I =

p
2πσ

¯
And next,we will
show that Gaussian distribution N ( x¯µ, σ2 ) is normal¯
∫ +∞
ized, (i.e. −∞ N ( x¯µ, σ2 ) dx = 1) :
∫

+∞
−∞

¯
N ( x¯µ, σ2 ) dx

∫

+∞

1
1
exp{− 2 ( x − µ)2 } dx
p
2
2σ
−∞
2πσ
∫ +∞
1
1
=
exp{− 2 y2 } d y ( y = x − µ )
p
2σ
−∞
2πσ2
∫ +∞
1
1
exp{− 2 y2 } d y
= p
2σ
2πσ2 −∞
= 1

=

Problem 1.8 Solution
The first question will need the result of Prob.1.7 :
∫ +∞
∫ +∞
¯
1
1
N ( x¯µ, σ2 ) x dx =
exp{− 2 ( x − µ)2 } x dx
p
2σ
−∞
−∞
2πσ2
∫ +∞
1
1
=
exp{− 2 y2 }( y + µ) d y ( y = x − µ)
p
2
2
σ
−∞
2πσ
∫ +∞
∫ +∞
1
1 2
1
1
= µ
exp{− 2 y } d y +
exp{− 2 y2 } y d y
p
p
2
2
2σ
2σ
−∞
−∞
2πσ
2πσ
= µ+0 = µ
The second problem has already be given hint in the description. Given
that :
dg
df
d ( f g)
= f
+g
dx
dx
dx
We differentiate both side of (1.127) with respect to σ2 , we will obtain :
∫

+∞
−∞

(−

¯
1
( x − µ)2
+
)N ( x¯µ, σ2 ) dx = 0
2
4
2σ
2σ

5
Provided the fact that σ ̸= 0, we can get:
∫ +∞
∫ +∞
¯
¯
2
2
¯
( x − µ) N ( x µ, σ ) dx =
σ2 N ( x¯µ, σ2 ) dx = σ2
−∞

−∞

So the equation above has actually proven (1.51), according to the definition:
∫ +∞
¯
var [ x] =
( x − E[ x])2 N ( x¯µ, σ2 ) dx
−∞

Where E[ x] = µ has already been proved. Therefore :

var [ x] = σ2
Finally,
E[ x2 ] = var [ x] + E[ x]2 = σ2 + µ2

Problem 1.9 Solution
Here we only focus on (1.52), because (1.52) is the general form of (1.42).
Based on the definition : The maximum of distribution is known as its mode
and (1.52), we can obtain :
¯
∂N (x¯µ, Σ)
∂x

¯
1
= − [Σ−1 + (Σ−1 )T ] (x − µ)N (x¯µ, Σ)
2
¯
= −Σ−1 (x − µ)N (x¯µ, Σ)

Where we take advantage of :
∂xT Ax
∂x

= (A + AT )x

Therefore,
only when

x = µ,

and (Σ−1 )T = Σ−1
¯
∂N (x¯µ, Σ)
∂x

=0

Note: You may also need to calculate Hessian Matrix to prove that it is
maximum. However, here we find that the first derivative only has one root.
Based on the description in the problem, this point should be maximum point.
Problem 1.10 Solution
We will solve this problem based on the definition of expectation, variation

6
and independence.
∫ ∫
E[ x + z ] =

=

∫ ∫

( x + z) p( x, z) dx dz

( x + z) p( x) p( z) dx dz
∫ ∫
=
xp( x) p( z) dx dz +
z p( x) p( z) dx dz
∫ ∫
∫ ∫
=
( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz
∫
∫
=
xp( x) dx + z p( z) dz
∫ ∫

= E[ x ] + E[ z ]
∫ ∫

var [ x + z] =

∫ ∫

( x + z − E[ x + z] )2 p( x, z) dx dz

=

{ ( x + z )2 − 2 ( x + z) E[ x + z]) + E2 [ x + z] } p( x, z) dx dz
∫ ∫
∫ ∫
2
( x + z) p( x, z) dx dz − 2E[ x + z]
( x + z) p( x, z) dx dz + E2 [ x + z]
∫ ∫
( x + z)2 p( x, z) dx dz − E2 [ x + z]
∫ ∫
( x2 + 2 xz + z2 ) p( x) p( z) dx dz − E2 [ x + z]
∫ ∫
∫ ∫
∫ ∫
2
( p( z) dz) x p( x) dx +
2 xz p( x) p( z) dx dz + ( p( x) dx) z2 p( z) dz − E2 [ x + z]
∫ ∫
E[ x 2 ] + E[ z 2 ] − E2 [ x + z ] +
2 xz p( x) p( z) dx dz
∫ ∫
E[ x2 ] + E[ z2 ] − (E[ x] + E[ z])2 +
2 xz p( x) p( z) dx dz
∫ ∫
E[ x2 ] − E2 [ x] + E[ z2 ] − E2 [ z] − 2E[ x] E[ z] + 2
xz p( x) p( z) dx dz
∫
∫
var [ x] + var [ z] − 2E[ x] E[ z] + 2( xp( x) dx)( z p( z) dz)

=

var [ x] + var [ z]

=
=
=
=
=
=
=
=

Problem 1.11 Solution
Based on prior knowledge that µ ML and σ2ML will decouple. We will first
calculate µ ML :
¯
N
d ( ln p(x¯ µ, σ2 ))
1 ∑
= 2
( x n − µ)
dµ
σ n=1
We let :

¯
d ( ln p(x¯ µ, σ2 ))

dµ

=0

7
Therefore :
µ ML =

And because:

¯
d ( ln p(x¯ µ, σ2 ))

d σ2

=

N
1 ∑
xn
N n=1

N
1 ∑
(
( x n − µ )2 − N σ 2 )
2σ4 n=1

¯
d ( ln p(x¯ µ, σ2 ))

We let :

d σ2
Therefore :
σ2ML =

=0

N
1 ∑
( xn − µ ML )2
N n=1

Problem 1.12 Solution
It is quite straightforward for E[µ ML ], with the prior knowledge that xn is
i.i.d. and it also obeys Gaussian distribution N (µ, σ2 ).
E[µ ML ] = E[

N
N
1 ∑
1 ∑
x n ] = E[ x n ] = µ
x n ] = E[
N n=1
N n=1

For E[σ2ML ], we need to take advantage of (1.56) and what has been given
in the problem :
E[σ2ML ] = E[

N
1 ∑
( xn − µ ML )2 ]
N n=1

=

N
1 ∑
E[
( xn − µ ML )2 ]
N n=1

=

N
1 ∑
( x2 − 2 xn µ ML + µ2ML )]
E[
N n=1 n

=

N
N
N
1 ∑
1 ∑
1 ∑
E[
x2n ] − E[
2 xn µ ML ] + E[
µ2 ]
N n=1
N n=1
N n=1 ML

= µ2 + σ 2 −

N
N
2 ∑
1 ∑
E[
xn (
xn )] + E[µ2ML ]
N n=1
N n=1

= µ2 + σ 2 −

N
N
N
∑
∑
1 ∑
2
x
)]
+
E
[(
x
(
xn )2 ]
E
[
n
n
N
N 2 n=1
n=1
n=1

= µ2 + σ 2 −

N
N
∑
∑
1
2
2
x
)
]
+
x n )2 ]
E
[(
E
[(
n
N 2 n=1
N 2 n=1

N
∑
1
x n )2 ]
E
[(
N 2 n=1
1
= µ2 + σ 2 − 2 [ N ( N µ2 + σ 2 ) ]
N

= µ2 + σ 2 −

8
Therefore we have:
E[σ2ML ] = (

N −1 2
)σ
N

Problem 1.13 Solution
This problem can be solved in the same method used in Prob.1.12 :
E[σ2ML ] = E[

N
1 ∑
( x n − µ )2 ]
N n=1

(Because here we use µ to replace µ ML )

=

N
1 ∑
E[
( xn − µ)2 ]
N n=1

=

N
1 ∑
E[
( x2 − 2 xn µ + µ2 )]
N n=1 n

=

N
N
N
1 ∑
1 ∑
1 ∑
E[
x2n ] − E[
2 x n µ ] + E[
µ2 ]
N n=1
N n=1
N n=1

= µ2 + σ 2 −

N
2µ ∑
E[
x n ] + µ2
N n=1

= µ2 + σ2 − 2µ2 + µ2
= σ2

Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the
mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in
Prob.1.12). In other words, the difference can be shown by the following equations:
E[ µ 2 ] = µ 2

(µ is determined, i.e. its expectation is itself, also true for µ2 )
N
N
∑
1 ∑
1
1
σ2
E[µ2ML ] = E[(
xn )2 ] = 2 E[(
xn )2 ] = 2 N ( N µ2 + σ2 ) = µ2 +
N n=1
N
N
N
n=1
Problem 1.14 Solution
This problem is quite similar to the fact that any function f ( x) can be
written into the sum of an odd function and an even function. If we let:

wSij =

w i j + w ji
2

and

w iAj =

w i j − w ji
2

It is obvious that they satisfy the constraints described in the problem,
which are :
w i j = wSij + w iAj , wSij = wSji , w iAj = −w Aji

9
To prove (1.132), we only need to simplify it :
D ∑
D
∑

wi j xi x j

i =1 j =1

=
=

D ∑
D
∑
i =1 j =1

(wSij + w iAj ) x i x j

D ∑
D
∑
i =1 j =1

wSij x i x j +

D ∑
D
∑
i =1 j =1

w iAj x i x j

Therefore, we only need to prove that the second term equals to 0, and
here we use a simple trick: we will prove twice of the second term equals to 0
instead.
2

D ∑
D
∑
i =1 j =1

w iAj x i x j

=
=
=
=

D ∑
D
∑
i =1 j =1
D ∑
D
∑
i =1 j =1
D ∑
D
∑
i =1 j =1
D ∑
D
∑
i =1 j =1

( w iAj + w iAj ) x i x j
( w iAj − w Aji ) x i x j

w iAj x i x j −
w iAj x i x j −

D ∑
D
∑
i =1 j =1
D ∑
D
∑
j =1 i =1

w Aji x i x j
w Aji x j x i

= 0

Therefore, we choose the coefficient matrix to be symmetric as described
in the problem. Considering about the symmetry, we can see that if and only
if for i = 1, 2, ..., D and i ≤ j , w i j is given, the whole matrix will be determined.
Hence, the number of independent parameters are given by :

D + D − 1 + ... + 1 =

D (D + 1)
2

Note: You can view this intuitively by considering if the upper triangular
part of a symmetric matrix is given, the whole matrix will be determined.
Problem 1.15 Solution
This problem is a more general form of Prob.1.14, so the method can also
be used here: we will find a way to use w i 1 i 2 ...i M to represent w
e i 1 i 2 ...i M .
We begin by introducing a mapping function:

F ( x i1 x i2 ...x iM ) = x j1 x j2 ..., x jM
s.t.

M
∪
k=1

x ik =

M
∪
k=1

x jk ,

and

x j1 ≥ x j2 ≥ x j3 ... ≥ x jM

10
It is complexed to write F in mathematical form. Actually this function
does a simple work: it rearranges the element in a decreasing order based on
its subindex. Several examples are given below, when D = 5, M = 4:

F ( x5 x2 x3 x2 ) = x5 x3 x2 x2
F ( x1 x3 x3 x2 ) = x3 x3 x2 x1
F ( x1 x4 x2 x3 ) = x4 x3 x2 x1
F ( x1 x1 x5 x2 ) = x5 x2 x1 x1
After introducing F , the solution will be very simple, based on the fact
that F will not change the value of the term, but only rearrange it.
D ∑
D
∑

...

i 1 =1 i 2 =1

where

D
∑
i M =1

w i 1 i 2 ...i M x i1 x i2 ...x iM =

w
e j 1 j 2 ... j M =

∑

j1
D ∑
∑
j 1 =1 j 2 =1

...

j∑
M −1
j M =1

w
e j 1 j 2 ... j M x j1 x j2 ...x jM

w

w∈Ω

¯
Ω = {w i 1 i 2 ...i M ¯ F ( x i1 x i2 ...x iM ) = x j1 x j2 ...x jM , ∀ x i1 x i2 ...x iM }

By far, we have already proven (1.134). Mathematical induction will be
used to prove (1.135) and we will begin by proving D = 1, i.e. n(1, M ) =
n(1, M − 1). When D = 1, (1.134) will degenerate into wx
e 1M , i.e., it only has
one term, whose coefficient is govern by w
e regardless the value of M .
Therefore, we have proven when D = 1, n(D, M ) = 1. Suppose (1.135)
holds for D , let’s prove it will also hold for D + 1, and then (1.135) will be
proved based on Mathematical induction.
Let’s begin based on (1.134):
i1
D∑
+1 ∑
i 1 =1 i 2 =1

...

i∑
M −1
i M =1

w
e i 1 i 2 ...i M x i1 x i2 ...x iM

(∗)

We divide (∗) into two parts based on the first summation: the first part
is made up of i i = 1, 2, ..., D and the second part i 1 = D + 1. After division, the
first part corresponds to n(D, M ), and the second part corresponds to n(D +
1, M − 1). Therefore we obtain:

n(D + 1, M ) = n(D, M ) + n(D + 1, M − 1)
And given the fact that (1.135) holds for D :

n(D, M ) =

D
∑
i =1

n( i, M − 1)

(∗∗)

11
Therefore,we substitute it into (∗∗)

n ( D + 1, M ) =

D
∑
i =1

n( i, M − 1) + n(D + 1, M − 1) =

D∑
+1
i =1

n( i, M − 1)

We will prove (1.136) in a different but simple way. We rewrite (1.136) in
Permutation and Combination view:
D
∑
i =1

M
C iM+−M1−2 = C D
+ M −1

Firstly, We expand the summation.
M −1
M −1
M −1
M
CM
+ ... C D
−1 + C M
+ M −2 = C D + M −1

M
CM

M
M −1
Secondly, we rewrite the first term on the left side to C M
, because C M
=
−1
= 1. In other words, we only need to prove:
M
M −1
M −1
M
CM
+ CM
+ ... C D
+ M −2 = C D + M −1

r
r
r −1
Thirdly, we take advantage of the property : C N
= CN
+ CN
. So we
−1
−1
can recursively combine the first term and the second term on the left side,
and it will ultimately equal to the right side.
(1.137) gives the mathematical form of n(D, M ), and we need all the conclusions above to prove it.
Let’s give some intuitive concepts by illustrating M = 0, 1, 2. When M = 0,
(1.134) will consist of only a constant term, which means n(D, 0) = 1. When
M = 1,it is obvious n(D, 1) = D , because in this case (1.134) will only have D
terms if we expand it. When M = 2, it degenerates to Prob.1.14, so n(D, 2) =
D (D +1)
is also obvious. Suppose (1.137) holds for M − 1, let’s prove it will also
2
hold for M .

n(D, M ) =
=
=
=
=
=

D
∑
i =1
D
∑

n( i, M − 1) ( based on (1.135) )
C iM+−M1−2

...
=

( based on (1.137) holds for M − 1 )

i =1
M −1
M −1
M −1
M −1
CM
+ CM
−1 + C M
+1 ... + C D + M −2
M
M −1
M −1
M −1
( CM
+ CM
) + CM
+1 ... + C D + M −2
M
M −1
M −1
( CM
+1 + C M +1 )... + C D + M −2
M
M −1
CM
+2 ... + C D + M −2
M
CD
+ M −1

By far, all have been proven.

12
Problem 1.16 Solution
This problem can be solved in the same way as the one in Prob.1.15.
Firstly, we should write the expression consisted of all the independent terms
up to M th order corresponding to N (D, M ). By adding a summation regarding to M on the left side of (1.134), we obtain:
i1
M ∑
D ∑
∑

...

m=0 i 1 =1 i 2 =1

i∑
m−1
i m =1

w
e i 1 i 2 ...i m x i1 x i2 ...x im

(∗)

(1.138) is quite obvious if we view m as an looping variable, iterating
through all the possible orders less equal than M , and for every possible oder
m, the independent parameters are given by n(D, m).
Let’s prove (1.138) in a formal way by using Mathematical Induction.
When M = 1,(∗) will degenerate to two terms: m = 0, corresponding to n(D, 0)
and m = 1, corresponding to n(D, 1). Therefore N (D, 1) = n(D, 0) + n(D, 1).
Suppose (1.138) holds for M , we will see that it will also hold for M + 1. Let’s
begin by writing all the independent terms based on (∗) :
D
M
+1 ∑
∑

i1
∑

...

m=0 i 1 =1 i 2 =1

i∑
m−1
i m =1

w
e i 1 i 2 ...i m x i1 x i2 ...x im

(∗∗)

Using the same technique as in Prob.1.15, we divide (∗∗) to two parts
based on the summation regarding to m: the first part consisted of m =
0, 1, ..., M and the second part m = M + 1. Hence, the first part will correspond to N (D, M ) and the second part will correspond to n(D, M + 1). So we
obtain:
N (D, M + 1) = N (D, M ) + n(D, M + 1)
Then we substitute (1.138) into the equation above :

N (D, M + 1) =
=

M
∑
m=0
M
+1
∑

n(D, m) + n(D, M + 1)
n(D, m)

m=0

To prove (1.139), we will also use the same technique in Prob.1.15 instead
of Mathematical Induction. We begin based on already proved (1.138):

N (D, M ) =

M
∑
m=0

n(D, M )

13
We then take advantage of (1.137):

N (D, M ) =
=
=
=

M
∑

m
CD
+ m−1

m=0
M
C 0D −1 + C 1D + C 2D +1 + ... + C D
+ M −1
M
( C 0D + C 1D ) + C 2D +1 + ... + C D
+ M −1
M
( C 1D +1 + C 2D +1 ) + ... + C D
+ M −1

= ...
=

M
CD
+M

Here as asked by the problem, we will view the growing speed of N (D, M ).
We should see that in n(D, M ), D and M are symmetric, meaning that we only
need to prove when D ≫ M , it will grow like D M , and then the situation of
M ≫ D will be solved by symmetry.

N (D, M ) =
=
=
≈
=
=
≈

( D + M )D + M
(D + M )!
≈
D! M!
DD MM
1 D+M D
(
) (D + M ) M
D
MM
1
M D
[(1 + ) M ] M (D + M ) M
M
D
M
e M
( ) (D + M ) M
M
eM
M
(1 + ) M D M
M
D
M
eM
M D M2
[(1 + ) M ] D D M
M
D
M

e M+

M2
D

MM

DM

≈

eM
MM

DM

Where we use Stirling’s approximation, lim (1 + n1 )n = e and e
n→+∞

M2
D

≈ e0 =

1. According to the description in the problem, When D ≫ M , we can actually
eM
M
view M
in this case. And by
M as a constant, so N ( D, M ) will grow like D
D
symmetry, N (D, M ) will grow like M , when M ≫ D .
Finally, we are asked to calculate N (10, 3) and N (100, 3):
3
N (10, 3) = C 13
= 286
3
N (100, 3) = C 103
= 176851

Problem 1.17 Solution

14

∫

Γ( x + 1) =
∫

+∞

0

u x e−u du

+∞

− u x de−u
0
¯+∞ ∫
¯
−
= − u x e−u ¯

=

0

=
=

+∞

e−u d (− u x )

0
∫ +∞
¯+∞
x −u ¯
+x
−u e ¯
e−u u x−1 du
0
0
¯+∞
x −u ¯
+ x Γ( x)
−u e ¯
0

Where we have taken advantage of Integration by parts and according to
the equation above, we only need to prove the first term equals to 0. Given
L’Hospital’s Rule:
ux
x!
lim − u = lim − u = 0
u→+∞ e
u→+∞ e
And also when u = 0,− u x e u = 0, so we have proved Γ( x + 1) = xΓ( x). Based
on the definition of Γ( x), we can write:
∫ +∞
¯+∞
¯
Γ(1) =
e−u du = − e−u ¯
= −(0 − 1) = 1
0

0

Therefore when x is an integer:

Γ( x) = ( x − 1) Γ( x − 1) = ( x − 1) ( x − 2) Γ( x − 2) = ... = x! Γ(1) = x!
Problem 1.18 Solution
Based on (1.124) and (1.126) and by substituting x to
obvious to obtain :
∫ +∞
p
2
e− xi dx i = π

p
2σ y, it is quite

−∞

D

Therefore, the left side of (1.42) will equal to π 2 . For the right side of
(1.42):
∫ +∞
∫ +∞
p
D −1
2
SD
e−r r D −1 dr = S D
e−u u 2 d u ( u = r 2 )
0
0
∫
S D +∞ −u D −1
=
e u 2 du
2 0
SD D
Γ( )
=
2
2
Hence, we obtain:
D

π2 =

SD D
Γ( )
2
2

D

=>

SD =

2π 2

Γ( D2 )

15

S D has given the expression of the surface area with radius 1 in dimension D , we can further expand the conclusion: the surface area with radius r
in dimension D will equal to S D · r D −1 , and when r = 1, it will reduce to S D .
This conclusion is naive, if you find that the surface area of different sphere
in dimension D is proportion to the D − 1th power of radius, i.e. r D −1 . Considering the relationship between V and S of a sphere with arbitrary radius
in dimension D : dV
dr = S , we can obtain :
∫

V =

∫

S dr =

S D r D −1 dr =

SD D
r
D

The equation above gives the expression of the volume of a sphere with
radius r in dimension D , so we let r = 1 :

VD =

SD
D

For D = 2 and D = 3 :

V2 =

1 2π
S2
= ·
=π
2
2 Γ(1)
3

3

S3
1 2π 2
1 2π 2
4
V3 =
= · 3 = · p = π
π
3
3 Γ( 2 )
3
3
2

Problem 1.19 Solution
We have already given a hint in the solution of Prob.1.18, and here we
will make it more clearly: the volume of a sphere with radius r is VD · r D .
This is quite similar with the conclusion we obtained in Prob.1.18 about the
surface area except that it is proportion to D th power of its radius, i.e. r D not
r D −1 .
D
volume of sphere
VD aD
SD
π2
=
= D =
(∗)
volume of cube
(2a)D
2 D
2D −1 D Γ( D )
2

Where we have used the result of (1.143). And when D → +∞, we will use
a simple method to show that (∗) will converge to 0. We rewrite it :
(∗) =

2 π D
1
·( ) 2 · D
D 4
Γ( 2 )

Hence, it is now quite obvious, all the three terms will converge to 0 when
D → +∞. Therefore their product will also converge to 0. The last problem is
quite simple :
p
p
center to one corner
a2 · D p
=
= D and
lim
D = +∞
D →+∞
center to one side
a
Problem 1.20 Solution

16
The density of probability in a thin shell with radius r and thickness ϵ
can be viewed as a constant. And considering that a sphere in dimension D
with radius r has surface area S D r D −1 , which has already been proved in
Prob.1.19 :
∫

∫
shell

p(x) d x = p(x)

2

shell

dx =

Thus we denote :

p( r ) =

exp(− 2rσ2 )
D

(2πσ2 ) 2

S D r D −1
D

2

· V (shell) =

exp(−

exp(− 2rσ2 )
D

(2πσ2 ) 2

S D r D −1 ϵ

r2
)
2σ2

(2πσ2 ) 2
We calculate the derivative of (1.148) with respect to r :

d p( r )
SD
r2
r2
D −2
=
)
(
D
−
1
−
)
r
exp
(
−
D
dr
2σ 2
σ2
(2πσ2 ) 2

(∗)

We let p
the derivative equal to 0, we will obtain its unique root( stationary
point) r̂ = D − 1 σ, because r ∈ [0, +∞]. When r < r̂ , the derivative is large
than 0, p( r ) will increase as r ↑, and when r > r̂ , the derivative is less than 0,
p( r ) will decrease as r ↑. Therefore
r̂ will be the only maximum point. And it
p
is obvious when D ≫ 1, r̂ ≈ D σ.

p( r̂ + ϵ )
p( r̂ )

( r̂ + ϵ)D −1 exp(− (r̂2+σϵ2) )
2

=

2

r̂ D −1 exp(− 2r̂σ2 )

ϵ
2ϵ r̂ + ϵ2
= (1 + )D −1 exp(−
)
r̂
2σ 2
2ϵ r̂ + ϵ2
ϵ
= exp( −
+ (D − 1) ln(1 + ) )
2
r̂
2σ

We process for the exponential term by using Taylor Theorems.
−

ϵ
ϵ
2ϵ r̂ + ϵ2
ϵ2
2ϵ r̂ + ϵ2
+
(
D
−
1)
ln
(1
+
+
(
D
−
1)
(
)
)
≈
−
−
r̂
r̂ 2 r̂ 2
2σ2
2σ 2
2ϵ r̂ + ϵ2 2 r̂ ϵ − ϵ2
= −
+
2σ 2
2σ2
2
ϵ
= − 2
σ
2

Therefore, p( r̂ + ϵ) = p( r̂ ) exp(− σϵ 2 ). Note: Here I draw a different conclusion compared with (1.149), but I do not think there is any mistake in
my deduction.
Finally, we see from (1.147) :
¯
1
¯
p(x)¯
=
D
x=0
(2πσ2 ) 2

17
¯
¯
p(x)¯

||x||2 = r̂ 2

=

1
D

(2πσ2 ) 2

exp(−

r̂ 2
1
D
)≈
exp(− )
D
2
2
2σ
(2πσ2 ) 2

Problem 1.21 Solution
The first question is rather simple :
1

1

1

1

(ab) 2 − a = a 2 ( b 2 − a 2 ) ≥ 0
Where we have taken advantage of b ≥ a ≥ 0. And based on (1.78):

p(mistake) =

p(x ∈ R 1 , C 2 ) + p(x ∈ R 2 , C 1 )
∫
∫
=
p(x, C 2 ) dx +
p(x, C 1 ) dx
R1

R2

Recall that the decision rule which can minimize misclassification is that
if p(x, C 1 ) > p(x, C 2 ), for a given value of x, we will assign that x to class
C 1 . We can see that in decision area R 1 , it should satisfy p(x, C 1 ) > p(x, C 2 ).
Therefore, using what we have proved, we can obtain :
∫
∫
1
{ p(x, C 1 ) p(x, C 2 ) } 2 dx
p(x, C 2 ) dx ≤
R1

R1

It is the same for decision area R 2 . Therefore we can obtain:
∫
1
p(mistake) ≤ { p(x, C 1 ) p(x, C 2 ) } 2 dx
Problem 1.22 Solution
We need to deeply understand (1.81). When L k j = 1 − I k j :
∑

∑
¯
¯
¯
L k j p(C k ¯x) =
p(C k ¯x) − p(C j ¯x)

k

k

Given a specific x, the first term on the right side is a constant, which
equals to 1, no matter which class C j we assign
x to. Therefore if we want to
¯
¯
minimize the loss, we will maximize p(C j x). Hence, we will
¯ assign x to class
C j , which can give the biggest posterior probability p(C j ¯x).
The explanation of the loss matrix is quite simple. If we label correctly,
there is no loss. Otherwise, we will incur a loss, in the same degree whichever
class we label it to. The loss matrix is given below to give you an intuitive
view:


0 1 1 ... 1


1 0 1 . . . 1
. . .
.
 . . . ..

. .. 
. . .
1 1 1 ...
Problem 1.23 Solution

0

18

E[ L ] =

∑∑∫
k

j

Rj

L k j p(x, C k ) d x =

∑∑∫
k

j

Rj

¯
L k j p(C k ) p(x¯C k ) d x

If we denote a new loss matrix by L⋆jk = L jk p(C k ), we can obtain a new
equation :
∑∑∫
¯
¯
E[ L ] =
L⋆
k j p(x C k ) d x
j

k

Rj

Problem 1.24 Solution
This description of the problem is a little confusing, and what it really
mean is that λ is the parameter governing the loss, just like θ governing the
posterior probability p(C k |x) when we introduce the reject option. Therefore
the reject option can be written in a new way when we view it from the view
of λ and the loss:

class C j min ∑k L kl p(C k | x) < λ
l
choice
reject
else
Where C j is the class that can obtain the minimum. If L k j = 1 − I k j ,
according to what we have proved in Prob.1.22 :
∑

∑
¯
¯
¯
¯
p(C k ¯x) − p(C j ¯x) = 1 − p(C j ¯x)
L k j p(C k ¯x) =
k

k

Therefore, the reject criterion from the view of λ above is actually equivalent to the largest posterior probability is larger than 1 − λ :
min
l

∑

L kl p(C k | x) < λ

<=>

max p(C l | x) > 1 − λ

k

l

And from the view of θ and posterior probability, we label a class for x (i.e.
we do not reject) is given by the constrain :
max p(C l | x) > θ
l

Hence from the two different views, we can see that λ and θ are correlated
with:
λ+θ =1
Problem 1.25 Solution
We can prove this informally by dealing with one dimension once a time
just as the same process in (1.87) - (1.89) until all has been done, due to the
fact that the total loss E can be divided to the summation of loss on every

19
dimension, and what’s more they are independent. Here, we will use a more
informal way to prove this. In this case, the expected loss can be written :
∫ ∫
E[ L ] =
{y(x) − t}2 p(x, t) d t d x
Therefore, just as the same process in (1.87) - (1.89):
∫
∂E[L]
= 2 {y(x) − t} p(x, t) d t = 0
∂ y(x)
∫
t p(x, t) d t
=> y(x) =
= Et [t|x]
p(x)
Problem 1.26 Solution
The process is identical as the deduction we conduct for (1.90). We will
not repeat here. And what we should emphasize is that E[t|x] is a function of
x, not t. Thus the integral over t and x can be simplified based on Integration
by parts and that is how we obtain (1.90).
]Note: There is a mistake in (1.90), i.e. the second term on the right side
is wrong. You can view (3.37) on P148 for reference. It should be :
∫
∫
¯
¯ 2
¯
E[L] = { y(x) − E[ t x]} p(x) d x + {E[ t¯x − t]}2 p(x, t) d x dt
Problem 1.27 Solution
We deal with this problem based on Calculus of Variations.
∫
∂E[L q ]
= q [ y(x − t)] q−1 si gn( y(x) − t) p(x, t) dt = 0
∂ y(x)
∫ y(x)
∫ +∞
=>
[ y(x) − t] q−1 p(x, t) dt =
[ y(x) − t] q−1 p(x, t) dt
−∞

∫

=>

y(x)
−∞

[ y(x) − t] q−1 p( t|x) dt =

∫

y(x)

+∞

y(x)

[ y(x) − t] q−1 p( t|x) dt

Where we take advantage of p(x, t) = p( t|x) p(x) and the property of sign
function. Hence, when q = 1, the equation above will reduce to :
∫ y(x)
∫ +∞
p( t|x) dt =
p( t|x) dt
−∞

y(x)

In other words, when q = 1, the optimal y(x) will be given by conditional
median. When q = 0, it is non-trivial. We need to rewrite (1.91) :
∫ {∫
}
E[ L q ] =
| y(x) − t| q p( t|x) p(x) dt d x
∫ {
∫
}
=
p(x) | y(x) − t| q p( t|x) dt d x (∗)

20
If we want to minimize E[L q ], we only need to minimize the integrand of
(∗):

∫

| y(x) − t| q p( t|x) dt

(∗∗)

When q = 0, | y(x) − t| q is close to 1 everywhere except in the neighborhood
around t = y(x) (This can be seen from Fig1.29). Therefore:
∫
∫
∫
∫
(∗∗) ≈
p( t|x) dt − (1 − | y(x) − t| q ) p( t|x) dt ≈
p( t|x) dt − p( t|x) dt
U

U

ϵ

ϵ

Where ϵ means the small neighborhood,U means the whole space x lies
in. Note that y(x) has no correlation with the first term, but the second term
(because how to choose y(x) will affect the location of ϵ). Hence we will put ϵ
at the location where p( t|x) achieve its largest value, i.e. the mode, because
in this way we can obtain the largest reduction. Therefore, it is natural we
choose y(x) equals to t that maximize p( t|x) for every x.
Problem 1.28 Solution
Basically this problem is focused on the definition of Information Content,
i.e. h( x). We will rewrite the problem more precisely. In Information Theory,
h(·) is also called Information Content and denoted as I (·). Here we will still
use h(·) for consistency. The whole problem is about the property of h( x).
Based on our knowledge that h(·) is a monotonic function of the probability
p( x), we can obtain:
h( x) = f ( p( x))
The equation above means that the Information we obtain for a specific
value of a random variable x is correlated with its occurring probability p( x),
and its relationship is given by a mapping function f (·). Suppose C is the
intersection of two independent event A and B, then the information of event
C occurring is the compound message of both independent events A and B
occurring:
h(C ) = h( A ∩ B ) = h( A ) + h(B )
(∗)
Because A and B is independent:

P (C ) = P ( A ) · P (B )
We apply function f (·) to both side:

f (P (C )) = f (P ( A ) · P (B))

(∗∗)

Moreover, the left side of (∗) and (∗∗) are equivalent by definition, so we
can obtain:
h( A ) + h(B) = f (P ( A ) · P (B))
=>

f ( p( A )) + f ( p(B)) = f (P ( A ) · P (B))

21
We obtain an important property of function f (·): f ( x · y) = f ( x) + f ( y).
Note: In problem (1.28), what it really wants us to prove is about the form
and property of function f in our formulation, because there is one sentence
in the description of the problem : "In this exercise, we derive the relation
between h and p in the form of a function h( p)", (i.e. f (·) in our formulation
is equivalent to h( p) in the description).
At present, what we know is the property of function f (·):

f ( x y ) = f ( x ) + f ( y)

(∗)

Firstly, we choose x = y, and then it is obvious : f ( x2 ) = 2 f ( x). Secondly, it
is obvious f ( x n ) = n f ( x) , n ∈ N is true for n = 1, n = 2. Suppose it is also true
for n, we will prove it is true for n + 1:

f ( x n+1 ) = f ( x n ) + f ( x) = n f ( x) + f ( x) = ( n + 1) f ( x)
Therefore, f ( x n ) = n f ( x) , n ∈ N has been proved. For an integer m, we
n m
rewrite x n as ( x m ) , and take advantage of what we have proved, we will
obtain:
n m
n
f ( x n ) = f (( x m ) ) = m f ( x m )
n

Because f ( x n ) also equals to n f ( x), therefore n f ( x) = m f ( x m ). We simplify the equation and obtain:
n

f (x m ) =

n
f ( x)
m

For an arbitrary positive x , x ∈ R+ , we can find two positive rational array
{ yn } and { z n }, which satisfy:

y1 < y2 < ... < yN < x

and

z1 > z2 > ... > z N > x,

and

lim yN = x

N →+∞

lim z N = x

N →+∞

We take advantage of function f (·) is monotonic:

yN f ( p) = f ( p yN ) ≤ f ( p x ) ≤ f ( p z N ) = z N f ( p)
And when N → +∞, we will obtain: f ( p x ) = x f ( p) , x ∈ R+ . We let p = e,
it can be rewritten as : f ( e x ) = x f ( e). Finally, We denote y = e x :

f ( y) = ln( y) f ( e)
Where f ( e) is a constant once function f (·) is decided. Therefore f ( x) ∝
ln( x).
Problem 1.29 Solution

22
This problem is a little bit tricky. The entropy for a M-state discrete random variable x can be written as :

H [ x] = −

M
∑

λ i ln(λ i )

i

Where λ i is the probability that x choose state i . Here we choose a concave
function f (·) = ln(·), we rewrite Jensen’s inequality, i.e.(1.115):

ln(

M
∑
i =1

We choose x i =

1
λi

λi xi ) ≥

M
∑
i =1

λ i ln( x i )

and simplify the equation above, we will obtain :

lnM ≥ −

M
∑
i =1

λ i ln(λ i ) = H [ x]

Problem 1.30 Solution
Based on definition :

ln{

p ( x)
} =
q ( x)
=

1
s
1
ln( ) − [ 2 ( x − µ)2 − 2 ( x − m)2 ]
σ
2σ
2s
µ
1
m
µ2
m2
s
1
ln( ) − [( 2 − 2 ) x2 − ( 2 − 2 ) x + ( 2 − 2 )]
σ
2σ
2s
σ
s
2σ
2s

We will take advantage of the following equations to solve this problem.
∫
E[ x2 ] = x2 N ( x|µ, σ2 ) dx = µ2 + σ2
∫
E[ x ] =

x N ( x|µ, σ2 ) dx = µ

∫

N ( x|µ, σ2 ) dx = 1

Given the equations above, it is easy to see :
∫
q ( x)
K L( p|| q) = − p( x) ln{
} dx
p ( x)
∫
p ( x)
} dx
=
N ( x|µ, σ) ln{
q ( x)
=
=

s
1
1
µ
m
µ2
m2
ln( ) − ( 2 − 2 )(µ2 + σ2 ) + ( 2 − 2 )µ − ( 2 − 2 )
σ
2σ
2s
σ
s
2σ
2s
s
σ2 + (µ − m)2 1
ln( ) +
−
σ
2
2 s2

23
We will discuss this result in more detail. Firstly, if K L distance is defined
in Information Theory, the first term of the result will be log 2 ( σs ) instead of
ln( σs ). Secondly, if we denote x = σs , K L distance can be rewritten as :

K L( p|| q) = ln( x) +

1
1
− + a,
2
2
2x

where a =

(µ − m)2
2 s2

We calculate the derivative of K L with respect to x, and let it equal to 0:

d (K L)
1
= − x−3 = 0
dx
x

=>

x = 1 ( ∵ s, σ > 0 )

When x < 1 the derivative is less than 0, and when x > 1, it is greater than
0, which makes x = 1 the global minimum. When x = 1, K L( p|| q) = a. What’s
more, when µ = m, a will achieve its minimum 0. In this way, we have shown
that the K L distance between two Gaussian Distributions is not less than 0,
and only when the two Gaussian Distributions are identical, i.e. having same
mean and variance, K L distance will equal to 0.
Problem 1.31 Solution
We evaluate H [x] + H [y] − H [x, y] by definition. Firstly, let’s calculate
H [x, y] :
∫ ∫
H [x, y] = −
p(x, y) lnp(x, y) d x d y
∫ ∫
∫ ∫
= −
p(x, y) lnp(x) d x d y −
p(x, y) lnp(y|x) d x d y
∫
∫ ∫
= − p(x) lnp(x) d x −
p(x, y) lnp(y|x) d x d y
=

H [x] + H [y|x]

∫
Where we take advantage of p(x, y) = p(x) p(y|x), p(x, y) d y = p(x) and
(1.111). Therefore, we have actually solved Prob.1.37 here. We will continue
our proof for this problem, based on what we have proved:

H [x] + H [y] − H [x, y] =
=
=
=
=

H [y] − H [y|x]
∫
∫ ∫
− p(y) lnp(y) d y +
p(x, y) lnp(y|x) d x d y
∫ ∫
∫ ∫
−
p(x, y) lnp(y) d x d y +
p(x, y) lnp(y|x) d x d y
∫ ∫
( p(x) p(y) )
−
p(x, y) ln
d xd y
p(x, y)
K L( p(x, y)|| p(x) p(y) ) = I (x, y) ≥ 0

Where we take advantage of the following properties:
∫
p(y) = p(x, y) d x

24

p(y)
p(x) p(y)
=
p(y|x)
p(x, y)
Moreover, it is straightforward that if and only if x and y is statistically
independent, the equality holds, due to the property of KL distance. You can
also view this result by :
∫ ∫
H [x, y] = −
p(x, y) lnp(x, y) d x d y
∫ ∫
∫ ∫
= −
p(x, y) lnp(x) d x d y −
p(x, y) lnp(y) d x d y
∫
∫ ∫
= − p(x) lnp(x) d x −
p(y) lnp(y) d y
=

H [x] + H [y]

Problem 1.32 Solution
It is straightforward based on definition and note that if we want to
change variable in integral, we have to introduce a redundant term called
Jacobian Determinant.
∫
H [y] = − p(y) lnp(y) d y
∫
p(x) p(x) ∂y
= −
ln
| |d x
| A|
|A| ∂x
∫
p(x)
= − p(x) ln
dx
|A|
∫
∫
1
= − p(x) lnp(x) d x − p(x) ln
dx
|A|
= H [x] + ln|A|
Where we have taken advantage of the following equations:
∂y
∂x

=A

p(x) = p(y) |

and
∫

∂y
∂x

| = p(y) |A|

p(x) d x = 1
Problem 1.33 Solution
Based on the definition of Entropy, we write:
∑∑
H [ y| x ] = −
p( x i , y j ) lnp( y j | x i )
xi y j

Considering the property of probability, we can obtain that 0 ≤ p( y j | x i ) ≤
1, 0 ≤ p( x i , y j ) ≤ 1. Therefore, we can see that − p( x i , y j ) lnp( y j | x i ) ≥ 0 when
0 < p( y j | x i ) ≤ 1. And when p( y j | x i ) = 0, provided with the fact that lim plnp =
p→0

25
0, we can see that − p( x i , y j ) lnp( y j | x i ) = − p( x i ) p( y j | x i ) lnp( y j | x i ) ≈ 0, (here
we view p( x) as a constant). Hence for an arbitrary term in the equation
above, we have proved that it can not be less than 0. In other words, if and
only if every term of H [ y| x] equals to 0, H [ y| x] will equal to 0.
Therefore, for each possible value of random variable x, denoted as x i :
∑
− p( x i , y j ) lnp( y j | x i ) = 0
(∗)
yj

If there are more than one possible value of random variable y given
x = x i , denoted as y j , such that p( y j | x i ) ̸= 0 (Because x i , y j are both "possible", p( x i , y j ) will also not equal to 0), constrained by 0 ≤ p( y j | x i ) ≤ 1 and
∑
j p( y j | x i ) = 1, there should be at least two value of y satisfied 0 < p( y j | x i ) <
1, which ultimately leads to (∗) > 0.
Therefore, for each possible value of x, there will only be one y such that
p( y| x) ̸= 0. In other words, y is determined by x. Note: This result is quite
straightforward. If y is a function of x, we can obtain the value of y as soon
as observing a x. Therefore we will obtain no additional information when
observing a y j given an already observed x.
Problem 1.34 Solution
This problem is complicated. We will explain it in detail. According to
Appenddix D, we can obtain the relation,i.e. (D.3) :
∫
∂F
F [ y( x) + ϵη( x)] = F [ y( x)] +
ϵη( x) dx
(∗∗)
∂y
Where y( x) can be viewed as an operator that for any input x it will give
an output value y, and equivalently, F [ y( x)] can be viewed as an functional
operator that for any input value y( x), it will give an ouput value F [ y( x)].
Then we consider a functional operator:
∫
I [ p( x)] = p( x) f ( x) dx
Under a small variation p( x) → p( x) + ϵη( x), we will obtain :
∫
∫
I [ p( x) + ϵη( x)] = p( x) f ( x) dx + ϵη( x) f ( x) dx
Comparing the equation above and (∗), we can draw a conclusion :
∂I
∂ p ( x)

= f ( x)

Similarly, let’s consider another functional operator:
∫
J [ p( x)] = p( x) lnp( x) dx

26
Then under a small variation p( x) → p( x) + ϵη( x):
∫
J [ p( x) + ϵη( x)] =
( p( x) + ϵη( x) ) ln( p( x) + ϵη( x) ) dx
∫
∫
=
p( x) ln( p( x) + ϵη( x) ) dx + ϵη( x) ln( p( x) + ϵη( x) ) dx

Note that ϵη( x) is much smaller than p( x), we will write its Taylor Theorems at point p( x):

ln( p( x) + ϵη( x) ) = lnp( x) +

ϵη( x)

p ( x)

+ O (ϵη( x)2 )

Therefore, we substitute the equation above into J [ p( x) + ϵη( x)]:
∫
∫
J [ p( x) + ϵη( x)] = p( x) lnp( x) dx + ϵη( x) ( lnp( x) + 1) dx + O (ϵ2 )
Therefore, we also obtain :
∂J
∂ p ( x)

= lnp( x) + 1

Now we can go back to (1.108). Based on ∂ ∂pJ( x) and ∂ p∂(Ix) , we can calculate
the derivative of the expression just before (1.108) and let it equal to 0:
− lnp( x) − 1 + λ1 + λ2 x + λ3 ( x − µ)2 = 0

Hence we rearrange it and obtain (1.108). From (1.108) we can see that
p( x) should take the form of a Gaussian distribution. So we rewrite it into
Gaussian form and then compare it to a Gaussian distribution with mean µ
and variance σ2 , it is straightforward:

exp(−1 + λ1 ) =

1
1

(2πσ2 ) 2

,

exp(λ2 x + λ3 ( x − µ)2 ) = exp{

Finally, we obtain :
λ1 = 1 − ln(2πσ2 )
λ2 = 0
λ3 =

Problem 1.35 Solution

1
2σ 2

( x − µ )2
}
2σ 2

27
If p( x) = N (µ, σ2 ), we write its entropy:
∫
H [ x] = − p( x) lnp( x) dx
∫
∫
1
( x − µ)2
}
dx
−
p
(
x
)
{−
} dx
= − p( x) ln{
2πσ2
2σ 2
1
σ2
= − ln{
}
+
2πσ2
2σ 2
1
=
{ 1 + ln(2πσ2 ) }
2
Where we have taken advantage of the following properties of a Gaussian
distribution:
∫
∫
p( x) dx = 1 and ( x − µ)2 p( x) dx = σ2
Problem 1.36 Solution
Here we should make it clear that if the second derivative is strictly positive, the function must be strictly convex. However, the converse may not be
true. For example f ( x) = x4 , g( x) = x2 , x ∈ R are both strictly convex by definition, but their second derivatives at x = 0 are both indeed 0 (See keyword
convex function on Wikipedia or Page 71 of the book Convex Optimization
written by Boyd, Vandenberghe for more details). Hence, here more precisely
we will prove that a convex function is equivalent to its second derivative is
non-negative by first considering Taylor Theorems:

f ( x + ϵ) = f ( x ) +

f ′ ( x)
f ′′ ( x) 2 f ′′′ ( x) 3
ϵ+
ϵ +
ϵ + ...
1!
2!
3!

f ( x − ϵ) = f ( x ) −

f ′ ( x)
f ′′ ( x) 2 f ′′′ ( x) 3
ϵ+
ϵ −
ϵ + ...
1!
2!
3!

Then we can obtain the expression of f ′′ ( x):

f ′′ ( x) = lim
ϵ→0

f ( x + ϵ) + f ( x − ϵ) − 2 f ( x )
ϵ2

Where O (ϵ4 ) is neglected and if f ( x) is convex, we can obtain:
1
1
1
1
f ( x) = f ( ( x + ϵ) + ( x − ϵ)) ≤ f ( x + ϵ) + f ( x − ϵ)
2
2
2
2
Hence f ′′ ( x) ≥ 0. The converse situation is a little bit complex, we will use
Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion
above :
f ′′ ( x⋆ )
f ( x) = f ( x0 ) + f ′ ( x0 )( x − x0 ) +
( x − x0 )
2

28
Where x⋆ lies between x and x0 . By hypothesis, f ′′ ( x) ≥ 0, the last term is
non-negative for all x. We let x0 = λ x1 + (1 − λ) x2 , and x = x1 :

f ( x1 ) ≥ f ( x0 ) + (1 − λ)( x1 − x2 ) f ′ ( x0 )

(∗)

And then, we let x = x2 :

f ( x2 ) ≥ f ( x0 ) + λ( x2 − x1 ) f ′ ( x0 )

(∗∗)

We multiply (∗) by λ, (∗∗) by 1 − λ and then add them together, we will
see :
λ f ( x1 ) + (1 − λ) f ( x2 ) ≥ f (λ x1 + (1 − λ) x2 )
Problem 1.37 Solution
See Prob.1.31.
Problem 1.38 Solution
When M = 2, (1.115) will reduce to (1.114). We suppose (1.115) holds for
M , we will prove that it will also hold for M + 1.

f(

M
∑
m=1

=

λm xm )

f (λ M +1 x M +1 + (1 − λ M +1 )

M
∑

m=1 1 − λ M +1

≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 ) f (
≤ λ M +1 f ( x M +1 ) + (1 − λ M +1 )
M
+1
∑

≤

m=1

λm
M
∑

xm )

λm

m=1 1 − λ M +1

xm )

M
∑

λm
f ( xm )
1
−
λ M +1
m=1

λm f ( xm )

Hence, Jensen’s Inequality, i.e. (1.115), has been proved.
Problem 1.39 Solution
It is quite straightforward based on definition.

H [ x] = −

∑
i

H [ y] = −

∑
i

H [ x, y] = −

∑
i, j

H [ x | y] = −

∑
i, j

2 2 1 1
p( x i ) lnp( x i ) = − ln − ln = 0.6365
3 3 3 3
2 2 1 1
p( yi ) lnp( yi ) = − ln − ln = 0.6365
3 3 3 3

1 1
p( x i , y j ) lnp( x i , y j ) = −3 · ln − 0 = 1.0986
3 3

1
1 1 1 1
p( x i , y j ) lnp( x i | y j ) = − ln1 − ln − ln = 0.4621
3
3 2 3 2

29

H [ y| x ] = −

∑
i, j

1 1
1 1 1
p( x i , y j ) lnp( y j | x i ) = − ln − − ln − ln1 = 0.4621
3 2
3 2 3

I [ x, y] = −

∑

p( x i , y j ) ln

p( x i ) p( y j )

i, j

p( x i , y j )

2 1
2 2
1 2
·
·
·
1
1
1
= − ln 3 3 − ln 3 3 − ln 3 3 = 0.1744
3
1/3
3
1/3
3
1/3

Their relations are given below, diagrams omitted.

I [ x, y] = H [ x] − H [ x| y] = H [ y] − H [ y| x]
H [ x, y] = H [ y| x] + H [ x] = H [ x| y] + H [ y]
Problem 1.40 Solution

f ( x) = lnx is actually a strict concave function, therefore we take advantage of Jensen’s Inequality to obtain:
f(

M
∑
i =1

We let λm =

ln(

λm xm ) ≥

1
M , m = 1, 2, ..., M .

M
∑
i =1

λm f ( xm )

Hence we will obtain:

x1 + x2 + ... + xm
1
1
)≥
[ ln( x1 ) + ln( x2 ) + ... + ln( x M ) ] =
ln( x1 x2 ...x M )
M
M
M

We take advantage of the fact that f ( x) = lnx is strictly increasing and
then obtain :
x1 + x2 + ... + xm
p
≥ M x1 x2 ...x M
M
Problem 1.41 Solution
Based on definition of I [x, y], i.e.(1.120), we obtain:
∫ ∫
p(x) p(y)
I [x, y] = −
p(x, y) ln
dx dy
p(x, y)
∫ ∫
p(x)
= −
p(x, y) ln
dx dy
p(x|y)
∫ ∫
∫ ∫
= −
p(x, y) lnp(x) d x d y +
p(x, y) lnp(x|y) d x d y
∫ ∫
∫ ∫
= −
p(x) lnp(x) d x +
p(x, y) lnp(x|y) d x d y
=

H [x] − H [x|y]

Where we have taken advantage of the fact: p(x, y) = p(y) p(x|y), and
p(x, y) d y = p(x). The same process can be used for proving I [x, y] = H [y] −
H [y|x], if we substitute p(x, y) with p(x) p(y|x) in the second step.

∫

30

0.2 Probability Distribution
Problem 2.1 Solution
Based on definition, we can obtain :
∑
x i =0,1

E[ x] =

∑
x i =0,1

var [ x] =

p( x i ) = µ + (1 − µ) = 1
x i p( x i ) = 0 · (1 − µ) + 1 · µ = µ
∑
x i =0,1

( x i − E[ x])2 p( x i )

= (0 − µ)2 (1 − µ) + (1 − µ)2 · µ
= µ(1 − µ)

H [ x] = −

∑
x i =0,1

p( x i ) lnp( x i ) = − µ lnµ − (1 − µ) ln(1 − µ)

Problem 2.2 Solution
The proof in Prob.2.1. can also be used here.
∑
x i =−1,1

E[ x ] =

∑
x i =−1,1

var [E ] =

p( x i ) =

1−µ 1+µ
+
=1
2
2

x i p ( x i ) = −1 ·
∑

1+µ
1−µ
+ 1·
=µ
2
2

( x i − E[ x])2 p( x i )

x i =−1,1

= (−1 − µ)2 ·

1−µ
1+µ
+ (1 − µ)2 ·
2
2

= (1 − µ)2

H [ x] = −

∑
x i =−1,1

p( x i ) lnp( x i ) = −

1−µ 1−µ 1+µ 1+µ
ln
−
ln
2
2
2
2

Problem 2.3 Solution
(2.262) is an important property of Combinations, which we have used
m
to
before, such as in Prob.1.15. We will use the ’old fashioned’ denotation C N
represent choose m objects from a total of N . With the prior knowledge:
m
CN
=

N!
m! ( N − m)!

31
We evaluate the left side of (2.262) :
m
m−1
CN
+ CN

N!
N!
+
m! ( N − m)! ( m − 1)! ( N − ( m − 1))!
N!
1
1
( +
)
( m − 1)! ( N − m)! m N − m + 1
( N + 1)!
m
= CN
+1
m! ( N + 1 − m)!

=
=
=

To proof (2.263), here we will proof a more general form:
( x + y) N =

N
∑
m=0

m m N −m
CN
x y

(∗)

If we let y = 1, (∗) will reduce to (2.263). We will proof it by induction.
First, it is obvious when N = 1, (∗) holds. We assume that it holds for N , we
will proof that it also holds for N + 1.
( x + y) N +1

= ( x + y)
=
=
=
=
=
=

x

N
∑
m=0

N
∑
m=0
N∑
+1
m=1
N
∑
m=1
N
∑
m=1
N∑
+1
m=0

N
∑
m=0

m m N −m
CN
x y

m m N −m
CN
+y
x y

m m+1 N − m
CN
x
y
+

N
∑
m=0

N
∑
m=0

m−1 m N +1− m
+
CN
x y

m m N −m
CN
x y

m m N +1− m
CN
x y

N
∑
m=0

m m N +1− m
CN
x y

m−1
m m N +1− m
(C N
+ CN
)x y
+ x N +1 + y N +1
m
m N +1− m
CN
+ x N +1 + y N +1
+1 x y
m
m N +1− m
CN
+1 x y

By far, we have proved (∗). Therefore, if we let y = 1 in (∗), (2.263) has
been proved. If we let x = µ and y = 1 − µ, (2.264) has been proved.
Problem 2.4 Solution
Solution has already been given in the problem, but we will solve it in a

32
more intuitive way, beginning by definition:
E[ m ] =

=
=

N
∑
m=0
N
∑
m=1
N
∑
m=1

m m
mC N
µ (1 − µ) N −m
m m
mC N
µ (1 − µ) N −m

N!
µm (1 − µ) N −m
( m − 1)!( N − m)!
N
∑

( N − 1)!
µm−1 (1 − µ) N −m
( m − 1)!( N − m)!

=

N ·µ

=

N ·µ

=

N ·µ

=

N · µ [ µ + (1 − µ) ] N −1 = N µ

m=1
N
∑
m=1
N∑
−1
k=0

m−1 m−1
CN
(1 − µ) N −m
−1 µ
k
k
N −1− k
CN
−1 µ (1 − µ)

Some details should be explained here. We note that m = 0 actually
doesn’t affect the Expectation, so we let the summation begin from m = 1,
i.e. (what we have done from the first step to the second step). Moreover, in
the second last step, we rewrite the subindex of the summation, and what we
actually do is let k = m − 1. And in the last step, we have taken advantage of
(2.264). Variance is straightforward once Expectation has been calculated.

var [ m] = E[ m2 ] − E[ m]2
N
∑
m m
=
m2 C N
µ (1 − µ) N −m − E[ m] · E[ m]
m=0

=
=
=

N
∑
m=0
N
∑
m=1
N
∑
m=1

=

Nµ

=

Nµ

m m
m2 C N
µ (1 − µ) N −m − ( N µ) ·
m m
m2 C N
µ (1 − µ) N −m − N µ ·

m

N
∑
m=0

N
∑
m=1

m m
mC N
µ (1 − µ) N −m

m m
mC N
µ (1 − µ) N −m

N
∑
N!
m m
mC N
µ (1 − µ) N −m
µm (1 − µ) N −m − ( N µ) ·
( m − 1)!( N − m)!
m=1

N
∑
m=1
N
∑
m=1

m

N
∑
( N − 1)!
m m
µm−1 (1 − µ) N −m − N µ ·
mC N
µ (1 − µ) N −m
( m − 1)!( N − m)!
m=1

m−1
m
mµm−1 (1 − µ) N −m ( C N
−1 − µC N )

Here we will use a little tick, −µ = −1 + (1 − µ) and then take advantage

33
m
m
m−1
of the property, C N
= CN
+ CN
.
−1
−1

var [ m] =

Nµ

=

Nµ

=

Nµ

=
=
=

Nµ

N
∑
m=1
N
∑
m=1
N
∑
m=1

[ m−1
]
m
m
mµm−1 (1 − µ) N −m C N
−1 − C N + (1 − µ)C N
[
]
m
m−1
m
mµm−1 (1 − µ) N −m (1 − µ)C N
+ CN
−1 − C N
[
]
m
m
mµm−1 (1 − µ) N −m (1 − µ)C N
− CN
−1

{∑
N
{

m=1

m
mµm−1 (1 − µ) N −m+1 C N
−

N
∑
m=1

m
mµm−1 (1 − µ) N −m C N
−1

N µ · N (1 − µ)[µ + (1 − µ)] N −1 − ( N − 1)(1 − µ)[µ + (1 − µ)] N −2
{
}
N µ N (1 − µ) − ( N − 1)(1 − µ) = N µ(1 − µ)

}

}

Problem 2.5 Solution
Hints have already been given in the description, and let’s make a little
improvement by introducing t = y + x and x = tµ at the same time, i.e. we will
do following changes:
{
{
t = x+ y
x = tµ
and
µ = x+x y
y = t(1 − µ)
Note t ∈ [0, +∞], µ ∈ (0, 1), and that when we change variables in integral,
we will introduce a redundant term called Jacobian Determinant.
¯ ∂x ∂x ¯ ¯
¯
∂( x, y) ¯¯ ∂µ ∂ t ¯¯ ¯¯ t
µ ¯¯
= ¯ ∂y ∂y ¯ =
=t
∂(µ, t) ¯ ∂µ ∂ t ¯ ¯ − t 1 − µ ¯
Now we can calculate the integral.
∫ +∞
∫ +∞
exp(− y) yb−1 d y
exp(− x) xa−1 dx
Γ(a)Γ( b) =
0
0
∫ +∞ ∫ +∞
a−1
=
exp(− x) x
exp(− y) yb−1 d y dx
0
0
∫ +∞ ∫ +∞
=
exp(− x − y) xa−1 yb−1 d y dx
0
0
∫ 1 ∫ +∞
=
exp(− t) ( tµ)a−1 ( t(1 − µ))b−1 t dt d µ
0 0
∫ +∞
∫ 1
a+ b−1
=
exp(− t) t
dt ·
µa−1 (1 − µ)b−1 d µ
0
0
∫ 1
= Γ(a + b) ·
µa−1 (1 − µ)b−1 d µ
0

34
Therefore, we have obtained :
∫

1
0

µa−1 (1 − µ)b−1 d µ =

Γ(a)Γ( b)
Γ(a + b)

Problem 2.6 Solution
We will solve this problem based on definition.
∫
E[µ] =
∫

=
=
=
=
=

1

µ Beta(µ|a, b) d µ

0

Γ(a + b) a
µ (1 − µ)b−1 d µ
0 Γ(a) Γ( b)
∫
Γ(a + b)Γ(a + 1) 1 Γ(a + 1 + b) a
µ (1 − µ)b−1 d µ
Γ(a + 1 + b)Γ(a) 0 Γ(a + 1) Γ( b)
∫
Γ(a + b)Γ(a + 1) 1
Beta(µ|a + 1, b) d µ
Γ(a + 1 + b)Γ(a) 0
Γ(a + b)
Γ(a + 1)
·
Γ(a + 1 + b)
Γ(a)
a
a+b
1

Where we have taken advantage of the property: Γ( z + 1) = zΓ( z). For
variance, it is quite similar. We first evaluate E [µ2 ].
∫
E[ µ 2 ] =
∫

=
=
=
=
=

1
0

µ2 Beta(µ|a, b) d µ

Γ(a + b) a+1
µ
(1 − µ)b−1 d µ
Γ
(
a
)
Γ
(
b
)
0
∫
Γ(a + b)Γ(a + 2) 1 Γ(a + 2 + b) a+1
µ
(1 − µ)b−1 d µ
Γ(a + 2 + b)Γ(a) 0 Γ(a + 2) Γ( b)
∫
Γ(a + b)Γ(a + 2) 1
Beta(µ|a + 2, b) d µ
Γ(a + 2 + b)Γ(a) 0
Γ(a + b)
Γ(a + 2)
·
Γ(a + 2 + b)
Γ(a)
a(a + 1)
(a + b)(a + b + 1)
1

Then we use the formula: var [µ] = E [µ2 ] − E [µ]2 .

var [µ] =
=

Problem 2.7 Solution

a(a + 1)
a 2
−(
)
(a + b)(a + b + 1)
a+b
ab
2
(a + b) (a + b + 1)

35
The maximum likelihood estimation for µ, i.e. (2.8), can be written as :
µ ML =

m
m+l

Where m represents how many times we observe ’head’, l represents how
many times we observe ’tail’. And the prior mean of µ is given by (2.15), the
posterior mean value of x is given by (2.20). Therefore, we will prove that
( m + a) / ( m + a + l + b) lies between m / ( m + l ), a / (a + b). Given the fact that :
λ

m
m+a
a+b
a
+ (1 − λ)
=
where λ =
a+b
m+l
m+a+l +b
m+l +a+b

We have solved problem. Note : you can also solve it in a more simple way
by prove that :
(

m+a
a
m+a
m
−
)·(
−
)≤0
m+a+l +b a+b
m+a+l +b m+l

The expression above can be proved by reduction of fractions to a common
denominator.
Problem 2.8 Solution
We solve it base on definition.
E y [E x [ x| y]]] =

=
=
=
=

∫
E x [ x| y] p( y) d y
∫ ∫
( x p( x| y) dx) p( y) d y
∫ ∫
x p( x| y) p( y) dx d y
∫ ∫
x p( x, y) dx d y
∫
x p( x) dx = E[ x]

(2.271) is complicated and we will calculate every term separately.
∫
E y [var x [ x| y]] =
var x [ x| y] p( y) d y
∫ ∫
=
( ( x − E x [ x| y])2 p( x| y) dx ) p( y) d y
∫ ∫
=
( x − E x [ x| y])2 p( x, y) dx d y
∫ ∫
=
( x2 − 2 x E x [ x| y] + E x [ x| y]2 ) p( x, y) dx d y
∫ ∫
∫ ∫
∫ ∫
=
x2 p( x) dx −
2 x E x [ x| y] p( x, y) dx d y +
( E x [ x | y] 2 ) p ( y) d y

36
About the second term in the equation above, we further simplify it :
∫ ∫
∫
∫
2 x E x [ x| y] p( x, y) dx d y = 2 E x [ x| y] ( xp( x, y) dx ) d y
∫
∫
= 2 E x [ x| y] p( y) ( xp( x| y) dx ) d y
∫
= 2 E x [ x | y] 2 p ( y) d y

Therefore, we obtain the simple expression for the first term on the right
side of (2.271) :
∫ ∫
∫ ∫
E y [var x [ x| y]] =
x2 p( x) dx −
E x [ x| y]2 p( y) d y
(∗)
Then we process for the second term.
∫
var y [E x [ x| y]] =
(E x [ x| y] − E y [E x [ x| y]])2 p( y) d y
∫
=
(E x [ x| y] − E[ x])2 p( y) d y
∫
∫
∫
=
E x [ x| y]2 p( y) d y − 2 E[ x]E x [ x| y] p( y) d y + E[ x]2 p( y) d y
∫
∫
2
=
E x [ x| y] p( y) d y − 2E[ x] E x [ x| y] p( y) d y + E[ x]2

Then following the same procedure, we deal with the second term of the
equation above.
∫
2E[ x] · E x [ x| y] p( y) d y = 2E[ x] · E y [E x [ x| y]]] = 2E[ x]2

Therefore, we obtain the simple expression for the second term on the
right side of (2.271) :
∫
var y [E x [ x| y]] = E x [ x| y]2 p( y) d y − E[ x]2
(∗∗)
Finally, we add (∗) and (∗∗), and then we will obtain:
E y [var x [ x| y]] + var y [E x [ x| y]] = E[ x2 ] − E[ x]2 = var [ x]

Problem 2.9 Solution
This problem is complexed, but hints have already been given in the description. Let’s begin by performing integral of (2.272) over µ M −1 . (Note :

37
by integral over µ M −1 , we actually obtain Dirichlet distribution with M − 1
variables.)
∫

p M −1 (µ, m, ..., µ M −2 ) =

1−µ− m − ... −µ M −2

CM

0

=

CM

M
−2
∏
k=1

α −1

µk k

M
−1
∏
k=1

∫

α −1

µk k

1−µ− m − ... −µ M −2
0

(1 −
α

M
−1
∑
j =1

µ MM−−11

−1

µ j )αM −1 d µ M −1

(1 −

M
−1
∑
j =1

µ j )αM −1 d µ M −1

We change variable by :

t=

µ M −1

1 − µ − m − ... − µ M −2

The reason we do so is that µ M −1 ∈ [0, 1 − µ − m − ... − µ M −2 ], by making this
changing of variable, we can see that t ∈ [0, 1]. Then we can further simplify
the expression.

p M −1

=

CM

=

CM

=

M
−2
∏
k=1
M
−2
∏
k=1

CM

M
−2
∏
k=1

α −1
µk k (1 −
α −1

µk k

α −1

µk k

(1 −
(1 −

M
−2
∑
j =1
M
−2
∑
j =1
M
−2
∑
j =1

µ j)

α M −1 +α M −1

µ j )αM −1 +αM −1
µ j )αM −1 +αM −1

∫

1

1
0

−1

(1 −

∑ M −1
j =1

µ j )αM −1

(1 − µ − m − ... − µ M −2 )αM −1 +αM −2

0

∫

α

µ MM−−11

tαM −1 −1 (1 − t)αM −1 dt

Γ(α M −1 − 1)Γ(α M )
Γ(α M −1 + α M )

Comparing the expression above with a normalized Dirichlet Distribution
with M −1 variables, and supposing that (2.272) holds for M −1, we can obtain
that:
Γ(α M −1 )Γ(α M )
Γ(α1 + α2 + ... + α M )
CM
=
Γ(α M −1 + α M )
Γ(α1 )Γ(α2 )...Γ(α M −1 + α M )
Therefore, we obtain

CM =
as required.
Problem 2.10 Solution

Γ(α1 + α2 + ... + α M )
Γ(α1 )Γ(α2 )...Γ(α M −1 )Γ(α M )

dt

38
Based on definition of Expectation and (2.38), we can write:
∫
E[µ j ] =
µ j D ir (µ|α) d µ
∫

=
=
=
=

K
∏
Γ(α0 )
α −1
µ k dµ
Γ(α1 )Γ(α2 )...Γ(αK ) k=1 k
∫
K
∏
Γ(α0 )
α −1
µj
µk k d µ
Γ(α1 )Γ(α2 )...Γ(αK )
k=1

µj

Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 1)Γ(α j+1 )...Γ(αK )
Γ(α0 )
Γ(α1 )Γ(α2 )...Γ(αK )
Γ(α0 + 1)
Γ(α0 )Γ(α j + 1)
αj
=
Γ(α j )Γ(α0 + 1) α0

It is quite the same for variance, let’s begin by calculating E[µ2j ].
∫
E[µ2j ] =

=
=
=

µ2j D ir (µ|α) d µ

Γ(α0 )
Γ(α1 )Γ(α2 )...Γ(αK )

∫
µ2j

K
∏
k=1

α −1

µk k

dµ

Γ(α1 )Γ(α2 )...Γ(α j−1 )Γ(α j + 2)Γ(α j+1 )...Γ(αK )
Γ(α0 )
Γ(α1 )Γ(α2 )...Γ(αK )
Γ(α0 + 2)
Γ(α0 )Γ(α j + 2)
α j (α j + 1)
=
Γ(α j )Γ(α0 + 2) α0 (α0 + 1)

Hence, we obtain :

var [µ j ] = E[µ2j ] − E[µ j ]2 =

α j (α j + 1)
α0 (α0 + 1)

−(

αj
α0

)2 =

α j (α0 − α j )
α20 (α0 + 1)

It is the same for covariance.
∫
cov[µ j µl ] =
(µ j − E[µ j ])(µl − E[µl ]) D ir (µ|α) d µ
∫
=
(µ j µl − E[µ j ]µl − E[µl ]µ j + E[µ j ]E[µl ]) D ir (µ|α) d µ
=
=
=
=

Γ(α0 )Γ(α j + 1)Γ(αl + 1)
Γ(α j )Γ(αl )Γ(α0 + 2)
α j αl

α0 (α0 + 1)
α j αl

− E[µ j ]E[µl ]
−

α0 (α0 + 1)
α j αl
− 2
α0 (α0 + 1)

α j αl
α20

( j ̸= l )

− 2E[µ j ]E[µl ] + E[µ j ]E[µl ]

39
Note : when j = l , cov[µ j µl ] will actually reduce to var [µ j ], however we
cannot simply replace l with j in the expression of cov[µ j µl ] to get the right
∫
∫
result and that is because µ j µl D ir (µ|α) d α will reduce to µ2j D ir (µ|α) d α
in this case.
Problem 2.11 Solution
Based on definition of Expectation and (2.38), we first denote :

Γ(α0 )
= K (α)
Γ(α1 )Γ(α2 )...Γ(αK )
Then we can write :
∂D ir (µ|α)
∂α j

= ∂(K (α)

K
∏
i =1

K
∂K (α) ∏

=

∂α j

i =1

K
∂K (α) ∏

=

∂α j

α −1

µi i

i =1

) / ∂α j

α −1
µ i i + K (α)
α −1

µi i

∂

∏K

α i −1
i =1 µ i

∂α j

+ lnµ j · D ir (µ|α)

Then let us perform integral to both sides:
∫

∂D ir (µ|α)
∂α j

∫

dµ =

K
∂ K (α ) ∏

∂α j

i =1

α −1
µi i dµ +

∫

lnµ j · D ir (µ|α) d µ

The left side can be further simplified as :
∫
∂ D ir (µ|α) d µ
∂1
left side =
=
=0
∂α j
∂α j
The right side can be further simplified as :
right side =
=
=

∂K (α)
∂α j
∂K (α)

∫ ∏
K
i =1

α −1

µi i

d µ + E[ lnµ j ]

1
+ E[ lnµ j ]
∂α j K (α)
∂ lnK (α)
+ E[ lnµ j ]
∂α j

40
Therefore, we obtain :
∂ lnK (α)

E[ lnµ j ] = −

∂α j
}
{
∑
∂ lnΓ(α0 ) − K
i =1 lnΓ(α i )

= −
=
=
=

∂α j

∂ lnΓ(α j )
∂α j
∂ lnΓ(α j )
∂α j
∂ lnΓ(α j )

−
−
−

∂ lnΓ(α0 )
∂α j
∂ lnΓ(α0 ) ∂α0
∂α0

∂α j

∂ lnΓ(α0 )

∂α j
∂α0
= ψ(α j ) − ψ(α0 )

Therefore, the problem has been solved.
Problem 2.12 Solution
Since we have :

∫

b
a

1
dx = 1
b−a

It is straightforward that it is normalized. Then we calculate its mean :
∫
E[ x ] =

b

x
a

1
x2 ¯¯b a + b
dx =
¯ =
b−a
2( b − a) a
2

Then we calculate its variance.
∫ b 2
x
a+b 2
x3 ¯¯b
a+b 2
2
2
var [ x] = E[ x ] − E[ x] =
dx − (
) =
)
¯ −(
2
3( b − a) a
2
a b−a
Hence we obtain:

var [ x] =

( b − a )2
12

Problem 2.13 Solution
This problem is an extension of Prob.1.30. We can follow the same procep ( x)
dure to solve it. Let’s begin by calculating ln q(( x) :

ln(

p ( x)
) =
q ( x)

1
|L|
1
1
ln (
) + ( x − m)T L−1 ( x − m) − ( x − µ)T Σ−1 ( x − µ)
2
|Σ|
2
2

If x ∼ p( x) = N (µ|Σ), we then take advantage of the following properties.
∫
p( x) dx = 1

41
∫
E[ x ] =

x p( x) dx = µ

E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a)

We obtain :
∫ {
}
1
1 |L| 1
ln
− ( x − µ)T Σ−1 ( x − µ) + ( x − m)T L−1 ( x − m) p( x) dx
KL =
2 |Σ| 2
2
1 |L| 1
1
=
ln
− E [( x − µ)Σ−1 ( x − µ)T ] + E [( x − m)T L−1 ( x − m)]
2 | Σ| 2
2
1 |L| 1
1
1
=
ln
− tr { I D } + (µ − m)T L−1 (µ − m) + tr{L−1 Σ}
2 | Σ| 2
2
2
|L|
1
=
[ ln
− D + tr{L−1 Σ} + ( m − µ)T L−1 ( m − µ)]
2
|Σ|
Problem 2.14 Solution
The hint given in the problem is straightforward, however it is a little bit
difficult to calculate, and here we will use a more simple method to solve this
problem, taking advantage of the property of Kullback—Leibler Distance. Let
g( x) be a Gaussian PDF with mean µ and variance Σ, and f ( x) an arbitrary
PDF with the same mean and variance.
{
}
∫
∫
g ( x)
dx = − H ( f ) − f ( x) lng( x) dx
(∗)
0 ≤ K L( f || g) = − f ( x) ln
f ( x)
Let’s calculate the second term of the equation above.
{
}
∫
∫
[ 1
]
1
1
T −1
exp
−
f ( x) lng( x) dx =
f ( x) ln
(
x
−
µ
)
Σ
(
x
−
µ
)
dx
2
(2π)D /2 |Σ|1/2
{
}
∫
∫
[ 1
]
1
1
T −1
=
f ( x) ln
dx
+
f
(
x
)
−
(
x
−
µ
)
Σ
(
x
−
µ
)
dx
2
(2π)D /2 |Σ|1/2
}
{
]
1
1 [
1
T −1
−
E
(
x
−
µ
)
Σ
(
x
−
µ
)
= ln
2
(2π)D /2 |Σ|1/2
{
}
1
1
1
= ln
− tr{ I D }
D
/2
1/2
2
(2π)
|Σ|
{
}
1
D
= −
ln|Σ| + (1 + ln(2π))
2
2
= − H ( g)
We take advantage of two properties of PDF f ( x), with mean µ and variance Σ, as listed below. What’s more, we also use the result of Prob.2.15,
which we will proof later.
∫

f ( x) dx = 1
E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a)

42
Now we can further simplify (∗) to obtain:

H ( g) ≥ H ( f )
In other words, we have proved that an arbitrary PDF f ( x) with the same
mean and variance as a Gaussian PDF g( x), its entropy cannot be greater
than that of Gaussian PDF.
Problem 2.15 Solution
We have already used the result of this problem to solve Prob.2.14, and
now we will prove it. Suppose x ∼ p( x) = N (µ|Σ) :
∫
H [ x] = − p( x) lnp( x) dx
{
}
∫
]
[ 1
1
1
T −1
= − p( x) ln
exp − ( x − µ) Σ ( x − µ) dx
2
(2π)D /2 |Σ|1/2
{
}
∫
∫
[ 1
]
1
1
= − p( x) ln
dx
−
f ( x) − ( x − µ)T Σ−1 ( x − µ) dx
D
/2
1/2
2
(2π)
| Σ|
{
}
]
1
1
1 [
= − ln
+ E ( x − µ)T Σ−1 ( x − µ)
D
/2
1/2
2
(2π)
|Σ|
{
}
1
1
1
= − ln
+ tr{ I D }
2
(2π)D /2 |Σ|1/2
1
D
=
ln|Σ| + (1 + ln(2π))
2
2
Where we have taken advantage of :
∫
p( x) dx = 1
E[( x − a)T A ( x − a)] = tr( A Σ) + (µ − a)T A (µ − a)

Note : Actually in Prob.2.14, we have already solved this problem, you can
intuitively view it by replacing the integrand f ( x) lng( x) with g( x) lng( x), and
∫
the same procedure in Prob.2.14 still holds to calculate g( x) lng( x) dx.
Problem 2.16 Solution
Let us consider a more general conclusion about the Probability Density
Function (PDF) of the summation of two independent random variables. We
denote two random variables X and Y . Their summation Z = X + Y , is still a
random variable. We also denote f (·) as PDF, and F (·) as Cumulative Distribution Function (CDF). We can obtain :
Ï
F Z ( z) = P ( Z < z) =
f X ,Y ( x, y) dxd y
x+ y≤ z

43
Where z represents an arbitrary real number. We rewrite the double integral into iterated integral :
]
∫ +∞ [∫ z− y
F Z ( z) =
f X ,Y ( x, y) dx d y
−∞

−∞

We fix z and y, and then make a change of variable x = u− y to the integral.
]
]
∫ +∞ [∫ z− y
∫ +∞ [∫ z
F Z ( z) =
f X ,Y ( x, y) dx d y =
f X ,Y ( u − x, y) du d y
−∞

−∞

−∞

−∞

Note: f X ,Y (·) is the joint PDF of X and Y , and then we rearrange the
order, we will obtain :
]
∫ z [∫ +∞
F Z ( z) =
f X ,Y ( u − y, y) d y du
−∞

−∞

Compare the equation above with th definition of CDF :
∫ z
F Z ( z) =
f Z ( u) du
−∞

We can obtain :

∫

f Z ( u) =

+∞
−∞

f X ,Y ( u − y, y) d y

And if X and Y are independent, which means f X ,Y ( x, y) = f X ( x) f Y ( y), we
can simplify f Z ( z) :
∫ +∞
f X ( u − y) f Y ( y) d y i.e. f Z = f X ∗ f Y
f Z ( u) =
−∞

Until now we have proved that the PDF of the summation of two independent random variable is the convolution of the PDF of them. Hence it is
straightforward to see that in this problem, where random variable x is the
summation of random variable x1 and x2 , the PDF of x should be the convolution of the PDF of x1 and x2 . To find the entropy of x, we will use a simple
method, taking advantage of (2.113)-(2.117). With the knowledge :
1
p( x2 ) = N (µ2 , τ−
2 )
1
p( x| x2 ) = N (µ1 + x2 , τ−
1 )

We make analogies : x2 in this problem to x in (2.113), x in this problem to
y in (2.114). Hence by using (2.115), we can obtain p( x) is still a normal distribution, and since the entropy of a Gaussian is fully decided by its variance,
there is no need to calculate the mean. Still by using (2.115), the variance of
1
−1
x is τ−
1 + τ2 , which finally gives its entropy :

H [ x] =

]
1[
1
−1
1 + ln2π(τ−
1 + τ2 )
2

44
Problem 2.17 Solution
This is an extension of Prob.1.14. The same procedure can be used here.
We suppose an arbitrary precision matrix Λ can be written as ΛS + Λ A , where
they satisfy :
Λ i j + Λ ji
Λ i j − Λ ji
ΛSij =
, Λ iAj =
2
2
Hence it is straightforward that ΛSij = ΛSji , and Λ iAj = − Λ Aji . If we expand
the quadratic form of exponent, we will obtain :
( x − µ)T Λ( x − µ) =

D ∑
D
∑
i =1 j =1

( x i − µ i )Λ i j ( x j − µ j )

(∗)

It is straightforward then :
(∗) =
=

D ∑
D
∑
i =1 j =1
D ∑
D
∑

( x i − µ i )ΛSij ( x j − µ j ) +

D ∑
D
∑
i =1 j =1

( x i − µ i )Λ iAj ( x j − µ j )

( x i − µ i )ΛSij ( x j − µ j )

i =1 j =1

Therefore, we can assume precision matrix is symmetric, and so is covariance matrix.
Problem 2.18 Solution
We will just follow the hint given in the problem. Firstly, we take complex
conjugate on both sides of (2.45) :

Σu i = λi u i

=>

Σu i = λi u i

Where we have taken advantage of the fact that Σ is a real matrix, i.e.,
Σ = Σ. Then using that Σ is a symmetric, i.e., ΣT = Σ :

u i T Σu i = u i T ( Σu i ) = u i T ( λi u i ) = λi u i T u i
T

u i T Σ u i = ( Σ u i )T u i = ( λ i u i )T u i = λ i u i T u i
T

Since u i ̸= 0, we have u i T u i ̸= 0. Thus λTi = λ i , which means λ i is real.
Next we will proof that two eigenvectors corresponding to different eigenvalues are orthogonal.
λ i < u i , u j > = < λ i u i , u j > = < Σ u i , u j > = < u i , ΣT u j > = λ j < u i , u j >

Where we have taken advantage of ΣT = Σ and for arbitrary real matrix
A and vector x, y, we have :
< Ax, y > = < x, A T y >

45
Provided λ i ̸= λ j , we have < u i , u j > = 0, i.e., u i and u j are orthogonal. And then if we perform normalization on every eigenvector to force its
Euclidean norm to equal to 1, (2.46) is straightforward. By performing normalization, I mean multiplying the eigenvector by a real number a to let its
Euclidean norm (length) to equal to 1, meanwhile we should also divide its
corresponding eigenvalue by a.
Problem 2.19 Solution
For every N × N real symmetric matrix, the eigenvalues are real and the
eigenvectors can be chosen such that they are orthogonal to each other. Thus
a real symmetric matrix Σ can be decomposed as Σ = U ΛU T ,where U is an
orthogonal matrix, and Λ is a diagonal matrix whose entries are the eigenvalues of A. Hence for an arbitrary vector x, we have:





u 1T x
λ1 u 1T x
D
∑
 . 


..
Σ x = U ΛU T x = U Λ  ..  = U 
λk u k u T
=(
.
k )x

uT
x
D

λD u T
x
D

k=1

And since Σ−1 = U Λ−1U T , the same procedure can be used to prove
(2.49).
Problem 2.20 Solution
Since u 1 , u 2 , ..., u D can constitute a basis for RD , we can make projection
for a :
a = a 1 u 1 + a 2 u 2 + ... + a D u D
We substitute the expression above into aT Σa, taking advantage of the
property: u i u j = 1 only if i = j , otherwise 0, we will obtain :

aT Σa = (a 1 u 1 + a 2 u 2 + ... + a D u D )T Σ(a 1 u 1 + a 2 u 2 + ... + a D u D )
= (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )Σ(a 1 u 1 + a 2 u 2 + ... + a D u D )
= (a 1 u 1 T + a 2 u 2 T + ... + a D u D T )(a 1 λ1 u 1 + a 2 λ2 u 2 + ... + a D λD u D )
= λ1 a 1 2 + λ2 a 2 2 + ... + λD a D 2

Since a is real,the expression above will be strictly positive for any nonzero a, if all eigenvalues are strictly positive. It is also clear that if an eigenvalue, λ i , is zero or negative, there will exist a vector a (e.g. a = u i ), for
which this expression will be no greater than 0. Thus, that a real symmetric matrix has eigenvectors which are all strictly positive is a sufficient and
necessary condition for the matrix to be positive definite.
Problem 2.21 Solution
It is straightforward. For a symmetric matrix Λ of size D × D , when the
lower triangular part is decided, the whole matrix will be decided due to

46
symmetry. Hence the number of independent parameters is D + (D − 1) + ... +
1, which equals to D (D + 1)/2.
Problem 2.22 Solution
Suppose A is a symmetric matrix, and we need to prove that A −1 is also
symmetric, i.e., A −1 = ( A −1 )T . Since identity matrix I is also symmetric, we
have :
A A −1 = ( A A −1 )T
And since AB T = B T A T holds for arbitrary matrix A and B, we will
obtain :
T
A A −1 = ( A −1 ) A T
Since A = A T , we substitute the right side:
T

A A −1 = ( A −1 ) A
And note that A A −1 = A −1 A = I , we rearrange the order of the left side :
T

A −1 A = ( A −1 ) A
Finally, by multiplying A −1 to both sides, we can obtain:
T

A −1 A A −1 = ( A −1 ) A A −1
Using A A −1 = I , we will get what we are asked :

A −1 = ( A −1 )

T

Problem 2.23 Solution
Let’s reformulate the problem. What the problem wants us to prove is
that if ( x − µ)T Σ−1 ( x − µ) = r 2 , where r 2 is a constant, we will have the
volume of the hyperellipsoid decided by the equation above will equal to
VD |Σ|1/2 r D . Note that the center of this hyperellipsoid locates at µ, and a
translation operation won’t change its volume, thus we only need to prove
that the volume of a hyperellipsoid decided by xT Σ−1 x = r 2 , whose center
locates at 0 equals to VD |Σ|1/2 r D .
This problem can be viewed as two parts. Firstly, let’s discuss about VD ,
the volume of a unit sphere in dimension D . The expression of VD has already
be given in the solution procedure of Prob.1.18, i.e., (1.144) :

VD =

SD
2πD /2
=
D
Γ( D2 + 1)

And also in the procedure, we show that a D dimensional sphere with
radius r , i.e., xT x = r 2 , has volume V ( r ) = VD r D . We move a step forward: we

47
perform a linear transform using matrix Σ1/2 , i.e., yT y = r 2 , where y = Σ1/2 x.
After the linear transformation, we actually get a hyperellipsoid whose center
locates at 0, and its volume is given by multiplying V ( r ) with the determinant
of the transformation matrix, which gives |Σ|1/2 VD r D , just as required.
Problem 2.24 Solution
We just following the hint, and firstly let’s calculate :
[

A
C

] [
×

B
D

M
−D −1 CM

− MBD −1
−1
D + D −1 CMBD −1

]

The result can also be partitioned into four blocks. The block located at
left top equals to :

AM − BD −1 CM = ( A − BD −1 C )( A − BD −1 C )−1 = I
Where we have taken advantage of (2.77). And the right top equals to :
− AMBD −1 + BD −1 + BD −1 CMBD −1 = ( I − AM + BD −1 CM )BD −1 = 0

Where we have used the result of the left top block. And the left bottom
equals to :

CM − DD −1 CM = 0
And the right bottom equals to :
−CMBD −1 + DD −1 + DD −1 CMDD −1 = I

we have proved what we are asked. Note: if you want to be more precise,
you should also multiply the block matrix on the right side of (2.76) and then
prove that it will equal to a identity matrix. However, the procedure above
can be also used there, so we omit the proof and what’s more, if two arbitrary
square matrix X and Y satisfied X Y = I , it can be shown that Y X = I also
holds.
Problem 2.25 Solution
We will take advantage of the result of (2.94)-(2.98). Let’s first begin by
grouping xa and x b together, and then we rewrite what has been given as :
(

x=

xa,b
xc

)

(
µ=

µa,b
µc

)

[

Σ=

Σ(a,b)(a,b)
Σ(a,b) c

Then we take advantage of (2.98), we can obtain :

p( xa,b ) = N ( xa,b |µa,b , Σ(a,b)(a,b) )

Σ(a,b) c
Σ cc

]

48
Where we have defined:
)
(
µa
µa,b =
µb

[

Σ(a,b)(a,b) =

Σaa
Σba

Σab
Σbb

]

Since now we have obtained the joint contribution of xa and x b , we will
take advantage of (2.96) (2.97) to obtain conditional distribution, which gives:
1
p( xa | x b ) = N ( x|µa|b , Λ−
aa )

Where we have defined
1
µa|b = µa − Λ−
aa Λab ( x b − µ b )
1
And the expression of Λ−
aa and Λab can be given by using (2.76) and (2.77)
once we notice that the following relation exits:

[

Λaa
Λba

Λab
Λbb

]

[

=

Σaa
Σba

Σab
Σbb

]−1

Problem 2.26 Solution
This problem is quite straightforward, if we just follow the hint.
(
)
( A + BCD ) A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1

= A A −1 − A A −1 B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 − BCD A −1 B(C −1 + D A −1 B)−1 D A −1
= I − B(C −1 + D A −1 B)−1 D A −1 + BCD A −1 + B(C −1 + D A −1 B)−1 D A −1 − BCD A −1
=I

Where we have taken advantage of
− BCD A −1 B(C −1 + D A −1 B)−1 D A −1
= −BC (−C −1 + C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1
= (−BC )(−C −1 )(C −1 + D A −1 B)−1 D A −1 + (−BC )(C −1 + D A −1 B)(C −1 + D A −1 B)−1 D A −1
= B(C −1 + D A −1 B)−1 D A −1 − BCD A −1

Here we will also directly calculate the inverse matrix instead to give
another solution. Let’s first begin by introducing two useful formulas.
( I + P )−1

= ( I + P )−1 ( I + P − P )
=

I − ( I + P )−1 P

And since

P + PQP = P ( I + QP ) = ( I + PQ )P

49
The second formula is :
( I + PQ )−1 P = P ( I + QP )−1
And now let’s directly calculate ( A + BCD )−1 :
( A + BCD )−1

= [ A ( I + A −1 BCD )]−1
= ( I + A −1 BCD )−1 A −1
[
]
= I − ( I + A −1 BCD )−1 A −1 BCD A −1
=

A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1

Where we have assumed that A is invertible and also used the first formula we introduced. Then we also assume that C is invertible and recursively use the second formula :
( A + BCD )−1

=

A −1 − ( I + A −1 BCD )−1 A −1 BCD A −1

=

A −1 − A −1 ( I + BCD A −1 )−1 BCD A −1

=
=

A −1 − A −1 B( I + CD A −1 B)−1 CD A −1
[
]−1
CD A −1
A −1 − A −1 B C (C −1 + D A −1 B)

=

A −1 − A −1 B(C −1 + D A −1 B)−1 C −1 CD A −1

=

A −1 − A −1 B(C −1 + D A −1 B)−1 D A −1

Just as required.
Problem 2.27 Solution
The same procedure used in Prob.1.10 can be used here similarly.
∫ ∫
E[ x + z ] =
( x + z) p( x, z) dx dz
∫ ∫
=
( x + z) p( x) p( z) dx dz
∫ ∫
∫ ∫
=
xp( x) p( z) dx dz +
z p( x) p( z) dx dz
∫ ∫
∫ ∫
=
( p( z) dz) xp( x) dx + ( p( x) dx) z p( z) dz
∫
∫
=
xp( x) dx + z p( z) dz
= E[ x ] + E[ z ]

And for covariance matrix, we will use matrix integral :
∫ ∫
cov[ x + z] =
( x + z − E[ x + z] )( x + z − E[ x + z] )T p( x, z) dx dz
Also the same procedure can be used here. We omit the proof for simplicity.

50
Problem 2.28 Solution
It is quite straightforward when we compare the problem with (2.94)(2.98). We treat x in (2.94) as z in this problem, xa in (2.94) as x in this
problem, x b in (2.94) as y in this problem. In other words, we rewrite the
problem in the form of (2.94)-(2.98), which gives :
(

z=

x
y

)

(
E( z ) =

µ
Aµ + b

)

[

cov( z) =

Λ−1
A Λ−1

Λ−1 A T
−1
L + A Λ−1 A T

]

By using (2.98), we can obtain:

p( x) = N ( x|µ, Λ−1 )
And by using (2.96) and (2.97), we can obtain :

p( y| x) = N ( y|µ y| x , Λ−yy1 )
Where Λ yy can be obtained by the right bottom part of (2.104),which gives
Λ yy = L−1 , and you can also calculate it using (2.105) combined with (2.78)
and (2.79). Finally the conditional mean is given by (2.97) :
µ y| x = A µ + L − L−1 (−LA )( x − µ) = Ax + L

Problem 2.29 Solution
It is straightforward. Firstly, we calculate the left top block :
]−1
[
= Λ−1
left top = (Λ + A T LA ) − (− A T L)(L−1 )(−LA )

And then the right top block :
right top = −Λ−1 (− A T L)L−1 = Λ−1 A T
And then the left bottom block :
left bottom = −L−1 (−LA )Λ−1 = A Λ−1
Finally the right bottom block :
right bottom = L−1 + L−1 (−LA )Λ−1 (− A T L)L−1 = L−1 + A Λ−1 A T
Problem 2.30 Solution
It is straightforward by multiplying (2.105) and (2.107), which gives :
(

Λ−1
A Λ−1

Λ−1 A T
L−1 + A Λ−1 A T

Just as required in the problem.

)(

Λµ − A T Lb
Lb

)

(

=

µ
Aµ + b

)

51
Problem 2.31 Solution
According to the problem, we can write two expressions :

p ( x ) = N ( x |µ x , Σ x ) ,

p( y| x) = N ( y|µ z + x, Σ z )

By comparing the expression above and (2.113)-(2.117), we can write the
expression of p( y) :
p ( y) = N ( y|µ x + µ z , Σ x + Σ z )
Problem 2.32 Solution
Let’s make this problem more clear. The deduction in the main text, i.e.,
(2.101-2.110), firstly denote a new random variable z corresponding to the
joint distribution, and then by completing square according to z,i.e.,(2.103),
obtain the precision matrix R by comparing (2.103) with the PDF of a multivariate Gaussian Distribution, and then it takes the inverse of precision
matrix to obtain covariance matrix, and finally it obtains the linear term i.e.,
(2.106) to calculate the mean.
In this problem, we are asked to solve the problem from another perspective: we need to write the joint distribution p( x, y) and then perform integration over x to obtain marginal distribution p( y). Let’s begin by write the
quadratic form in the exponential of p( x, y) :
1
1
− ( x − µ)T Λ( x − µ) − ( y − Ax − b)T L( y − Ax − b)
2
2
We extract those terms involving x :
1
= − xT (Λ + A T LA ) x + xT [Λµ + A T L( y − b) ] + const
2
1
1
= − ( x − m)T (Λ + A T LA ) ( x − m) + mT (Λ + A T LA ) m + const
2
2
Where we have defined :

m = (Λ + A T LA )−1 [Λµ + A T L( y − b) ]
Now if we perform integration over x, we will see that the first term vanish to a constant, and we extract the terms including y from the remaining
parts, we can obtain :
]
1 [
= − yT L − LA (Λ + A T LA )−1 A T L y
2 {
[
]
+ yT L − LA (Λ + A T LA )−1 A T L b
}
+LA (Λ + A T LA )−1 Λµ

We firstly view the quadratic term to obtain the precision matrix, and
then we take advantage of (2.289), we will obtain (2.110). Finally, using the

52
linear term combined with the already known covariance matrix, we can obtain (2.109).
Problem 2.33 Solution
p( x,y)

According to Bayesian Formula, we can write p( x| y) = p( y) , where we
have already known the joint distribution p( x, y) in (2.105) and (2.108), and
the marginal distribution p( y) in Prob.2.32., we can follow the same procedure in Prob.2.32., i.e. firstly obtain the covariance matrix from the quadratic
term and then obtain the mean from the linear term. The details are omitted
here.
Problem 2.34 Solution
Let’s follow the hint by firstly calculating the derivative of (2.118) with
respect to Σ and let it equal to 0 :
−

N
1 ∂ ∑
N ∂
ln|Σ| −
( xn − µ)T Σ−1 ( xn − µ) = 0
2 ∂Σ
2 ∂Σ n=1

By using (C.28), the first term can be reduced to :
−

N
N ∂
N
ln|Σ| = − (Σ−1 )T = − Σ−1
2 ∂Σ
2
2

Provided with the result that the optimal covariance matrix is the sample
covariance, we denote sample matrix S as :

S=

N
1 ∑
( xn − µ)( xn − µ)T
N n=1

We rewrite the second term :
second term = −

N
1 ∂ ∑
( xn − µ)T Σ−1 ( xn − µ)
2 ∂Σ n=1

N ∂
T r [Σ−1 S ]
2 ∂Σ
N −1
Σ S Σ−1
2

= −
=

Where we have taken advantage of the following property, combined with
the fact that S and Σ is symmetric. (Note : this property can be found in The
Matrix Cookbook.)
∂
∂X

T r ( A X −1 B) = −( X −1 BA X −1 )T = −( X −1 )T A T B T ( X −1 )T

Thus we obtain :
−

N −1 N −1
Σ + Σ S Σ−1 = 0
2
2

53
Obviously, we obtain Σ = S , just as required.
Problem 2.35 Solution
The proof of (2.62) is quite clear in the main text, i.e., from page 82 to
page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin
by proving (2.123) :
E[µ ML ] =

N
1 ∑
1
E[
· Nµ = µ
xn ] =
N n=1
N

Where we have taken advantage of the fact that xn is independently and
identically distributed (i.i.d).
Then we use the expression in (2.122) :
E[Σ ML ] =

N
1 ∑
E[ ( xn − µ ML )( xn − µ ML )T ]
N n=1

=

N
1 ∑
E[( xn − µ ML )( xn − µ ML )T ]
N n=1

=

N
1 ∑
E[( xn − µ ML )( xn − µ ML )T ]
N n=1

=

N
1 ∑
E[ xn xn T − 2µ ML xn T + µ ML µT
ML ]
N n=1

=

N
N
N
1 ∑
1 ∑
1 ∑
E[ x n x n T ] − 2
E[µ ML xn T ] +
E[µ ML µT
ML ]
N n=1
N n=1
N n=1

By using (2.291), the first term will equal to :
first term =

1
· N (µµT + Σ) = µµT + Σ
N

The second term will equal to :
second term = −2

N
1 ∑
E[µ ML xn T ]
N n=1

= −2

N
N
1 ∑
1 ∑
xm ) xn T ]
E[ (
N n=1 N m=1

= −2

N
N ∑
1 ∑
E[ x m x n T ]
N 2 n=1 m=1

N
N ∑
1 ∑
(µµT + I nm Σ)
N 2 n=1 m=1
1
= −2 2 ( N 2 µµT + N Σ)
N
1
= −2(µµT + Σ)
N

= −2

54
Similarly, the third term will equal to :
third term =

N
1 ∑
E[µ ML µT
ML ]
N n=1

=

N
N
N
1 ∑
1 ∑
1 ∑
E[(
x j) · (
x i )]
N n=1 N j=1
N i=1

=

N
N
N
∑
∑
1 ∑
E
[(
x
)
·
(
x i )]
j
N 3 n=1 j=1
i =1

N
1 ∑
( N 2 µµT + N Σ)
N 3 n=1
1
= µµT + Σ
N

=

Finally, we combine those three terms, which gives:
E[Σ ML ] =

N −1
Σ
N

Note: the same procedure from (2.59) to (2.62) can be carried out to prove
(2.291) and the only difference is that we need to introduce index m and n
to represent the samples. (2.291) is quite straightforward if we see it in this
way: If m = n, which means xn and xm are actually the same sample, (2.291)
will reduce to (2.262) (i.e. the correlation between different dimensions exists) and if m ̸= n, which means xn and xm are different samples, also i.i.d,
then no correlation should exist, we can guess E[ xn xm T ] = µµT in this case.
Problem 2.36 Solution
Let’s follow the hint. However, firstly we will find the sequential expression based on definition, which will make the latter process on finding coefficient a N −1 more easily. Suppose we have N observations in total, and then
we can write:
N)
σ2(
ML

=
=
=

N
1 ∑
N) 2
( xn − µ(ML
)
N n=1
[
]
−1
1 N∑
N) 2
N) 2
( xn − µ(ML
) + ( x N − µ(ML
)
N n=1
−1
N − 1 1 N∑
1
N) 2
N) 2
( xn − µ(ML
) + ( x N − µ(ML
)
N N − 1 n=1
N

N − 1 2( N −1)
1
N) 2
σ ML
+ ( x N − µ(ML
)
N
N
[
]
1
N −1)
(N ) 2
2( N −1)
= σ2(
+
(
x
−
µ
)
−
σ
N
ML
ML
ML
N
=

55
And then let us write the expression for σ ML .
}
{
¯
N
1 ∑
∂
¯
lnp
(
x
|
µ
,
σ
)
=0
¯
n
2
σ ML
N n=1
∂σ
By exchanging the summation and the derivative, and letting N → +∞,
we can obtain :
[
]
N
1 ∑
∂
∂
lim
lnp
(
x
|
µ
,
σ
)
=
E
lnp
(
x
|
µ
,
σ
)
n
x
n
N →+∞ N n=1 ∂σ2
∂σ2
Comparing it with (2.127), we can obtain the sequential formula to estimate σ ML :
N)
σ2(
ML

N −1)
= σ2(
+ a N −1
ML

∂
N −1)
∂σ2(
ML

[

N −1)
+ a N −1 −
= σ2(
ML

N)
N −1)
lnp( x N |µ(ML
, σ(ML
) (∗)

1
N −1)
2σ2(
ML

+

N) 2
( x N − µ(ML
)

]

N −1)
2σ4(
ML

N)
Where we use σ2(
to represent the N th estimation of σ2ML , i.e., the estiML
mation of σ2ML after the N th observation. What’s more, if we choose :

a N −1 =

N −1)
2σ4(
ML

N

Then we will obtain :
N)
N −1)
σ2(
= σ2(
+
ML
ML

]
1 [ 2( N −1)
N) 2
−σ ML
+ ( x N − µ(ML
)
N

We can see that the results are the same. An important thing should be
N)
noticed : In maximum likelihood, when estimating variance σ2(
, we will
ML

N)
N)
first estimate mean µ(ML
, and then we we will calculate variance σ2(
.
ML
In other words, they are decoupled. It is the same in sequential method.
For instance, if we want to estimate both mean and variance sequentially,
N −1)
after observing the N th sample (i.e., x N ), firstly we can use µ(ML
together

N)
with (2.126) to estimate µ(ML
and then use the conclusion in this problem

N)
N)
N −1)
to obtain σ(ML
. That is why in (∗) we write lnp( x N |µ(ML
, σ(ML
) instead of
N −1) ( N −1)
lnp( x N |µ(ML
, σ ML ).

Problem 2.37 Solution (Wait for revising)
We follow the same procedure in Prob.2.36 to solve this problem. Firstly,

56
we can obtain the sequential formula based on definition.
N)
Σ(ML

N
1 ∑
N)
N) T
( xn − µ(ML
)( xn − µ(ML
)
N n=1
[
]
−1
1 N∑
(N )
(N ) T
(N )
(N ) T
( xn − µ ML )( xn − µ ML ) + ( x N − µ ML )( x N − µ ML )
N n=1

=
=

N − 1 ( N −1) 1
N)
N) T
Σ ML + ( x N − µ(ML
)( x N − µ(ML
)
N
N
[
]
1
N −1)
N)
N) T
N −1)
= Σ(ML
+
( x N − µ(ML
)( x N − µ(ML
) − Σ(ML
N
=

If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we
can obtain :
N)
Σ(ML

N −1)
= Σ(ML
+ a N −1
N −1)
+ a N −1
= Σ(ML

∂
N −1)
∂Σ(ML

∂
N −1)
∂Σ(ML

N)
N −1)
lnp( x N |µ(ML
, Σ(ML
)
N −1)
N)
)
, Σ(ML
lnp( x N |µ(ML

[
]
1
1
N −1)
N −1) −1
N −1) T
N −1)
N −1) −1
N −1) −1
+ a N −1 − [Σ(ML
]
= Σ(ML
) [Σ(ML
)( xn − µ(ML
] + [Σ(ML
] ( xn − µ(ML
2
2

Where we have taken advantage of the procedure we carried out in Prob.2.34
to calculate the derivative, and if we choose :

a N −1 =

2 2( N −1)
Σ
N ML

We can see that the equation above will be identical with our previous
conclusion based on definition.
Problem 2.38 Solution
It is straightforward. Based on (2.137), (2.138) and (2.139), we focus on
the exponential term of the posterior distribution p(µ| X ), which gives :
−

N
1
1 ∑
1
( xn − µ)2 − 2 (µ − µ0 )2 = − 2 (µ − µ N )2
2
2σ n=1
2σ 0
2σ N

We rewrite the left side regarding to µ.
quadratic term = −(

N
1
+
) µ2
2
2σ
2σ20

∑N

linear term = (

n=1 x n
σ2

+

µ0
σ20

)µ

57
We also rewrite the right side regarding to µ, and hence we will obtain :

N
1
1
−( 2 +
) µ2 = − 2 µ2 , (
2
2σ
2σ 0
2σ N
Then we will obtain :

1

1

=

σ2N

σ20

∑N

n=1 x n
σ2

+

n=1 x n

∑N
σ2N

=

µN

= (

n=1 x n
σ2

·(

1

+

σ20

)µ =

µN
σ2N

µ

+

µ0
σ20

= N · µ ML , we can write :

)

N µ ML σ20 + µ0 σ2

·

σ2 + N σ20
σ2

=

σ20

N −1 N µ ML
µ0
) ·(
+ 2)
2
2
σ
σ
σ0

σ20 σ2

=

µ0

N
σ2

∑N

And with the prior knowledge that

+

N σ20 + σ2

σσ20

µ0 +

N σ20
N σ20 + σ2

µ ML

Problem 2.39 Solution
Let’s follow the hint.
1
σ2N

=

1
σ20

+

N
N −1
1
1
1
1
+ 2
= 2 +
+ 2 = 2
2
2
σ
σ
σ
σ
σ0
σ N −1

However, it is complicated to derive a sequential formula for µ N directly.
Based on (2.142), we see that the denominator in (2.141) can be eliminated if
we multiply 1/σ2N on both side of (2.141). Therefore we will derive a sequential formula for µ N /σ2N instead.
µN
σ2N

=
=
=
=
=

σ2 + N σ20
σ20 σ2
σ2 + N σ20
σ20 σ2
µ0
σ20
µ0

+

(
(

σ2

N σ20 + σ2
σ2

N σ20 + σ2

N)
N µ(ML

σ2

∑ N −1

+

σ20
µ N −1
σ2N −1

n=1
σ2

+

xN
σ2

=

xn

µ0
σ20

+

+

xN
σ2

µ0 +
µ0 +

∑N

N σ20
N σ20 + σ2
N σ20
N σ20 + σ2

n=1 x n
σ2

N)
µ(ML
)
N)
µ(ML
)

58
Another possible solution is also given in the problem. We solve it by
completing the square.
−

1
1
1
( x N − µ)2 − 2 (µ − µ N −1 )2 = − 2 (µ − µ N )2
2
2σ
2σ N −1
2σ N

By comparing the quadratic and linear term regarding to µ, we can obtain:
1
1
1
= 2 + 2
2
σ
σN
σ N −1
And :

µN
σ2N

=

xN
µ N −1
+ 2
2
σ
σ N −1

It is the same as previous result. Note: after obtaining the N th observation, we will firstly use the sequential formula to calculate σ2N , and then µ N .
This is because the sequential formula for µ N is dependent on σ2N .
Problem 2.40 Solution
Based on Bayes Theorem, we can write :

p(µ| X ) ∝ p( X |µ) p(µ)
We focus on the exponential term on the right side and then rearrange it
regarding to µ.
[
]
N
∑
1
1
T −1
right =
− ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
2
2
n=1
[
]
N
∑
1
1
T −1
=
− ( xn − µ) Σ ( xn − µ) − (µ − µ0 )T Σ0 −1 (µ − µ0 )
2
n=1 2
N
∑
1
1
−1
= − µ (Σ0 −1 + N Σ−1 ) µ + µT (Σ−
xn ) + const
0 µ0 + Σ
2
n=1

Where ’const’ represents all the constant terms independent of µ. According to the quadratic term, we can obtain the posterior covariance matrix.

Σ−N1 = Σ0 −1 + N Σ−1
Then using the linear term, we can obtain :
1
−1
Σ−N1 µ N = (Σ−
0 µ0 + Σ

N
∑

xn )

n=1

Finally we obtain posterior mean :
1
−1
µ N = (Σ0 −1 + N Σ−1 )−1 (Σ−
0 µ0 + Σ

N
∑
n=1

xn )

59
Which can also be written as :
µ N = (Σ0 −1 + N Σ−1 )−1 (Σ0 −1 µ0 + Σ−1 N µ ML )

Problem 2.41 Solution
Let’s compute the integral of (2.146) over λ.
∫ +∞
∫ +∞
ba
1 a a−1
b λ
exp(− bλ) d λ =
λa−1 exp(− bλ) d λ
Γ(a)
Γ(a) 0
0
∫ +∞
ba
1
u
=
( )a−1 exp(− u) du
Γ(a) 0
b
b
∫ +∞
1
=
u a−1 exp(− u) du
Γ(a) 0
1
=
· Γ(a) = 1
Γ(a)
Where we first perform change of variable bλ = u, and then take advantage of the definition of gamma function:
∫ +∞
Γ( x) =
u x−1 e−u du
0

Problem 2.42 Solution
We first calculate its mean.
∫ +∞
1 a a−1
b λ
exp(− bλ) d λ =
λ
Γ
(
a)
0
=
=
=

∫ +∞
ba
λa exp(− bλ) d λ
Γ(a) 0
∫ +∞
ba
u
1
( )a exp(− u) du
Γ(a) 0
b
b
∫ +∞
1
u a exp(− u) du
Γ(a) · b 0
1
a
· Γ(a + 1) =
Γ(a) · b
b

Where we have taken advantage of the property Γ(a + 1) = aΓ(a). Then
we calculate E[λ2 ].
∫ +∞
∫ +∞
ba
2 1
a a−1
λ
b λ
exp(− bλ) d λ =
λa+1 exp(− bλ) d λ
Γ(a)
Γ(a) 0
0
∫ +∞
u
1
ba
( )a+1 exp(− u) du
=
Γ(a) 0
b
b
∫ +∞
1
a+1
=
u
exp(− u) du
Γ(a) · b2 0
1
a(a + 1)
=
· Γ(a + 2) =
Γ(a) · b2
b2

60
Therefore, according to var [λ] = E[λ2 ] − E[λ]2 , we can obtain :

var [λ] = E[λ2 ] − E[λ]2 =

a 2
a
a(a + 1)
−( ) = 2
2
b
b
b

For the mode of a gamma distribution, we need to find where the maximum of the PDF occurs, and hence we will calculate the derivative of the
gamma distribution with respect to λ.
[
]
d
1 a a−1
1 a a−2
b λ
exp(− bλ)
= [(a − 1) − bλ]
b λ
exp(− bλ)
d λ Γ(a)
Γ(a)
It is obvious that Gam(λ|a, b) has its maximum at λ = (a − 1)/ b. In other
words, the gamma distribution Gam(λ|a, b) has mode (a − 1)/ b.
Problem 2.43 Solution
Let’s firstly calculate the following integral.
∫ +∞
∫ +∞
xq
| x| q
exp(− 2 ) dx
exp(− 2 ) dx = 2
2σ
2σ
−∞
−∞
1
∫ +∞
(2σ2 ) q 1q −1
exp(− u)
= 2
u
du
q
0
1 ∫
1
(2σ2 ) q +∞
−1
exp(− u) u q dx
= 2
q
0
1

(2σ2 ) q 1
= 2
Γ( )
q
q
And then it is obvious that (2.293) is normalized. Next, we consider about
the log likelihood function. Since ϵ = t − y( x, w) and ϵ ∼ p(ϵ|σ2 , q), we can
write:
ln p(t| X , w, σ2 ) =

N
∑
n=1

(
)
ln p y( xn , w) − t n |σ2 , q

= −

[
]
N
1 ∑
q
q
|
y
(
x
,
w
)
−
t
|
+
N
·
ln
n
n
2σ2 n=1
2(2σ2 )1/ q Γ(1/ q)

= −

N
1 ∑
N
| y( xn , w) − t n | q − ln(2σ2 ) + const
2
q
2σ n=1

Problem 2.44 Solution
Here we use a simple method to solve this problem by taking advantage
of (2.152) and (2.153). By writing the prior distribution in the form of (2.153),
i.e., p(µ, λ|β, c, d ), we can easily obtain the posterior distribution.

p(µ, λ| X ) ∝

p( X |µ, λ) · p(µ, λ)
[
]
[
] N +β
N x2
N
∑
∑
λµ2
n
1/2
∝
λ exp(−
xn )λµ − ( d +
)
exp ( c +
)λ
2
n=1 2
n=1

61
Therefore, we can see that the posterior distribution has parameters: β′ =
∑
∑
x2
β + N , c′ = c + nN=1 xn , d ′ = d + nN=1 2n . And since the prior distribution is
actually the product of a Gaussian distribution and a Gamma distribution:
[
]
p(µ, λ|µ0 , β, a, b) = N µ|µ0 , (βλ)−1 Gam(λ|a, b)
Where µ0 = c/β, a = 1 + β/2, b = d − c2 /2β. Hence the posterior distribution can also be written as the product of a Gaussian distribution and a
Gamma distribution.
[
]
p(µ, λ| X ) = N µ|µ′0 , (β′ λ)−1 Gam(λ|a′ , b′ )
Where we have defined:
µ′0 = c′ /β′ = ( c +

N
∑
n=1

/
xn ) ( N + β)

/
a′ = 1 + β′ /2 = 1 + ( N + β) 2

b′ = d ′ − c′ /2β′ = d +
2

N x2
∑
n
n=1

2

− (c +

N
∑
n=1

/
xn )2 (2(β + N ))

Problem 2.45 Solution
Let’s begin by writing down the dependency of the prior distribution W (Λ|W, v)
and the likelihood function p( X |µ, Λ) on Λ.

p( X |µ, Λ) ∝ |Λ| N /2 exp
And if we denote

S=

N
[∑

]
1
− ( xn − µ)T Λ( xn − µ)
n=1 2

N
1 ∑
( xn − µ)( xn − µ)T
N n=1

Then we can rewrite the equation above as:
[ 1
]
p( X |µ, Λ) ∝ |Λ| N /2 exp − Tr(S Λ)
2

Just as what we have done in Prob.2.34, and comparing this problem with
Prob.2.34, one important thing should be noticed: since S and Λ are both
(
)
symmetric, we have: Tr(S Λ) = Tr (S Λ)T = Tr(ΛT S T ) = Tr(ΛS ). And we
can also write down the prior distribution as:
[ 1
]
W (Λ|W, v) ∝ |Λ|(v−D −1)/2 exp − Tr(W −1 Λ)
2

Therefore, the posterior distribution can be obtained:

p(Λ| X ,W, v) ∝

p( X |µ, Λ) · W (Λ|W, v)
{ 1 [
]}
∝ |Λ|( N +v−D −1)/2 exp − Tr (W −1 + S )Λ
2

62
Therefore, p(Λ| X ,W, v) is also a Wishart distribution, with parameters:

vN = N + v
W N = (W −1 + S )−1
Problem 2.46 Solution
It is quite straightforward.
∫ ∞
p( x|µ, a, b) =
N ( x|µ, τ−1 )Gam(τ|a, b) d τ
∫

=
=

0

{ τ
}
b a exp(− bτ)τa−1 τ 1/2
( ) exp − ( x − µ)2 d τ
Γ(a)
2π
2
0
∫ ∞
{
}
a
b
1
τ
( )1/2
τa−1/2 exp − bτ − ( x − µ)2 d τ
Γ(a) 2π
2
0
∞

And if we make change of variable: z = τ[ b + ( x − µ)2 /2], the integral above
can be written as:
∫
{
}
b a 1 1/2 ∞ a−1/2
τ
p( x|µ, a, b) =
( )
τ
exp − bτ − ( x − µ)2 d τ
Γ(a) 2π
2
0
]a−1/2
∫ ∞[
a
b
1
z
1
=
dz
( )1/2
exp {− z}
2 /2
Γ(a) 2π
b
+
(
x
−
µ
)
b
+
(
x
− µ)2 /2
0
[
]a+1/2 ∫ ∞
b a 1 1/2
1
=
( )
z a−1/2 exp {− z} dz
Γ(a) 2π
b + ( x − µ)2 /2
0
[
]−a−1/2
b a 1 1/2
( x − µ)2
=
( )
b+
Γ(a + 1/2)
Γ(a) 2π
2
And if we substitute a = v/2 and b = v/2λ, we will obtain (2.159).
Problem 2.47 Solution
We focus on the dependency of (2.159) on x.
[
]−v/2−1/2
λ( x − µ)2
St( x|µ, λ, v) ∝
1+
v
[
]
−v − 1
λ( x − µ)2
∝ exp
ln(1 +
)
2
v
[
]
−v − 1 λ( x − µ)2
−2
∝ exp
(
+ O (v ))
2
v
]
[
λ( x − µ)2
(v → ∞)
≈ exp −
2

Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). We see that
this, up to an overall constant, is a Gaussian distribution with mean µ and
precision λ.

63
Problem 2.48 Solution
The same steps in Prob.2.46 can be used here.
∫ +∞
¯
¯
¯v v
¯
St( x µ, Λ, v) =
N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η
2 2
0
{
}
∫ +∞
1
1
vη
1
v v/2
1/2
T
=
|ηΛ| exp − ( x − µ) (ηΛ)( x − µ) −
( ) ηv/2−1 d η
D
/2
2
2 Γ(v/2) 2
(2π)
0
}
{
v/2
1/2 ∫ +∞
(v/2) |Λ|
vη D /2+v/2−1
1
T
=
(
x
−
µ
)
(
η
Λ
)(
x
−
µ
)
−
η
dη
exp
−
2
2
(2π)D /2 Γ(v/2) 0
Where we have taken advantage of the property: |ηΛ| = ηD |Λ|, and if we
denote:
η
∆2 = ( x − µ)T Λ( x − µ) and z = (∆2 + v)
2
The expression above can be reduced to :
∫
¯
2 z D /2+v/2−1
(v/2)v/2 |Λ|1/2 +∞
2
exp
(
−
z
)(
St( x ¯ µ, Λ, v) =
)
· 2
dz
2
(2π)D /2 Γ(v/2) 0
∆ +v
∆ +v
D /2+v/2 ∫ +∞
(v/2)v/2 |Λ|1/2
2
exp(− z) · z D /2+v/2−1 dz
=
)
(
(2π)D /2 Γ(v/2) ∆2 + v
0
D /2+v/2
(v/2)v/2 |Λ|1/2
2
=
(
)
Γ(D /2 + v/2)
(2π)D /2 Γ(v/2) ∆2 + v
And if we rearrange the expression above, we will obtain (2.162) just as
required.
Problem 2.49 Solution
Firstly, we notice that if and only if x = µ, ∆2 equals to 0, so that St( x|µ, Λ, v)
achieves its maximum. In other words, the mode of St( x|µ, Λ, v) is µ. Then
we consider about its mean E[ x].
∫
E[ x ] =
St( x|µ, Λ, v) · x dx
x∈RD
[∫ +∞
]
∫
¯
¯v v
−1
¯
¯
N ( x µ, (ηΛ) ) · Gam(η , ) d η x dx
=
2 2
x∈RD
0
∫
∫ +∞
¯
¯
v v
=
xN ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx
D
2 2
x∈R
0
]
∫ +∞ [∫
¯
¯v v
−1
¯
¯
xN ( x µ, (ηΛ) ) dx · Gam(η , ) d η
=
2 2
x∈RD
0
∫ +∞ [
¯v v ]
=
µ · Gam(η ¯ , ) d η
2 2
0
∫ +∞
¯v v
= µ
Gam(η ¯ , ) d η = µ
2 2
0
Where we have taken the following property:
∫
¯
xN ( x ¯ µ, (ηΛ)−1 ) dx = E[ x] = µ
x∈RD

64
Then we calculate E[ xxT ]. The steps above can also be used here.
∫
T
E[ xx ] =
St( x|µ, Λ, v) · xxT dx
x∈RD
]
[∫ +∞
∫
¯
¯v v
T
−1
¯
¯
N ( x µ, (ηΛ) ) · Gam(η , ) d η xx dx
=
2 2
x∈RD
0
∫
∫ +∞
¯v v
¯
=
xxT N ( x ¯ µ, (ηΛ)−1 ) · Gam(η ¯ , ) d η dx
2 2
x∈RD 0
]
∫ +∞ [∫
¯
¯v v
−1
T
¯
¯
=
xx N ( x µ, (ηΛ) ) dx · Gam(η , ) d η
2 2
0
x∈RD
∫ +∞ [
]
¯
v
v
=
E[µµT ] · Gam(η ¯ , ) d η
2 2
0
∫ +∞ [
]
¯v v
=
µµT + (ηΛ)−1 Gam(η ¯ , ) d η
2 2
0
∫ +∞
¯
v
v
(ηΛ)−1 · Gam(η ¯ , ) d η
= µµT +
2 2
0
∫ +∞
v v/2 v/2−1
v
1
T
−1
( ) η
exp(− η) d η
= µµ +
(ηΛ) ·
Γ
(
v
/2)
2
2
0
∫
v
1
v v/2 +∞ v/2−2
T
−1
= µµ + Λ
η
exp(− η) d η
( )
Γ(v/2) 2
2
0
If we denote: z =
T

vη
2 ,

E[ xx ] = µµ

T

= µµT
= µµT
= µµT
= µµT

the equation above can be reduced to :
∫
v v/2 +∞ 2 z v/2−2
2
1
+Λ
( )
( )
exp(− z) dz
Γ(v/2) 2
v
v
0
∫ +∞
v
1
z v/2−2 exp(− z) dz
·
+ Λ−1
Γ(v/2) 2 0
Γ(v/2 − 1) v
+ Λ−1
·
Γ(v/2)
2
1 v
+ Λ−1
v/2 − 1 2
v
+
Λ−1
v−2
−1

Where we have taken advantage of the property: Γ( x + 1) = xΓ( x), and
[
]
since we have cov[ x] = E ( x − E[ x])( x − E[ x])T , together with E[ x] = µ, we
can obtain:
v
cov[ x] =
Λ−1
v−2
Problem 2.50 Solution

65
The same steps in Prob.2.47 can be used here.
[
]−D /2−v/2
∆2
St( x|µ, Λ, v) ∝
1+
v
]
[
∆2
∝ exp (−D /2 − v/2) · ln(1 +
)
v
[
]
2
D+v ∆
−2
∝ exp −
·(
+ O (v ))
2
v
∆2
≈ exp(−
) (v → ∞)
2

Where we have used Taylor Expansion: ln(1 + ϵ) = ϵ + O (ϵ2 ). And since
∆2 = ( x−µ)T Λ( x−µ), we see that this, up to an overall constant, is a Gaussian
distribution with mean µ and precision Λ.
Problem 2.51 Solution
We first prove (2.177). Since we have exp( i A )· exp(− i A ) = 1, and exp( i A ) =
cosA + isinA . We can obtain:
( cosA + isinA ) · ( cosA − isinA ) = 1
Which gives cos2 A + sin2 A = 1. And then we prove (2.178) using the hint.

cos( A − B) = ℜ[ exp( i ( A − B))]
/
= ℜ[ exp( i A ) exp( iB)]
cosA + isinA
= ℜ[
]
cosB + isinB
( cosA + isinA )( cosB − isinB)
= ℜ[
]
( cosB + isinB)( cosB − isinB)
= ℜ[( cosA + isinA )( cosB − isinB)]
=

cosAcosB + sinAsinB

It is quite similar for (2.183).

sin( A − B) = ℑ[ exp( i ( A − B))]
= ℑ[( cosA + isinA )( cosB − isinB)]
=

sinAcosB − cosAsinB

Problem 2.52 Solution
Let’s follow the hint. We first derive an approximation for exp[ mcos(θ −

66
θ0 )].

{

[
]}
(θ − θ0 )2
4
exp m 1 −
+ O ((θ − θ0 ) )
2
{
}
(θ − θ0 )2
exp m − m
− mO ((θ − θ0 )4 )
2
}
{
{
}
(θ − θ0 )2
exp( m) · exp − m
· exp − mO ((θ − θ0 )4 )
2

exp { mcos(θ − θ0 )} =
=
=

It is same for exp( mcosθ ) :

exp { mcosθ } =

exp( m) · exp(− m

θ2

2

{
}
) · exp − mO (θ 4 )

Now we rearrange (2.179):

p (θ | θ 0 , m ) =
=

=
=

1
exp { mcos(θ − θ0 )}
2π I 0 ( m )
1
exp { mcos(θ − θ0 )}
∫ 2π
0 exp { mcosθ } d θ
}
{
{
}
2
exp( m) · exp − m (θ−2θ0 ) · exp − mO ((θ − θ0 )4 )
{
}
∫ 2π
θ2
4
0 exp( m) · exp(− m 2 ) · exp − mO (θ ) d θ
{
}
1
(θ − θ 0 )2
exp − m
∫ 2π
2
2
exp(− m θ ) d θ
0

2

Where we have taken advantage of the following fact:
}
{
}
{
(when m → ∞)
exp − mO ((θ − θ0 )4 ) ≈ exp − mO (θ 4 )
Therefore, it is straightforward that when m → ∞, (2.179) reduces to a
Gaussian Distribution with mean θ0 and precision m.
Problem 2.53 Solution
Let’s rearrange (2.182) according to (2.183).
N
∑
n=1

sin(θ − θ0 ) =
=

N
∑
n=1

( sinθn cosθ0 − cosθn sinθ0 )

cosθ0

N
∑
n=1

sinθn − sinθ0

N
∑
n=1

cosθn

Where we have used (2.183), and then together with (2.182), we can obtain :
N
N
∑
∑
cosθn = 0
sinθn − sinθ0
cosθ0
n=1

n=1

67
Which gives:
θ0ML

= tan

−1

{∑
}
n sinθ n
∑
n cosθ n

Problem 2.54 Solution
We calculate the first and second derivative of (2.179) with respect to θ .

p(θ |θ0 , m)′ =
p(θ |θ0 , m)′′ =

{
}
1
[− msin(θ − θ0 )] exp mcos(θ − θ0 )
2π I 0 ( m )

[
]
{
}
1
− mcos(θ − θ0 ) + (− msin(θ − θ0 ))2 exp mcos(θ − θ0 )
2π I 0 ( m )

If we let p(θ |θ0 , m)′ equals to 0, we will obtain its root:
θ = θ0 + k π

(k ∈ Z )

When k ≡ 0 ( mod 2), i.e. θ ≡ θ0 ( mod 2π), we have:

p(θ |θ0 , m)′′ =

− m exp( m)
<0
2π I 0 ( m )

Therefore, when θ = θ0 , (2.179) obtains its maximum. And when k ≡
1 ( mod 2), i.e. θ ≡ θ0 + π ( mod 2π), we have:

p(θ |θ0 , m)′′ =

m exp(− m)
>0
2π I 0 ( m )

Therefore, when θ = θ0 + π ( mod 2π), (2.179) obtains its minimum.
Problem 2.55 Solution
According to (2.185), we have :

A ( m ML ) =

N
1 ∑
cos(θn − θ0ML )
N n=1

By using (2.178), we can write :
N
1 ∑
cos(θn − θ0ML )
N n=1
)
N (
1 ∑
=
cosθn cosθ0ML + sinθn sinθ0ML
N n=1
(
)
(
)
N
N
∑
1 ∑
1
cosθ N cosθ0ML +
sinθ N sinθ0ML
=
N n=1
N n=1

A ( m ML ) =

By using (2.168), we can further derive:
)
(
)
(
N
N
1 ∑
1 ∑
ML
cosθ N cosθ0 +
sinθ N sinθ0ML
A ( m ML ) =
N n=1
N n=1
=

r̄cosθ̄ · cosθ0ML + r̄sinθ̄ · sinθ0ML

=

r̄cos(θ̄ − θ0ML )

68
And then by using (2.169) and (2.184), it is obvious that θ̄ = θ0ML , and
hence A ( m ML ) = r̄ .
Problem 2.56 Solution
Recall that the distributions belonging to the exponential family have the
form:
p( x|η) = h( x) g(η) exp(ηT u( x))
And according to (2.13), the beta distribution can be written as:
Beta( x|a, b) =
=
=

Γ(a + b) a−1
x (1 − x)b−1
Γ(a)Γ( b)
Γ(a + b)
exp [(a − 1) lnx + ( b − 1) ln(1 − x)]
Γ(a)Γ( b)
Γ(a + b) exp [alnx + bln(1 − x)]
Γ(a)Γ( b)
x(1 − x)

Comparing it with the standard form of exponential family, we can obtain:

η = [a, b]T




 u( x) = [ lnx, ln(1 − x)]T
/[
]

g(η) = Γ(η 1 + η 2 ) Γ(η 1 )Γ(η 2 )



/

h( x) = 1 ( x(1 − x))

Where η 1 means the first element of η, i.e. η 1 = a − 1, and η 2 means the
second element of η, i.e. η 2 = b − 1. According to (2.146), Gamma distribution
can be written as:
Gam( x|a, b) =

1 a a−1
b x
exp(− bx)
Γ(a)

Comparing it with the standard form of exponential family, we can obtain:

T


η = [a, b]


 u( x) = [0, − x]
/
η1

g
(
η
)
=
η
Γ(η 1 )

2



 h( x) = xη1 −1
According to (2.179), the von Mises distribution can be written as:

p ( x |θ 0 , m ) =
=

1
exp( mcos( x − θ0 ))
2π I 0 ( m )
1
exp [ m( cosxcosθ0 + sinxsinθ0 )]
2π I 0 ( m )

69
Comparing it with the standard form of exponential family, we can obtain:


η = [ mcosθ0 , msinθ0 ]T




 u( x) = [ cosx, sinx]
√
/

g
(
η
)
=
1
2
π
I
(
η21 + η22 )

0




h( x) = 1
Note : a given distribution can be written into the exponential family in
several ways with different natural parameters.
Problem 2.57 Solution
Recall that the distributions belonging to the exponential family have the
form:
p( x|η) = h( x) g(η) exp(ηT u( x))
And the multivariate Gaussian Distribution has the form:
}
{
1
1
1
T −1
Σ
(
x
−
µ
)
N ( x|µ, Σ) =
exp
−
(
x
−
µ
)
2
(2π)D /2 |Σ|1 /2
We expand the exponential term with respect to µ.
{
}
1
1 T −1
1
T −1
−1
exp
−
N ( x|µ, Σ) =
(
x
Σ
x
−
2
µ
Σ
x
+
µ
Σ
µ
)
2
(2π)D /2 |Σ|1/2
{
}
{
}
1
1
1 T −1
1
T −1
−1
=
exp
−
x
Σ
x
+
µ
Σ
x
exp
−
µ
Σ
µ
)
2
2
(2π)D /2 |Σ|1/2
Comparing it with the standard form of exponential family, we can obtain:

η = [Σ−1 µ, − 12 vec(Σ−1 ) ]T




 u( x) = [ x, vec( xxT ) ]

g(η) = exp( 14 η 1 T η 2 −1 η 1 ) + | − 2η 2 |1/2




h( x) = (2π)−D /2

Where we have used η1 to denote the first element of η, and η2 to denote
the second element of η. And we also take advantage of the vectorizing operator, i.e.vec(·). The vectorization of a matrix is a linear transformation which
converts the matrix into a column vector. This can be viewed in an example :
[
]
a b
A=
=> vec( A ) = [a, c, b, d ]T
c d
Note: By introducing vectorizing operator, we actually have vec(Σ−1 ) ·
vec( xxT ) = xT Σ−1 x
Problem 2.58 Solution (Wait for updating)

70
Based on (2.226), we rewrite the expression for ∇ g(η).
∇ g(η) = − g(η)E[ u( x)]

And then we calculate the derivative of both sides of the equation above
with respect to η.
[
]
∇∇ g(η) = − ∇ g(η)E[ u( x)T ] + g(η)∇E[ u( x)T ]
If we multiply both sides by − g(1η) , we can obtain :
−∇∇ lng(η) = ∇ lng(η)E[ u( x)T ] + ∇E[ u( x)T ]

According to (2.225), we calculate ∇E[ u( x)T ].
∫
{
}
∇E[ u( x)T ] = ∇ g(η) h( x) exp ηT u( x) u( x)T dx +
∫
{
}
g(η) h( x) exp ηT u( x) u( x) u( x)T dx
=> ∇E[ u( x)T ] = ∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ]

Therefore, we obtain :
−∇∇ lng(η) = 2∇ lng(η)E[ u( x)T ] + E[ u( x) u( x)T ] = −2E[ u( x)]E[ u( x)T ] + E[ u( x) u( x)T ]

Problem 2.59 Solution
It is straightforward.
∫

∫

1 x
f ( ) dx
σ σ
∫
1
=
f ( u)σ du
σ
∫
=
f ( u) du = 1

p( x|σ) dx =

Where we have denoted u = x/σ.
Problem 2.60 Solution
Firstly, we write down the log likelihood function.
N
∑
n=1

lnp( xn ) =

M
∑

n i ln( h i )

i =1

Some details should be explained here. If xn falls into region ∆ i , then
p( xn ) will equal to h i , and since we have already been given that among
all the N observations, there are n i samples fall into region ∆ i , we can easily
write down the likelihood function just as the equation above, and note we use

71

M to denote the number of different regions. Therefore, an implicit equation
should hold:
M
∑
ni = N
i =1

We now need to take account of the constraint that p( x) must integrate
∑
to unity, which can be written as M
j =1 h j ∆ j = 1. We introduce a Lagrange
multiplier to the expression, and then we need to minimize:
M
∑
i =1

M
∑
n i ln( h i ) + λ( h j ∆ j − 1)
j

We calculate its derivative with respect to h i and let it equal to 0.

ni
+ λ∆ i = 0
hi
Multiplying both sides by h i , performing summation over i and then using the constraint, we can obtain:

N +λ=0
In other words, λ = − N . Then we substitute the result into the likelihood
function, which gives:
ni 1
hi =
N ∆i
Problem 2.61 Solution
It is straightforward. In K nearest neighbours (KNN), when we want to
estimate probability density at a point x i , we will consider a small sphere
centered on x i and then allow the radius to grow until it contains K data
points, and then p( x i ) will equal to K /( NVi ), where N is total observations
and Vi is the volume of the sphere centered on x i . We can assume that Vi is
small enough that p( x i ) is roughly constant in it. In this way, We can write
down the integral:
∫

p( x) dx ≈

N
∑
i =1

p( x i ) · Vi =

N K
∑
· Vi = K ̸= 1
i =1 NVi

We also see that if we use "1NN" (K = 1), the probability density will be
well normalized. Note that if and only if the volume of all the spheres are
small enough and N is large enough, the equation above will hold. Fortunately, these two conditions can be satisfied in KNN.

72

0.3 Probability Distribution
Problem 3.1 Solution
Based on (3.6), we can write :
2σ(2a) − 1 =

2
1 − exp(−2a)
exp(a) − exp(−a)
−1=
=
1 + exp(−2a)
1 + exp(−2a)
exp(a) + exp(−a)

Which is exactly tanh(a). Then we will find the relation between µ i , w i
in (3.101) and (3.102). Let’s start from (3.101).

y( x, w) =

w0 +

=

w0 +

=

w0 +

M
∑
j =1
M
∑

w j σ(
wj

x −µj
s

tanh(

j =1

)

x−µ j
2s ) + 1

2

M w
M
∑
x −µj
1∑
j
wj +
tanh(
)
2 j=1
2s
j =1 2

Hence the relation is given by :
µ0 = w0 +

M
1∑
wj
2 j=1

and µ j =

wj
2

Note: there is a typo in (3.102), the denominator should be 2 s instead of
s, or alternatively you can view it as a new s′ , which equals to 2 s.
Problem 3.2 Solution
We first need to show that (ΦT Φ)−1 is invertible. Suppose, for the sake of
contradiction, c is a nonzero vector in the kernel(Null space) of ΦT Φ. Then
ΦT Φ c equals to 0 and so we have:
0 = c T ΦT Φ c = (Φ c)T Φ c = ||Φ c||2
The equation above shows that Φ c = 0. However, Φ c = c 1 ϕ1 + c 2 ϕ2 +
... + c M ϕ M and {ϕ1 , ϕ2 , , ..., ϕ M } is a basis for Φ, there is no linear relation
between the ϕ i and therefore we cannot have c 1 ϕ1 + c 2 ϕ2 + ... + c M ϕ M = 0.
This is the contradiction. Hence ΦT Φ is invertible. Then let’s first prove two
specific cases.
Case 1: w1 is in Φ. In this case, we have Φ c = w1 for some c. So we
have:
Φ(ΦT Φ)−1 ΦT w1 = Φ(ΦT Φ)−1 ΦT Φ c = Φ c = w1
Case 2:w2 is in Φ⊥ , where Φ⊥ is used to denote the orthogonal complement of Φ and then we have ΦT w2 = 0, which leads to:

Φ(ΦT Φ)−1 ΦT w2 = 0

73
Recall that any vector x ∈ R M can be divided into the summation of two
vectors w1 and w2 , were w1 ∈ Φ and w2 ∈ Φ⊥ separately. And so we have:

Φ(ΦT Φ)−1 ΦT w = Φ(ΦT Φ)−1 ΦT (w1 + w2 ) = w1
Which is exactly what orthogonal projection is supposed to do.
Problem 3.3 Solution
Let’s calculate the derivative of (3.104) with respect to w.
∇ E D ( w) =

N
∑
n=1

{
}
r n t n − wT Φ( xn ) Φ( xn )T

We set the derivative equal to 0.
0=
If we denote

p

N
∑
n=1

T

(

r n t n Φ( xn ) − w

T

N
∑
n=1

)

r n Φ( xn )Φ( xn )

T

p
r n ϕ( xn ) = ϕ′ ( xn ) and r n t n = t′n , we can obtain:
(
)
N
N
∑
∑
′
′
T
T
′
′
T
0=
t n Φ ( xn ) − w
Φ ( xn )Φ ( xn )
n=1

n=1

Taking advantage of (3.11) – (3.17), we can derive a similar result, i.e.
w ML = (ΦT Φ)−1 ΦT t. But here, we define t as:
[p
]T
p
p
t=
r 1 t 1 , r 2 t 2 , ... , r N t N
p
We also define Φ as a N × M matrix, with element Φ( i, j ) = r i ϕ j ( x i ).
Problem 3.4 Solution
Firstly, we rearrange E D (w).
{
}2
N
D
∑
[
]
1 ∑
E D ( w) =
w0 +
wi (xi + ϵi ) − t n
2 n=1
i =1
{
}2
N
D
D
∑
∑
(
)
1 ∑
=
w0 +
wi xi − t n +
wi ϵi
2 n=1
i =1
i =1
}2
{
D
N
∑
1 ∑
wi ϵi
=
y( x n , w ) − t n +
2 n=1
i =1
}
{
D
D
N
)(
)
(∑
)2
(
)2 ( ∑
1 ∑
w i ϵ i y( xn , w) − t n
wi ϵi + 2
=
y( x n , w ) − t n +
2 n=1
i =1
i =1
Where we have used y( xn , w) to denote the output of the linear model
when input variable is xn , without noise added. For the second term in the
equation above, we can obtain :
Eϵ [(

D
∑
i =1

w i ϵ i )2 ] = Eϵ [

D ∑
D
∑
i =1 j =1

wi w j ϵi ϵ j ] =

D ∑
D
∑
i =1 j =1

w i w j Eϵ [ ϵ i ϵ j ] = σ 2

D ∑
D
∑
i =1 j =1

wi w j δi j

74
Which gives
Eϵ [(

D
∑
i =1

w i ϵ i )2 ] = σ2

D
∑
i =1

w2i

For the third term, we can obtain:
D
D
∑
(∑
)(
)
(
)
Eϵ [2
w i ϵ i y( xn , w) − t n ] = 2 y( xn , w) − t n Eϵ [ w i ϵ i ]
i =1

i =1

(

= 2 y( x n , w ) − t n
= 0

D
)∑
i =1

Eϵ [ w i ϵ i ]

Therefore, if we calculate the expectation of E D (w) with respect to ϵ, we
can obtain:
N (
D
)2 σ2 ∑
1 ∑
y( xn , w) − t n +
w2
Eϵ [E D (w)] =
2 n=1
2 i=1 i
Problem 3.5 Solution
We can firstly rewrite the constraint (3.30) as :
(
)
M
1 ∑
|w j | q − η ≤ 0
2 j=1
Where we deliberately introduce scaling factor 1/2 for convenience.Then
it is straightforward to obtain the Lagrange function.
(
)
}2 λ ∑
N {
M
1 ∑
L(w, λ) =
|w j | q − η
t n − w T ϕ( x n ) +
2 n=1
2 j=1
It is obvious that L(w, λ) and (3.29) has the same dependence on w. Meanwhile, if we denote the optimal w that can minimize L(w, λ) as w⋆ (λ), we can
see that
M
∑
η=
|w⋆j | q
j =1

Problem 3.6 Solution
Firstly, we write down the log likelihood function.

lnp(T | X ,W, β) = −

N [
[
]
]T
N
1 ∑
ln|Σ| −
t n − W T ϕ( xn ) Σ−1 t n − W T ϕ( xn )
2
2 n=1

Where we have already omitted the constant term. We set the derivative
of the equation above with respect to W equals to zero.
0=−

N
∑
n=1

]
Σ−1 [ t n − W T ϕ( xn ) ϕ( xn )T

75
Therefore, we can obtain similar result for W as (3.15). For Σ, comparing
with (2.118) – (2.124), we can easily write down a similar result :

Σ=

N
]
]T
1 ∑
T
[t n − W T
ML ϕ( x n ) [ t n − W ML ϕ( x n )
N n=1

We can see that the solutions for W and Σ are also decoupled.
Problem 3.7 Solution
Let’s begin by writing down the prior distribution p(w) and likelihood
function p( t| X , w, β).

p ( w) = N ( w| m0 , S 0 ) ,

p( t| X , w, β) =

N
∏
n=1

N ( t n |wT ϕ( xn ), β−1 )

Since the posterior PDF equals to the product of the prior PDF and likelihood function, up to a normalized constant. We mainly focus on the exponential term of the product.
exponential term

=
=
=

N {
β ∑

}2

1
1
− ( w − m0 ) T S −
0 ( w − m0 )
2
} 1
N {
β ∑
1
−
t2n − 2 t n wT ϕ( xn ) + wT ϕ( xn )ϕ( xn )T w − (w − m0 )T S −
0 ( w − m0 )
2 n=1
2
[
]
N
1 T ∑
T
−1
− w
βϕ( xn )ϕ( xn ) + S 0 w
2
n=1
[
]
N
∑
1
1
2β t n ϕ ( x n ) T w
− −2 m0T S −
0 −
2
n=1
−

2 n=1

t n − w T ϕ( x n )

+ const

Hence, by comparing the quadratic term with standard Gaussian Distri1
−1
T
bution, we can obtain: S −
N = S 0 + βΦ Φ. And then comparing the linear
term, we can obtain :
1
−2 m N T S N −1 = −2 m0T S −
0 −

N
∑
n=1

2β t n ϕ( xn )T

If we multiply −0.5 on both sides, and then transpose both sides, we can
easily see that m N = S N (S 0 −1 m 0 + βΦT t)
Problem 3.8 Solution
Firstly, we write down the prior :

p ( w) = N ( m N , S N )

76
Where m N , S N are given by (3.50) and (3.51). And if now we observe
another sample ( X N +1 , t N +1 ), we can write down the likelihood function :

p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 )
Since the posterior equals to the production of likelihood function and the
prior, up to a constant, we focus on the exponential term.
exponential term

1
T
2
( w − m N )T S −
N (w − m N ) + β( t N +1 − w ϕ( x N +1 ))
[
]
wT S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T w
[ 1
]
−2wT S −
N m N + β ϕ( x N +1 ) t N +1

=
=

+const

Therefore, after observing ( X N +1 , t N +1 ), we have p(w) = N ( m N +1 , S N +1 ),
where we have defined:
1
−1
T
S−
N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 )

And

( 1
)
m N +1 = S N +1 S −
N m N + β ϕ( x N +1 ) t N +1

Problem 3.9 Solution
We know that the prior p(w) can be written as:

p ( w) = N ( m N , S N )
And the likelihood function p( t N +1 | x N +1 , w) can be written as:

p( t N +1 | x N +1 , w) = N ( t N +1 | y( x N +1 , w), β−1 )
According to the fact that y( x N +1 , w) = wT ϕ( x N +1 ) = ϕ( x N +1 )T w, the
likelihood can be further written as:

p( t N +1 | x N +1 , w) = N ( t N +1 |(ϕ( x N +1 )T w, β−1 )
Then we take advantage of (2.113), (2.114) and (2.116), which gives:
}
{
p(w| x N +1 , t N +1 ) = N (Σ ϕ( x N +1 )β t N +1 + S N −1 m N , Σ)

Where Σ = (S N −1 + ϕ( x N +1 )βϕ( x N +1 )T )−1 , and we can see that the result
is exactly the same as the one we obtained in the previous problem.
Problem 3.10 Solution
We have already known:

p( t|w, β) = N ( t| y( x, w), β−1 )

77
And

p(w|t, α, β) = N (w| m N , S N )
Where m N , S N are given by (3.53) and (3.54). As what we do in previous
problem, we can rewrite p( t|w, β) as:

p( t|w, β) = N ( t|ϕ( x)T w, β−1 )
And then we take advantage of (2.113), (2.114) and (2.115), we can obtain:

p( t|t, α, β) = N (ϕ( x)T m N , β−1 + ϕ( x)T S N ϕ( x))
Which is exactly the same as (3.58), if we notice that
ϕ( x ) T m N = m N T ϕ( x )

Problem 3.11 Solution
We need to use the result obtained in Prob.3.8. In Prob.3.8, we have de1
rived a formula for S −
:
N +1
1
−1
T
S−
N +1 = S N + β ϕ( x N +1 ) ϕ( x N +1 )

And then using (3.110), we can obtain :

S N +1

]−1
S N −1 + β ϕ( x N +1 ) ϕ( x N +1 )T
√
√
]−1
[
= S N −1 + βϕ( x N +1 ) βϕ( x N +1 )T
√
√
S N ( βϕ( x N +1 ))( βϕ( x N +1 )T )S N
= SN −
√
√
1 + ( βϕ( x N +1 )T )S N ( βϕ( x N +1 ))

=

=

[

SN −

βS N ϕ( x N +1 )ϕ( x N +1 )T S N

1 + βϕ( x N +1 )T S N ϕ( x N +1 )

Now we calculate σ2N ( x) − σ2N +1 ( x) according to (3.59).
σ2N ( x) − σ2N +1 ( x)

= ϕ( x)T (S N − S N +1 )ϕ( x)
= ϕ( x ) T
=
=

βS N ϕ( x N +1 )ϕ( x N +1 )T S N

1 + βϕ( x N +1 )T S N ϕ( x N +1 )

ϕ( x )

ϕ( x)T S N ϕ( x N +1 )ϕ( x N +1 )T S N ϕ( x)

1/β + ϕ( x N +1 )T S N ϕ( x N +1 )
[
]2
ϕ( x)T S N ϕ( x N +1 )
(∗)
1/β + ϕ( x N +1 )T S N ϕ( x N +1 )

And since S N is positive definite, (∗) is larger than 0. Therefore, we have
proved that σ2N ( x) − σ2N +1 ( x) ≥ 0
Problem 3.12 Solution

78
Let’s begin by writing down the prior PDF p(w, β):

p(w, β)

N (w| m 0 , β−1 S 0 ) Gam(β|a 0 , b 0 ) (∗)
β 2
1
a
∝ (
) exp(− (w − m 0 )T βS 0−1 (w − m 0 )) b 0 0 βa0 −1 exp(− b 0 β)
|S 0 |
2
=

And then we write down the likelihood function p(t|X, w, β) :

p(t|X, w, β)

=
∝

N
∏
n=1

N ( t n |wT ϕ( xn ), β−1 )

N
∏

]
[ β
β1/2 exp − ( t n − wT ϕ( xn ))2
2
n=1

(∗∗)

According to Bayesian Inference, we have p(w, β|t) ∝ p(t|X, w, β)× p(w, β).
We first focus on the quadratic term with regard to w in the exponent.
N
∑
β
β
quadratic term = − wT S 0 −1 w +
− wT ϕ( xn )ϕ( xn )T w
2
n=1 2
N
∑
[
]
β
= − wT S 0 −1 +
ϕ ( x n )ϕ ( x n ) T w
2
n=1

Where the first term is generated by (∗), and the second by (∗∗). By now,
we know that:
N
∑
S N −1 = S 0 −1 +
ϕ( xn )ϕ( xn )T
n=1

We then focus on the linear term with regard to w in the exponent.
linear term = β m 0 T S 0 −1 w +

N
∑
n=1

β t n ϕ( x n ) T w

N
∑
]
[
t n ϕ( x n ) T w
= β m 0 T S 0 −1 +
n=1

Again, the first term is generated by (∗), and the second by (∗∗). We can
also obtain:
N
∑
m N T S N −1 = m 0 T S 0 −1 +
t n ϕ( x n ) T
n=1

Which gives:
N
∑
]
[
t n ϕ( x n )
m N = S N S 0 −1 m 0 +
n=1

Then we focus on the constant term with regard to w in the exponent.
N
β ∑
β
constant term = (− m 0 T S 0 −1 m 0 − b 0 β) −
t2
2
2 n=1 n
N
[1
]
1 ∑
t2n
= −β m 0 T S 0 −1 m 0 + b 0 +
2
2 n=1

79
Therefore, we can obtain:
N
1
1
1 ∑
m N T S N −1 m N + b N = m 0 T S 0 −1 m 0 + b 0 +
t2
2
2
2 n=1 n

Which gives :

bN =

N
1
1 ∑
1
m 0 T S 0 −1 m 0 + b 0 +
t2 − m N T S N −1 m N
2
2 n=1 n 2

Finally, we focus on the exponential term whose base is β.
exponent term = (2 + a 0 − 1) +

N
2

Which gives:
2 + a N − 1 = (2 + a 0 − 1) +

N
2

Hence,

N
2
Problem 3.13 Solution(Waiting for update)
a N = a0 +

Similar to (3.57), we write down the expression of the predictive distribution p( t|X, t):
∫ ∫

p( t|X, t) =

p( t|w, β) p(w, β|X, t) dw d β

(∗)

We know that:

p( t|w, β) = N ( t| y( x, w), β−1 ) = N ( t|ϕ( x)T w, β−1 )
And that:

p(w, β|X, t) = N (w| m N , β−1 S N ) Gam(β|a N , b N )
We go back to (∗), and we first deal with the integral with regard to w:
∫ ∫
[
]
p( t|X, t) =
N ( t|ϕ( x)T w, β−1 ) N (w| m N , β−1 S N ) dw Gam(β|a N , b N ) d β
∫
=
N ( t|ϕ( x)T m N , β−1 + ϕ( x)T β−1 S N ϕ( x)) Gam(β|a N , b N ) d β
∫
[
]
=
N t|ϕ( x)T m N , β−1 (1 + ϕ( x)T S N ϕ( x)) Gam(β|a N , b N ) d β
Where we have used (2.113), (2.114) and (2.115). Then, we compare the
expression above with (2.160), we can see that p( t|X, t) = St( t|µ, λ, v), where
we have defined:
µ = ϕ( x ) T m N ,

[
]−1
λ = 1 + ϕ ( x ) T S N ϕ( x ) ,

v = 2a N

80
Problem 3.14 Solution(Wait for updating)
Firstly, according to (3.16), if we use the new orthonormal basis set specified in the problem to construct Φ, we can obtain an important property:
ΦT Φ = I. Hence, if α = 0, together with (3.54), we know that SN = 1/β.
Finally, according to (3.62), we can obtain:

k( x, x′ ) = βψ( x)T SN ψ( x′ ) = ψ( x)T ψ( x′ )
Problem 3.15 Solution
It is quite obvious if we substitute (3.92) and (3.95) into (3.82), which
gives,
β
α
N −γ γ
N
E ( m N ) = ||t − Φ m N ||2 + m N T m N =
+ =
2
2
2
2
2
Problem 3.16 Solution(Waiting for update)
We know that

p(t|w, β) =

N
∏
n=1

And

N (ϕ( xn )T w, β−1 ) ∝ N (Φw, β−1 I)

p(w|α) = N (0, α−1 I)

Comparing them with (2.113), (2.114) and (2.115), we can obtain:

p(t|α, β) = N (0, β−1 I + α−1 ΦΦT )
Problem 3.17 Solution
We know that:

p(t|w, β) =
=

N
∏
n=1

N (ϕ( xn )T w, β−1 )

N
∏

1

n=1

(2πβ−1 )1/2

{
exp −

}
1
( t n − ϕ( xn )T w)2
−
1
2β

N
{∑

}
β
− ( t n − ϕ( xn )T w)2
2π
n=1 2
{ β
}
β
= ( ) N /2 exp − ||t − Φw||2
2π
2

= (

β

) N /2 exp

And that:

p(w|α) = N (0, α−1 I)
=

α M /2

(2π) M /2

{ α
}
exp − ||w||2
2

81
If we substitute the expressions above into (3.77), we can obtain (3.78)
just as required.
Problem 3.18 Solution
We expand (3.79) as follows:

E ( w) =
=
=

β

α

wT w
2
2
β T
α
(t t − 2tT Φw + wT ΦT Φw) + wT w
2
2
]
1[ T
w (βΦT Φ + αI)w − 2βtT Φw + βtT t
2
||t − Φw||2 +

Observing the equation above, we see that E (w) contains the following
term :
1
(w − m N )T A(w − m N )
(∗)
2
Now, we need to solve A and m N . We expand (∗) and obtain:
(∗) =

1 T
(w Aw − 2 m N T Aw + m N T A m N )
2

We firstly compare the quadratic term, which gives:
A = βΦT Φ + αI
And then we compare the linear term, which gives:

m N T A = βtT Φ
Noticing that A = AT , which implies A−1 is also symmetric, we first transpose and then multiply A−1 on both sides, which gives:

m N = βA−1 ΦT t
Now we rewrite E (w):

E ( w) =
=
=
=
=
=
=

]
1[ T
w (βΦT Φ + αI)w − 2βtT Φw + βtT t
2
]
1[
(w − m N )T A(w − m N ) + βtT t − m N T A m N
2
1
1
(w − m N )T A(w − m N ) + (βtT t − m N T A m N )
2
2
1
1
T
(w − m N ) A(w − m N ) + (βtT t − 2 m N T A m N + m N T A m N )
2
2
1
1
(w − m N )T A(w − m N ) + (βtT t − 2 m N T A m N + m N T (βΦT Φ + αI) m N )
2
2
] α
1
1[
T
(w − m N ) A(w − m N ) + βtT t − 2βtT Φ m N + m N T (βΦT Φ) m N + m N T m N
2
2
2
1
β
α
(w − m N )T A(w − m N ) + ||t − Φ m N ||2 + m N T m N
2
2
2

82
Just as required.
Problem 3.19 Solution
Based on the standard form of a multivariate normal distribution, we
know that
∫
}
{ 1
1
1
exp − (w − m N )T A(w − m N ) dw = 1
M
/2
1/2
2
(2π)
| A|
Hence,
∫

{ 1
}
exp − (w − m N )T A(w − m N ) dw = (2π) M /2 |A|1/2
2

And since E ( m N ) doesn’t depend on w, (3.85) is quite obvious. Then we
substitute (3.85) into (3.78), which will immediately gives (3.86).
Problem 3.20 Solution
You can just follow the steps from (3.87) to (3.92), which is already very
clear.
Problem 3.21 Solution
Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A
is a M × M real symmetric matrix, with eigenvalues λ i , i = 1, 2, ..., M , |A| and
Tr(A) can be written as:
| A| =

M
∏
i =1

λi ,

Tr(A) =

M
∑
i =1

λi

Back to this problem, according to section 3.5.2, we know that A has
eigenvalues α + λ i , i = 1, 2, ..., M . Hence the left side of (3.117) equals to:
left side =

M
M d
M
∑
[∏
] ∑
d
1
(α + λ i ) =
ln
ln(α + λ i ) =
dα
i =1
i =1 d α
i =1 α + λ i

And according to (3.81), we can obtain:
A−1

d
A = A−1 I = A−1
dα

For the symmetric matrix A, its inverse A−1 has eigenvalues 1 / (α+λ i ) , i =
1, 2, ..., M . Therefore,
M
∑
1
d
A) =
Tr(A−1
dα
α
+
λi
i =1
Hence there are the same, and (3.92) is quite obvious.
Problem 3.22 Solution

83
Let’s derive (3.86) with regard to β. The first term dependent on β in
(3.86) is :
d N
N
( lnβ) =
dβ 2
2β
The second term is :

d
E (m N ) =
dβ

1
β d
d α
||t − Φ m N ||2 +
||t − Φ m N ||2 +
mN T mN
2
2 dβ
dβ 2

The last two terms in the equation above can be further written as:
β d

2 dβ

||t − Φ m N ||2 +

d α
mN T mN
dβ 2

=
=
=
=
=
=

{β

} dm N
d
d α
||t − Φ m N ||2 +
mN T mN ·
2 dm N
dm N 2
dβ
{β
}
α
dm
N
[−2ΦT (t − Φ m N )] + 2 m N ·
2
2
dβ
{
} dm N
− βΦT (t − Φ m N ) + α m N ·
dβ
{
}
dm N
− βΦT t + (αI + βΦT Φ) m N ·
dβ
{
}
dm N
− βΦT t + A m N ·
dβ
0

Where we have taken advantage of (3.83) and (3.84). Hence
N
d
1
1 ∑
E ( m N ) = ||t − Φ m N ||2 =
( t n − m N T ϕ( xn ))2
dβ
2
2 n=1

The last term dependent on β in (3.86) is:

d 1
γ
( ln|A|) =
dβ 2
2β
Therefore, if we combine all those expressions together, we will obtain
(3.94). And then if we rearrange it, we will obtain (3.95).
Problem 3.23 Solution
First, according to (3.10), we know that p(t|X, w, β) can be further written
as p(t|X, w, β) = N (t|Φw, β−1 I), and given that p(w|β) = N ( m 0 , β−1 S0 ) and

84

p(β) = Gam(β|a 0 , b 0 ). Therefore, we just follow the hint in the problem.
∫ ∫
p(t) =
p(t|X, w, β) p(w|β) dw p(β) d β
∫ ∫
{ β
}
β
=
( ) N /2 exp − (t − Φw)T (t − Φw) ·
2π
2
{
}
β
β M /2
( ) |S0 |−1/2 exp − (w − m 0 )T S0 −1 (w − m 0 ) dw
2π
2
−1 a 0 a 0 −1
Γ(a 0 ) b 0 β
exp(− b 0 β) d β
∫ ∫
a0
b0
{ β
}
=
exp − (t − Φw)T (t − Φw)
(
M
+
N
)/2
1/2
2
(2π)
|S0 |
{ β
}
exp − (w − m 0 )T S0 −1 (w − m 0 ) dw
2
βa0 −1+ N /2+ M /2 exp(− b 0 β) d β
∫ ∫
a
b0 0
}
{ β
T
−1
S
(
w
−
m
)
dw
=
exp
−
(
w
−
m
)
N
N
N
2
(2π)( M + N )/2 |S0 |1/2
{ β
}
exp − (tT t + m 0 T S0 −1 m 0 − m N T SN −1 m N )
2
a N −1+ M /2
exp(− b 0 β) d β
β
Where we have defined

m N = SN (S0 −1 m 0 + ΦT t)
S N −1 = S0 −1 + ΦT Φ
a N = a0 +

N
2

N
∑
1
b N = b 0 + ( m 0 T S0 −1 m 0 − m N T SN −1 m N +
t2n )
2
n=1

Which are exactly the same as those in Prob.3.12, and then we evaluate
the integral, taking advantage of the normalized property of multivariate
Gaussian Distribution and Gamma Distribution.
∫
2π M /2
1/2
)
|
S
|
βa N −1+ M /2 exp(− b N β) d β
N
β
(2π)( M + N )/2 |S0 |
∫
a
b0 0
M /2
1/2
(2π) |SN |
βa N −1 exp(− b N β) d β
(2π)( M + N )/2 |S0 |1/2
a0
|SN |1/2 b 0 Γ(a N )
1
a

p(t) =
=
=

b0 0

(
1/2

a
(2π) N /2 |S0 |1/2 b NN Γ( b N )

Just as required.
Problem 3.24 Solution

85
Let’s just follow the hint and we begin by writing down expression for the
likelihood, prior and posterior PDF. We know that p(t|w, β) = N (t|Φw, β−1 I).
What’s more, the form of the prior and posterior are quite similar:

p(w, β) = N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
And

p(w, β|t) = N (w|mN , β−1 SN ) Gam(β|a N , b N )

Where the relationships among those parameters are shown in Prob.3.12,
Prob.3.23. Now according to (3.119), we can write:

p(t) = N (t|Φw, β−1 I)

N (w|m0 , β−1 S0 ) Gam(β|a 0 , b 0 )
N (w|mN , β−1 SN ) Gam(β|a N , b N )
0 a −1
N (w|m0 , β−1 S0 ) b 0 β 0 exp(− b 0 β) / Γ(a 0 )
N (w|mN , β−1 SN ) b aNN βa N −1 exp(− b N β) / Γ(a N )

a

= N (t|Φw, β−1 I)

a

−1

= N (t|Φw, β

0
{
}
N (w|m0 , β−1 S0 ) b 0 Γ(a N ) a0 −a N
I)
exp
−
(
b
−
b
)
β
β
0
N
a
N (w|mN , β−1 SN ) b NN Γ(a 0 )

a

−1

= N (t|Φw, β

{
} b 0 0 Γ(a N ) − N /2
N (w|m0 , β−1 S0 )
I)
exp
−
(
b
−
b
)
β
β
0
N
a
N (w|mN , β−1 SN )
b NN Γ(a 0 )

Where we have used a N = a 0 + N
2 . Now we deal with the terms expressed
in the form of Gaussian Distribution:
N (w|m0 , β−1 S0 )
N (w|mN , β−1 SN )
{ β
}
β
= ( ) N /2 exp − (t − Φw)T (t − Φw) ·
2π
2
{
}
−1
1/2 exp − β (w − m )T S −1 (w − m )
| β SN |
0
0
0
2
{
}
|β−1 S0 |1/2 exp − β (w − mN )T SN −1 (w − mN )
2

Gaussian terms = N (t|Φw, β−1 I)

{ β
}
T
exp
−
(t
−
Φ
w
)
(t
−
Φ
w
)
·
2π
2
|S0 |1/2
}
{ β
exp − 2 (w − m0 )T S0 −1 (w − m0 )
}
{ β
exp − 2 (w − mN )T SN −1 (w − mN )

= (

β

) N /2

|SN |1/2

We look back to the previous problem and we notice that at the last step
in the deduction of p(t), we complete the square according to w. And if we
carefully compare the left and right side at the last step, we can obtain :

=

{ β
}
{ β
}
exp − (t − Φw)T (t − Φw) exp − (w − m 0 )T S0 −1 (w − m 0 )
2
2
}
{
}
{ β
T
−1
exp − (w − m N ) SN (w − m N ) exp − ( b N − b 0 )β
2

86
Hence, we go back to deal with the Gaussian terms:
Gaussian terms = (

β

2π

) N /2

|SN |1/2
|S0 |1/2

{
}
exp − ( b N − b 0 )β

If we substitute the expressions above into p(t), we will obtain (3.118)
immediately.

0.4

Linear Models Classification

Problem 4.1 Solution
If the convex hull of {xn } and {yn } intersects, we know that there will be a
∑
∑
point z which can be written as z = n αn xn and also z = n βn yn . Hence we
can obtain:
b T z + w0
w

b T(
= w

= (

∑

∑

α n xn ) + w 0

n

∑
b T xn ) + ( α n ) w 0
αn w

n

=

∑

n
T

b xn + w 0 )
αn (w

(∗)

n

∑
Where we have used n αn = 1. And if {xn } and {yn } are linearly separab T xn + w0 > 0 and w
b T yn + w0 < 0, for ∀xn , yn . Together with
ble, we have w
T
b z + w0 > 0. And if we calculate w
b T z + w0
αn ≥ 0 and (∗), we know that w
from the perspective of {yn } following the same procedure, we can obtain
b T z + w0 < 0. Hence contradictory occurs. In other words, they are not linw
early separable if their convex hulls intersect.
Now let’s assume they are linearly separable and try to prove their convex
b T xn + w0 > 0 and
hulls don’t intersect. This is obvious. We can obtain w
b T yn + w0 < 0, for ∀xn , yn , if the two sets are linearly separable. And if there
w
is a point z, which belongs to both convex hulls of {xn } and {yn }, following the
b T z + w0 > 0 from the perspective
same procedure as above, we can see that w
b T z + w0 < 0 from the perspective of {yn }, which directly leads to
of {xn } and w
a contradictory. Therefore, the convex hulls don’t intersect.

Problem 4.2 Solution
e on w0 explicitly:
Let’s make the dependency of E D (W)
e =
E D (W)

}
1 {
Tr (XW + 1w0 T − T)T (XW + 1w0 T − T)
2

e with respect to w0 :
Then we calculate the derivative of E D (W)
e
∂E D (W)
∂w0

= 2 N w0 + 2(XW − T)T 1

87
Where we have used the property:
∂
∂X

[
]
Tr (AXB + C)(AXB + C)T = 2AT (AXB + C)BT

We set the derivative equals to 0, which gives:
w0 = −

1
(XW − T)T 1 = t̄ − WT x̄
N

Where we have denoted:
t̄ =

1 T
T 1,
N

and x̄ =

1 T
X 1
N

e we can obtain:
If we substitute the equations above into E D (W),
{
}
e = 1 Tr (XW + T̄ − X̄W − T)T (XW + T̄ − X̄W − T)
E D (W)
2

Where we further denote
T̄ = 1t̄T ,

and X̄ = 1x̄T

e with regard to W to 0, which gives:
Then we set the derivative of E D (W)
b †T
b
W=X

Where we have defined:
b = X − X̄ ,
X

b = T − T̄
and T

Now consider the prediction for a new given x, we have:
y(x) = WT x + w0
= WT x + t̄ − WT x̄
= t̄ + WT (x − x̄)

If we know that aT tn + b = 0 holds for some a and b, we can obtain:
aT t̄ =

N
1 ∑
1 T T
a T 1=
a T tn = − b
N
N n=1

Therefore,
[
]
aT y(x) = aT t̄ + WT (x − x̄)

= aT t̄ + aT WT (x − x̄)
b T (X
b † )T (x − x̄)
= − b + aT T
= −b

88
Where we have used:
bT
aT T

= aT (T − T̄)T = aT (T −
= aT TT −

1 T T
11 T)
N

1 T T T
a T 11 = − b1T + b1T
N

= 0T

Problem 4.3 Solution
Suppose there are Q constraints in total. We can write aq T tn + b q = 0 , q =
1, 2, ...,Q for all the target vector tn , n = 1, 2..., N . Or alternatively, we can
group them together:
A T tn + b = 0
Where A is a Q × Q matrix, and the qth column of A is aq , and meanwhile b is a Q × 1 column vector, and the qth element is bq . for every pair
of {aq , b q } we can follow the same procedure in the previous problem to show
that aq y(x) + b q = 0. In other words, the proofs will not affect each other.
Therefore, it is obvious :
AT y(x) + b = 0
Problem 4.4 Solution
We use Lagrange multiplier to enforce the constraint wT w = 1. We now
need to maximize :

L(λ, w) = wT (m2 − m1 ) + λ(wT w − 1)
We calculate the derivatives:
∂L(λ, w)
∂λ

And

∂L(λ, w)
∂w

= wT w − 1

= m2 − m1 + 2λw

We set the derivatives above equals to 0, which gives:
w=−

1
(m2 − m1 ) ∝ (m2 − m1 )
2λ

Problem 4.5 Solution
We expand (4.25) using (4.22), (4.23) and (4.24).

J (w) =
=

( m 2 − m 1 )2

s21 + s22
∑

||wT (m2 − m1 )||2
∑
T
2
T
2
n∈C 1 (w xn − m 1 ) + n∈C 2 (w xn − m 2 )

89
The numerator can be further written as:
[
][
]T
numerator = wT (m2 − m1 ) wT (m2 − m1 ) = wT SB w
Where we have defined:
SB = (m2 − m1 )(m2 − m1 )T
And ti is the same for the denominator:
∑
∑
denominator =
[wT (xn − m1 )]2 +
[wT (xn − m2 )]2
n∈C 1

n∈C 2

T

T

= w Sw1 w + w Sw2 w
= w T Sw w

Where we have defined:
∑
∑
Sw =
(xn − m1 )(xn − m1 )T +
(xn − m2 )(xn − m2 )T
n∈C 1

n∈C 2

Just as required.
Problem 4.6 Solution
Let’s follow the hint, beginning by expanding (4.33).
(4.33) =
=
=
=
=

N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1

= [

w T xn xn + w 0

N
∑
n=1

xn xn T w − w T m

xn −

N
∑
n=1

N
∑
n=1

xn − (

t n xn
∑
n∈C 1

xn xn T w − wT m · ( N m) − (
xn xn T w − N wT mm − N (

t n xn +

∑

∑ N
∑ −N
xn +
xn )
n ∈ C 1 N1
n ∈ C 2 N2
∑

n∈C 1

∑ 1
1
xn −
xn )
N1
n ∈ C 2 N2

xn xn T w − N mmT w − N (m1 − m2 )

N
∑
n=1

(xn xn T ) − N mmT ]w − N (m1 − m2 )

If we let the derivative equal to 0, we will see that:
[

N
∑

(xn xn T ) − N mmT ]w = N (m1 − m2 )

n=1

Therefore, now we need to prove:
N
∑
n=1

t n xn )

n∈C 2

(xn xn T ) − N mmT = Sw +

N1 N2
SB
N

90
Let’s expand the left side of the equation above:
left =
=
=
=
=
=
=
=

N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1
N
∑
n=1

=

xn xn T − N (
xn xn T −

N1
N2
m1 +
m2 )2
N
N
N12

N22

N

N2

N12
N

||m1 ||2 +
2

||m1 ||2 −

xn xn T + ( N 1 +

N22
N

||m2 ||2 + 2

||m2 ||2 − 2

N1 N2
m1 m2 T )
N2

N1 N2
m1 m2 T
N

N1 N2
N1 N2
N1 N2
− 2 N1 )||m1 ||2 + ( N2 +
− 2 N2 )||m2 ||2 − 2
m1 m2 T
N
N
N

xn xn T + ( N1 − 2 N1 )||m1 ||2 + ( N2 − 2 N2 )||m2 ||2 +

N1 N2
||m1 − m2 ||2
N

xn xn T + N1 ||m1 ||2 − 2m1 · ( N1 m1 T ) + N2 ||m2 ||2 − 2m2 · ( N2 m2 T ) +
xn xn T + N1 ||m1 ||2 − 2m1

∑

n∈C 1

+
=

xn xn T − N (

T

n∈C 2

n∈C 1

∑

n∈C 1

n∈C 1

xn xn + N1 ||m1 || − 2m1

∑

∑

2

∑
∑

n∈C 1

xn xn T + N2 ||m2 ||2 − 2m2

||xn − m1 ||2 +

= Sw +

n∈C 2

∑
n∈C 2

xnT +

N1 N2
SB
N

xnT

∑

n∈C 2

(xn xn T + ||m1 ||2 − 2m1 xnT ) +
∑

xnT + N2 ||m2 ||2 − 2m2

xnT +

∑

n∈C 2

||xn − m2 ||2 +

N1 N2
SB
N

(xn xn T + ||m2 ||2 − 2m2 xn T ) +

N1 N2
SB
N

N1 N2
SB
N

N1 N2
SB
N

Just as required.
Problem 4.7 Solution
This problem is quite simple. We can solve it by definition. We know that
logistic sigmoid function has the form:
σ( a ) =

1
1 + exp(−a)

Therefore, we can obtain:
σ ( a ) + σ (− a )

=
=
=

N1 N2
SB
N

1
1
+
1 + exp(−a) 1 + exp(a)
2 + exp(a) + exp(−a)
[1 + exp(−a)][1 + exp(a)]
2 + exp(a) + exp(−a)
=1
2 + exp(a) + exp(−a)

91
Next we exchange the dependent and independent variables to obtain its
inverse.
1
a=
1 + exp(− y)
We first rearrange the equation above, which gives:

exp(− y) =

1−a
a

Then we calculate the logarithm for both sides, which gives:

y = ln(

a
)
1−a

Just as required.
Problem 4.8 Solution
According to (4.58) and (4.64), we can write:

a = ln

p(x|C 1 ) p(C 1 )
p(x|C 2 ) p(C 2 )

p (C 1 )
p (C 2 )
1
1
p (C 1 )
= − (x − µ1 )T Σ−1 (x − µ1 ) + (x − µ2 )T Σ−1 (x − µ2 ) + ln
2
2
p (C 2 )
1
1
p
(
C
)
1
= Σ−1 (µ1 − µ2 )x − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln
2
2
p (C 2 )
= ln p(x|C 1 ) − ln p(x|C 2 ) + ln

= w T x + w0

Where in the last second step, we rearrange the term according to x, i.e.,
its quadratic, linear, constant term. We have also defined :
w = Σ−1 (µ1 − µ2 )
And

1
1
p (C 1 )
w0 = − µ1 T Σ−1 µ1 + µ2 T Σ−1 µ2 + ln
2
2
p (C 2 )

Finally, since p(C 1 |x) = σ(a) as stated in (4.57), we have p(C 1 |x) = σ(wT x+
w0 ) just as required.
Problem 4.9 Solution
We begin by writing down the likelihood function.

p({ϕn , t n }|π1 , π2 , ..., πK ) =
=

N ∏
K
∏

[ p(ϕn |C k ) p(C k )] t nk

n=1 k=1
N ∏
K
∏

[πk p(ϕn |C k )] t nk

n=1 k=1

92
Hence we can obtain the expression for the logarithm likelihood:
ln p =

N ∑
K
∑
n=1 k=1

N ∑
K
∑
[
]
t nk ln πk + ln p(ϕn |C k ) ∝
t nk ln πk
n=1 k=1

Since there is a constraint on πk , so we need to add a Lagrange Multiplier
to the expression, which becomes:

L=

N ∑
K
∑
n=1 k=1

t nk ln πk + λ(

K
∑
k=1

πk − 1)

We calculate the derivative of the expression above with regard to πk :
∂L

=

∂πk

N t
∑
nk
+λ
π
n=1 k

And if we set the derivative equal to 0, we can obtain:
πk = − (

N
∑
n=1

t nk ) / λ = −

Nk
λ

(∗)

And if we preform summation on both sides with regard to k, we can see
that:
K
∑
N
Nk ) / λ = −
1 = −(
λ
k=1
Which gives λ = − N , and substitute it into (∗), we can obtain πk = Nk / N .
Problem 4.10 Solution
This time, we focus on the term which dependent on µk and Σ in the
logarithm likelihood.
ln p =

N ∑
K
∑
n=1 k=1

N ∑
K
∑
[
]
t nk ln πk + ln p(ϕn |C k ) ∝
t nk ln p(ϕn |C k )
n=1 k=1

Provided p(ϕ|C k ) = N (ϕ|µk , Σ), we can further derive:
ln p ∝

K
N ∑
∑

[ 1
]
1
t nk − ln |Σ| − (ϕn − µk )Σ−1 (ϕn − µk )T
2
2
n=1 k=1

We first calculate the derivative of the expression above with regard to
µk :
∂ ln p
∂µk

=

N
∑
n=1

t nk Σ−1 (ϕn − µk )

We set the derivative equals to 0, which gives:
N
∑
n=1

t nk Σ−1 ϕn =

N
∑
n=1

t nk Σ−1 µk = Nk Σ−1 µk

93
Therefore, if we multiply both sides by Σ / Nk , we will obtain (4.161). Now
let’s calculate the derivative of ln p with regard to Σ, which gives:
∂ ln p
∂Σ

=
=
=

N ∑
K
∑

N ∑
K
1
1 ∂ ∑
t nk (− Σ−1 ) −
t nk (ϕn − µk )Σ−1 (ϕn − µk )T
2
2
∂
Σ
n=1 k=1
n=1 k=1
N ∑
K
∑
n=1 k=1

−

K ∑
N
t nk −1 1 ∂ ∑
Σ −
t nk (ϕn − µk )Σ−1 (ϕn − µk )T
2
2 ∂Σ k=1 n=1

N
∑

K
1
1 ∂ ∑
− Σ−1 −
Nk Tr(Σ−1 Sk )
2 ∂Σ k=1
n=1 2

= −

K
N −1 1 ∑
Σ +
Nk Σ−1 Sk Σ−1
2
2 k=1

Where we have denoted
Sk =

N
1 ∑
t nk (ϕn − µk )(ϕn − µk )T
Nk n=1

Now we set the derivative equals to 0, and rearrange the equation, which
gives:
K N
∑
k
Σ=
Sk
N
k=1
Problem 4.11 Solution
Based on definition, we can write down

p(ϕ|C k ) =

M ∏
L
∏
m=1 l =1

ϕ

ml
µkml

Note that here only one of the value among ϕm1 , ϕm2 , ... ϕmL is 1, and the
others are all 0 because we have used a 1 − of − L binary coding scheme, and
also we have taken advantage of the assumption that the M components of
ϕ are independent conditioned on the class C k . We substitute the expression
above into (4.63), which gives:

ak =

L
M ∑
∑
m=1 l =1

ϕml µkml + ln p(C k )

Hence it is obvious that a k is a linear function of the components of ϕ.
Problem 4.12 Solution
Based on definition, i.e., (4.59), we know that logistic sigmoid has the
form:
1
σ( a ) =
1 + exp(−a)

94
Now, we calculate its derivative with regard to a.

d σ( a )
exp(a)
exp(a)
1
=
=
·
= [ 1 − σ( a ) ] · σ( a )
2
da
1 + exp(−a) 1 + exp(−a)
[ 1 + exp(−a) ]
Just as required.
Problem 4.13 Solution
Let’s follow the hint.
∇E (w) = −∇
= −

N
∑
n=1

N
∑
n=1

{ t n ln yn + (1 − t n ) ln(1 − yn ) }

∇{ t n ln yn + (1 − t n ) ln(1 − yn ) }

= −

N d { t ln y + (1 − t ) ln(1 − y ) } d y da
∑
n
n
n
n
n
n
d
y
da
d
w
n
n
n=1

= −

N t
∑
1 − tn
n
( −
) · yn (1 − yn ) · ϕn
y
1
− yn
n=1 n

= −
= −
=

N
∑
n=1
N
∑

t n − yn
· yn (1 − yn ) · ϕn
yn (1 − yn )

( t n − yn )ϕn

n=1

N
∑
n=1

( yn − t n )ϕn

Where we have used yn = σ(a n ), a n = wT ϕn , the chain rules and (4.88).
Problem 4.14 Solution
According to definition, we know that if a dataset is linearly separable,
we can find w, for some points xn , we have wT ϕ(xn ) > 0, and the others
wT ϕ(xm ) < 0. Then the boundary is given by wT ϕ(x) = 0. Note that for any
point x0 in the dataset, the value of wT ϕ(x0 ) should either be positive or
negative, but it can not equal to 0.
Therefore, the maximum likelihood solution for logistic regression is trivial. We suppose for those points xn belonging to class C 1 , we have wT ϕ(xn ) >
0 and wT ϕ(xm ) < 0 for those belonging to class C 2 . According to (4.87), if
|w| → ∞, we have
p(C 1 |ϕ(xn )) = σ(wT ϕ(xn )) → 1
Where we have used wT ϕ(xn ) → +∞. And since wT ϕ(xm ) → −∞, we can
also obtain:

p(C 2 |ϕ(xm )) = 1 − p(C 1 |ϕ(xm )) = 1 − σ(wT ϕ(xm )) → 1

95
In other words, for the likelihood function, i.e.,(4.89), if we have |w| → ∞,
and also we label all the points lying on one side of the boundary as class C 1 ,
and those on the other side as class C 2 , the every term in (4.89) can achieve
its maximum value, i.e., 1, finally leading to the maximum of the likelihood.
Hence, for a linearly separable dataset, the learning process may prefer
to make |w| → ∞ and use the linear boundary to label the datasets, which
can cause severe over-fitting problem.
Problem 4.15 Solution(Waiting for update)
Since yn is the output of the logistic sigmoid function, we know that 0 <
yn < 1 and hence yn (1 − yn ) > 0. Then we use (4.97), for an arbitrary non-zero
real vector a ̸= 0, we have:
aT Ha = aT
=
=

N
[∑

N
∑
n=1
N
∑
n=1

n=1

]
yn (1 − yn )ϕn ϕT
n a

T
T
yn (1 − yn ) (ϕT
n a) (ϕ n a)

yn (1 − yn ) b2n

Where we have denoted b n = ϕT
n a. What’s more, there should be at least
one of { b 1 , b 2 , ..., b N } not equal to zero and then we can see that the expression
above is larger than 0 and hence H is positive definite.
Otherwise, if all the b n = 0, a = [a 1 , a 2 , ..., a M ]T will locate in the null
space of matrix Φ N × M . However, with regard to the rank-nullity theorem,
we know that Rank(Φ) + Nullity(Φ) = M, and we have already assumed that
those M features are independent, i.e., Rank(Φ) = M , which means there is
only 0 in its null space. Therefore contradictory occurs.
Problem 4.16 Solution
We still denote yn = p( t = 1|ϕn ), and then we can write down the log
likelihood by replacing t n with πn in (4.89) and (4.90).
ln p(t|w) =

N
∑
n=1

{ πn ln yn + (1 − πn ) ln(1 − yn ) }

Problem 4.17 Solution
We should discuss in two situations separately, namely j = k and j ̸= k.
When j ̸= k, we have:
∂ yk
∂a j

=

− exp(a k ) · exp(a j )
= − yk · y j
∑
[ j exp(a j ) ]2

And when j = k, we have:
∑
exp(a k ) j exp(a j ) − exp(a k ) exp(a k )
∂ yk
=
= yk − yk2 = yk (1 − yk )
∑
∂a k
[ j exp(a j ) ]2

96
Therefore, we can obtain:
∂ yk
∂a j

= yk ( I k j − y j )

Where I k j is the elements of the indentity matrix.
Problem 4.18 Solution
We derive every term t nk ln ynk with regard to a j .
∂ t nk ln ynk

∂ t nk ln ynk ∂ ynk ∂a j

=

∂wj

∂ ynk
∂a j ∂wj
1
· ynk ( I k j − yn j ) · ϕn
t nk
ynk
t nk ( I k j − yn j ) ϕn

=
=

Where we have used (4.105) and (4.106). Next we perform summation
over n and k.
∇wj E

= −
=
=
=
=

K
N ∑
∑
n=1 k=1

K
N ∑
∑
n=1 k=1

t nk ( I k j − yn j ) ϕn

t nk yn j ϕn −

N ∑
K
∑
n=1 k=1

t nk I k j ϕn

N [ ∑
K
N
∑
] ∑
t nk ) yn j ϕn −
(
t n j ϕn

n=1
N
∑
n=1
N
∑

n=1

k=1

yn j ϕn −

N
∑
n=1

t n j ϕn

( yn j − t n j ) ϕn

n=1

Where we have used the fact that for arbitrary n, we have
Problem 4.19 Solution
We write down the log likelihood.
ln p(t|w) =

N {
∑
n=1

t n ln yn + (1 − t n ) ln(1 − yn )

Therefore, we can obtain:
∇w ln p

=
=
=

∂ ln p ∂ yn ∂a n
·
·
∂ yn ∂a n ∂w
N t
∑
1 − tn ′
n
( −
)Φ (a n )ϕn
1 − yn
n=1 yn
N
∑
n=1

yn − t n
Φ′ (a n )ϕn
yn (1 − yn )

}

∑K

k=1 t nk

= 1.

97
Where we have used y = p( t = 1|a) = Φ(a) and a n = wT ϕn . According to
(4.114), we can obtain:
¯
1
1
Φ′ (a) = N (θ |0, 1)¯θ=a = p exp(− a2 )
2
2π

Hence, we can obtain:
∇w ln p =

N
∑
n=1

a2

yn − t n exp(− 2n )
ϕn
p
yn (1 − yn )
2π

To calculate the Hessian Matrix, we need to first evaluate several derivatives.

yn − t n
} =
∂w yn (1 − yn )
∂

{

=

yn − t n
∂ yn ∂a n
}·
·
∂ yn yn (1 − yn ) ∂a n ∂w
yn (1 − yn ) − ( yn − t n )(1 − 2 yn ) ′
Φ (a n )ϕn
[ yn (1 − yn ) ]2
∂

{

a2

=

yn2 + t n − 2 yn t n exp(− 2n )
ϕn
p
yn2 (1 − yn )2
2π

And
a2

exp(− n )
{ p 2 } =
∂w
2π
∂

a2

exp(− n ) ∂a n
{ p 2 }
∂a n
∂w
2π
2
a
an
= − p exp(− n )ϕn
2
2π
∂

Therefore, using the chain rule, we can obtain:
a2

yn − t n exp(− 2n )
{
} =
p
∂w yn (1 − yn )
2π
∂

a2

a2

yn − t n exp(− 2n )
yn − t n ∂ exp(− 2n )
{
} p
+
{ p
}
∂w yn (1 − yn )
yn (1 − yn ) ∂w
2π
2π
∂

a2

a2

[ yn2 + t n − 2 yn t n exp(− 2n )
] exp(− 2n )
− a n ( yn − t n ) p
ϕn
=
p
yn (1 − yn )
2π
2π yn (1 − yn )

Finally if we perform summation over n, we can obtain the Hessian Matrix:
H = ∇∇w ln p
a2

=

N ∂
∑
yn − t n exp(− 2n )
{
} · ϕn
p
2π
n=1 ∂w yn (1 − yn )

=

N [ y2 + t − 2 y t exp(− n )
∑
] exp(− 2n )
n
n n
n
2
− a n ( yn − t n ) p
ϕn ϕn T
p
(1
y
)
y
−
2
π
2
π
y
(1
−
y
)
n
n
n=1
n
n

a2

a2

98
Problem 4.20 Solution(waiting for update)
We know that the Hessian Matrix is of size MK × MK , and the ( j, k) th
block with size M × M is given by (4.110), where j, k = 1, 2, ..., K . Therefore,
we can obtain:
K ∑
K
∑
uT Hu =
uT
(∗)
j Hj,k uk
j =1 k=1

Where we use uk to denote the k th block vector of u with size M × 1, and
Hj,k to denote the ( j, k) th block matrix of H with size M × M . Then based on
(4.110), we further expand (4.110):
(∗) =
=
=
=

K ∑
K
∑
j =1 k=1

uT
j {−

K ∑
K ∑
N
∑
j =1 k=1 n=1
K ∑
K ∑
N
∑
j =1 k=1 n=1
K ∑
N
∑
k=1 n=1

N
∑
n=1

ynk ( I k j − yn j ) ϕn ϕn T }uk

T
uT
j {− ynk ( I k j − yn j ) ϕ n ϕ n }uk

T
uT
j {− ynk I k j ϕ n ϕ n }uk +

T
uT
k {− ynk ϕ n ϕ n }uk +

K ∑
K ∑
N
∑
j =1 k=1 n=1

K ∑
K ∑
N
∑
j =1 k=1 n=1

T
uT
j { ynk yn j ϕ n ϕ n }uk

T
yn j uT
j { ϕ n ϕ n } ynk uk

Problem 4.21 Solution
It is quite obvious.
∫

Φ( a ) =
=
=
=
=
=
=

a

N (θ |0, 1) d θ
∫ a
1
+
N (θ |0, 1) d θ
2
∫0 a
1
+
N (θ |0, 1) d θ
2
0
∫ a
1
1
exp(−θ 2 /2) d θ
+p
2
2π 0
p ∫ a
2
1
π
1
+p
p exp(−θ 2 /2) d θ
2
π
2π 2 0
∫ a
2
1
1
(1 + p
p exp(−θ 2 /2) d θ )
2
π
2 0
{
}
1
1
1 + p er f (a)
2
2
−∞

Where we have used
∫

0

−∞

N (θ |0, 1) d θ =

1
2

99
Problem 4.22 Solution
If we denote f (θ ) = p(D |θ ) p(θ ), we can write:
∫
∫
p(D ) =
p ( D | θ ) p (θ ) d θ = f (θ ) d θ
(2π) M /2

=

f (θ M AP )

=

p(D |θ M AP ) p(θ M AP )

|A|1/2

(2π) M /2
|A|1/2

Where θ M AP is the value of θ at the mode of f (θ ), A is the Hessian Matrix
of − ln f (θ ) and we have also used (4.135). Therefore,
ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) +

M
1
ln 2π − ln |A|
2
2

Just as required.
Problem 4.23 Solution
According to (4.137), we can write:
1
M
ln 2π − ln |A|
2
2
M
1
1
= ln p(D |θ M AP ) −
ln 2π − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m)
2
2
2
M
1
+
ln 2π − ln |A|
2
2
1
1
1
= ln p(D |θ M AP ) − ln |V0 | − (θ M AP − m)T V0 −1 (θ M AP − m) − ln |A|
2
2
2

ln p(D ) = ln p(D |θ M AP ) + ln p(θ M AP ) +

Where we have used the definition of the multivariate Gaussian Distribution. Then, from (4.138), we can write:
A = −∇∇ ln p(D |θ M AP ) p(θ M AP )
= −∇∇ ln p(D |θ M AP ) − ∇∇ ln p(θ M AP )
}
{ 1
= H − ∇∇ − (θ M AP − m)T V0 −1 (θ M AP − m)
2
{ −1
}
= H + ∇ V0 (θ M AP − m)
= H + V0 −1

Where we have denoted H = −∇∇ ln p(D |θ M AP ). Therefore, the equation

100
above becomes:
1
(θ M AP − m)T V0 −1 (θ M AP − m) −
2
1
= ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
2
1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
2
1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
2

ln p(D ) = ln p(D |θ M AP ) −

}
1 {
1
ln |V0 | · |H + V−
0 |
2
}
1 {
ln |V0 H + I|
2
1
1
ln |V0 | − ln |H|
2
2
1
ln |H| + const
2

Where we have used the property of determinant: |A|·|B| = |AB|, and the
fact that the prior is board, i.e. I can be neglected with regard to V0 H. What’s
more, since the prior is pre-given, we can view V0 as constant. And if the data
is large, we can write:
N
∑
b
H=
Hn = N H
b = 1/ N
Where H

n=1

∑N

n=1 Hn ,

and then

1
(θ M AP − m)T V0 −1 (θ M AP − m) −
2
1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
2
1
≈ ln p(D |θ M AP ) − (θ M AP − m)T V0 −1 (θ M AP − m) −
2
M
ln N
≈ ln p(D |θ M AP ) −
2

ln p(D ) ≈ ln p(D |θ M AP ) −

1
ln |H| + const
2
1
b | + const
ln | N H
2
M
1
b | + const
ln N − ln |H
2
2

This is because when N >> 1, other terms can be neglected.
Problem 4.24 Solution(Waiting for updating)
Problem 4.25 Solution
We first need to obtain the expression for the first derivative of probit
function Φ(λa) with regard to a. According to (4.114), we can write down:

d
Φ (λ a ) =
da
=

d Φ(λa) d λa
·
d (λa)
da
{ 1
}
λ
p exp − (λa)2
2
2π

Which further gives:
¯
d
λ
¯
Φ(λa)¯
= p
a=0
da
2π

And for logistic sigmoid function, according to (4.88), we have

dσ
1
= σ (1 − σ) = 0.5 × 0.5 =
da
4

101
Where we have used σ(0) = 0.5. Let their derivatives at origin equals, we
have:
λ
1
=
p
4
2π
p /
/
2
i.e., λ = 2π 4. And hence λ = π 8 is obvious.
Problem 4.26 Solution
We will prove (4.152) in a more simple and intuitive way. But firstly, we
need to prove a trivial yet useful statement: Suppose we have a random variable satisfied normal distribution denoted as X ∼ N ( X |µ, σ2 ), the probability
x−µ
of X ≤ x is P ( X ≤ x) = Φ( σ ), and here x is a given real number. We can see
this by writing down the integral:
∫ x
[
]
1
1
P ( X ≤ x) =
exp − 2 ( X − µ)2 d X
p
2σ
−∞ 2πσ2
∫ x−µ
σ
1
1
exp(− γ2 ) σ d γ
=
p
2
2
−∞
2πσ
∫ x−µ
σ
1
1
=
p exp(− γ2 ) d γ
2
−∞
2π
x−µ
)
= Φ(
σ
Where we have changed the variable X = µ + σγ. Now consider two random variables X ∼ N (0, λ−2 ) and Y ∼ N (µ, σ2 ). We first calculate the conditional probability P ( X ≤ Y | Y = a):

a−0
) = Φ(λa)
λ−1
Together with Bayesian Formula, we can obtain:
∫ +∞
P(X ≤ Y ) =
P ( X ≤ Y | Y = a) pd f (Y = a) dY
−∞
∫ +∞
Φ(λa) N (a|µ, σ2 ) da
=
P ( X ≤ Y | Y = a ) = P ( X ≤ a ) = Φ(

−∞

Where pd f (·) denotes the probability density function and we have also
used pd f (Y ) = N (µ, σ2 ). What’s more, we know that X − Y should also satisfy normal distribution, with:

E [ X − Y ] = E [ X ] − E [Y ] = 0 − µ = − µ
And

var [ X − Y ] = var [ X ] + var [Y ] = λ−2 + σ2

Therefore, X − Y ∼ N (−µ, λ−2 + σ2 ) and it follows that:
0 − (−µ)
µ
P ( X − Y ≤ 0) = Φ( p
) = Φ( p
)
λ−2 + σ2
λ−2 + σ2
Since P ( X ≤ Y ) = P ( X − Y ≤ 0), we obtain what have been required.

102

0.5 Neural Networks
Problem 5.1 Solution
Based on definition of tanh(·), we can obtain:

e a − e −a
e a + e −a
2 ea
= −1 + a
e + e −a
1
= −1 + 2
1 + e−2a
= 2σ(2a) − 1

tanh(a) =

s)
s)
s)
s)
, w(2
for a network whose
If we have parameters w(1
, w(1
and w(2
ji
j0
k0
kj
t)
t)
hidden units use logistic sigmoid function as activation and w(1
, w(1
and
ji
j0
t)
t)
w(2
, w(2
for another one using tanh(·), for the network using tanh(·) as
kj
k0
activation, we can write down the following expression by using (5.4):

a(kt)

=
=
=

M
∑
j =1
M
∑
j =1
M
∑
j =1

t)
t)
w(2
tanh(a(jt) ) + w(2
kj
k0
t)
t)
w(2
[ 2σ(2a(jt) ) − 1 ] + w(2
kj
k0
M
[ ∑
t)
t) ]
t)
w(2
+ w(2
2 w(2
σ(2a(jt) ) + −
kj
k0
kj
j =1

What’s more, we also have :

a(ks) =

M
∑
j =1

s)
s)
w(2
σ(a(js) ) + w(2
kj
k0

To make the two networks equivalent, i.e., a(ks) = a(kt) , we should make
sure:


a(s) = 2a(jt)

 j
s)
t)
w(2
= 2w(2
kj
kj


∑
w(2s) = − M w(2 t) + w(2 t)
k0

j =1

kj

k0

Note that the first condition can be achieved by simply enforcing:
s)
t)
w(1
= 2w(1
,
ji
ji

and

s)
t)
w(1
= 2w(1
j0
j0

Therefore, these two networks are equivalent under a linear transformation.
Problem 5.2 Solution

103
It is obvious. We write down the likelihood.

p(T|X, w) =

N
∏
n=1

N (tn |y(xn , w), β−1 I)

Taking the negative logarithm, we can obtain:

E (w, β) = − ln p(T|X, w) =

N [
β ∑

2 n=1

] NK
ln β+const
( y(xn , w)−tn )T ( y(xn , w)−tn ) −
2

Here we have used const to denote the term independent of both w and
β. Note that here we have used the definition of the multivariate Gaussian
Distribution. What’s more, we see that the covariance matrix β−1 I and the
weight parameter w have decoupled, which is distinct from the next problem. We can first solve wML by minimizing the first term on the right of the
equation above or equivalently (5.11), i.e., imaging β is fixed. Then according
to the derivative of E (w, β) with regard to β, we can obtain (5.17) and hence
β ML .
Problem 5.3 Solution
Following the process in the previous question, we first write down the
negative logarithm of the likelihood function.

E (w, Σ) =

N {
} N
1 ∑
[ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] +
ln |Σ| + const (∗)
2 n=1
2

Note here we have assumed Σ is unknown and const denotes the term
independent of both w and Σ. In the first situation, if Σ is fixed and known,
the equation above will reduce to:

E (w) =

N {
}
1 ∑
[ y(xn , w) − tn ]T Σ−1 [ y(xn , w) − tn ] + const
2 n=1

We can simply solve wML by minimizing it. If Σ is unknown, since Σ is
in the first term on the right of (∗), solving wML will involve Σ. Note that in
the previous problem, the main reason that they can decouple is due to the
independent assumption, i.e., Σ reduces to β−1 I, so that we can bring β to the
front and view it as a fixed multiplying factor when solving wML .
Problem 5.4 Solution
Based on (5.20), the current conditional distribution of targets, considering mislabel, given input x and weight w is:

p( t = 1|x, w) = (1 − ϵ) · p( t r = 1|x, w) + ϵ · p( t r = 0|x, w)

104
Note that here we use t to denote the observed target label, t r to denote
its real label, and that our network is aimed to predict the real label t r not t,
i.e., p( t r = 1|x, w) = y(x, w), hence we see that:
[
]
p( t = 1|x, w) = (1 − ϵ) · y(x, w) + ϵ · 1 − y(x, w)

(∗)

Also, it is the same for p( t = 0|x, w):
[
]
p( t = 0|x, w) = (1 − ϵ) · 1 − y(x, w) + ϵ · y(x, w)

(∗∗)

Combing (∗) and (∗∗), we can obtain:

p( t|x, w) = (1 − ϵ) · y t (1 − y)1− t + ϵ · (1 − y) t y1− t
Where y is short for y(x, w). Therefore, taking the negative logarithm, we
can obtain the error function:

E (w) = −

N
∑
n=1

{
1− t }
t
ln (1 − ϵ) · ynn (1 − yn )1− t n + ϵ · (1 − yn ) t n yn n

When ϵ = 0, it is obvious that the equation above will reduce to (5.21).
Problem 5.5 Solution
It is obvious by using (5.22).

E (w) = − ln
= − ln
= −
= −
= −

N
∏
n=1

p(t|xn , w)

K
N ∏
∏
n=1 k=1

K
N ∑
∑
n=1 k=1
N ∑
K
∑
n=1 k=1

{
[
]1− t nk }
ln yk (xn , w) t nk 1 − yk (xn , w)
[ t nk
]
ln ynk
( 1 − ynk )1− t nk

N ∑
K {
∑
n=1 k=1

[
]1− t nk
yk (xn , w) t nk 1 − yk (xn , w)

}
t nk ln ynk + (1 − t nk ) ln( 1 − ynk )

Where we have denoted

ynk = yk (xn , w)
Problem 5.6 Solution
We know that yk = σ(a k ), where σ(·) represents the logistic sigmoid function. Moreover,
dσ
= σ(1 − σ)
da

105

dE (w)
da k

]
]
1[
1 [
yk (1 − yk ) + (1 − t k )
yk (1 − yk )
yk
1 − yk
[
] [ 1 − tk tk ]
−
=
yk (1 − yk )
1 − yk yk
= (1 − t k ) yk − t k (1 − yk )

= −t k

=

yk − t k

Just as required.
Problem 5.7 Solution
It is similar to the previous problem. First we denote ykn = yk (xn , w). If
we use softmax function as activation for the output unit, according to (4.106),
we have:
d ykn
= ykn ( I k j − y jn )
da j
Therefore,

dE (w)
da j

=

N ∑
K
}
d { ∑
−
t kn ln yk (xn , w)
da k
n=1 k=1

= −
= −
= −
= −
= −
=

N ∑
K
∑
}
d {
t kn ln ykn
n=1 k=1 da j
N ∑
K
∑
n=1 k=1
N ∑
K
∑
n=1 k=1
N ∑
K
∑
n=1 k=1
N
∑
n=1

N
∑
n=1

t kn

]
1 [
ykn ( I k j − y jn )
ykn

( t kn I k j − t kn y jn )

t kn I k j +

t jn +

N
∑
n=1

N ∑
K
∑
n=1 k=1

t kn y jn

y jn

( y jn − t jn )

Where we have used the fact that only when k = j , I k j = 1 ̸= 0 and that
k=1 t kn = 1.

∑K

Problem 5.8 Solution
It is obvious based on definition of ’tanh’, i.e., (5.59).
( e a + e−a )( e a + e−a ) − ( e a − e−a )( e a − e−a )
( e a + e − a )2
( e a − e−a )2
= 1− a
( e + e−a )2
= 1 − tanh(a)2

d
tanh(a) =
da

106
Problem 5.9 Solution
We know that the logistic sigmoid function σ(a) ∈ [0, 1], therefore if we
perform a linear transformation h(a) = 2σ(a) − 1, we can find a mapping function h(a) from (−∞, +∞) to [−1, 1]. In this case, the conditional distribution
of targets given inputs can be similarly written as:

p( t|x, w) =

[ 1 + y(x, w) ](1+ t)/2 [ 1 − y(x, w) ](1− t)/2

2
2
[
]
Where 1 + y(x, w) /2 represents the conditional probability p(C 1 | x). Since
now y(x, w) ∈ [−1, 1], we also need to perform the linear transformation to
make it satisfy the constraint for probability.Then we can further obtain:

E (w) = −
= −

N {1+ t
∑
n
n=1

2

ln

1 + yn 1 − t n 1 − yn }
+
ln
2
2
2

N
∑

{
}
1
(1 + t n ) ln(1 + yn ) + (1 − t n ) ln(1 − yn ) + N ln 2
2 n=1

Problem 5.10 Solution
It is obvious. Suppose H is positive definite, i.e., (5.37) holds. We set v
equals to the eigenvector of H, i.e., v = ui which gives:
vT Hv = vT (Hv) = ui T λ i ui = λ i ||ui ||2
Therefore, every λ i should be positive. On the other hand, If all the eigenvalues λ i are positive, from (5.38) and (5.39), we see that H is positive definite.
Problem 5.11 Solution
It is obvious. We follow (5.35) and then write the error function in the
form of (5.36). To obtain the contour, we enforce E (w) to equal to a constant
C.
1∑
E (w) = E (w∗ ) +
λ i α2i = C
2 i
We rearrange the equation above, and then obtain:
∑
λ i α2i = B
i

Where B = 2C − 2E (w∗ ) is a constant. Therefore, the contours of constant error are ellipses whose axes are aligned with the eigenvector ui of
the Hessian Matrix H. The length for the j th axis is given by setting all
α i = 0, s.t.i ̸= j :
√
B
αj =
λj

107
In other words, the length is inversely proportional to the square root of
the corresponding eigenvalue λ j .
Problem 5.12 Solution
If H is positive definite, we know the second term on the right side of
(5.32) will be positive for arbitrary w. Therefore, E (w∗ ) is a local minimum.
On the other hand, if w∗ is a local minimum, we have
1
E (w∗ ) − E (w) = − (w − w∗ )T H(w − w∗ ) < 0
2
In other words, for arbitrary w, (w − w∗ )T H(w − w∗ ) > 0, according to the
previous problem, we know that this means H is positive definite.
Problem 5.13 Solution
It is obvious. Suppose that there are W adaptive parameters in the network. Therefore, b has W independent parameters. Since H is symmetric,
there should be W (W + 1)/2 independent parameters in it. Therefore, there
are W + W (W + 1)/2 = W (W + 3)/2 parameters in total.
Problem 5.14 Solution
It is obvious. Since we have

E n (w ji + ϵ) = E n (w ji ) + ϵE ′n (w ji ) +
And

E n (w ji − ϵ) = E n (w ji ) − ϵE ′n (w ji ) +

ϵ2

2

E ′′n (w ji ) + O (ϵ3 )

ϵ2

E ′′ (w ji ) + O (ϵ3 )
2 n
We combine those two equations, which gives,
E n (w ji + ϵ) − E n (w ji − ϵ) = 2ϵE ′n (w ji ) + O (ϵ3 )
Rearrange the equation above, we obtain what has been required.
Problem 5.15 Solution
It is obvious. The back propagation formalism starts from performing
summation near the input, as shown in (5.73). By symmetry, the forward
propagation formalism should start near the output.

Jki =

∂ yk
∂xi

=

∂ h( a k )

= h′ ( a k )

∂xi

∂a k
∂xi

(∗)

Where h(·) is the activation function at the output node a k . Considering
all the units j , which have links to unit k:
∂a k
∂xi

=

∑ ∂a k ∂a j
j

∂a j ∂ x i

=

∑
j

w k j h′ ( a j )

∂a j
∂xi

(∗∗)

108
Where we have used:

ak =

∑

z j = h( a j )

wk j z j ,

j

It is similar for ∂a j /∂ x i . In this way we have obtained a recursive formula
starting from the input node:
{
wl i , if there is a link from input unit i to l
∂a l
=
∂xi
0, if there isn’t a link from input unit i to l
Using recursive formula (∗∗) and then (∗), we can obtain the Jacobian
Matrix.
Problem 5.16 Solution
It is obvious. We begin by writing down the error function.

E=

N
N ∑
M
1 ∑
1 ∑
||yn − tn ||2 =
( yn,m − t n,m )2
2 n=1
2 n=1 m=1

Where the subscript m denotes the mthe element of the vector. Then we
can write down the Hessian Matrix as before.
H = ∇∇E =

N ∑
M
∑
n=1 m=1

∇yn,m ∇yn,m +

N ∑
M
∑
n=1 m=1

( yn,m − t n,m )∇∇yn,m

Similarly, we now know that the Hessian Matrix can be approximated as:
H≃

N ∑
M
∑
n=1 m=1

bn,m bT
n,m

Where we have defined:
bn,m = ∇ yn,m
Problem 5.17 Solution
It is obvious.
∂2 E
∂wr ∂ws

=
=

∂ 1

∫ ∫

2( y − t)

∂wr 2
∫ ∫
[
( y − t)

Since we know that
∫ ∫
∂ y2
p(x, t) d x dt
( y − t)
∂wr ∂ws

∂y

p(x, t) d x dt

∂ws

∂ y2
∂wr ∂ws

+

∂y ∂y ]
∂ws ∂wr

∫ ∫

=

( y − t)
∫

=
= 0

∂ y2
∂wr ∂ws

∂ y2

p(x, t) d x dt

p( t|x) p(x) d x dt
∂wr ∂ws
∫
{
}
( y − t) p( t|x) dt p(x) d x

109
Note that in the last step, we have used y =
tute it into the second derivative, which gives,
∫ ∫

∂2 E

=

∂wr ∂ws

∫

=

∫

tp( t|x) dt. Then we substi-

∂y ∂y

p(x, t) d x dt
∂ws ∂wr
∂y ∂y
p(x) d x
∂ws ∂wr

Problem 5.18 Solution
skip

By analogy with section 5.3.2, we denote wki as those parameters corresponding to skip-layer connections, i.e., it connects the input unit i with the
output unit k. Note that the discussion in section 5.3.2 is still correct and
now we only need to obtain the derivative of the error function with respect
skip
to the additional parameters wki .
∂E n

=

skip
∂wki

∂E n ∂a k
∂a k ∂wskip
ki

= δk x i

Where we have used a k = yk due to linear activation at the output unit
and:
M
∑
∑ skip
yk =
w(2)
z
+
wki x i
j
kj
j =0

i

Where the first term on the right side corresponds to those information
conveying from the hidden unit to the output and the second term corresponds to the information conveying directly from the input to output.
Problem 5.19 Solution
The error function is given by (5.21). Therefore, we can obtain:
∇E (w) =

N ∂E
∑
∇a n
n=1 ∂a n

= −
= −
= −
= −
=

N
∑

∂ [

n=1 ∂a n

]
t n ln yn + (1 − t n ) ln(1 − yn ) ∇a n

N { ∂( t ln y ) ∂ y
∑
n
n
n

∂ yn

n=1
N [t
∑
n

∂a n

+

∂(1 − t n ) ln(1 − yn ) ∂ yn }
∇a n
∂ yn
∂a n

· yn (1 − yn ) + (1 − t n )

]
−1
· yn (1 − yn ) ∇a n
1 − yn

n=1

yn

N [
∑

]
t n (1 − yn ) − (1 − t n ) yn ∇a n

n=1

N
∑
n=1

( yn − t n )∇a n

110
Where we have used the conclusion of problem 5.6. Now we calculate the
second derivative.
∇∇E (w) =

N {
∑
n=1

yn (1 − yn )∇a n ∇a n + ( yn − t n )∇∇a n

}

Similarly, we can drop the last term, which gives exactly what has been
asked.
Problem 5.20 Solution(waiting for update)
We begin by writing down the error function.

E (w) = −

N ∑
K
∑
n=1 k=1

t nk ln ynk

Here we assume that the output of the network has K units in total and
there are W weights parameters in the network. WE first calculate the first
derivative:
∇E

=

N dE
∑
· ∇an
n=1 d a n

= −
=

N [ d
K
∑
∑
]
(
t nk ln ynk ) · ∇an
n=1 d a n k=1

N
∑
n=1

cn · ∇an

Note that here cn = − dE / d an is a vector with size K × 1, ∇an is a matrix
with size K × W . Moreover, the operator · means inner product, which gives
∇E as a vector with size 1 × W . According to (4.106), we can obtain the j th
element of cn :

c n, j

= −

∂
∂a j

(

K
∑

t nk ln ynk )

k=1

= −

K
∑
∂
( t nk ln ynk )
k=1 ∂a j

= −

K t
∑
nk
ynk ( I k j − yn j )
y
k=1 nk

= −

K
∑
k=1

t nk I k j +

= − t n j + yn j (
=

yn j − t n j

K
∑

t nk yn j

k=1

K
∑
k=1

t nk )

111
Now we calculate the second derivative:
∇∇E

N dc
∑
n
(
∇an ) · ∇an + cn ∇∇an
d
a
n
n=1

=

Here d cn / d an is a matrix with size K × K . Therefore, the second term can
be neglected as before, which gives:
H=

N dc
∑
n
∇an ) · ∇an
(
d
a
n
n=1

Problem 5.21 Solution
We first write down the expression of Hessian Matrix in the case of K
outputs.
N ∑
K
∑
H N,K =
bn,k bT
n,k
n=1 k=1

Where bn,k = ∇w an,k . Therefore, we have:
H N +1,K = H N,K +

K
∑
k=1

T
b N +1,k bT
N +1,k = H N,K + B N +1 B N +1

Where B N +1 = [b N +1,1 , b N +1,2 , ..., b N +1,K ] is a matrix with size W × K ,
and here W is the total number of the parameters in the network. By analogy
with (5.88)-(5.89), we can obtain:
1
−1
H−
N +1,K = H N,K −

1
H−
B
BT H−1
N,K N +1 N +1 N,K

1 + BT
H−1 B
N +1 N,K N +1

(∗)

Furthermore, similarly, we have:
H N +1,K +1 = H N +1,K +

N∑
+1
n=1

T
bn,K +1 bT
n,K +1 = H N +1,K + BK +1 BK +1

Where BK +1 = [b1,K +1 , b2,K +1 , ..., b N +1,K +1 ] is a matrix with size W ×( N +
1). Also, we can obtain:
1
H−
N +1,K +1

=

1
H−
N +1,K

−

1
H−
B
BT H−1
N +1,K K +1 K +1 N +1,K

1 + BT
H−1
B
K +1 N +1,K K +1

1
Where H−
is defined by (∗). If we substitute (∗) into the expression
N +1,K

1
1
above, we can obtain the relationship between H−
and H−
.
N +1,K +1
N,K

Problem 5.22 Solution

112
We begin by handling the first case.
∂2 E n

∂

=

∂w(2)
∂w(2)
kj
k′ j ′

(
(2)

∂E n

∂wk j ∂w(2)
k′ j ′
∂

=

∂w(2)
kj

(

∂

=

)

∂ E n ∂ a k′

)
∂a k′ ∂w(2)′ ′
k j
∑
∂ E n ∂ j ′ w k′ j ′ z j ′

(
∂ a k′
∂w(2)
kj
∂

=

∂w(2)
kj
∂

=

∂w(2)
kj
∂

=

∂a k
∂

=

(
(

(
(

∂E n
∂ a k′
∂E n
∂a

k′

)

z j′ )
) z j′ +

∂E n ∂ z j′
∂a k′ ∂w(2)
kj

∂E n ∂a k
)
z j′ + 0
∂a k′ ∂w(2)
kj
∂E n

∂ a k ∂ a k′
z j z j′ M kk′

=

∂w(2)
k′ j ′

) z j z j′

Then we focus on the second case, and if here j ̸= j ′
∂2 E n
∂w(1)
∂w(1)
ji
j′ i′

=
=
=
=
=

∂
∂w(1)
ji
∂

∂E n

(
(

∂w(1)
j′ i′

)

∑ ∂ E n ∂ a k′
)
∂a k′ ∂w(1)
′ ′

′
∂w(1)
ji k

∑

∂

(1)
k′ ∂w ji

∑
k′

∑
k′

ji

(

∂E n
∂ a k′

h′ ( a j ′ ) x i ′
h′ ( a j ′ ) x i ′

w(2)
h′ ( a j ′ ) x i ′ )
k′ j ′
∂

∂w(1)
ji

(

∂E n
∂ a k′

w(2)
)
k′ j ′

∑ ∂ ∂E n (2) ∂a k
(
wk′ j′ ) (1)
∂w
k ∂ a k ∂ a k′
ji

∑

∑ ∂ ∂E n (2)
=
h′ ( a j ′ ) x i ′
(
wk′ j′ ) · (w(2)
h′ ( a j ) x i )
kj
′
∂
a
∂
a
′
k
k
k
k
∑ ′
∑
=
h (a j′ ) x i′ M kk′ w(2)
· w(2)
h′ ( a j ) x i
k′ j ′
kj
k′

=

′

k
′

x i′ x i h (a j′ ) h (a j )

∑∑
k′

k

· w(2)
M kk′
w(2)
k′ j ′
kj

113
When j = j ′ , similarly we have:
∂2 E n
∂w(1)
∂w(1)
ji
ji ′

=
=
=
=
=

∑
k′

x i′

∂

∂E n

∂w ji

∂ a k′

(
(1)

∑
k′

∂

(
(1)

w(2)
h′ ( a j ) x i ′ )
k′ j

∂E n

∂w ji ∂a k′

x i ′ x i h′ ( a j ) h′ ( a j )

w(2)
) h′ ( a j ) + x i ′
k′ j

∑ ∂E n (2) ∂ h′ (a j )
(
w k′ j )
∂w(1)
k′ ∂ a k′
ji

∑∑
k′ k

∑ ∂E n (2) ∂ h′ (a j )
(2)
′
′
·
w
M
+
x
(
w(2)
w k′ j )
i
kk
kj
k′ j
∂w(1)
k′ ∂ a k′
ji

∑∑

∑ ∂E n (2) ′′
(2)
′ + x i′
x i ′ x i h′ ( a j ) h′ ( a j )
w(2)
·
w
M
(
w k′ j ) h ( a j ) x i
′
kk
k j
kj
k′ k
k′ ∂ a k′
∑ ∑ (2)
∑
x i ′ x i h′ ( a j ) h′ ( a j )
wk′ j · w(2)
M kk′ + h′′ (a j ) x i x i′ δk′ w(2)
kj
k′ j
k′ k

k′

It seems that what we have obtained is slightly different from (5.94) when
j = j ′ . However this is not the case, since the summation over k′ in the second
term of our formulation and the summation over k in the first term of (5.94) is
actually the same (i.e., they both represent the summation over all the output
units). Combining the situation when j = j ′ and j ̸= j ′ , we can obtain (5.94)
just as required. Finally, we deal with the third case. Similarly we first focus
on j ̸= j ′ :
∂2 E n
∂w(2)
∂w(1)
ji
k j′

=
=
=
=

∂
∂w(1)
ji

(

∂

(
(1)

∂w ji
∂

∂E n
∂w(2)
k j′

)

∂E n ∂a k

)
∂a k ∂w(2)′
kj
∑
∂E n ∂ j′ wk j′ z j′

(
∂a k
∂w(1)
ji
∂
∂w(1)
ji

(

∂w(2)
k j′

∂E n
∂a k

z j′ )

=

∑ ∂ ∂ E n ∂ a k′
) (1)
(
k′ ∂a k′ ∂a k ∂w ji
∑
z j′ M kk′ w(2)
h′ ( a j ) x i
k′ j

=

x i h (a j ) z j′

=

)

z j′

k′
′

∑
k′

M kk′ w(2)
k′ j

Note that in (5.95), there are two typos: (i) H kk′ should be M kk′ . (ii) j should

114
exchange position with j ′ in the right side of (5.95). When j = j ′ , we have:
∂2 E n
∂w(1)
∂w(2)
ji
kj

=
=
=
=
=
=
=

∂

(
(1)

∂E n

∂w ji ∂w(2)
kj
∂
∂w(1)
ji

(

∂

)

∂E n ∂a k

)
∂a k ∂w(2)
kj
∑
∂E n ∂ j wk j z j

(
∂a k
∂w(1)
ji
∂
∂w(1)
ji
∂
∂w(1)
ji

(
(

∂w(2)
kj

∂E n

z j)

∂a k
∂E n

)z j +

∂a k

x i h′ ( a j ) z j
x i h′ ( a j ) z j

)

∑
k′

∑
k′

∂E n ∂ z j
∂a k w(1)
ji

M kk′ w(2)
+
k′ j

∂E n ∂ z j
∂a k w(1)
ji

M kk′ w(2)
+ δ k h′ ( a j ) x i
k′ j

Combing these two situations, we obtain (5.95) just as required.
Problem 5.23 Solution
It is similar to the previous problem.
∂2 E n
∂ w k′ i ′ ∂ w k j

=
=
=
=

∂

(

∂E n

)
∂ w k′ i ′ ∂ w k j
∂
∂E n
z j)
(
′
′
∂wk i ∂a k
∂ w k′ i ′ ∂ ∂ E n
zj
(
)
∂ a k′ ∂ a k′ ∂ a k
z j x i′ M kk′

115
And
∂2 E n

∑ ∂E n ∂a k
(
)
∂wk′ i′ k ∂a k ∂w ji
∑ ∂E n
∂
(
w k j h′ ( a j ) x i )
∂ w k′ i ′ k ∂ a k
∑ ′
∂
∂E n
h (a j ) x i wk j
(
)
∂ w k′ i ′ ∂ a k
k
∑ ′
∂ ∂ E n a k′
(
)
h (a j ) x i wk j
∂ a k′ ∂ a k w k′ i ′
k
∑ ′
h (a j ) x i wk j M kk′ x i′
∂

=

∂wk′ i′ ∂w ji

=
=
=
=

k

x i x i ′ h′ ( a j )

=

∑

wk j M kk′

k

Finally, we have
∂2 E n

=

∂wk′ i′ wki

=
=
=

∂
∂E n
(
)
∂wk′ i′ ∂wki
∂
∂E n
(
xi )
∂ w k′ i ′ ∂ a k
∂ ∂ E n ∂ a k′
(
)
xi
∂ a k′ ∂ a k w k′ i ′
x i x i′ M kk′

Problem 5.24 Solution
It is obvious. According to (5.113), we have:
∑
ae j =
w
e ji e
xi + w
e j0
i

=

∑1
i

=

∑

a

w ji · (ax i + b) + w j0 −

b∑
w ji
a i

w ji x i + w j0 = a j

i

Where we have used (5.115), (5.116) and (5.117). Currently, we have
proved that under the transformation the hidden unit a j is unchanged. If
the activation function at the hidden unit is also unchanged, we have e
z j = z j.
Now we deal with the output unit e
yk :
∑
e
yk =
w
ek j e
zj + w
e k0
j

=

∑

cwk j · z j + cwk0 + d

j

∑[

]
w k j · z j + w k0 + d

=

c

=

c yk + d

j

116
Where we have used (5.114), (5.119) and (5.120). To be more specific,
here we have proved that the linear transformation between e
yk and yk can
be achieved by making transformation (5.119) and (5.120).
Problem 5.25 Solution
Since we know the gradient of the error function with respect to w is:
∇E = H(w − w∗ )

Together with (5.196), we can obtain:
w(τ)

= w(τ−1) − ρ ∇E
= w(τ−1) − ρ H(w(τ−1) − w∗ )

Multiplying both sides by uTj , using w j = wT u j , we can obtain:

w(jτ)

[
]
= uTj w(τ−1) − ρ H(w(τ−1) − w∗ )

=

w(jτ−1) − ρ uTj H(w(τ−1) − w∗ )

=

w(jτ−1) − ρη j uTj (w(τ−1) − w∗ )

=

w(jτ−1) − ρη j (w(jτ−1) − w∗j )

= (1 − ρη j )w(jτ−1) + ρη j w∗j

Where we have used (5.198). Then we use mathematical deduction to
prove (5.197), beginning by calculating w(1)
:
j

w(1)
j

+ ρη j w∗j
= (1 − ρη j )w(0)
j
= ρη j w∗j
[
]
=
1 − (1 − ρη j ) w∗j

Suppose (5.197) holds for τ, we now prove that it also holds for τ + 1.

w(jτ+1)

= (1 − ρη j w(jτ) + ρη j w∗j
[
]
= (1 − ρη j ) 1 − (1 − ρη j )τ w∗j + ρη j w∗j
{
[
]
}
= (1 − ρη j ) 1 − (1 − ρη j )τ + ρη j w∗j
[
]
=
1 − (1 − ρη j )τ+1 w∗j

Hence (5.197) holds for τ = 1, 2, .... Provided |1 − ρη j | < 1, we have (1 −
ρη j )τ → 0 as τ → ∞ ans thus w(τ) = w∗ . If τ is finite and η j >> (ρτ)−1 , the
above argument still holds since τ is still relatively large. Conversely, when
η j << (ρτ)−1 , we expand the expression above:
[
]
|w(jτ) | = | 1 − (1 − ρη j )τ w∗j | ≈ |τρη j w∗j | << |w∗j |

117
We can see that (ρτ)−1 works as the regularization parameter α in section
3.5.3.
Problem 5.26 Solution
Based on definition or by analogy with (5.128), we have:

Ωn

1 ∑ ∂ ynk ¯¯
)2
(
2 k ∂ξ ξ=0
1 ∑ ∑ ∂ ynk ∂ x i ¯¯
)2
(
2 k i ∂ x i ∂ξ ξ=0
1∑ ∑
∂
ynk )2
( τi
2 k i
∂xi

=
=
=

Where we have denoted
τi =

∂ x i ¯¯
∂ξ

ξ=0

And this is exactly the form given in (5.201) and (5.202) if the nth observation ynk is denoted as yk in short. Firstly, we define α j and β j as (5.205)
shows, where z j and a j are given by (5.203). Then we will prove (5.204) holds:
αj

=

∑

τi

i

=

∑

τi

∂z j
∂xi

=

∑

τi

∂ h( a j )
∂xi

i

∂ h( a j ) ∂ a j

∂a j ∂ x i
∑
∂
h′ ( a j ) τ i
a j = h ′ ( a j )β j
∂
x
i
i
i

=

Moreover,
βj

=

∑

τi

i

=

∑
i

=

∑
i′

τi

∂a j

=

∑

τi

∂xi
i
∑ ∂w ji′ z i′
i′

w ji′

∂xi

∑
i

τi

∂ z i′
∂xi

∂

∑

i ′ w ji ′ z i ′

∂xi
∑ ∑
∂ z i′
=
τ i w ji′
∂xi
i
i′
∑
=
w ji′ α i′
i′

So far we have proved that (5.204) holds and now we aim to find a forward
propagation formula to calculate Ωn . We firstly begin by evaluating {β j } at
the input units, and then use the first equation in (5.204) to obtain {α j } at the
input units, and then the second equation to evaluate {β j } at the first hidden
layer, and again the first equation to evaluate {α j } at the first hidden layer.
We repeatedly evaluate {β j } and {α j } in this way until reaching the output

118
layer. Then we deal with (5.206):
∂Ωn
∂wrs

=

} 1 ∑ ∂(G yk )2
∂ {1 ∑
(G yk )2 =
∂wrs 2 k
2 k ∂wrs

∂G yk
1 ∑ ∂(G yk )2 ∂(G yk ) ∑
G yk
=
2 k ∂(G yk ) ∂wrs
∂wrs
k
∑
[ ∂ yk ∂a r ]
[ ∂ yk ] ∑
=
=
αk G
G yk G
∂wrs
∂a r ∂wrs
k
k
∑
[
] ∑ {
}
=
αk G δkr z s =
αk G [δkr ] z s + G [ z s ]δkr

=

k

=

∑

k

{
}
αk ϕkr z s + αs δkr

k

Provided with the idea in section 5.3, the backward propagation formula
is easy to derive. We can simply replace E n with yk to obtain a backward
equation, so we omit it here.
Problem 5.27 Solution
Following the procedure in section 5.5.5, we can obtain:
∫
1
(τT ∇ y(x))2 p(x) d x
Ω=
2
/
Since we have τ = ∂s(x, ξ) ∂ξ and s = x + ξ, so we have τ = I. Therefore,
substituting τ into the equation above, we can obtain:
∫
1
Ω=
(∇ y(x))2 p(x) d x
2
Just as required.
Problem 5.28 Solution
The modifications only affect derivatives with respect to the weights in
the convolutional layer. The units within a feature map (indexed m) have
different inputs, but all share a common weight vector, w(m) . Therefore, we
can write:
( m)
∑ ∂E n ∂a j
∑ ( m) ( m)
∂E n
=
=
δ j z ji
( m)
( m)
∂w(im)
j ∂a j ∂w i
j
Here a(jm) denotes the activation of the j th unit in th mth feature map,
whereas w(im) denotes the i th element of the corresponding feature vector

)
and finally z(im
denotes the i th input for the j th unit in the mth feature map.
j

Note that δ(jm) can be computed recursively from the units in the following
layer.
Problem 5.29 Solution

119
It is obvious. Firstly, we know that:
∂ {

}
wi − µ j
π j N (w i |µ j , σ2j ) = −π j
N (w i |µ j , σ2j )
∂w i
σ2j

We now derive the error function with respect to w i :
e
∂E
∂w i

=
=
=
=
=

=
=
=

∂E
∂w i
∂E
∂w i
∂E
∂w i
∂E
∂w i
∂E
∂w i
∂E
∂w i
∂E
∂w i
∂E
∂w i

+

∂λΩ(w)

−λ
−λ

∂w i
∂
∂w i
∂

{
∑
{

∂w i

i

(

ln

(

ln

j =1
M
∑
j =1

j =1 π j N

+ λ ∑M

+λ
+λ

)}
π j N (w i |µ j , σ2j )

π jN

)}

(w i |µ j , σ2j )
∂

1

− λ ∑M

+λ

M
∑

(w i |µ j , σ2j ) ∂w i
{
M
∑
1

{

M
∑
j =1

πj
2
j =1 π j N (w i |µ j , σ j ) j =1
∑M
w i −µ j
2
j =1 π j σ2 N (w i |µ j , σ j )

∑

M
∑
j =1
M
∑
j =1

∑

}
π jN

(w i |µ j , σ2j )

wi − µ j
σ2j

}

N

(w i |µ j , σ2j )

j

k πk N

(w i |µk , σ2k )

π j N (w i |µ j , σ2j )
k πk N

γ j (w i )

(w i |µk , σ2k )

wi − µ j
σ2j

wi − µ j
σ2j

Where we have used (5.138) and defined (5.140).
Problem 5.30 Solution
Is is similar to the previous problem. Since we know that:
}
wi − µ j
∂ {
π j N (w i |µ j , σ2j ) = π j
N (w i |µ j , σ2j )
∂µ j
σ2j

120
We can derive:
e
∂E
∂µ j

=

∂λΩ(w)
∂µ j

= −λ

{
∑

∂
∂µ j

= −λ

i

∑M

{

ln

(

M
∑
j =1

)}
π jN

M
∑
j =1

j =1 π j N

1

(w i |µ j , σ2j )
)}

π j N (w i |µ j , σ2j )
∂

(w i |µ j , σ2j ) ∂µ j

{

M
∑
j =1

}
π jN

(w i |µ j , σ2j )

wi − µ j
πj
N (w i |µ j , σ2j )
2
2
σ
π
N
(
w
|
µ
,
σ
)
j
i
j
i
j
j =1
j
2
π j N ( w i |µ j , σ j )
∑
∑
µ j − wi
µ j − wi
λ ∑K
= λ γ j (w i )
2
2
σj
σ2j
i
i
k=1 π k N (w i |µ k , σ k )

= −λ
=

∑

ln

i

∑ ∂
= −λ
i ∂µ j
∑

(

∑M

1

Note that there is a typo in (5.142). The numerator should be µ j − w i
instead of µ i − w j . This can be easily seen through the fact that the mean and
variance of the Gaussian Distribution should have the same subindex and
since σ j is in the denominator, µ j should occur in the numerator instead of
µi .
Problem 5.31 Solution
It is similar to the previous problem. Since we know that:
(
)
}
∂ {
1 (w i − µ j )2
2
π j N ( w i |µ j , σ j ) = −
+
π j N (w i |µ j , σ2j )
∂σ j
σj
σ3j

121
We can derive:
e
∂E
∂σ j

=

∂λΩ(w)
∂σ j

= −λ

{
∑

∂
∂σ j

= −λ

i

=
=

{

∑M

ln

(

M
∑
j =1
M
∑
j =1

j =1 π j N

1

)}
π jN

(w i |µ j , σ2j )
)}

π j N (w i |µ j , σ2j )
∂

(w i |µ j , σ2j ) ∂σ j

{

M
∑
j =1

}
π jN

(w i |µ j , σ2j )

}
∂ {
π j N (w i |µ j , σ2j )
2 ∂σ
j
i
j =1 π j N (w i |µ j , σ j )
(
)
∑
1 (w i − µ j )2
1
λ ∑M
π j N (w i |µ j , σ2j )
−
3
2) σ
σ
π
N
(
w
|
µ
,
σ
j
i j
i
j
j =1 j
j
(
)
2
π j N ( w i |µ j , σ j )
∑
1 ( w i − µ j )2
λ ∑M
−
2) σ
σ3j
π
N
(
w
|
µ
,
σ
j
i
i
k
k
k=1
k
)
(
∑
1 (w i − µ j )2
−
λ γ j (w i )
σj
σ3j
i

= −λ
=

ln

i

∑ ∂
= −λ
i ∂σ j
∑

(

∑

∑M

1

Just as required.
Problem 5.32 Solution
It is trivial. We begin by verifying (5.208) when j ̸= k.
}
{
∂πk
∂
exp(η k )
=
∑
∂η j
∂η j
k exp(η k )
− exp(η k ) exp(η j )
=
[∑
]2
k exp(η k )
= −π j πk
And if now we have j = k:
{
}
∂πk
exp(η k )
∂
=
∑
∂η k
∂η k
k exp(η k )
[∑
]
exp(η k ) k exp(η k ) − exp(η k ) exp(η k )
=
[∑
]2
k exp(η k )
= πk − πk πk
If we combine these two cases, we can easily see that (5.208) holds. Now

122
we prove (5.147).
e
∂E
∂η j

= λ

∂Ω(w)

= −λ

∂η j
∂

{
∑

∂η j

= −λ

i

= −λ

∑
i

= −λ

∑
i

=
=

{

{

j =1

ln

M
∑
j =1

}}
π j N (w i |µ j , σ2j )

}}
π jN

(w i |µ j , σ2j )
∂

1

∑M

j =1 π j N

(w i |µ j , σ2j ) ∂η j

{

M
∑
k=1

}
πk N

(w i |µk , σ2k )

M ∂ {
∑
}
πk N (w i |µk , σ2k )
2
∂η j
j =1 π j N (w i |µ j , σ j ) k=1

1

∑M

M
∑
} ∂πk
∂ {
πk N (w i |µk , σ2k )
2
∂πk
∂η j
j =1 π j N (w i |µ j , σ j ) k=1

1

∑M

M
∑

1

N (w i |µk , σ2k )(δ jk π j − π j πk )
2)
π
N
(
w
|
µ
,
σ
i j
i
j =1 j
j k=1
{
}
M
∑
∑
1
2
2
πk N (w i |µk , σk ))
π j N ( w i |µ j , σ j ) − π j
−λ ∑ M
2
i
k=1
j =1 π j N (w i |µ j , σ j )
{
}
∑
π j N (w i |µ j , σ2j )
∑
π j kM=1 πk N (w i |µk , σ2k ))
−λ
− ∑M
∑M
2
2
i
j =1 π j N (w i |µ j , σ j )
j =1 π j N (w i |µ j , σ j )
∑{
∑
}
{
}
−λ
γ j (w i ) − π j = λ
π j − γ j (w i )

= −λ
=

∑

M
∑

ln

i

∑ ∂
= −λ
i ∂η j
∑

{

∑M

i

i

Just as required.
Problem 5.33 Solution
It is trivial. We set the attachment point of the lower arm with the ground
as the origin of the coordinate. We first aim to find the vertical distance from
the origin to the target point, and this is also the value of x2 .

x2

=

L 1 sin(π − θ1 ) + L 2 sin(θ2 − (π − θ1 ))

=

L 1 sin θ1 − L 2 sin(θ1 + θ2 )

Similarly, we calculate the horizontal distance from the origin to the target point.

x1

= −L 1 cos(π − θ1 ) + L 2 cos(θ2 − (π − θ1 ))
=

L 1 cos θ1 − L 2 cos(θ1 + θ2 )

From these two equations, we can clearly see the ’forward kinematics’ of
the robot arm.

123
Problem 5.34 Solution
By analogy with (5.208), we can write:
∂πk (x)
∂aπj

= δ jk π j (x) − π j (x)πk (x)

Using (5.153), we can see that:
{
}
K
∑
2
E n = − ln
πk N (tn |µk , σk )
k=1

Therefore, we can derive:
{
}
K
∑
∂E n
∂
2
= − π ln
πk N (tn |µk , σk )
∂aπj
∂a j
k=1
{
}
K
∑
1
∂
2
πk N (tn |µk , σk )
= − ∑K
π
πk N (tn |µk , σ2 ) ∂a j k=1
k=1

k

K ∂π
∑
k
2
π N (t n |µ k , σ k )
2)
∂
a
π
N
(t
|
µ
,
σ
n k
j
k=1 k
k k=1

= − ∑K
= − ∑K

1
1

k=1 π k N

K [
∑
]
δ jk π j (xn ) − π j (xn )πk (xn ) N (tn |µk , σ2k )

(tn |µk , σ2k ) k=1
{
}
K
∑
1
= − ∑K
π j (xn )N (tn |µ j , σ2j ) − π j (xn )
πk (xn )N (tn |µk , σ2k )
2)
π
N
(t
|
µ
,
σ
n
k
k
=
1
k
k=1
k
{
}
K
∑
1
2
2
= ∑K
−π j (xn )N (tn |µ j , σ j ) + π j (xn )
πk (xn )N (tn |µk , σk )
2)
π
N
(t
|
µ
,
σ
n
k
k
=
1
k
k=1
k
And if we denoted (5.154), we will have:
∂E n
∂aπj

= −γ j + π j

Note that our result is slightly different from (5.155) by the subindex. But
there are actually the same if we substitute index j by index k in the final
expression.
Problem 5.35 Solution
We deal with the derivative of error function with respect to µk instead,
which will give a vector as result. Furthermore, the l th element of this vector
will be what we have been required. Since we know that:
} tn − µk
∂ {
πk N (tn |µk , σ2k ) =
πk N (tn |µk , σ2k )
2
∂µk
σk

124
One thing worthy noticing is that here we focus on the isotropic case as
stated in page 273 of the textbook. To be more precise, N (tn |µk , σ2k ) should
be N (tn |µk , σ2k I). Provided with the equation above, we can further obtain:
∂E n
∂µk

=

{

∂

− ln

∂µk

K
∑
k=1

}
πk N (tn |µk , σ2k )
∂

1

= − ∑K

{

K
∑

}

(tn |µk , σ2k )

πk N
(tn |µk , σ2k ) ∂µk k=1
tn − µk
1
·
πk N (tn |µk , σ2k )
= − ∑K
2
2
σ
π
N
(t
|
µ
,
σ
)
n k
k
k=1 k
k
k=1 π k N

= −γk

tn − µk
σ2k

Hence noticing (5.152), the l th element of the result above is what we are
required.
∂E n
∂E n
µkl − tl
= γk
µ =
∂µkl
σ2k
∂a
kl

Problem 5.36 Solution
Similarly, we know that:
∂ {
∂σk

πk N

(tn |µk , σ2k )

}

{

=

}
D ||tn − µk ||2
−
+
πk N (tn |µk , σ2k )
3
σk
σk

Therefore, we can obtain:
}
{
K
∑
∂E n
∂
2
πk N (tn |µk , σk )
− ln
=
∂σk
∂σk
k=1
{
}
K
∑
1
∂
= − ∑K
πk N (tn |µk , σ2k )
2 ) ∂σ
π
N
(t
|
µ
,
σ
k
n
k
k
=
1
k
k=1
k
}
{
1
D ||tn − µk ||2
+
πk N (tn |µk , σ2k )
= − ∑K
· −
3
2
σ
σk
k
k=1 π k N (t n |µ k , σ k )
{
}
2
D ||tn − µk ||
= −γk −
+
σk
σ3k
Note that there is a typo in (5.157) and the underlying reason is that:
= (σ2k )D

|σ2k ID ×D |

Problem 5.37 Solution
First we know two properties for the Gaussian distribution N (t|µ, σ2 I):
∫
E[t] = tN (t|µ, σ2 I) d t = µ

125
∫

And
E[||t||2 ] =

||t||2 N (t|µ, σ2 I) d t = Lσ2 + ||µ||2

Where we have used E[tT At] = Tr[Aσ2 I] + µT Aµ by setting A = I. This
property can be found in Matrixcookbook eq(378). Here L is the dimension of
t. Noticing (5.148), we can write:
∫
E[t|x] =
t p(t|x) d t
∫

=
=
=

t

K
∑
k=1

K
∑

πk N (t|µk , σ2k ) d t

∫

tN (t|µk , σ2k ) d t

πk

k=1
K
∑
k=1

πk µk

Then we prove (5.160).
(
)
s2 (x) = E[||t − E[t|x]||2 |x] = E[ t2 − 2tE[t|x] + E[t|x]2 |x]

= E[t2 |x] − E[2tE[t|x]|x] + E[t|x]2 = E[t2 |x] − E[t|x]2
∫
K
K
∑
∑
=
||t||2
πk N (µk , σ2k ) d t − ||
πl µl ||2
k=1

=
=

K
∑

k=1

=

L

=

L

=
=
=
=

||t||2 N (µk , σ2k ) d t − ||

πk

k=1
K
∑

πk (Lσ2k + ||µk ||2 ) − ||

K
∑
k=1
K
∑
k=1

L

K
∑
k=1

L

K
∑
k=1

L

K
∑
k=1

K
∑
k=1

l =1

∫

πk σ2k +
πk σ2k +
πk σ2k +
πk σ2k +
πk σ2k +

(

K
∑
k=1
K
∑
k=1
K
∑
k=1
K
∑
k=1
K
∑
k=1

K
∑
l =1

K
∑
l =1

πl µl ||2

πk ||µk ||2 − ||

K
∑
l =1

πl µl ||2

πk ||µk ||2 − 2 × ||
2

πk ||µk || − 2(
πk ||µk ||2 − 2(
πk ||µk −

πk Lσ2k + ||µk −

K
∑
l =1

πl µl ||2

K
∑
l =1

K
∑
l =1
K
∑
l =1

K
∑
l =1

πl µl ||2 + 1 × ||

πl µl )(
πl µl )(

K
∑
k=1
K
∑
k=1

K
∑
l =1

(

πk µk ) +
πk µk ) +

πl µl ||2

K
∑
k=1

K
∑
k=1

)
πk ||

πk ||

K
∑
l =1

K
∑
l =1

πl µl ||2

πl µl ||2

πl µl ||2

)

πl µl ||2

Note that there is a typo in (5.160), i.e., the coefficient L in front of σ2k is
missing.

126
Problem 5.38 Solution
From (5.167) and (5.171), we can write down the expression for the predictive distribution:
∫
p( t|x, D, α, β) =
p(w|D, α, β) p( t|x, w, β) d w
∫
≈
q(w|D ) p( t|x, w, β) d w
∫
=
N (w|wMAP , A−1 )N ( t|gT w − gT wMAP + y(x, wMAP ), β−1 ) d w
Note here p( t|x, w, β) is given by (5.171) and q(w|D ) is the approximation
to the posterior p(w|D, α, β), which is given by (5.167). Then by analogy with
(2.115), we first deal with the mean of the predictive distribution:
mean = gT w − gT wMAP + y(x, wMAP )|w = wMAP
=

y(x, wMAP )

Then we deal with the covariance matrix:
Covariance matrix = β−1 + gT A−1 g
Just as required.
Problem 5.39 Solution
Using Laplace Approximation, we can obtain:

p(D |w, β) p(w|α) =

{
}
p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP )

Then using (5.174), (5.162) and (5.163), we can obtain:
∫
p(D |α, β) =
p(D |w, β) p(w, α) d w
∫
{
}
=
p(D |wMAP , β) p(wMAP |α) exp −(w − wMAP )T A(w − wMAP ) d w
=
=

p(D |wMAP , β) p(wMAP |α)
N
∏
n=1

(2π)W /2
|A|1/2

N ( t n | y(xn , wMAP ), β−1 )N (wMAP |0, α−1 I)

(2π)W /2
|A|1/2

If we take logarithm of both sides, we will obtain (5.175) just as required.
Problem 5.40 Solution
For a k-class classification problem, we need to use softmax activation
function and also the error function is now given by (5.24). Therefore, the

127
Hessian matrix should be derived from (5.24) and the cross entropy in (5.184)
will also be replaced by (5.24).
Problem 5.41 Solution
By analogy to Prob.5.39, we can write:

p(D |α) = p(D |wMAP ) p(wMAP |α)

(2π)W /2
|A|1/2

Since we know that the prior p(w|α) follows a Gaussian distribution, i.e.,
(5.162), as stated in the text. Therefore we can obtain:
1
ln |A| + const
2
W
1
α
ln α − ln |A| + const
= ln p(D |wMAP ) − wT w +
2
2
2
W
1
= −E (wMAP ) +
ln α − ln |A| + const
2
2

ln p(D |α) = ln p(D |wMAP ) + ln p(wMAP |α) −

Just as required.

0.6

Kernel Methods

Problem 6.1 Solution
Recall that in section.6.1, a n can be written as (6.4). We can derive:

an

1
= − {wT ϕ(xn ) − t n }
λ
1
= − {w1 ϕ1 (xn ) + w2 ϕ2 (xn ) + ... + w M ϕ M (xn ) − t n }
λ
w1
w2
wM
tn
ϕ1 (xn ) −
ϕ2 (xn ) − ... −
ϕ M (xn ) +
= −
λ
λ
λ
λ
w1
w2
wM
= (cn −
)ϕ1 (xn ) + ( c n −
)ϕ2 (xn ) + ... + ( c n −
)ϕ M (xn )
λ
λ
λ

Here we have defined:

cn =

t n /λ
ϕ1 (xn ) + ϕ2 (xn ) + ... + ϕ M (xn )

From what we have derived above, we can see that a n is a linear combination of ϕ(xn ). What’s more, we first substitute K = ΦΦT into (6.7), and
then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain
(6.2) just as required.
Problem 6.2 Solution

128
If we set w(0) = 0 in (4.55), we can obtain:
w(τ+1) =

N
∑
n=1

η c n t n ϕn

where N is the total number of samples and c n is the times that t n ϕn has
been added from step 0 to step τ + 1. Therefore, it is obvious that we have:
w=

N
∑
n=1

αn t n ϕn

We further substitute the expression above into (4.55), which gives:
N
∑
n=1

α(nτ+1) t n ϕn =

N
∑
n=1

α(nτ) t n ϕn + η t n ϕn

In other words, the update process is to add learning rate η to the coefficient αn corresponding to the misclassified pattern xn , i.e.,
α(nτ+1) = α(nτ) + η

Now we similarly substitute it into (4.52):

=

f ( wT ϕ(x) )
N
∑
α n t n ϕT
f(
n ϕ(x) )

=

f(

y(x) =

n=1
N
∑

n=1

αn t n k(xn , x) )

Problem 6.3 Solution
We begin by expanding the Euclidean metric.
||x − xn ||2

= (x − xn )T (x − xn )
= (xT − xT
n )(x − x n )
T
= xT x − 2xT
n x + xn xn

Similar to (6.24)-(6.26), we use a nonlinear kernel k(xn , x) to replace xT
n x,
which gives a general nonlinear nearest-neighbor classifier with cost function
defined as:
k(x, x) + k(xn , xn ) − 2 k(xn , x)
Problem 6.4 Solution
To construct such a matrix, let us suppose the two eigenvalues are 1 and
2, and the matrix has form:
[
]
a b
c d

129
Therefore, based on the definition of eigenvalue, we have two equations:
{
(a − 2)( d − 2) = bc (1)
(a − 1)( d − 1) = bc (2)
(2)-(1), yielding:

a+d =3
Therefore, we set a = 4 and d = −1. Then we substitute them into (1), and
thus we see:
bc = −6
Finally, we choose b = 3 and c = −2. The constructed matrix is:
[
]
4
3
−2 −1
Problem 6.5 Solution
Since k 1 (x, x′ ) is a valid kernel, it can be written as:

k 1 (x, x′ ) = ϕ(x)T ϕ(x′ )
We can obtain:

k(x, x′ ) = ck 1 (x, x′ ) =

[p

]T [p
]
cϕ(x)
cϕ(x′ )

Therefore, (6.13) is a valid kernel. It is similar for (6.14):
[
]T [
]
k(x, x′ ) = f (x) k 1 (x, x′ ) f (x′ ) = f (x)ϕ(x)
f (x′ )ϕ(x′ )
Just as required.
Problem 6.6 Solution
We suppose q( x) can be written as:

q( x) = a n x n + a n−1 x n−1 + ... + a 1 x + a 0
We now obtain:

k(x, x′ ) = a n k 1 (x, x′ ) + a n−1 k 1 (x, x′ )
n

n−1

+ ... + a 1 k 1 (x, x′ ) + a 0

By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x, x′ )
is a valid kernel. For (6.16), we can use Taylor expansion, and since the
coefficients of Taylor expansion are all positive, we can similarly prove its
validity.
Problem 6.7 Solution
To prove (6.17), we will use the property stated below (6.12). Since we
know k 1 (x, x′ ) and k 2 (x, x′ ) are valid kernels, their Gram matrix K1 and K2

130
are both positive semidefinite. Given the relation (6.12), it can be easily
shown K = K1 + K2 is also positive semidefinite and thus k(x, x′ ) is also a
valid kernel.
To prove (6.18), we assume the map function for kernel k 1 (x, x′ ) is ϕ(1) (x),
and similarly ϕ(2) (x) for k 2 (x, x′ ). Moreover, we further assume the dimension
of ϕ(1) (x) is M , and ϕ(2) (x) is N . We expand k(x, x′ ) based on (6.18):

k(x, x′ ) =

k 1 (x, x′ ) k 2 (x, x′ )

= ϕ(1) (x)T ϕ(1) (x′ )ϕ(2) (x)T ϕ(2) (x′ )
M
N
∑
∑
(1) ′
=
ϕ(1)
(x)
ϕ
(x
)
ϕ(2)
(x)ϕ(2)
(x′ )
i
i
j
j
i =1

=
=

j =1

M ∑
N [
∑
i =1 j =1
MN
∑
k=1

ϕ(1)
(x)ϕ(2)
(x)
i
j

][

]
′ (2) ′
ϕ(1)
(x
)
ϕ
(x
)
i
j

ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ )

where ϕ(1)
(x) is the i th element of ϕ(1) (x), and ϕ(2)
(x) is the j th element
i
j

of ϕ(2) (x). To be more specific, we have proved that k(x, x′ ) can be written as
ϕ(x)T ϕ(x′ ). Here ϕ(x) is a MN ×1 column vector, and the kth ( k = 1, 2, ..., MN )
element is given by ϕ(1)
(x) × ϕ(2)
(x). What’s more, we can also express i, j in
i
j
terms of k:
i = ( k − 1) ⊘ N + 1 and j = ( k − 1) ⊙ N + 1
where ⊘ and ⊙ means integer division and remainder, respectively.
Problem 6.8 Solution
For (6.19) we suppose k 3 (x, x′ ) = g(x)T g(x′ ), and thus we have:

k(x, x′ ) = k 3 (ϕ(x), ϕ(x′ )) = g(ϕ(x))T g(ϕ(x′ )) = f (x)T f (x′ )
where we have denoted g(ϕ(x)) = f (x) and now it is obvious that (6.19)
holds. To prove (6.20), we suppose x is a N × 1 column vector and A is a N × N
symmetric positive semidefinite matrix. We know that A can be decomposed
to QBQT . Here Q is a N × N orthogonal matrix, and B is a N × N diagonal
matrix whose elements are no less than 0. Now we can derive:

k(x, x′ ) = xT Ax′ = xT QBQT x′ = (QT x)T B(QT x′ ) = yT By′
N
N √
√
∑
∑
′
′
B ii yi yi =
( B ii yi )( B ii yi ) = ϕ(x)T ϕ(x′ )
=
i =1

i =1

To be more specific, we have proved that k(x, x′ ) = ϕ(x)T ϕ(x′ ), and here
ϕ
vector, whose i th ( i = 1, 2, ..., N ) element is given by
√(x) is a N ×√1 column
B ii yi , i.e., B ii (QT x) i .

131
Problem 6.9 Solution
To prove (6.21), let’s first expand the expression:
′

k(x, x′ ) =

′

k a (xa , xa ) + k b (xb , xb )
M
∑

=

i =1

M∑
+N

=

′

ϕ(ia) (xa )ϕ(ia) (xa ) +

k=1

N
∑

′

j =1

ϕ(ib) (xb )ϕ(ib) (xb )

′

ϕk (x)ϕk (x ) = ϕ(x)T ϕ(x′ )

where we have assumed the dimension of xa is M and the dimension of
xb is N . The mapping function ϕ(x) is a ( M + N ) × 1 column vector, whose kth
( k = 1, 2, ..., M + N ) element ϕk (x) is:
{
1≤k≤M
ϕ(ka) (x)
ϕk (x) =
ϕ(kb−) M (xa ) M + 1 ≤ k ≤ M + N
(6.22) is quite similar to (6.18). We follow the same procedure:

k(x, x′ ) =
=
=
=

′

′

k a (xa , xa ) k b (xb , xb )
M
∑
i =1

′

ϕ(ia) (xa )ϕ(ia) (xa )

M ∑
N [
∑
i =1 j =1
MN
∑
k=1

N
∑
j =1

′

ϕ(jb) (xb )ϕ(jb) (xb )

ϕ(ia) (xa )ϕ(jb) (xb )

][

′

′

ϕ(ia) (xa )ϕ(jb) (xb )

]

ϕk (x)ϕk (x′ ) = ϕ(x)T ϕ(x′ )

By analogy to (6.18), the mapping function ϕ(x) is a MN ×1 column vector,
whose kth ( k = 1, 2, ..., MN ) element ϕk (x) is:
ϕk (x) = ϕ(ia) (xa ) × ϕ(jb) (xb )

To be more specific, xa is the sub-vector of x made up of the first M element of x, and xb is the sub-vector of x made up of the last N element of x.
What’s more, we can also express i, j in terms of k:

i = ( k − 1) ⊘ N + 1 and

j = ( k − 1) ⊙ N + 1

where ⊘ and ⊙ means integer division and remainder, respectively.
Problem 6.10 Solution
According to (6.9), we have:
T

−1

y(x) = k(x) (K + λI N )

T

t = k(x) a =

N
∑
n=1

[

f (xn ) · f (x) · a n =

N
∑
n=1

]

f (xn ) · ·a n f (x)

132
We see that if we choose k(x, x′ ) = f (x) f (x′ ) we will always find a solution
y(x) proportional to f (x).
Problem 6.11 Solution
We follow the hint.

k(x, x′ ) = exp(−xT x/2σ2 ) · exp(xT x′ /σ2 ) · exp(−(x′ )T x′ /2σ2 )


x T x′ 2
T ′
)
(
x
x
2
= exp(−xT x/2σ2 ) · 1 + 2 + σ
+ · · · · exp(−(x′ )T x′ /2σ2 )
2!
σ
= ϕ(x)T ϕ(x′ )

where ϕ(x) is a column vector with infinite dimension. To be more specific, (6.12) gives a simple example on how to decompose (xT x′ )2 . In our case,
we can also decompose (xT x′ )k , k = 1, 2, ..., ∞ in the similar way. However,
since k → ∞, i.e., the decomposition will consist monomials with infinite degree. Thus, there will be infinite terms in the decomposition and the feature
mapping function ϕ(x) will have infinite dimension.
Problem 6.12 Solution
First, let’s explain the problem a little bit. According to (6.27), what we
need to prove here is:

k ( A 1 , A 2 ) = 2| A 1 ∩ A 2 | = ϕ ( A 1 ) T ϕ ( A 2 )
The biggest difference from the previous problem is that ϕ( A ) is a 2|D | × 1
column vector and instead of indexed by 1, 2, ..., 2|D | here we index it by {U |U ⊆
D } (Note that {U |U ⊆ D } is all the possible subsets of D and thus there are
2|D | elements in total). Therefore, according to (6.95), we can obtain:
ϕ( A 1 ) T ϕ ( A 2 ) =

∑
U ⊆D

ϕU ( A 1 )ϕU ( A 2 )

By using the summation, we actually iterate through all the possible subsets of D . If and only if the current iterating subset U is a subset of both A 1
and A 2 simultaneously, the current adding term equals to 1. Therefore, we
actually count how many subsets of D is in the intersection of A 1 and A 2 .
Moreover, since A 1 and A 2 are both defined in the subset space of D , what
we have deduced above can be written as:
ϕ ( A 1 ) T ϕ ( A 2 ) = 2| A 1 ∩ A 2 |

Just as required.
Problem 6.13 SolutionWait for update
Problem 6.14 Solution

133
Since the covariance matrix S is fixed, according to (6.32) we can obtain:
(
)
∂
1
T −1
g(µ, x) = ∇µ ln p(x|µ) =
− (x − µ) S (x − µ) = S−1 (x − µ)
∂µ
2
Therefore, according to (6.34), we can obtain:
[
]
[
]
F = Ex g(µ, x)g(µ, x)T = S−1 Ex (x − µ)(x − µ)T S−1
Since x ∼ N (x|µ, S), we have:
[
]
Ex (x − µ)(x − µ)T = S
So we obtain F = S−1 and then according to (6.33), we have:

k(x, x′ ) = g(µ, x)T F−1 g(µ, x′ ) = (x − µ)T S−1 (x′ − µ)
Problem 6.15 Solution
We rewrite the problem. What we are required to prove is that the Gram
matrix K:
[
]
k 11 k 12
K=
,
k 21 k 22
where k i j ( i, j = 1, 2) is short for k( x i , x j ), should be positive semidefinite. A
positive semidefinite matrix should have positive determinant, i.e.,

k 12 k 21 ≤ k 11 k 22 .
Using the symmetric property of kernel, i.e., k 12 = k 21 , we obtain what
has been required.
Problem 6.16 Solution
Based on the total derivative of function f , we have:
) ∑
(
N
f (w + ∆w)T ϕ1 , (w + ∆w)T ϕ2 , ..., (w + ∆w)T ϕ N =

∂f

T
n=1 ∂(w ϕ n )

· ∆wT ϕn

Which can be further written as:
(

T

T

T

)

f (w + ∆w) ϕ1 , (w + ∆w) ϕ2 , ..., (w + ∆w) ϕ N =

[

N
∑

]

∂f

n=1 ∂(w

Tϕ

n)

· ϕT
n

∆w

Note that here ϕn is short for ϕ(xn ). Based on the equation above, we can
obtain:
N
∑
∂f
∇w f =
· ϕT
n
Tϕ )
∂
(w
n=1
n

134
Now we focus on the derivative of function g with respect to w:
∇w g =

∂g
∂(wT w)

· 2wT

In order to find the optimal w, we set the derivative of J with respect to
w equal to 0, yielding:
∇w J = ∇w f + ∇w g =

N
∑
n=1

∂f
∂(wT ϕ

n)

· ϕT
n +

∂g
∂(wT w)

· 2wT = 0

Rearranging the equation above, we can obtain:
w=

N
1 ∑
∂f
·ϕ
2a n=1 ∂(wT ϕn ) n
∂g

Where we have defined: a = 1 ÷ ∂(wT w) , and since g is a monotonically
increasing function, we have a > 0.
Problem 6.17 Solution
We consider a variation in the function y(x) of the form:

y(x) → y(x) + ϵη(x)
Substituting it into (6.39) yields:

E [ y + ϵη] =
=
=

N ∫ {
}2
1 ∑
y + ϵη − t n v(ξ) d ξ
2 n=1
N ∫ {
}
1 ∑
( y − t n )2 + 2 · (ϵη) · ( y − t n ) + (ϵη)2 v(ξ) d ξ
2 n=1
N ∫
∑
E [ y] + ϵ
{ y − t n } ηvd ξ + O (ϵ2 )
n=1

Note that here y is short for y(xn + ξ), η is short for η(xn + ξ) and v is short
for v(ξ) respectively. Several clarifications must be made here. What we have
done is that we vary the function y by a little bit (i.e., ϵη) and then we expand
the corresponding error with respect to the small variation ϵ. The coefficient
before ϵ is actually the first derivative of the error E [ y + ϵη] with respect to
ϵ at ϵ = 0. Since we know that y is the optimal function that can make E
the smallest, the first derivative of the error E [ y + ϵη] should equal to zero at
ϵ = 0, which gives:
N ∫
∑
n=1

{ y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ = 0

135
Now we are required to find a function y that can satisfy the equation
above no matter what η is. We choose:
η(x) = δ(x − z)

This allows us to evaluate the integral:
N ∫
∑
n=1

{ y(xn + ξ) − t n } η(xn + ξ)v(ξ) d ξ =

N
∑
n=1

{ y(z) − t n } v(z − xn )

We set it to zero and rearrange it, which finally gives (6.40) just as required.
Problem 6.18 Solution
According to the main text below Eq (6.48), we know that f ( x, t), i.e., f (z),
follows a zero-mean isotropic Gaussian:

f (z) = N (z|0, σ2 I)
Then f ( x − xm , t − t m ), i.e., f (z − zm ) should also satisfy a Gaussian distribution:
f (z − zm ) = N (z|zm , σ2 I)
Where we have defined:
zm = ( xm , t m )
∫
The integral f (z − zm ) dt corresponds to the marginal distribution with
respect to the remaining variable x and, thus, we obtain:
∫
f (z − zm ) dt = N ( x| xm , σ2 )

We substitute all the expressions into Eq (6.48), which gives:
∑
2
p( t, x)
n N (z|z m , σ I)
p ( t| x) = ∫
= ∑
2
p( t, x) dt
m N ( x| x m , σ )
( 1
)
∑ 1
T 2 −1
n 2πσ2 exp − 2 (z − z n ) (σ I) (z − z n )
(
)
=
∑
1
1
2
exp
−
(
x
−
x
)
m
m (2πσ2 )1/2
2σ 2
(
)
(
)
∑ 1
1
1
2
2
n 2πσ2 exp − 2σ2 ( x − x n ) exp − 2σ2 ( t − t n )
(
)
=
∑
1
1
2
exp
−
(
x
−
x
)
m
m (2πσ2 )1/2
2σ2
(
)
)
(
p 1
exp − 2σ1 2 ( x − xn )2
∑
1
1
2πσ2
2
(
)
(
t
−
t
)
exp
−
=
·
p
n
∑
1
1
2σ 2
2
n
2πσ2
m (2πσ2 )1/2 exp − 2σ2 ( x − x m )
∑
=
π n · N ( t | t n , σ2 )
n

136
Where we have defined:
(
)
exp − 2σ1 2 ( x − xn )2
(
)
πn = ∑
1
2
−
exp
(
x
−
x
)
m
m
2σ 2

We also observe that:

∑

πn = 1

n

Therefore, the conditional distribution p( t| x) is given by a Gaussian Mixture. Similarly, we attempt to find a specific form for Eq (6.46):
∫
f ( x − xn , t) dt
k( x, xn ) = ∑ ∫
m f ( x − x m , t) dt
=

∑

N ( x | x n , σ2 )
2
m N ( x| x m , σ )

= πn

In other words, the conditional distribution can be more precisely written
as:

p ( t| x) =

∑

k( x, xn ) · N ( t| t n , σ2 )

n

Thus its mean is given by:
E[ t | x ] =

∑

k( x, xn ) · t n

n

Its variance is given by:
var[ t| x] = E[( t| x)2 ] − E[ t| x]2
(
)2
∑
∑
k( x, xn ) · t n
=
k( x, xn ) · ( t2n + σ2 ) −
n

n

Problem 6.19 Solution
Similar to Prob.6.17, it is straightforward to show that:

y(x) =

∑

t n k(x, xn )

n

Where we have defined:

g(xn − x)
k(x, xn ) = ∑
n g(x n − x)
Problem 6.20 Solution
Since we know that t N +1 = ( t 1 , t 2 , ..., t N , t N +1 )T follows a Gaussian distribution, i.e., t N +1 ∼ N (t N +1 |0, C N +1 ) given in Eq (6.64), if we rearrange its

137
order by putting the last element (i.e., t N +1 ) to the first position, denoted as
t̄ N +1 , it should also satisfy a Gaussian distribution:
t̄ N +1 = ( t N +1 , t 1 , ..., t 2 , t N )T ∼ N (t̄ N +1 |0, C̄ N +1 )
Where we have defined:
(

C̄ N +1 =

c kT
k CN

)

Where k and c have been given in the main text below Eq (6.65). The
conditional distribution p( t N +1 |t N ) should also be a Gaussian. By analogy to
Eq (2.94)-(2.98), we can simply treat t N +1 as xa , t N as xb , c as Σaa , k as Σba ,
kT as Σab and C N as Σbb . Substituting them into Eq (2.79) and Eq (2.80)
yields:
Λaa = ( c − kT C−N1 k)−1
And:

Λab = −( c − kT C−N1 k)−1 kT C−N1

Then we substitute them into Eq (2.96) and (2.97), yields:
1
p( t N +1 |t N ) = N (µa|b , Λ−
aa )

For its mean µa|b , we have:
(
) [
]
1
T −1 −1 T −1
µa|b = 0 − c − kT C−
k
·
−
(
c
−
k
C
k)
k
C
N
N
N · (t N − 0)
1
= kT C−
N t N = m(x N +1 )
1
Similarly, for its variance Λ−
aa (Note that here since t N +1 is a scalar, the
mean and the covariance matrix actually degenerate to one dimension case),
we have:
1
T −1
2
Λ−
aa = c − k C N k = σ (x N +1 )

Problem 6.21 Solution
We follow the hint beginning by verifying the mean. We write Eq (6.62)
in a matrix form:
1
C N = ΦΦT + β−1 I N
α
Where we have used Eq (6.54). Here Φ is the design matrix defined below
Eq (6.51) and I N is an identity matrix. Before we use Eq (6.66), we need to
obtain k:
k = [ k(x1 , x N +1 ), k(x2 , x N +1 ), ..., k(x N , x N +1 )]T
1
=
[ϕ(x1 )T ϕ(x N +1 ), ϕ(x2 )T ϕ(x N +1 ), ..., ϕ(xn )T ϕ(x N +1 )]T
α
1
Φϕ(x N +1 )T
=
α

138
Now we substitute all the expressions into Eq (6.66), yielding:
[
]−1
m(x N +1 ) = α−1 ϕ(x N +1 )T ΦT α−1 ΦΦT + β−1 I N
t

Next using matrix identity (C.6), we obtain:
[
]−1
[
]−1
ΦT α−1 ΦΦT + β−1 I N
= αβ βΦT Φ + αI M
ΦT = αβS N ΦT

Where we have used Eq (3.54). Substituting it into m(x N +1 ), we obtain:

m(x N +1 ) = βϕ(x N +1 )T S N ΦT t = < ϕ(x N +1 )T , βS N ΦT t >
Where < ·, · > represents the inner product. Comparing the result above
with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is
similar for the variance. We substitute c, k and C N into Eq (6.67). Then we
simplify the expression using matrix identity (C.7). Finally, we will observe
that it is equal to Eq (3.59).
Problem 6.22 Solution
Based on Eq (6.64) and (6.65), We first write down the joint distribution
for t N +L = [ t 1 (x), t 2 (x), ..., t N +L (x)]T :

p(t N +L ) = N (t N +L |0, C N +L )
Where C N +L is similarly given by:
(
C1,N
C N +L =
KT

K

)

C N +1,N +L

The expression above has already implicitly divided the vector t N +L into
two parts. Similar to Prob.6.20, for later simplicity we rearrange the order
of t N +L denoted as t̄ N +L = [ t N +1 , ..., t N +L , t 1 , ..., t N ]T . Moreover, t̄ N +L should
also follows a Gaussian distribution:

p(t̄ N +L ) = N (t̄ N +L |0, C̄ N +L )
Where we have defined:
(

C̄ N +L =

C N +1,N +L
K

KT
C1,N

)

Now we use Eq (2.94)-(2.98) and Eq (2.79)-(2.80) to derive the conditional
distribution, beginning by calculate Λaa :
1
−1
Λaa = (C N +1,N +L − KT · C−
1,N · K)

and Λab :

1
−1
1
Λab = − (C N +1,N +L − KT · C−
· KT · C−
1,N · K)
1,N

139
Now we can obtain:
1
p( t N +1 , ..., t N +L |t N ) = N (µa|b , Λ−
aa )

Where we have defined:
T
−1
1
µa|b = 0 + KT · C−
1,N · t N = K · C1,N · t N

If now we want to find the conditional distribution p( t j |t N ), where N + 1 ≤
j ≤ N + L, we only need to find the corresponding entry in the mean (i.e., the
( j − N )-th entry) and covariance matrix (i.e., the ( j − N )-th diagonal entry) of
p( t N +1 , ..., t N +L |t N ). In this case, it will degenerate to Eq (6.66) and (6.67)
just as required.
Problem 6.24 Solution
By definition, we only need to prove that for arbitrary vector x ̸= 0, xT Wx
is positive. Here suppose that W is a M × M matrix. We expand the multiplication:
M ∑
M
M
∑
∑
xT Wx =
Wi j · x i · x j =
Wii · x2i
i =1 j =1

i =1

where we have used the fact that W is a diagonal matrix. Since Wii > 0,
we obtain xT Wx > 0 just as required. Suppose we have two positive definite
matrix, denoted as A1 and A2 , i.e., for arbitrary vector x, we have xT A1 x > 0
and xT A2 x > 0. Therefore, we can obtain:
xT (A1 + A2 )x = xT A1 x + xT A2 x > 0
Just as required.
Problem 6.25 Solution
Based on Newton-Raphson formula, Eq(6.81) and Eq(6.82), we have:
anew
N

1 −1
−1
= a N − (−W N − C−
N ) (t N − σ N − C N a N )

1 −1
−1
= a N + (W N + C−
N ) (t N − σ N − C N a N )
[
]
1 −1
1
−1
= (W N + C−
(W N + C−
N )
N )a N + t N − σ N − C N a N

1
−1 −1
= C N C−
N (W N + C N ) (t N − σ N + W N a N )

= C N (C N W N + I)−1 (t N − σ N + W N a N )

Just as required.
Problem 6.26 Solution
Using Eq(6.77), (6.78) and (6.86), we can obtain:
∫
p(a N +1 |t N ) =
p(a N +1 |a N ) p(a N |t N ) d a N
∫
1
T −1
⋆
−1
=
N (a N +1 |kT C−
N a N , c − k C N k) · N (a N |a N , H ) d a N

140
By analogy to Eq (2.115), i.e.,
∫

p(y) =

p(y|x) p(x) d x

We can obtain:

p(a N +1 |t N ) = N (Aµ + b, L−1 + AΛ−1 AT )

(∗)

Where we have defined:
1
−1
1
A = kT C−
= c − kT C−
N , b = 0, L
N k

And

µ = a⋆
N, Λ = H

Therefore, the mean is given by:
1 ⋆
T −1
T
Aµ + b = kT C−
N a N = k C N C N (t N − σ N ) = k (t N − σ N )

Where we have used Eq (6.84). The covariance matrix is given by:
L−1 + AΛ−1 AT

=
=
=
=

1
T −1 −1 T −1 T
c − kT C−
N k + k C N H (k C N )

1
1 −1 −1
c − kT (C−
− C−
N H C N )k
( N
)
1
−1
−1 −1 −1
c − kT C−
−
C
(W
+
C
)
C
N
N
N
N
N k
(
)
1
−1 −1
c − kT C−
−
(C
W
C
+
C
)
k
N N N
N
N

Where we have used Eq (6.85) and the fact that C N is symmetric. Then we
use matrix identity (C.7) to further reduce the expression, which will finally
give Eq (6.88).
Problem 6.27 Solution(Wait for update) This problem is really complicated.
What’s more, I find that Eq (6.91) seems not right.

0.7

Sparse Kernel Machines

Problem 7.1 Solution
By analogy to Eq (2.249), we can obtain:

+1

1 N∑
1


· k(x, xn )

 N+1
n=1 Z k
p(x| t) =

−1

1 N∑
1


· k(x, xn )

N−1 n=1 Z k

t = +1
t = −1

141
where N+1 represents the number of samples with label t = +1 and it is
the same for N−1 . Z k is a normalization constant representing the volume of
the hypercube. Since we have equal prior for the class, i.e.,
{
0. 5 t = + 1
p ( t) =
0. 5 t = − 1
Based on Bayes’ Theorem, we have p( t|x) ∝ p(x| t) · p( t), yielding:

+1

1
1 N∑


·
· k(x, xn ) t = +1

 Z N+1
n=1
p( t|x) =

−1

1
1 N∑


· k(x, xn ) t = −1
 ·
Z N−1 n=1
Where 1/ Z is a normalization constant to guarantee the integration of the
posterior equal to 1. To classify a new sample x⋆ , we try to find the value t⋆
that can maximize p( t|x). Therefore, we can obtain:

+1
−1

1 N∑
1 N∑


·
k
(x
,
x
)
≥
· k(x, xn )
+
1
if

n

N+1 n=1
N−1 n=1
⋆
(∗)
t =

+1
−1

1 N∑
1 N∑


· k(x, xn ) ≤
· k(x, xn )
−1 if
N+1 n=1
N−1 n=1
If we now choose the kernel function as k(x, x′ ) = xT x′ ,we have:
+1
+1
1 N∑
1 N∑
k(x, xn ) =
xT xn = xT x̃+1
N+1 n=1
N+1 n=1

Where we have denoted:
x̃+1 =

+1
1 N∑
xn
N+1 n=1

and similarly for x̃−1 . Therefore, the classification criterion (∗) can be
written as:
{
+1 if x̃+1 ≥ x̃−1
⋆
t =
−1 if x̃+1 ≤ x̃−1
When we choose the kernel function as k(x, x′ ) = ϕ(x)T ϕ(x′ ), we can similarly obtain the classification criterion:
{
+1 if ϕ̃(x+1 ) ≥ ϕ̃(x−1 )
⋆
t =
−1 if ϕ̃(x+1 ) ≤ ϕ̃(x−1 )
Where we have defined:
ϕ̃(x+1 ) =

+1
1 N∑
ϕ(xn )
N+1 n=1

142
Problem 7.2 Solution
Suppose we have find w0 and b 0 , which can let all points satisfy Eq (7.5)
and simultaneously minimize Eq (7.3). This hyperlane decided by w0 and
b 0 is the optimal classification margin. Now if the constraint in Eq (7.5)
becomes:
t n (wT ϕ(xn ) + b) ≥ γ
We can conclude that if we perform change of variables: w0 − > γw0 and
b− > γ b, the constraint will still satisfy and Eq (7.3) will be minimize. In
other words, if the right side of the constraint changes from 1 to γ, The new
hyperlane decided by γw0 and γ b 0 is the optimal classification margin. However, the minimum distance from the points to the classification margin is
still the same.
Problem 7.3 Solution
Suppose we have x1 belongs to class one and we denote its target value
t 1 = 1, and similarly x2 belongs to class two and we denote its target value
t 2 = −1. Since we only have two points, they must have t i · y(x i ) = 1 as shown
in Fig. 7.1. Therefore, we have an equality constrained optimization problem:
1
minimize ||w||2
2

s.t.

{ T
w ϕ(x1 ) + b = 1

wT ϕ(x2 ) + b = −1

This is an convex optimization problem and it has been proved that global
optimal exists.
Problem 7.4 Solution
Since we know that
ρ=

1
||w||

Therefore, we have:
1
= ||w||2
ρ2
In other words, we only need to prove that
||w||2 =

N
∑
n=1

an

When we find th optimal solution, the second term on the right hand side
of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its
dual is given by:
N
∑
1
L̃(a) =
a n − ||w||2
2
n=1

143
Therefore, we have:
N
∑
1
1
||w||2 = L(a) = L̃(a) =
a n − ||w||2
2
2
n=1

Rearranging it, we will obtain what we are required.
Problem 7.5 Solution
We have already proved this problem in the previous one.
Problem 7.6 Solution
If the target variable can only choose from {−1, 1}, and we know that

p ( t = 1| y ) = σ ( y )
We can obtain:

p( t = −1| y) = 1 − p( t = 1| y) = 1 − σ( y) = σ(− y)
Therefore, combining these two situations, we can derive:

p( t| y) = σ( yt)
Consequently, we can obtain the negative log likelihood:
− ln p(D) = − ln

N
∏
n=1

σ( yn t n ) = −

N
∑
n=1

ln σ( yn t n ) =

N
∑
n=1

E LR ( yn t n )

Here D represents the dataset, i.e.,D = {(xn , t n ); n = 1, 2, ..., N }, and E LR ( yt)
is given by Eq (7.48). With the addition of a quadratic regularization, we obtain exactly Eq (7.47).
Problem 7.7 Solution
The derivatives are easy to obtain. Our main task is to derive Eq (7.61)

144
using Eq (7.57)-(7.60).

L

=

C
−

=

C
−

=

C
−

N
∑

N
∑
1
(ξn + ξbn ) + ||w||2 −
(µ n ξ n + µ
bn ξbn )
2
n=1
n=1
N
∑
n=1

a n (ϵ + ξn + yn − t n ) −

N
∑
n=1

abn (ϵ + ξbn + yn − t n )

N
N
∑
∑
1
(a n + µn )ξn −
(abn + µ
bn )ξbn
(ξn + ξbn ) + ||w||2 −
2
n=1
n=1
n=1
N
∑

N
∑
n=1

a n (ϵ + yn − t n ) −

N
∑
n=1

abn (ϵ + yn − t n )

N
∑

N
N
∑
∑
1
(ξn + ξbn ) + ||w||2 −
C ξn −
C ξbn
2
n=1
n=1
n=1
N
∑
n=1

(a n + abn )ϵ −

N
∑
n=1

(a n − abn )( yn − t n )

=

N
N
∑
∑
1
||w||2 −
(a n + abn )ϵ −
(a n − abn )( yn − t n )
2
n=1
n=1

=

N
N
N
∑
∑
∑
1
(a n + abn )ϵ +
(a n − abn )(wT ϕ(xn ) + b − t n ) −
||w||2 −
2
n=1
n=1
n=1

=

N
N
N
∑
∑
∑
1
||w||2 −
(a n − abn )(wT ϕ(xn ) + b) −
(a n + abn )ϵ +
(a n − abn ) t n
2
n=1
n=1
n=1

=

N
N
N
∑
∑
∑
1
||w||2 −
(a n − abn )wT ϕ(xn ) −
(a n + abn )ϵ +
(a n − abn ) t n
2
n=1
n=1
n=1

=

N
N
∑
∑
1
||w||2 − ||w||2 −
(a n + abn )ϵ +
(a n − abn ) t n
2
n=1
n=1

N
N
∑
∑
1
= − ||w||2 −
(a n + abn )ϵ +
(a n − abn ) t n
2
n=1
n=1

Just as required.
Problem 7.8 Solution
This obviously follows from the KKT condition, described in Eq (7.67) and
(7.68).
Problem 7.9 Solution
The prior is given by Eq (7.80).

p(w|α) =

M
∏
i =1

1
−1
N (0, α−
i ) = N (w|0, A )

Where we have defined:
A = diag(α i )

145
The likelihood is given by Eq (7.79).
N
∏

p(t|X, w, β) =

n=1
N
∏

=

n=1

p( t n |xn , w, β−1 )
N ( t n |wT ϕ(xn ), β−1 )

= N (t|Φw, β−1 I)

Where we have defined:

Φ = [ϕ(x1 ), ϕ(x2 ), ..., ϕ(xn )]T
Our definitions of Φ and A as consistent with the main text. Therefore,
according to Eq (2.113)-Eq (2.117), we have:

p(w|t, X, α, β) = N (m, Σ)
Where we have defined:

Σ = (A + βΦT Φ)−1
And
m = βΣΦT t
Just as required.
Problem 7.10&7.11 Solution
It is quite similar to the previous problem. We begin by writting down the
prior:

p(w|α) =

M
∏
i =1

1
−1
N (0, α−
i ) = N (w|0, A )

Then we write down the likelihood:
N
∏

p(t|X, w, β) =

n=1
N
∏

=

n=1

p( t n |xn , w, β−1 )
N ( t n |wT ϕ(xn ), β−1 )

= N (t|Φw, β−1 I)

Since we know that:
∫

p(t|X, α, β) =

p(t|X, w, β) p(w|α) d w

146
First as required by Prob.7.10, we will solve it by completing the square.
We begin by write down the expression for p(t|X, w, β):
∫
p(t|X, α, β) =
N (w|0, A−1 )N (t|Φw, β−1 I) d w
= (

β

2π

) N /2 ·

1
(2π) M /2

·

M
∏
m=1

∫
α1/2
i ·

exp{−E (w)} d w

Where we have defined:
β
1 T
w Aw + ||t − Φw||2
2
2
We expand E (w) with respect to w:
}
1{ T
E (w) =
w (A + βΦT Φ)w − 2βtT (Φw) + βtT t
2
}
1 { T −1
w Σ w − 2mT Σ−1 w + βtT t
=
2
}
1{
=
(w − m)T Σ−1 (w − m) + βtT t − mT Σ−1 m
2
Where we have used Eq (7.82) and Eq (7.83). Substituting E (w) into the
integral, we will obtain:
∫
M
∏
β N /2
1
1/2
·
α · exp{−E (w)} d w
p(t|X, α, β) = ( )
·
2π
(2π) M /2 m=1 i
{ 1
}
M
∏
β
1
1/2
M /2
1/2
T
T −1
= ( ) N /2 ·
·
α
·
|
Σ
|
exp
−
t
−
m
Σ
m)
·
(2
π
)
(
β
t
2π
2
(2π) M /2 m=1 i
{
}
M
∏
β
1
T
T −1
= ( ) N /2 · |Σ|1/2 ·
α1/2
·
exp
−
(
β
t
t
−
m
Σ
m)
i
2π
2
m=1
{
}
M
∏
β
= ( ) N /2 · |Σ|1/2 ·
α1/2
i · exp − E (t)
2π
m=1

E (w) =

We further expand E (t):

E (t) =
=
=
=
=
=
=

1
(βtT t − mT Σ−1 m)
2
1
(βtT t − (βΣΦT t)T Σ−1 (βΣΦT t))
2
1
(βtT t − β2 tT ΦΣΣ−1 ΣΦT t)
2
1
(βtT t − β2 tT ΦΣΦT t)
2
1 T
t (βI − β2 ΦΣΦT )t
2
]
1 T[
t βI − βΦ(A + βΦT Φ)−1 ΦT β t
2
1 T −1
1
t (β I + ΦA−1 ΦT )−1 t = tT C−1 t
2
2

147
Note that in the last step we have used matrix identity Eq (C.7). Therefore, as we know that the pdf is Gaussian and the exponential term has been
given by E (t), we can easily write down Eq (7.85) considering those normalization constant.
What’s more, as required by Prob.7.11, the evaluation of the integral can
be easily performed using Eq(2.113)- Eq(2.117).
Problem 7.12 Solution
According to the previous problem, we can explicitly write down the log
marginal likelihood in an alternative form:
ln p(t|X, α, β) =

M
N
1
1∑
N
ln β − ln 2π + ln |Σ| +
ln α i − E (t)
2
2
2
2 i=1

We first derive:

dE (t)
dαi

1 d
(mT Σ−1 m)
2 dαi
1 d
−
(β2 tT ΦΣΣ−1 ΣΦT t)
2 dαi
1 d
(β2 tT ΦΣΦT t)
−
2 dαi
1 [ d
d Σ−1
2 T
T
− Tr
(
β
t
ΦΣΦ
t)
·
]
2
dαi
d Σ−1
1 2 [
1
β T r Σ(ΦT t)(ΦT t)T Σ · I i ] = m2ii
2
2

= −
=
=
=
=

In the last step, we have utilized the following equation:

d
T r (AX−1 B) = −X−T AT BT X−T
dX
Moreover, here I i is a matrix with all elements equal to zero, expect the
i -th diagonal element, and the i -th diagonal element equals to 1. Then we
utilize matrix identity Eq (C.22) to derive:

d ln |Σ|
dαi

d ln |Σ−1 |
dαi
[ d
]
= −T r Σ
(A + βΦT Φ)
dαi
= −Σ ii
= −

Therefore, we can obtain:

d ln p
1
1
1
=
− m2 − Σ ii
dαi
2α i 2 i 2

148
Set it to zero and obtain:
αi =

1 − α i Σ ii
γi
= 2
mi
mi

Then we calculate the derivatives of ln p with respect to β beginning by:

d ln |Σ|
dβ

d ln |Σ−1 |
dβ
[ d
]
= −T r Σ
(A + βΦT Φ)
dβ
[
]
= −T r ΣΦT Φ
= −

Then we continue:

dE (t)
dβ

=
=
=
=
=
=
=
=
=
=
=

1 T
1 d
t t−
(mT Σ−1 m)
2
2 dβ
1 T
1 d 2 T
t t−
(β t ΦΣΣ−1 ΣΦT t)
2
2 dβ
1 T
1 d 2 T
t t−
(β t ΦΣΦT t)
2
2 dβ
1 T
1
d T
t t − βtT ΦΣΦT t − β2
(t ΦΣΦT t)
2
2 dβ
}
1{ T
d T
t t − 2βtT ΦΣΦT t − β2
(t ΦΣΦT t)
2
dβ
}
{
1 T
d T
t t − 2tT (Φm) − β2
(t ΦΣΦT t)
2
dβ
{
[ d
1 T
d Σ−1 ]}
T
T
(t
ΦΣΦ
t)
·
t t − 2tT (Φm) − β2 T r
2
dβ
d Σ−1
{
[
]}
1 T
t t − 2tT (Φm) + β2 T r Σ(ΦT t)(ΦT t)T Σ · ΦT Φ
2
}
[
1{ T
t t − 2tT (Φm) + T r mmT · ΦT Φ]
2
}
[
1{ T
t t − 2tT (Φm) + T r ΦmmT · ΦT ]
2
1
||t − Φm||2
2

Therefore, we have obtained:
)
1( N
d ln p
=
− ||t − Φm||2 − T r [ΣΦT Φ]
dβ
2 β

149
Using Eq (7.83), we can obtain:

ΣΦT Φ = ΣΦT Φ + β−1 ΣA − β−1 ΣA
= Σ(βΦT Φ + A)β−1 − β−1 ΣA
= Iβ−1 − β−1 ΣA
= (I − ΣA)β−1

Setting the derivative equal to zero, we can obtain:
β−1 =

||t − Φm||2
||t − Φm||2
=
∑
N − T r (I − ΣA)
N − i γi

Just as required.
Problem 7.13 Solution
This problem is quite confusing. In my point of view, the posterior should
be denoted as p(w|t, X, {a i , b i }, a β , b β ), where a β , b β controls the Gamma distribution of β, and a i , b i controls the Gamma distribution of α i . What we
should do is to maximize the marginal likelihood p(t|X, {a i , b i }, a β , b β ) with
respect to {a i , b i }, a β , b β . Now we do not have a point estimation for the hyperparameters β and α i . We have a distribution (controled by the hyper priors,
i.e., {a i , b i }, a β , b β ) instead.
Problem 7.14 Solution
We begin by writing down p( t|x, w, β∗ ). Using Eq (7.76) and Eq (7.77), we
can obtain:
p( t|x, w, β∗ ) = N ( t|wT ϕ(x), (β∗ )−1 )
Then we write down p(w|X, t, α∗ , β∗ ). Using Eq (7.81), (7.82) and (7.83),
we can obtain:
p(w|X, t, α∗ , β∗ ) = N (w|m, Σ)
Where m and Σ are evaluated using Eq (7.82) and (7.83) given α = α∗
and β = β∗ . Then we utilize Eq (7.90) and obtain:
∫
∗ ∗
p ( t | x , X, t , α , β ) =
N ( t|wT ϕ(x), (β∗ )−1 )N (w|m, Σ) d w
∫
=
N ( t|ϕ(x)T w, (β∗ )−1 )N (w|m, Σ) d w
Using Eq (2.113)-(2.117), we can obtain:

p( t|x, X, t, α∗ , β∗ ) = N (µ, σ2 )
Where we have defined:
µ = mT ϕ(x)

150
And

σ2 = (β∗ )−1 + ϕ(x)T Σϕ(x)

Just as required.
Problem 7.15 Solution
We just follow the hint.
1
L(α) = − { N ln 2π + ln |C| + tT C−1 t}
2
1{
1 T −1
= − N ln 2π + ln |C− i | + ln |1 + α−
i φ i C− i φ i |
2
1
C−
φ φT C−1 }
−i i i −i
1
+tT (C−
−
)t
−i
α i + φT
C−1 φ
i −i i
=
=
=
=

L (α − i ) −

T −1
−1
1 T C− i φ i φ i C− i
1
1
ln |1 + α−i 1 φTi C−
φ
|
+
t
t
−i i
1φ
2
2 α i + φTi C−
−i i

2
1 qi
1
−1
L(α− i ) − ln |1 + α i s i | +
2
2 αi + s i
2
1 αi + s i 1 q i
L(α− i ) − ln
+
2
αi
2 αi + s i
[
q2i ]
1
L(α− i ) + ln α i − ln(α i + s i ) +
= L(α− i ) + λ(α i )
2
αi + s i

Where we have defined λ(α i ), s i and q i as shown in Eq (7.97)-(7.99).
Problem 7.16 Solution
We first calculate the first derivative of Eq(7.97) with respect to α i :

q2i
1 1
1
= [ −
−
]
∂α i
2 α i α i + s i (α i + s i )2
∂λ

Then we calculate the second derivative:
∂2 λ
∂α2i

=

2 q2i
1
1
1
[− 2 +
+
]
2 α i (α i + s i )2 (α i + s i )3

Next we aim to prove that when α i is given by Eq (7.101), i.e., setting the
first derivative equal to 0, the second derivative (i.e., the expression above) is
negative. First we can obtain:
αi + s i =

s2i
q2i − s i

+ si =

s i q2i
q2i − s i

151
Therefore, substituting α i + s i and α i into the second derivative, we can
obtain:
∂2 λ
∂α2i

=

2
2
( q2i − s i )2 2 q2i ( q2i − s i )3
1 (q i − s i )
[−
+
+
]
2
s4i
s2i q4i
s3i q6i

=

4 2
2
s2i ( q2i − s i )2 2 s i ( q2i − s i )3
1 q i (q i − s i )
+
+
]
[−
2
q4i s4i
s4i q4i
s4i q4i

=

2
2
1 (q i − s i )
[− q4i + s2i + 2 s i ( q2i − s i )]
2 q4i s4i

=

2
2
1 (q i − s i )
[−( q2i − s i )2 ]
2 q4i s4i

= −

2
4
1 (q i − s i )
<0
2 q4i s4i

Just as required.
Problem 7.17 Solution
We just follow the hint. According to Eq (7.102), Eq (7.86) and matrix
identity (C.7), we have:

Qi

−1
= φT
i C t

−1
−1 T −1
= φT
i (β I + ΦA Φ ) t

T
−1 T
= φT
i (βI − βIΦ(A + Φ βIΦ) Φ βI)t
2
T
−1 T
= φT
i (β − β Φ(A + βΦ Φ) Φ )t
2
T
= φT
i (β − β ΦΣΦ )t
2 T
T
= βφT
i t − β φ i ΦΣΦ t

Similarly, we can obtain:

Si

−1
= φT
i C φi

2
T
= φT
i (β − β ΦΣΦ )φ i
2 T
T
= βφT
i φ i − β φ i ΦΣΦ φ i

Just as required.
Problem 7.18 Solution
We begin by deriving the first term in Eq (7.109) with respect to w. This
can be easily evaluate based on Eq (4.90)-(4.91).
N
∂ {∑

∂w

n=1

} ∑
N
t n ln yn + (1 − t n ) ln(1 − yn ) =
( t n − yn )ϕn = ΦT (t − y)
n=1

152
Since the derivative of the second term in Eq (7.109) with respect to w
is rather simple to obtain. Therefore, The first derivative of Eq (7.109) with
respect to w is:
∂ ln p
= ΦT (t − y) − Aw
∂w
For the Hessian matrix, we can first obtain:
∂ {
∂w

ΦT (t − y)

}

}
N ∂ {
∑
( t n − yn )ϕn
n=1 ∂w
}
N ∂ {
∑
= −
yn · ϕn
n=1 ∂w

=

= −
= −

N ∂σ(wT ϕ )
∑
n
n=1

∂w

· ϕT
n

N ∂σ(a) ∂a
∑
·
· ϕT
∂w n
n=1 ∂a

Where we have defined a = wT ϕn . Then we can utilize Eq (4.88) to derive:
∂ {

}
N
∑
T
ΦT (t − y) = −
σ(1 − σ) · ϕn · ϕT
n = −Φ BΦ

∂w

n=1

Where B is a diagonal N × N matrix with elements b n = yn (1 − yn ). Therefore, we can obtain the Hessian matrix:
∂ { ∂ ln p }
= −(ΦT BΦ + A)
H=
∂w ∂w
Just as required.
Problem 7.19 Solution
We begin from Eq (7.114).

p(t|w∗ ) p(w∗ |α)(2π) M /2 |Σ|1/2
¯
]
[∏
][ ∏
N
M
1
M /2
1/2 ¯
=
p( t n | xn , w)
N (w i |0, α−
)
(2
π
)
|
Σ
|
¯
i

p(t|α) =

n=1

=

[∏
N

n=1

i =1

¯
]
¯
p( t n | xn , w) · N (w|0, A) · (2π) M /2 |Σ|1/2 ¯

w = w∗

w = w∗

We further take logarithm for both sides.
ln p(t|α) =

[∑
N
n=1

ln p( t n | xn , w) + ln N (w|0, A) +

[∑
N [

]¯
M
1
¯
ln 2π + ln |Σ| ¯
w = w∗
2
2

]¯
] 1
1
1
¯
t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw − ln |A| + ln |Σ| + const ¯
w = w∗
2
2
2
n=1
] [1
]¯
[∑
N [
] 1
1
¯
t n ln yn + (1 − t n ) ln(1 − yn ) − wT Aw + ln |Σ| − ln |A| + const ¯
=
w = w∗
2
2
2
n=1

=

153
Using the Chain rule, we can obtain:
∂ ln p(t|α) ¯¯
∂ ln p(t|α) ∂w ¯¯
=
¯
¯
w = w∗
∂α i
∂w
∂α i w = w∗

Observing Eq (7.109), (7.110) and that (7.110) will equal 0 at w∗ , we can
conclude that the first term on the right hand side of ln p(t|α) will have zero
derivative with respect to w at w∗ . Therefore, we only need to focus on the
second term:
[
]¯
∂ 1
1
∂ ln p(t|α) ¯¯
¯
=
ln |Σ| − ln |A| ¯
¯
w = w∗
w = w∗
∂α i
∂α i 2
2
It is rather easy to obtain:
]
1
1 ∂ [∑
1
[− ln |A|] = −
ln α−i 1 =
∂α i 2
2 ∂α i i
2α i
∂

Then we follow the same procedure as in Prob.7.12, we can obtain:
∂

1
1
[ ln |Σ|] = − Σ ii
∂α i 2
2
Therefore, we obtain:
∂ ln p(t|α)
∂α i

=

1
1
− Σ ii
2α i 2

Note: here I draw a different conclusion as the main text. I have also
verified my result in another way. You can write the prior as the product of
1
N (w i |0, α−
) instead of N (w|0, A). In this form, since we know that:
i
M
∂ ∑

∂α i

i =1

ln N (w i |0, α−i 1 ) =

∂

1
αi
1
1
( ln α i − w2i ) =
− (w∗i )2
∂α i 2
2
2α i 2

The above expression can be used to replace the derivative of −1/2wT Aw−
1/2 ln |A|. Since the derivative of the likelihood with respect to α i is not zero
at w∗ , (7.115) seems not right anyway.

0.8 Graphical Models
Problem 8.1 Solution
We are required to prove:
∫
x

p(x) d x =

∫ ∏
K
x k=1

p( xk | pa k ) d x = 1

154
Here we adopt the same assumption as in the main text: No arrows lead
from a higher numbered node to a According to Eq(8.5), we can write:
∫
∫ ∏
K
p(x) d x =
p( xk | pa k ) d x
x

∫

=
∫

=
∫

=
∫

=
∫

=

x k=1

x

p( xK | pa K )
∫

K∏
−1

p( xk | pa k ) d x

k=1

[

[ x1 ,x2 ,...,xK −1 ] xK

p( xK | pa K )

[ K∏
−1
[ x1 ,x2 ,...,xK −1 ]

k=1

[ K∏
−1
[ x1 ,x2 ,...,xK −1 ]

k=1
K∏
−1

[ x1 ,x2 ,...,xK −1 ] k=1

K∏
−1
k=1

∫

p( xk | pa k )

xK

]
p( xk | pa k ) dxK dx1 dx2 , ...dxK −1

]
p( xK | pa K ) dxK dx1 dx2 , ...dxK −1

]

p( xk | pa k ) dx1 dx2 , ...dxK −1

p( xk | pa k ) dx1 dx2 , ...dxK −1

Note that from the third line to the fourth line, we have used the fact
that x1 , x2 , ...xK −1 do not depend on xK , and thus the product from k = 1 to
K − 1 can be moved to the outside of the integral with respect to xK , and that
we have used the fact that the conditional probability is correctly normalized
from the fourth line to the fifth line. The aforementioned procedure will be
repeated for K times until all the variables have been integrated out.
Problem 8.2 Solution
This statement is obvious. Suppose that there exists an ordered numbering of the nodes such that for each node there are no links going to a
lower-numbered node, and that there is a directed cycle in the graph:

a 1 → a 2 → ... → a N
To make it a real cycle, we also require a N → a 1 . According to the assumption, we have a 1 ≤ a 2 ≤ ... ≤ a N . Therefore, the last link a N → a 1 is
invalid since a N ≥ a 1 .
Problem 8.3 Solution
Based on definition, we can obtain:



0.336, if a


 0.264, if a
p(a, b) = p(a, b, c = 0) + p(a, b, c = 1) =
 0.256, if a


 0.144, if a

Similarly, we can obtain:

p(a) = p(a, b = 0) + p(a, b = 1) =

{

= 0, b
= 0, b
= 1, b
= 1, b

0.6, if a = 0
0.4, if a = 1

=0
=1
=0
=1

155
And

{

p ( b ) = p ( a = 0, b ) + p ( a = 1, b ) =

0.592, if b = 0
0.408, if b = 1

Therefore, we conclude that p(a, b) ̸= p(a) p( b). For instance, we have
p(a = 1, b = 1) = 0.144, p(a = 1) = 0.4 and p( b = 1) = 0.408. It is obvious
that:
0.144 = p(a = 1, b = 1) ̸= p(a = 1) p( b = 1) = 0.4 × 0.408
To prove the conditional dependency, we first calculate p( c):

p( c) =

∑
a,b = 0,1

{

p(a, b, c) =

0.480, if c = 0
0.520, if c = 1

According to Bayes’ Theorem, we have:


0.400, if a




0.277, if a





0
.100, if a


p(a, b, c)  0.415, if a
=
p(a, b| c) =

0.400, if a
p( c)




0
.123, if a





0.100, if a


 0.185, if a

= 0, b
= 0, b
= 0, b
= 0, b
= 1, b
= 1, b
= 1, b
= 1, b

= 0, c
= 0, c
= 1, c
= 1, c
= 0, c
= 0, c
= 1, c
= 1, c

=0
=1
=0
=1
=0
=1
=0
=1

Similarly, we also have:


0.240/0.480 = 0.500, if a


p(a, c)  0.360/0.520 = 0.692, if a
=
p ( a| c ) =

0.240/0.480 = 0.500, if a
p( c)


 0.160/0.520 = 0.308, if a

= 0, c
= 0, c
= 1, c
= 1, c

=0
=1
=0
=1

Where we have used p(a, c) = p(a, b = 0, c) + p(a, b = 1, c). Similarly, we
can obtain:


0.384/0.480 = 0.800, if b = 0, c = 0



p( b, c)
0.208/0.520 = 0.400, if b = 0, c = 1
p( b| c) =
=

0
.096/0.480 = 0.200, if b = 1, c = 0
p( c)



0.312/0.520 = 0.600, if b = 1, c = 1
Now we can easily verify the statement p(a, b| c) = p(a| c) p( b| c). For instance, we have:
0.1 = p(a = 1, b = 1| c = 0) = p(a = 1| c = 0) p( b = 1| c = 0) = 0.5 × 0.2 = 0.1
Problem 8.4 Solution

156
This problem follows the previous one. We have already calculated p(a)
and p( b| c), we rewrite it here.
{

p(a) = p(a, b = 0) + p(a, b = 1) =
And

0.6, if a = 0
0.4, if a = 1



0.384/0.480 = 0.800, if b


p( b, c)  0.208/0.520 = 0.400, if b
p( b| c) =
=

0.096/0.480 = 0.200, if b
p( c)


 0.312/0.520 = 0.600, if b

= 0, c
= 0, c
= 1, c
= 1, c

=0
=1
=0
=1

We can also obtain p( c|a):


0.24/0.6 = 0.4, if a


p(a, c)  0.36/0.6 = 0.6, if a
p ( c | a) =
=

0.24/0.4 = 0.6, if a
p ( a)


 0.16/0.4 = 0.4, if a

= 0, c
= 0, c
= 1, c
= 1, c

=0
=1
=0
=1

Now we can easily verify the statement that p(a, b, c) = p(a) p( c|a) p( b| c)
given Table 8.2. The directed graph looks like:

a→c→b
Problem 8.5 Solution
It looks quite like Figure 8.6. The difference is that we introduce α i for
each w i , where i = 1, 2, ..., M .

Figure 1: probabilistic graphical model corresponding to the RVM described
in (7.79) and (7.80).
Problem 8.6 Solution(Wait for update)
Problem 8.7 Solution
Let’s just follow the hint. We begin by calculating the mean µ.
E[ x 1 ] = b 1

157
According to Eq (8.15), we can obtain:
∑

E[ x 2 ] =

j ∈ pa 2

w2 j E[ x j ] + b 2 = w21 b 1 + b 2

Then we can obtain:
E[ x3 ] =

w32 E[ x2 ] + b 3

=

w32 (w21 b 1 + b 2 ) + b 3

=

w32 w21 b 1 + w32 b 2 + b 3

Therefore, we obtain Eq (8.17) just as required. Next, we deal with the
covariance matrix.
cov[ x1 , x1 ] = v1
Then we can obtain:
cov[ x1 , x2 ] =

∑
k=1

w2k cov[ x1 , xk ] + I 12 v2 = w21 cov[ x1 , x1 ] = w21 v1

And also cov[ x2 , x1 ] = cov[ x1 , x2 ] = w21 v1 . Hence, we can obtain:
cov[ x2 , x2 ] =

∑
k=1

2
w2k cov[ x2 , xk ] + I 22 v2 = w21
v1 + v2

Next, we can obtain:
cov[ x1 , x3 ] =

∑
k=2

w3k cov[ x1 , xk ] + I 31 v1 = w32 w21 v1

Then, we can obtain:
cov[ x2 , x3 ] =

∑
k=2

2
w3k cov[ x2 , xk ] + I 23 v3 = w32 (v2 + w21
v1 )

Finally, we can obtain:
cov[ x3 , x3 ] =
=

∑
k=2

w3k cov[ x3 , xk ] + I 33 v3

[
]
2
w32 w32 (v2 + w21
v1 ) + v3

Where we have used the fact that cov[ x3 , x2 ] = cov[ x2 , x3 ]. By now, we
have obtained Eq (8.18) just as required.
Problem 8.8 Solution



Source Exif Data:
File Type                       : PDF
File Type Extension             : pdf
MIME Type                       : application/pdf
PDF Version                     : 1.5
Linearized                      : No
Page Count                      : 158
Creator                         :  XeTeX output 2018.11.11:1337
Producer                        : xdvipdfmx (20170318)
Create Date                     : 2018:11:11 13:37:03+08:00
EXIF Metadata provided by EXIF.tools

Navigation menu