Solution Manual For PRML

User Manual:

Open the PDF directly: View PDF PDF.
Page Count: 158 [warning: Documents this large are best viewed by clicking the View PDF Link!]

SOLUTION MANUAL FOR
PATTERN RECOGNITION AND MACHINE
LEARNING
EDITED BY
ZHENGQI GAO
Information Science and Technology School
Fudan University
NOV.2017
1
0.1 Introduction
Problem 1.1 Solution
We let the derivative of error function Ewith respect to vector wequals
to 0, (i.e. E
w=0), and this will be the solution of w={wi}which minimizes
error function E. To solve this problem, we will calculate the derivative of E
with respect to every wi, and let them equal to 0 instead. Based on (1.1) and
(1.2) we can obtain :
=>
E
wi=
N
n=1
{y(xn,w)tn}xi
n=0
=> N
n=1
y(xn,w)xi
n=
N
n=1
xi
ntn
=> N
n=1
(
M
j=0
wjxj
n)xi
n=
N
n=1
xi
ntn
=> N
n=1
M
j=0
wjx(j+i)
n=
N
n=1
xi
ntn
=> M
j=0
N
n=1
x(j+i)
nwj=
N
n=1
xi
ntn
If we denote Ai j =N
n=1xi+j
nand Ti=N
n=1xnitn, the equation above can
be written exactly as (1.222), Therefore the problem is solved.
Problem 1.2 Solution
This problem is similar to Prob.1.1, and the only difference is the last
term on the right side of (1.4), the penalty term. So we will do the same thing
as in Prob.1.1 :
=>
E
wi=
N
n=1
{y(xn,w)tn}xi
n+λwi=0
=> M
j=0
N
n=1
x(j+i)
nwj+λwi=
N
n=1
xi
ntn
=> M
j=0
{
N
n=1
x(j+i)
n+δji λ}wj=
N
n=1
xi
ntn
2
where
δji 0 j ̸= i
1 j = i
Problem 1.3 Solution
This problem can be solved by Bayes’ theorem. The probability of selecting
an apple P(a) :
P(a)=P(a|r)P(r)+P(a|b)P(b)+P(a|g)P(g)=3
10 ×0.2+1
2×0.2+3
10 ×0.6=0.34
Based on Bayes’ theorem, the probability of an selected orange coming
from the green box P(g|o) :
P(g|o)=P(o|g)P(g)
P(o)
We calculate the probability of selecting an orange P(o) first :
P(o)=P(o|r)P(r)+P(o|b)P(b)+P(o|g)P(g)=4
10 ×0.2+1
2×0.2+3
10 ×0.6=0.36
Therefore we can get :
P(g|o)=P(o|g)P(g)
P(o)=
3
10 ×0.6
0.36 =0.5
Problem 1.4 Solution
This problem needs knowledge about calculus, especially about Chain
rule. We calculate the derivative of Py(y) with respect to y, according to
(1.27) :
d p y(y)
d y =d(px(g(y))|g(y)|)
d y =d px(g(y))
d y |g(y)|+ px(g(y)) d|g(y)|
d y ()
The first term in the above equation can be further simplified:
d px(g(y))
d y |g(y)|= d px(g(y))
d g(y)
d g(y)
d y |g(y)|(∗∗)
If ˆ
xis the maximum of density over x, we can obtain :
d px(x)
dx ˆ
x=0
Therefore, when y=ˆ
y,s.t.ˆ
x=g(ˆ
y), the first term on the right side of (∗∗)
will be 0, leading the first term in () equals to 0, however because of the
existence of the second term in (), the derivative may not equal to 0. But
3
when linear transformation is applied, the second term in () will vanish,
(e.g. x=ay +b). A simple example can be shown by :
px(x)=2x,x[0,1] => ˆ
x=1
And given that:
x=sin(y)
Therefore, py(y)=2sin(y)|cos(y)|,y[0,π
2], which can be simplified :
py(y)=sin(2y),y[0,π
2]=> ˆ
y=π
4
However, it is quite obvious :
ˆ
x̸=sin( ˆ
y)
Problem 1.5 Solution
This problem takes advantage of the property of expectation:
var[f]=E[(f(x)E[f(x)])2]
=E[f(x)22f(x)E[f(x)] +E[f(x)]2]
=E[f(x)2]2E[f(x)]2+E[f(x)]2
=>var[f]=E[f(x)2]E[f(x)]2
Problem 1.6 Solution
Based on (1.41), we only need to prove when xand yis independent,
Ex,y[xy]=E[x]E[y]. Because xand yis independent, we have :
p(x,y)=px(x)py(y)
Therefore:
  xyp(x,y)dx d y =  xypx(x)py(y)dx d y
=(xpx(x)dx)(yp y(y)d y)
=>Ex,y[xy]=E[x]E[y]
Problem 1.7 Solution
This problem should take advantage of Integration by substitution.
I2=+∞
−∞ +∞
−∞
exp(1
2σ2x21
2σ2y2)dx d y
=2π
0+∞
0
exp(1
2σ2r2)r dr dθ
4
Here we utilize :
x=rcosθ,y=rsinθ
Based on the fact :
+∞
0
exp(1
2σ2)r dr =σ2exp(r2
2σ2)+∞
0=σ2(0 (1)) =σ2
Therefore, Ican be solved :
I2=2π
0
σ2dθ=2πσ2,=> I=p2πσ
And next,we will show that Gaussian distribution N(xµ,σ2) is normal-
ized, (i.e. +∞
−∞ N(xµ,σ2)dx =1) :
+∞
−∞
N(xµ,σ2)dx =+∞
−∞
1
p2πσ2exp{1
2σ2(xµ)2}dx
=+∞
−∞
1
p2πσ2exp{1
2σ2y2}d y (y=xµ)
=1
p2πσ2+∞
−∞
exp{1
2σ2y2}d y
=1
Problem 1.8 Solution
The first question will need the result of Prob.1.7 :
+∞
−∞
N(xµ,σ2)x dx =+∞
−∞
1
p2πσ2exp{1
2σ2(xµ)2}x dx
=+∞
−∞
1
p2πσ2exp{1
2σ2y2}(y+µ)d y (y=xµ)
=µ+∞
−∞
1
p2πσ2exp{1
2σ2y2}d y ++∞
−∞
1
p2πσ2exp{1
2σ2y2}y d y
=µ+0=µ
The second problem has already be given hint in the description. Given
that : d(f g)
dx =fd g
dx +gd f
dx
We differentiate both side of (1.127) with respect to σ2, we will obtain :
+∞
−∞
(1
2σ2+(xµ)2
2σ4)N(xµ,σ2)dx =0
5
Provided the fact that σ̸=0, we can get:
+∞
−∞
(xµ)2N(xµ,σ2)dx =+∞
−∞
σ2N(xµ,σ2)dx =σ2
So the equation above has actually proven (1.51), according to the defini-
tion:
var[x]=+∞
−∞
(xE[x])2N(xµ,σ2)dx
Where E[x]=µhas already been proved. Therefore :
var[x]=σ2
Finally,
E[x2]=var[x]+E[x]2=σ2+µ2
Problem 1.9 Solution
Here we only focus on (1.52), because (1.52) is the general form of (1.42).
Based on the definition : The maximum of distribution is known as its mode
and (1.52), we can obtain :
N(xµ,Σ)
x= −1
2[Σ1+(Σ1)T](xµ)N(xµ,Σ)
= −Σ1(xµ)N(xµ,Σ)
Where we take advantage of :
xTAx
x=(A+AT)xand (Σ1)T=Σ1
Therefore,
only when x=µ,N(xµ,Σ)
x=0
Note: You may also need to calculate Hessian Matrix to prove that it is
maximum. However, here we find that the first derivative only has one root.
Based on the description in the problem, this point should be maximum point.
Problem 1.10 Solution
We will solve this problem based on the definition of expectation,variation
6
and independence.
E[x+z]= (x+z)p(x,z)dx d z
= (x+z)p(x)p(z)dx d z
=  xp(x)p(z)dx dz +  z p(x)p(z)dx d z
=(p(z)dz)xp(x)dx +(p(x)dx)zp(z)dz
=xp(x)dx +zp(z)d z
=E[x]+E[z]
var[x+z]= (x+zE[x+z])2p(x,z)dx dz
= {(x+z)22(x+z)E[x+z]) +E2[x+z]}p(x,z)dx dz
=  (x+z)2p(x,z)dx dz 2E[x+z] (x+z)p(x,z)dx d z +E2[x+z]
=  (x+z)2p(x,z)dx dz E2[x+z]
=  (x2+2xz +z2)p(x)p(z)dx d z E2[x+z]
=(p(z)dz)x2p(x)dx +  2xz p(x)p(z)dx dz +(p(x)dx)z2p(z)dz E2[x+z]
=E[x2]+E[z2]E2[x+z]+  2xz p(x)p(z)dx dz
=E[x2]+E[z2](E[x]+E[z])2+  2xz p(x)p(z)dx dz
=E[x2]E2[x]+E[z2]E2[z]2E[x]E[z]+2  xz p(x)p(z)dx d z
=var[x]+var[z]2E[x]E[z]+2(xp(x)dx)(zp(z)dz)
=var[x]+var[z]
Problem 1.11 Solution
Based on prior knowledge that µML and σ2
ML will decouple. We will first
calculate µML :
d(ln p(xµ,σ2))
dµ=1
σ2
N
n=1
(xnµ)
We let :
d(ln p(xµ,σ2))
dµ=0
7
Therefore :
µML =1
N
N
n=1
xn
And because:
d(ln p(xµ,σ2))
dσ2=1
2σ4(
N
n=1
(xnµ)2Nσ2)
We let :
d(ln p(xµ,σ2))
dσ2=0
Therefore :
σ2
ML =1
N
N
n=1
(xnµML)2
Problem 1.12 Solution
It is quite straightforward for E[µML], with the prior knowledge that xnis
i.i.d. and it also obeys Gaussian distribution N(µ,σ2).
E[µML]=E[1
N
N
n=1
xn]=1
N
E[
N
n=1
xn]=E[xn]=µ
For E[σ2
ML], we need to take advantage of (1.56) and what has been given
in the problem :
E[σ2
ML]=E[1
N
N
n=1
(xnµML)2]
=1
N
E[
N
n=1
(xnµML)2]
=1
N
E[
N
n=1
(x2
n2xnµML +µ2
ML)]
=1
N
E[
N
n=1
x2
n]1
N
E[
N
n=1
2xnµML]+1
N
E[
N
n=1
µ2
ML]
=µ2+σ22
N
E[
N
n=1
xn(1
N
N
n=1
xn)] +E[µ2
ML]
=µ2+σ22
N2E[
N
n=1
xn(
N
n=1
xn)] +E[( 1
N
N
n=1
xn)2]
=µ2+σ22
N2E[(
N
n=1
xn)2]+1
N2E[(
N
n=1
xn)2]
=µ2+σ21
N2E[(
N
n=1
xn)2]
=µ2+σ21
N2[N(Nµ2+σ2)]
8
Therefore we have:
E[σ2
ML]=(N1
N)σ2
Problem 1.13 Solution
This problem can be solved in the same method used in Prob.1.12 :
E[σ2
ML]=E[1
N
N
n=1
(xnµ)2] (Because here we use µto replace µML)
=1
N
E[
N
n=1
(xnµ)2]
=1
N
E[
N
n=1
(x2
n2xnµ+µ2)]
=1
N
E[
N
n=1
x2
n]1
N
E[
N
n=1
2xnµ]+1
N
E[
N
n=1
µ2]
=µ2+σ22µ
N
E[
N
n=1
xn]+µ2
=µ2+σ22µ2+µ2
=σ2
Note: The biggest difference between Prob.1.12 and Prob.1.13 is that the
mean of Gaussian Distribution is known previously (in Prob.1.13) or not (in
Prob.1.12). In other words, the difference can be shown by the following equa-
tions:
E[µ2]=µ2(µis determined, i.e. its expectation is itself, also true for µ2)
E[µ2
ML]=E[( 1
N
N
n=1
xn)2]=1
N2E[(
N
n=1
xn)2]=1
N2N(Nµ2+σ2)=µ2+σ2
N
Problem 1.14 Solution
This problem is quite similar to the fact that any function f(x)can be
written into the sum of an odd function and an even function. If we let:
wS
i j =wi j +wji
2and wA
i j =wi j wji
2
It is obvious that they satisfy the constraints described in the problem,
which are :
wi j =wS
i j +wA
i j ,wS
i j =wS
ji ,wA
i j = −wA
ji
9
To prove (1.132), we only need to simplify it :
D
i=1
D
j=1
wi j xixj=
D
i=1
D
j=1
(wS
i j +wA
i j)xixj
=
D
i=1
D
j=1
wS
i j xixj+
D
i=1
D
j=1
wA
i j xixj
Therefore, we only need to prove that the second term equals to 0, and
here we use a simple trick: we will prove twice of the second term equals to 0
instead.
2
D
i=1
D
j=1
wA
i j xixj=
D
i=1
D
j=1
(wA
i j +wA
i j )xixj
=
D
i=1
D
j=1
(wA
i j wA
ji )xixj
=
D
i=1
D
j=1
wA
i j xixj
D
i=1
D
j=1
wA
ji xixj
=
D
i=1
D
j=1
wA
i j xixj
D
j=1
D
i=1
wA
ji xjxi
=0
Therefore, we choose the coefficient matrix to be symmetric as described
in the problem. Considering about the symmetry, we can see that if and only
if for i=1,2,..., Dand ij,wi j is given, the whole matrix will be determined.
Hence, the number of independent parameters are given by :
D+D1+... +1=D(D+1)
2
Note: You can view this intuitively by considering if the upper triangular
part of a symmetric matrix is given, the whole matrix will be determined.
Problem 1.15 Solution
This problem is a more general form of Prob.1.14, so the method can also
be used here: we will find a way to use wi1i2...iMto represent wi1i2...iM.
We begin by introducing a mapping function:
F(xi1xi2...xiM )=xj1xj2..., xjM
s.t.
M
k=1
xik =
M
k=1
xjk,and xj1xj2xj3... xjM
10
It is complexed to write Fin mathematical form. Actually this function
does a simple work: it rearranges the element in a decreasing order based on
its subindex. Several examples are given below, when D=5,M=4:
F(x5x2x3x2)=x5x3x2x2
F(x1x3x3x2)=x3x3x2x1
F(x1x4x2x3)=x4x3x2x1
F(x1x1x5x2)=x5x2x1x1
After introducing F, the solution will be very simple, based on the fact
that Fwill not change the value of the term, but only rearrange it.
D
i1=1
D
i2=1
...
D
iM=1
wi1i2...iMxi1xi2...xiM =
D
j1=1
j1
j2=1
...
jM1
jM=1wj1j2... jMxj1xj2...xjM
where wj1j2... jM=
w
w
={wi1i2...iMF(xi1xi2...xiM )=xj1xj2...xj M ,xi1xi2...xiM }
By far, we have already proven (1.134). Mathematical induction will be
used to prove (1.135) and we will begin by proving D=1, i.e. n(1,M)=
n(1,M1). When D=1, (1.134) will degenerate into wxM
1, i.e., it only has
one term, whose coefficient is govern by wregardless the value of M.
Therefore, we have proven when D=1, n(D,M)=1. Suppose (1.135)
holds for D, let’s prove it will also hold for D+1, and then (1.135) will be
proved based on Mathematical induction.
Let’s begin based on (1.134):
D+1
i1=1
i1
i2=1
...
iM1
iM=1wi1i2...iMxi1xi2...xiM ()
We divide () into two parts based on the first summation: the first part
is made up of ii=1,2, ..., Dand the second part i1=D+1. After division, the
first part corresponds to n(D,M), and the second part corresponds to n(D+
1,M1). Therefore we obtain:
n(D+1,M)=n(D,M)+n(D+1,M1) (∗∗)
And given the fact that (1.135) holds for D :
n(D,M)=
D
i=1
n(i,M1)
11
Therefore,we substitute it into (∗∗)
n(D+1,M)=
D
i=1
n(i,M1) +n(D+1,M1) =
D+1
i=1
n(i,M1)
We will prove (1.136) in a different but simple way. We rewrite (1.136) in
Permutation and Combination view:
D
i=1
CM1
i+M2=CM
D+M1
Firstly, We expand the summation.
CM1
M1+CM1
M+...CM1
D+M2=CM
D+M1
Secondly, we rewrite the first term on the left side to CM
M, because CM1
M1=
CM
M=1. In other words, we only need to prove:
CM
M+CM1
M+...CM1
D+M2=CM
D+M1
Thirdly, we take advantage of the property : Cr
N=Cr
N1+Cr1
N1. So we
can recursively combine the first term and the second term on the left side,
and it will ultimately equal to the right side.
(1.137) gives the mathematical form of n(D,M), and we need all the con-
clusions above to prove it.
Let’s give some intuitive concepts by illustrating M=0,1,2. When M=0,
(1.134) will consist of only a constant term, which means n(D,0) =1. When
M=1,it is obvious n(D,1) =D, because in this case (1.134) will only have D
terms if we expand it. When M=2, it degenerates to Prob.1.14, so n(D,2) =
D(D+1)
2is also obvious. Suppose (1.137) holds for M1, let’s prove it will also
hold for M.
n(D,M)=
D
i=1
n(i,M1) ( based on (1.135) )
=
D
i=1
CM1
i+M2( based on (1.137) holds for M1 )
=CM1
M1+CM1
M+CM1
M+1... +CM1
D+M2
=(CM
M+CM1
M)+CM1
M+1... +CM1
D+M2
=(CM
M+1+CM1
M+1)... +CM1
D+M2
=CM
M+2... +CM1
D+M2
...
=CM
D+M1
By far, all have been proven.
12
Problem 1.16 Solution
This problem can be solved in the same way as the one in Prob.1.15.
Firstly, we should write the expression consisted of all the independent terms
up to Mth order corresponding to N(D,M). By adding a summation regard-
ing to Mon the left side of (1.134), we obtain:
M
m=0
D
i1=1
i1
i2=1
...
im1
im=1wi1i2...imxi1xi2...xim ()
(1.138) is quite obvious if we view mas an looping variable, iterating
through all the possible orders less equal than M, and for every possible oder
m, the independent parameters are given by n(D,m).
Let’s prove (1.138) in a formal way by using Mathematical Induction.
When M=1,() will degenerate to two terms: m=0, corresponding to n(D,0)
and m=1, corresponding to n(D,1). Therefore N(D,1) =n(D,0) +n(D,1).
Suppose (1.138) holds for M, we will see that it will also hold for M+1. Let’s
begin by writing all the independent terms based on () :
M+1
m=0
D
i1=1
i1
i2=1
...
im1
im=1wi1i2...imxi1xi2...xim (∗∗)
Using the same technique as in Prob.1.15, we divide (∗∗) to two parts
based on the summation regarding to m: the first part consisted of m=
0,1,..., Mand the second part m=M+1. Hence, the first part will corre-
spond to N(D,M) and the second part will correspond to n(D,M+1). So we
obtain:
N(D,M+1) =N(D,M)+n(D,M+1)
Then we substitute (1.138) into the equation above :
N(D,M+1) =
M
m=0
n(D,m)+n(D,M+1)
=
M+1
m=0
n(D,m)
To prove (1.139), we will also use the same technique in Prob.1.15 instead
of Mathematical Induction. We begin based on already proved (1.138):
N(D,M)=
M
m=0
n(D,M)
13
We then take advantage of (1.137):
N(D,M)=
M
m=0
Cm
D+m1
=C0
D1+C1
D+C2
D+1+... +CM
D+M1
=(C0
D+C1
D)+C2
D+1+... +CM
D+M1
=(C1
D+1+C2
D+1)+... +CM
D+M1
=...
=CM
D+M
Here as asked by the problem, we will view the growing speed of N(D,M).
We should see that in n(D,M), Dand Mare symmetric, meaning that we only
need to prove when DM, it will grow like DM, and then the situation of
MDwill be solved by symmetry.
N(D,M)=(D+M)!
D!M!(D+M)D+M
DDMM
=1
MM(D+M
D)D(D+M)M
=1
MM[(1 +M
D)D
M]M(D+M)M
(e
M)M(D+M)M
=eM
MM(1 +M
D)MDM
=eM
MM[(1 +M
D)D
M]M2
DDM
eM+M2
D
MMDMeM
MMDM
Where we use Stirling’s approximation, lim
n→+∞(1 +1
n)n=eand eM2
De0=
1. According to the description in the problem, When DM, we can actually
view eM
MMas a constant, so N(D,M) will grow like DMin this case. And by
symmetry, N(D,M) will grow like MD, when MD.
Finally, we are asked to calculate N(10,3) and N(100,3):
N(10,3) =C3
13 =286
N(100,3) =C3
103 =176851
Problem 1.17 Solution
14
Γ(x+1) =+∞
0
uxeudu
=+∞
0uxdeu
= −uxeu+∞
0+∞
0
eud(ux)
= −uxeu+∞
0+x+∞
0
euux1du
= −uxeu+∞
0+xΓ(x)
Where we have taken advantage of Integration by parts and according to
the equation above, we only need to prove the first term equals to 0. Given
L’Hospital’s Rule:
lim
u→+∞ux
eu=lim
u→+∞x!
eu=0
And also when u=0,uxeu=0, so we have proved Γ(x+1) =xΓ(x). Based
on the definition of Γ(x), we can write:
Γ(1) =+∞
0
eudu = −eu+∞
0= −(0 1) =1
Therefore when xis an integer:
Γ(x)=(x1)Γ(x1) =(x1)(x2)Γ(x2) =... =x!Γ(1) =x!
Problem 1.18 Solution
Based on (1.124) and (1.126) and by substituting xto p2σy, it is quite
obvious to obtain : +∞
−∞
ex2
idxi=pπ
Therefore, the left side of (1.42) will equal to πD
2. For the right side of
(1.42):
SD+∞
0
er2rD1dr =SD+∞
0
euuD1
2dpu(u=r2)
=SD
2+∞
0
euuD
21du
=SD
2Γ(D
2)
Hence, we obtain:
πD
2=SD
2Γ(D
2)=> SD=2πD
2
Γ(D
2)
15
SDhas given the expression of the surface area with radius 1 in dimen-
sion D, we can further expand the conclusion: the surface area with radius r
in dimension Dwill equal to SD·rD1, and when r=1, it will reduce to SD.
This conclusion is naive, if you find that the surface area of different sphere
in dimension Dis proportion to the D1th power of radius, i.e. rD1. Con-
sidering the relationship between Vand Sof a sphere with arbitrary radius
in dimension D:dV
dr =S, we can obtain :
V=S dr =SDrD1dr =SD
DrD
The equation above gives the expression of the volume of a sphere with
radius rin dimension D, so we let r=1 :
VD=SD
D
For D=2 and D=3 :
V2=S2
2=1
2·2π
Γ(1) =π
V3=S3
3=1
3·2π3
2
Γ(3
2)=1
3·2π3
2
pπ
2=4
3π
Problem 1.19 Solution
We have already given a hint in the solution of Prob.1.18, and here we
will make it more clearly: the volume of a sphere with radius ris VD·rD.
This is quite similar with the conclusion we obtained in Prob.1.18 about the
surface area except that it is proportion to Dth power of its radius, i.e. rDnot
rD1.
volume of sphere
volume of cube =VDaD
(2a)D=SD
2DD=πD
2
2D1DΓ(D
2)()
Where we have used the result of (1.143). And when D+∞, we will use
a simple method to show that () will converge to 0. We rewrite it :
()=2
D·(π
4)D
2·1
Γ(D
2)
Hence, it is now quite obvious, all the three terms will converge to 0 when
D+∞. Therefore their product will also converge to 0. The last problem is
quite simple :
center to one corner
center to one side =pa2·D
a=pDand lim
D→+∞
pD=+∞
Problem 1.20 Solution
16
The density of probability in a thin shell with radius rand thickness ϵ
can be viewed as a constant. And considering that a sphere in dimension D
with radius rhas surface area SDrD1, which has already been proved in
Prob.1.19 :
shell
p(x)dx=p(x)shell
dx=exp(r2
2σ2)
(2πσ2)D
2·V(shell) =exp(r2
2σ2)
(2πσ2)D
2
SDrD1ϵ
Thus we denote :
p(r)=SDrD1
(2πσ2)D
2
exp(r2
2σ2)
We calculate the derivative of (1.148) with respect to r:
d p(r)
dr =SD
(2πσ2)D
2
rD2exp(r2
2σ2)(D1r2
σ2) ()
We let the derivative equal to 0, we will obtain its unique root( stationary
point) ˆ
r=pD1σ, because r[0,+∞]. When r<ˆ
r, the derivative is large
than 0, p(r) will increase as r, and when r>ˆ
r, the derivative is less than 0,
p(r) will decrease as r. Therefore ˆ
rwill be the only maximum point. And it
is obvious when D1, ˆ
rpDσ.
p(ˆ
r+ϵ)
p(ˆ
r)=(ˆ
r+ϵ)D1exp((ˆ
r+ϵ)2
2σ2)
ˆ
rD1exp(ˆ
r2
2σ2)
=(1 +ϵ
ˆ
r)D1exp(2ϵˆ
r+ϵ2
2σ2)
=exp(2ϵˆ
r+ϵ2
2σ2+(D1)ln(1 +ϵ
ˆ
r))
We process for the exponential term by using Taylor Theorems.
2ϵˆ
r+ϵ2
2σ2+(D1) ln(1 +ϵ
ˆ
r)≈ −2ϵˆ
r+ϵ2
2σ2+(D1)( ϵ
ˆ
rϵ2
2ˆ
r2)
= −2ϵˆ
r+ϵ2
2σ2+2ˆ
rϵϵ2
2σ2
= ϵ2
σ2
Therefore, p(ˆ
r+ϵ)=p(ˆ
r)exp(ϵ2
σ2). Note: Here I draw a different con-
clusion compared with (1.149), but I do not think there is any mistake in
my deduction.
Finally, we see from (1.147) :
p(x)x=0=1
(2πσ2)D
2
17
p(x)||x||2=ˆ
r2=1
(2πσ2)D
2
exp(ˆ
r2
2σ2)1
(2πσ2)D
2
exp(D
2)
Problem 1.21 Solution
The first question is rather simple :
(ab)1
2a=a1
2(b1
2a1
2)0
Where we have taken advantage of ba0. And based on (1.78):
p(mistake) =p(xR1,C2)+p(xR2,C1)
=R1
p(x,C2)dx +R2
p(x,C1)dx
Recall that the decision rule which can minimize misclassification is that
if p(x,C1)>p(x,C2), for a given value of x, we will assign that xto class
C1. We can see that in decision area R1, it should satisfy p(x,C1)>p(x,C2).
Therefore, using what we have proved, we can obtain :
R1
p(x,C2)dx R1
{p(x,C1)p(x,C2)}1
2dx
It is the same for decision area R2. Therefore we can obtain:
p(mistake) {p(x,C1)p(x,C2)}1
2dx
Problem 1.22 Solution
We need to deeply understand (1.81). When Lk j =1Ik j :
k
Lk j p(Ckx)=
k
p(Ckx)p(Cjx)
Given a specific x, the first term on the right side is a constant, which
equals to 1, no matter which class Cjwe assign xto. Therefore if we want to
minimize the loss, we will maximize p(Cjx). Hence, we will assign xto class
Cj, which can give the biggest posterior probability p(Cjx).
The explanation of the loss matrix is quite simple. If we label correctly,
there is no loss. Otherwise, we will incur a loss, in the same degree whichever
class we label it to. The loss matrix is given below to give you an intuitive
view:
0 1 1 ... 1
1 0 1 ... 1
.
.
..
.
..
.
.....
.
.
1 1 1 ... 0
Problem 1.23 Solution
18
E[L]=
k
jRj
Lk j p(x,Ck)dx=
k
jRj
Lk j p(Ck)p(xCk)dx
If we denote a new loss matrix by L
jk =Ljk p(Ck), we can obtain a new
equation :
E[L]=
k
jRj
L
k j p(xCk)dx
Problem 1.24 Solution
This description of the problem is a little confusing, and what it really
mean is that λis the parameter governing the loss, just like θgoverning the
posterior probability p(Ck|x) when we introduce the reject option. Therefore
the reject option can be written in a new way when we view it from the view
of λand the loss:
choice
class Cjmin
lkLkl p(Ck|x)<λ
reject else
Where Cjis the class that can obtain the minimum. If Lk j =1Ik j,
according to what we have proved in Prob.1.22 :
k
Lk j p(Ckx)=
k
p(Ckx)p(Cjx)=1p(Cjx)
Therefore, the reject criterion from the view of λabove is actually equiv-
alent to the largest posterior probability is larger than 1 λ:
min
l
k
Lkl p(Ck|x)<λ<=> max
lp(Cl|x)>1λ
And from the view of θand posterior probability, we label a class for x(i.e.
we do not reject) is given by the constrain :
max
lp(Cl|x)>θ
Hence from the two different views, we can see that λand θare correlated
with:
λ+θ=1
Problem 1.25 Solution
We can prove this informally by dealing with one dimension once a time
just as the same process in (1.87) - (1.89) until all has been done, due to the
fact that the total loss Ecan be divided to the summation of loss on every
19
dimension, and what’s more they are independent. Here, we will use a more
informal way to prove this. In this case, the expected loss can be written :
E[L]= {y(x)t}2p(x,t)dtdx
Therefore, just as the same process in (1.87) - (1.89):
E[L]
y(x)=2{y(x)t}p(x,t)dt=0
=> y(x)=tp(x,t)dt
p(x)=Et[t|x]
Problem 1.26 Solution
The process is identical as the deduction we conduct for (1.90). We will
not repeat here. And what we should emphasize is that E[t|x] is a function of
x, not t. Thus the integral over tand xcan be simplified based on Integration
by parts and that is how we obtain (1.90).
]Note: There is a mistake in (1.90), i.e. the second term on the right side
is wrong. You can view (3.37) on P148 for reference. It should be :
E[L]={y(x)E[tx]}2p(x)dx+{E[txt]}2p(x,t)dxdt
Problem 1.27 Solution
We deal with this problem based on Calculus of Variations.
E[Lq]
y(x)=q[y(xt)]q1si gn(y(x)t)p(x,t)dt =0
=> y(x)
−∞
[y(x)t]q1p(x,t)dt =+∞
y(x)
[y(x)t]q1p(x,t)dt
=> y(x)
−∞
[y(x)t]q1p(t|x)dt =+∞
y(x)
[y(x)t]q1p(t|x)dt
Where we take advantage of p(x,t)=p(t|x)p(x) and the property of sign
function. Hence, when q=1, the equation above will reduce to :
y(x)
−∞
p(t|x)dt =+∞
y(x)
p(t|x)dt
In other words, when q=1, the optimal y(x) will be given by conditional
median. When q=0, it is non-trivial. We need to rewrite (1.91) :
E[Lq]=|y(x)t|qp(t|x)p(x)dtdx
=p(x)|y(x)t|qp(t|x)dtdx()
20
If we want to minimize E[Lq], we only need to minimize the integrand of
(): |y(x)t|qp(t|x)dt (∗∗)
When q=0, |y(x)t|qis close to 1 everywhere except in the neighborhood
around t=y(x) (This can be seen from Fig1.29). Therefore:
(∗∗)U
p(t|x)dt ϵ
(1 | y(x)t|q)p(t|x)dt U
p(t|x)dt ϵ
p(t|x)dt
Where ϵmeans the small neighborhood,Umeans the whole space xlies
in. Note that y(x) has no correlation with the first term, but the second term
(because how to choose y(x) will affect the location of ϵ). Hence we will put ϵ
at the location where p(t|x) achieve its largest value, i.e. the mode, because
in this way we can obtain the largest reduction. Therefore, it is natural we
choose y(x) equals to tthat maximize p(t|x) for every x.
Problem 1.28 Solution
Basically this problem is focused on the definition of Information Content,
i.e.h(x). We will rewrite the problem more precisely. In Information Theory,
h(·) is also called Information Content and denoted as I(·). Here we will still
use h(·) for consistency. The whole problem is about the property of h(x).
Based on our knowledge that h(·) is a monotonic function of the probability
p(x), we can obtain:
h(x)=f(p(x))
The equation above means that the Information we obtain for a specific
value of a random variable xis correlated with its occurring probability p(x),
and its relationship is given by a mapping function f(·). Suppose Cis the
intersection of two independent event Aand B, then the information of event
Coccurring is the compound message of both independent events Aand B
occurring:
h(C)=h(AB)=h(A)+h(B) ()
Because Aand Bis independent:
P(C)=P(A)·P(B)
We apply function f(·) to both side:
f(P(C)) =f(P(A)·P(B)) (∗∗)
Moreover, the left side of () and (∗∗) are equivalent by definition, so we
can obtain:
h(A)+h(B)=f(P(A)·P(B))
=> f(p(A)) +f(p(B)) =f(P(A)·P(B))
21
We obtain an important property of function f(·): f(x·y)=f(x)+f(y).
Note: In problem (1.28), what it really wants us to prove is about the form
and property of function fin our formulation, because there is one sentence
in the description of the problem : "In this exercise, we derive the relation
between hand pin the form of a function h(p)", (i.e. f(·) in our formulation
is equivalent to h(p) in the description).
At present, what we know is the property of function f(·):
f(xy)=f(x)+f(y) ()
Firstly, we choose x=y, and then it is obvious : f(x2)=2f(x). Secondly, it
is obvious f(xn)=n f (x),nNis true for n=1,n=2. Suppose it is also true
for n, we will prove it is true for n+1:
f(xn+1)=f(xn)+f(x)=n f (x)+f(x)=(n+1)f(x)
Therefore, f(xn)=n f (x),nNhas been proved. For an integer m, we
rewrite xnas (xn
m)m, and take advantage of what we have proved, we will
obtain:
f(xn)=f((xn
m)m)=m f (xn
m)
Because f(xn) also equals to n f (x), therefore n f (x)=m f (xn
m). We sim-
plify the equation and obtain:
f(xn
m)=n
mf(x)
For an arbitrary positive x,xR+, we can find two positive rational array
{yn}and {zn}, which satisfy:
y1<y2<... <yN<xand lim
N→+∞ yN=x
z1>z2>... >zN>x,and lim
N→+∞ zN=x
We take advantage of function f(·) is monotonic:
yNf(p)=f(pyN)f(px)f(pzN)=zNf(p)
And when N+∞, we will obtain: f(px)=x f (p),xR+. We let p=e,
it can be rewritten as : f(ex)=x f (e). Finally, We denote y=ex:
f(y)=ln(y)f(e)
Where f(e) is a constant once function f(·) is decided. Therefore f(x)
ln(x).
Problem 1.29 Solution
22
This problem is a little bit tricky. The entropy for a M-state discrete ran-
dom variable xcan be written as :
H[x]= −
M
i
λiln(λi)
Where λiis the probability that xchoose state i. Here we choose a concave
function f(·)=ln(·), we rewrite Jensen’s inequality, i.e.(1.115):
ln(
M
i=1
λixi)
M
i=1
λiln(xi)
We choose xi=1
λiand simplify the equation above, we will obtain :
lnM
M
i=1
λiln(λi)=H[x]
Problem 1.30 Solution
Based on definition :
ln{p(x)
q(x)}=ln(s
σ)[1
2σ2(xµ)21
2s2(xm)2]
=ln(s
σ)[( 1
2σ21
2s2)x2(µ
σ2m
s2)x+(µ2
2σ2m2
2s2)]
We will take advantage of the following equations to solve this problem.
E[x2]=x2N(x|µ,σ2)dx =µ2+σ2
E[x]=xN(x|µ,σ2)dx =µ
N(x|µ,σ2)dx =1
Given the equations above, it is easy to see :
K L(p||q)= −p(x)ln{q(x)
p(x)}dx
=N(x|µ,σ)ln{p(x)
q(x)}dx
=ln(s
σ)(1
2σ21
2s2)(µ2+σ2)+(µ
σ2m
s2)µ(µ2
2σ2m2
2s2)
=ln(s
σ)+σ2+(µm)2
2s21
2
23
We will discuss this result in more detail. Firstly, if K L distance is defined
in Information Theory, the first term of the result will be lo g2(s
σ) instead of
ln(s
σ). Secondly, if we denote x=s
σ,K L distance can be rewritten as :
K L(p||q)=ln(x)+1
2x21
2+a,where a=(µm)2
2s2
We calculate the derivative of K L with respect to x, and let it equal to 0:
d(K L)
dx =1
xx3=0=> x=1 ( s,σ>0 )
When x<1 the derivative is less than 0, and when x>1, it is greater than
0, which makes x=1 the global minimum. When x=1, K L(p||q)=a. What’s
more, when µ=m,awill achieve its minimum 0. In this way, we have shown
that the K L distance between two Gaussian Distributions is not less than 0,
and only when the two Gaussian Distributions are identical, i.e. having same
mean and variance, K L distance will equal to 0.
Problem 1.31 Solution
We evaluate H[x]+H[y]H[x,y] by definition. Firstly, let’s calculate
H[x,y] :
H[x,y]= −  p(x,y)lnp(x,y)dxdy
= −  p(x,y)lnp(x)dxdy  p(x,y)lnp(y|x)dxdy
= −p(x)lnp(x)dx  p(x,y)ln p(y|x)dxdy
=H[x]+H[y|x]
Where we take advantage of p(x,y)=p(x)p(y|x), p(x,y)dy=p(x) and
(1.111). Therefore, we have actually solved Prob.1.37 here. We will continue
our proof for this problem, based on what we have proved:
H[x]+H[y]H[x,y]=H[y]H[y|x]
= −p(y)lnp(y)dy+  p(x,y)ln p(y|x)dxdy
= −  p(x,y)lnp(y)dxdy+  p(x,y)lnp(y|x)dxdy
= −  p(x,y)lnp(x)p(y)
p(x,y)dxdy
=K L(p(x,y)||p(x)p(y)) =I(x,y)0
Where we take advantage of the following properties:
p(y)=p(x,y)dx
24
p(y)
p(y|x)=p(x)p(y)
p(x,y)
Moreover, it is straightforward that if and only if xand yis statistically
independent, the equality holds, due to the property of KL distance. You can
also view this result by :
H[x,y]= −  p(x,y)lnp(x,y)dxdy
= −  p(x,y)lnp(x)dxdy  p(x,y)lnp(y)dxdy
= −p(x)lnp(x)dx  p(y)ln p(y)dy
=H[x]+H[y]
Problem 1.32 Solution
It is straightforward based on definition and note that if we want to
change variable in integral, we have to introduce a redundant term called
Jacobian Determinant.
H[y]= −p(y)lnp(y)dy
= −p(x)
|A|ln p(x)
|A||y
x|dx
= −p(x)ln p(x)
|A|dx
= −p(x)lnp(x)dxp(x)ln 1
|A|dx
=H[x]+ln|A|
Where we have taken advantage of the following equations:
y
x=Aand p(x)=p(y)|y
x| = p(y)|A|
p(x)dx=1
Problem 1.33 Solution
Based on the definition of Entropy, we write:
H[y|x]= −
xi
yj
p(xi,yj)lnp(yj|xi)
Considering the property of probability, we can obtain that 0 p(yj|xi)
1, 0 p(xi,yj)1. Therefore, we can see that p(xi,yj)lnp(yj|xi)0 when
0<p(yj|xi)1. And when p(yj|xi)=0, provided with the fact that lim
p0plnp =
25
0, we can see that p(xi,yj)lnp(yj|xi)= −p(xi)p(yj|xi)lnp(yj|xi)0, (here
we view p(x) as a constant). Hence for an arbitrary term in the equation
above, we have proved that it can not be less than 0. In other words, if and
only if every term of H[y|x] equals to 0, H[y|x] will equal to 0.
Therefore, for each possible value of random variable x, denoted as xi:
yj
p(xi,yj)lnp(yj|xi)=0 ()
If there are more than one possible value of random variable ygiven
x=xi, denoted as yj, such that p(yj|xi)̸= 0 (Because xi,yjare both "possi-
ble", p(xi,yj) will also not equal to 0), constrained by 0 p(yj|xi)1 and
jp(yj|xi)=1, there should be at least two value of ysatisfied 0 <p(yj|xi)<
1, which ultimately leads to ()>0.
Therefore, for each possible value of x, there will only be one ysuch that
p(y|x)̸= 0. In other words, yis determined by x. Note: This result is quite
straightforward. If yis a function of x, we can obtain the value of yas soon
as observing a x. Therefore we will obtain no additional information when
observing a yjgiven an already observed x.
Problem 1.34 Solution
This problem is complicated. We will explain it in detail. According to
Appenddix D, we can obtain the relation,i.e. (D.3) :
F[y(x)+ϵη(x)] =F[y(x)] +F
yϵη(x)dx (∗∗)
Where y(x) can be viewed as an operator that for any input xit will give
an output value y, and equivalently, F[y(x)] can be viewed as an functional
operator that for any input value y(x), it will give an ouput value F[y(x)].
Then we consider a functional operator:
I[p(x)] =p(x)f(x)dx
Under a small variation p(x)p(x)+ϵη(x), we will obtain :
I[p(x)+ϵη(x)] =p(x)f(x)dx +ϵη(x)f(x)dx
Comparing the equation above and (), we can draw a conclusion :
I
p(x)=f(x)
Similarly, let’s consider another functional operator:
J[p(x)] =p(x)lnp(x)dx
26
Then under a small variation p(x)p(x)+ϵη(x):
J[p(x)+ϵη(x)] =(p(x)+ϵη(x)) ln(p(x)+ϵη(x) ) dx
=p(x)ln(p(x)+ϵη(x) ) dx +ϵη(x)ln(p(x)+ϵη(x) ) dx
Note that ϵη(x) is much smaller than p(x), we will write its Taylor Theo-
rems at point p(x):
ln(p(x)+ϵη(x) ) =lnp(x)+ϵη(x)
p(x)+O(ϵη(x)2)
Therefore, we substitute the equation above into J[p(x)+ϵη(x)]:
J[p(x)+ϵη(x)] =p(x)lnp(x)dx +ϵη(x)(lnp(x)+1) dx +O(ϵ2)
Therefore, we also obtain :
J
p(x)=lnp(x)+1
Now we can go back to (1.108). Based on J
p(x)and I
p(x), we can calculate
the derivative of the expression just before (1.108) and let it equal to 0:
lnp(x)1+λ1+λ2x+λ3(xµ)2=0
Hence we rearrange it and obtain (1.108). From (1.108) we can see that
p(x) should take the form of a Gaussian distribution. So we rewrite it into
Gaussian form and then compare it to a Gaussian distribution with mean µ
and variance σ2, it is straightforward:
exp(1+λ1)=1
(2πσ2)1
2
,exp(λ2x+λ3(xµ)2)=exp{(xµ)2
2σ2}
Finally, we obtain :
λ1=1ln(2πσ2)
λ2=0
λ3=1
2σ2
Problem 1.35 Solution
27
If p(x)=N(µ,σ2), we write its entropy:
H[x]= −p(x)lnp(x)dx
= −p(x)ln{1
2πσ2}dx p(x){(xµ)2
2σ2}dx
= −ln{1
2πσ2}+σ2
2σ2
=1
2{1+ln(2πσ2)}
Where we have taken advantage of the following properties of a Gaussian
distribution: p(x)dx =1 and (xµ)2p(x)dx =σ2
Problem 1.36 Solution
Here we should make it clear that if the second derivative is strictly pos-
itive, the function must be strictly convex. However, the converse may not be
true. For example f(x)=x4,g(x)=x2,xRare both strictly convex by def-
inition, but their second derivatives at x=0 are both indeed 0 (See keyword
convex function on Wikipedia or Page 71 of the book Convex Optimization
written by Boyd, Vandenberghe for more details). Hence, here more precisely
we will prove that a convex function is equivalent to its second derivative is
non-negative by first considering Taylor Theorems:
f(x+ϵ)=f(x)+f(x)
1! ϵ+f′′(x)
2! ϵ2+f′′′(x)
3! ϵ3+...
f(xϵ)=f(x)f(x)
1! ϵ+f′′(x)
2! ϵ2f′′′(x)
3! ϵ3+...
Then we can obtain the expression of f′′(x):
f′′(x)=lim
ϵ0
f(x+ϵ)+f(xϵ)2f(x)
ϵ2
Where O(ϵ4) is neglected and if f(x) is convex, we can obtain:
f(x)=f(1
2(x+ϵ)+1
2(xϵ)) 1
2f(x+ϵ)+1
2f(xϵ)
Hence f′′(x)0. The converse situation is a little bit complex, we will use
Lagrange form of Taylor Theorems to rewrite the Taylor Series Expansion
above :
f(x)=f(x0)+f(x0)(xx0)+f′′(x)
2(xx0)
28
Where xlies between xand x0. By hypothesis, f′′(x)0, the last term is
non-negative for all x. We let x0=λx1+(1 λ)x2, and x=x1:
f(x1)f(x0)+(1 λ)(x1x2)f(x0) ()
And then, we let x=x2:
f(x2)f(x0)+λ(x2x1)f(x0) (∗∗)
We multiply () by λ, (∗∗) by 1 λand then add them together, we will
see :
λf(x1)+(1 λ)f(x2)f(λx1+(1 λ)x2)
Problem 1.37 Solution
See Prob.1.31.
Problem 1.38 Solution
When M=2, (1.115) will reduce to (1.114). We suppose (1.115) holds for
M, we will prove that it will also hold for M+1.
f(
M
m=1
λmxm)=f(λM+1xM+1+(1 λM+1)
M
m=1
λm
1λM+1
xm)
λM+1f(xM+1)+(1 λM+1)f(
M
m=1
λm
1λM+1
xm)
λM+1f(xM+1)+(1 λM+1)
M
m=1
λm
1λM+1
f(xm)
M+1
m=1
λmf(xm)
Hence, Jensen’s Inequality, i.e. (1.115), has been proved.
Problem 1.39 Solution
It is quite straightforward based on definition.
H[x]= −
i
p(xi)lnp(xi)= −2
3ln 2
31
3ln 1
3=0.6365
H[y]= −
i
p(yi)lnp(yi)= −2
3ln 2
31
3ln 1
3=0.6365
H[x,y]= −
i,j
p(xi,yj)lnp(xi,yj)= −3·1
3ln 1
30=1.0986
H[x|y]= −
i,j
p(xi,yj)lnp(xi|yj)= −1
3ln11
3ln 1
21
3ln 1
2=0.4621
29
H[y|x]= −
i,j
p(xi,yj)lnp(yj|xi)= −1
3ln 1
2− −1
3ln 1
21
3ln1=0.4621
I[x,y]= −
i,j
p(xi,yj)ln p(xi)p(yj)
p(xi,yj)
= −1
3ln
2
3·1
3
1/3 1
3ln
2
3·2
3
1/3 1
3ln
1
3·2
3
1/3 =0.1744
Their relations are given below, diagrams omitted.
I[x,y]=H[x]H[x|y]=H[y]H[y|x]
H[x,y]=H[y|x]+H[x]=H[x|y]+H[y]
Problem 1.40 Solution
f(x)=lnx is actually a strict concave function, therefore we take advan-
tage of Jensen’s Inequality to obtain:
f(
M
i=1
λmxm)
M
i=1
λmf(xm)
We let λm=1
M,m=1,2,..., M. Hence we will obtain:
ln(x1+x2+... +xm
M)1
M[ln(x1)+ln(x2)+... +ln(xM)] =1
Mln(x1x2...xM)
We take advantage of the fact that f(x)=lnx is strictly increasing and
then obtain : x1+x2+... +xm
MM
px1x2...xM
Problem 1.41 Solution
Based on definition of I[x,y], i.e.(1.120), we obtain:
I[x,y]= −  p(x,y)ln p(x)p(y)
p(x,y)dxdy
= −  p(x,y)ln p(x)
p(x|y)dxdy
= −  p(x,y)lnp(x)dxdy+  p(x,y)lnp(x|y)dxdy
= −  p(x)lnp(x)dx+  p(x,y)lnp(x|y)dxdy
=H[x]H[x|y]
Where we have taken advantage of the fact: p(x,y)=p(y)p(x|y), and
p(x,y)dy=p(x). The same process can be used for proving I[x,y]=H[y]
H[y|x], if we substitute p(x,y) with p(x)p(y|x) in the second step.
30
0.2 Probability Distribution
Problem 2.1 Solution
Based on definition, we can obtain :
xi=0,1
p(xi)=µ+(1 µ)=1
E[x]=
xi=0,1
xip(xi)=0·(1 µ)+1·µ=µ
var[x]=
xi=0,1
(xiE[x])2p(xi)
=(0 µ)2(1 µ)+(1 µ)2·µ
=µ(1 µ)
H[x]= −
xi=0,1
p(xi)lnp(xi)= −µlnµ(1 µ)ln(1 µ)
Problem 2.2 Solution
The proof in Prob.2.1. can also be used here.
xi=−1,1
p(xi)=1µ
2+1+µ
2=1
E[x]=
xi=−1,1
xip(xi)= −1·1µ
2+1·1+µ
2=µ
var[E]=
xi=−1,1
(xiE[x])2p(xi)
=(1µ)2·1µ
2+(1 µ)2·1+µ
2
=(1 µ)2
H[x]= −
xi=−1,1
p(xi)lnp(xi)= −1µ
2ln 1µ
21+µ
2ln 1+µ
2
Problem 2.3 Solution
(2.262) is an important property of Combinations, which we have used
before, such as in Prob.1.15. We will use the ’old fashioned’ denotation Cm
Nto
represent choose mobjects from a total of N. With the prior knowledge:
Cm
N=N!
m!(Nm)!
31
We evaluate the left side of (2.262) :
Cm
N+Cm1
N=N!
m!(Nm)! +N!
(m1)!(N(m1))!
=N!
(m1)!(Nm)! (1
m+1
Nm+1)
=(N+1)!
m!(N+1m)! =Cm
N+1
To proof (2.263), here we will proof a more general form:
(x+y)N=
N
m=0
Cm
NxmyNm()
If we let y=1, () will reduce to (2.263). We will proof it by induction.
First, it is obvious when N=1, () holds. We assume that it holds for N, we
will proof that it also holds for N+1.
(x+y)N+1=(x+y)
N
m=0
Cm
NxmyNm
=x
N
m=0
Cm
NxmyNm+y
N
m=0
Cm
NxmyNm
=
N
m=0
Cm
Nxm+1yNm+
N
m=0
Cm
NxmyN+1m
=
N+1
m=1
Cm1
NxmyN+1m+
N
m=0
Cm
NxmyN+1m
=
N
m=1
(Cm1
N+Cm
N)xmyN+1m+xN+1+yN+1
=
N
m=1
Cm
N+1xmyN+1m+xN+1+yN+1
=
N+1
m=0
Cm
N+1xmyN+1m
By far, we have proved (). Therefore, if we let y=1 in (), (2.263) has
been proved. If we let x=µand y=1µ, (2.264) has been proved.
Problem 2.4 Solution
Solution has already been given in the problem, but we will solve it in a
32
more intuitive way, beginning by definition:
E[m]=
N
m=0
mCm
Nµm(1 µ)Nm
=
N
m=1
mCm
Nµm(1 µ)Nm
=
N
m=1
N!
(m1)!(Nm)! µm(1 µ)Nm
=N·µ
N
m=1
(N1)!
(m1)!(Nm)! µm1(1 µ)Nm
=N·µ
N
m=1
Cm1
N1µm1(1 µ)Nm
=N·µ
N1
k=0
Ck
N1µk(1 µ)N1k
=N·µ[µ+(1 µ)]N1=Nµ
Some details should be explained here. We note that m=0 actually
doesn’t affect the Expectation, so we let the summation begin from m=1,
i.e. (what we have done from the first step to the second step). Moreover, in
the second last step, we rewrite the subindex of the summation, and what we
actually do is let k=m1. And in the last step, we have taken advantage of
(2.264). Variance is straightforward once Expectation has been calculated.
var[m]=E[m2]E[m]2
=
N
m=0
m2Cm
Nµm(1 µ)NmE[m]·E[m]
=
N
m=0
m2Cm
Nµm(1 µ)Nm(Nµ)·
N
m=0
mCm
Nµm(1 µ)Nm
=
N
m=1
m2Cm
Nµm(1 µ)NmNµ·
N
m=1
mCm
Nµm(1 µ)Nm
=
N
m=1
mN!
(m1)!(Nm)! µm(1 µ)Nm(Nµ)·
N
m=1
mCm
Nµm(1 µ)Nm
=Nµ
N
m=1
m(N1)!
(m1)!(Nm)! µm1(1 µ)NmNµ·
N
m=1
mCm
Nµm(1 µ)Nm
=Nµ
N
m=1
mµm1(1 µ)Nm(Cm1
N1µCm
N)
Here we will use a little tick, µ= −1+(1 µ) and then take advantage
33
of the property, Cm
N=Cm
N1+Cm1
N1.
var[m]=Nµ
N
m=1
mµm1(1 µ)NmCm1
N1Cm
N+(1 µ)Cm
N
=Nµ
N
m=1
mµm1(1 µ)Nm(1 µ)Cm
N+Cm1
N1Cm
N
=Nµ
N
m=1
mµm1(1 µ)Nm(1 µ)Cm
NCm
N1
=NµN
m=1
mµm1(1 µ)Nm+1Cm
N
N
m=1
mµm1(1 µ)NmCm
N1
=Nµ·N(1 µ)[µ+(1 µ)]N1(N1)(1 µ)[µ+(1 µ)]N2
=NµN(1 µ)(N1)(1 µ)=Nµ(1 µ)
Problem 2.5 Solution
Hints have already been given in the description, and let’s make a little
improvement by introducing t=y+xand x=tµat the same time, i.e. we will
do following changes:
x=tµ
y=t(1 µ)and t=x+y
µ=x
x+y
Note t[0,+∞], µ(0,1), and that when we change variables in integral,
we will introduce a redundant term called Jacobian Determinant.
(x,y)
(µ,t)=
x
∂µ
x
t
y
∂µ
y
t=
tµ
t1µ=t
Now we can calculate the integral.
Γ(a)Γ(b)=+∞
0
exp(x)xa1dx +∞
0
exp(y)yb1d y
=+∞
0+∞
0
exp(x)xa1exp(y)yb1d y dx
=+∞
0+∞
0
exp(xy)xa1yb1d y dx
=1
0+∞
0
exp(t)(tµ)a1(t(1 µ))b1t dt dµ
=+∞
0
exp(t)ta+b1dt ·1
0
µa1(1 µ)b1dµ
=Γ(a+b)·1
0
µa1(1 µ)b1dµ
34
Therefore, we have obtained :
1
0
µa1(1 µ)b1dµ=
Γ(a)Γ(b)
Γ(a+b)
Problem 2.6 Solution
We will solve this problem based on definition.
E[µ]=1
0
µBeta(µ|a,b)dµ
=1
0
Γ(a+b)
Γ(a)Γ(b)µa(1 µ)b1dµ
=
Γ(a+b)Γ(a+1)
Γ(a+1+b)Γ(a)1
0
Γ(a+1+b)
Γ(a+1)Γ(b)µa(1 µ)b1dµ
=
Γ(a+b)Γ(a+1)
Γ(a+1+b)Γ(a)1
0
Beta(µ|a+1,b)dµ
=
Γ(a+b)
Γ(a+1+b)·
Γ(a+1)
Γ(a)
=a
a+b
Where we have taken advantage of the property: Γ(z+1) =zΓ(z). For
variance, it is quite similar. We first evaluate E[µ2].
E[µ2]=1
0
µ2Beta(µ|a,b)dµ
=1
0
Γ(a+b)
Γ(a)Γ(b)µa+1(1 µ)b1dµ
=
Γ(a+b)Γ(a+2)
Γ(a+2+b)Γ(a)1
0
Γ(a+2+b)
Γ(a+2)Γ(b)µa+1(1 µ)b1dµ
=
Γ(a+b)Γ(a+2)
Γ(a+2+b)Γ(a)1
0
Beta(µ|a+2,b)dµ
=
Γ(a+b)
Γ(a+2+b)·
Γ(a+2)
Γ(a)
=a(a+1)
(a+b)(a+b+1)
Then we use the formula: var[µ]=E[µ2]E[µ]2.
var[µ]=a(a+1)
(a+b)(a+b+1) (a
a+b)2
=ab
(a+b)2(a+b+1)
Problem 2.7 Solution
35
The maximum likelihood estimation for µ, i.e. (2.8), can be written as :
µML =m
m+l
Where mrepresents how many times we observe ’head’, lrepresents how
many times we observe ’tail’. And the prior mean of µis given by (2.15), the
posterior mean value of xis given by (2.20). Therefore, we will prove that
(m+a)/(m+a+l+b) lies between m/ (m+l), a/(a+b). Given the fact that :
λa
a+b+(1 λ)m
m+l=m+a
m+a+l+bwhere λ=a+b
m+l+a+b
We have solved problem. Note : you can also solve it in a more simple way
by prove that :
(m+a
m+a+l+ba
a+b)·(m+a
m+a+l+bm
m+l)0
The expression above can be proved by reduction of fractions to a common
denominator.
Problem 2.8 Solution
We solve it base on definition.
Ey[Ex[x|y]]] =Ex[x|y]p(y)d y
=(x p(x|y)dx)p(y)d y
=  x p(x|y)p(y)dx d y
=  x p(x,y)dx d y
=x p(x)dx =E[x]
(2.271) is complicated and we will calculate every term separately.
Ey[varx[x|y]] =varx[x|y]p(y)d y
=((xEx[x|y])2p(x|y)dx )p(y)d y
= (xEx[x|y])2p(x,y)dx d y
= (x22xEx[x|y]+Ex[x|y]2)p(x,y)dx d y
=  x2p(x)dx   2xEx[x|y]p(x,y)dx d y + (Ex[x|y]2)p(y)d y
36
About the second term in the equation above, we further simplify it :
  2xEx[x|y]p(x,y)dx d y =2Ex[x|y] ( xp(x,y)dx )d y
=2Ex[x|y]p(y) ( xp(x|y)dx )d y
=2Ex[x|y]2p(y)d y
Therefore, we obtain the simple expression for the first term on the right
side of (2.271) :
Ey[varx[x|y]] =  x2p(x)dx   Ex[x|y]2p(y)d y ()
Then we process for the second term.
var y[Ex[x|y]] =(Ex[x|y]Ey[Ex[x|y]])2p(y)d y
=(Ex[x|y]E[x])2p(y)d y
=Ex[x|y]2p(y)d y 2E[x]Ex[x|y]p(y)d y +E[x]2p(y)d y
=Ex[x|y]2p(y)d y 2E[x]Ex[x|y]p(y)d y +E[x]2
Then following the same procedure, we deal with the second term of the
equation above.
2E[x]·Ex[x|y]p(y)d y =2E[x]·Ey[Ex[x|y]]] =2E[x]2
Therefore, we obtain the simple expression for the second term on the
right side of (2.271) :
var y[Ex[x|y]] =Ex[x|y]2p(y)d y E[x]2(∗∗)
Finally, we add () and (∗∗), and then we will obtain:
Ey[varx[x|y]] +var y[Ex[x|y]] =E[x2]E[x]2=var[x]
Problem 2.9 Solution
This problem is complexed, but hints have already been given in the de-
scription. Let’s begin by performing integral of (2.272) over µM1. (Note :
37
by integral over µM1, we actually obtain Dirichlet distribution with M1
variables.)
pM1(µ,m,...,µM2)=1µm...µM2
0
CM
M1
k=1
µαk1
k(1
M1
j=1
µj)αM1dµM1
=CM
M2
k=1
µαk1
k1µm...µM2
0
µαM11
M1(1
M1
j=1
µj)αM1dµM1
We change variable by :
t=µM1
1µm... µM2
The reason we do so is that µM1[0,1µm...µM2], by making this
changing of variable, we can see that t[0,1]. Then we can further simplify
the expression.
pM1=CM
M2
k=1
µαk1
k(1
M2
j=1
µj)αM1+αM11
0
µαM11
M1(1 M1
j=1µj)αM1
(1 µm... µM2)αM1+αM2dt
=CM
M2
k=1
µαk1
k(1
M2
j=1
µj)αM1+αM11
0
tαM11(1 t)αM1dt
=CM
M2
k=1
µαk1
k(1
M2
j=1
µj)αM1+αM1Γ(αM11)Γ(αM)
Γ(αM1+αM)
Comparing the expression above with a normalized Dirichlet Distribution
with M1 variables, and supposing that (2.272) holds for M1, we can obtain
that:
CM
Γ(αM1)Γ(αM)
Γ(αM1+αM)=
Γ(α1+α2+... +αM)
Γ(α1)Γ(α2)...Γ(αM1+αM)
Therefore, we obtain
CM=
Γ(α1+α2+... +αM)
Γ(α1)Γ(α2)...Γ(αM1)Γ(αM)
as required.
Problem 2.10 Solution
38
Based on definition of Expectation and (2.38), we can write:
E[µj]=µjDir(µ|α)dµ
=µj
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)
K
k=1
µαk1
kdµ
=
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)µj
K
k=1
µαk1
kdµ
=
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)
Γ(α1)Γ(α2)...Γ(αj1)Γ(αj+1)Γ(αj+1)...Γ(αK)
Γ(α0+1)
=
Γ(α0)Γ(αj+1)
Γ(αj)Γ(α0+1) =αj
α0
It is quite the same for variance, let’s begin by calculating E[µ2
j].
E[µ2
j]=µ2
jDir(µ|α)dµ
=
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)µ2
j
K
k=1
µαk1
kdµ
=
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)
Γ(α1)Γ(α2)...Γ(αj1)Γ(αj+2)Γ(αj+1)...Γ(αK)
Γ(α0+2)
=
Γ(α0)Γ(αj+2)
Γ(αj)Γ(α0+2) =αj(αj+1)
α0(α0+1)
Hence, we obtain :
var[µj]=E[µ2
j]E[µj]2=αj(αj+1)
α0(α0+1) (αj
α0
)2=αj(α0αj)
α2
0(α0+1)
It is the same for covariance.
cov[µjµl]=(µjE[µj])(µlE[µl]) D ir(µ|α)dµ
=(µjµlE[µj]µlE[µl]µj+E[µj]E[µl]) Dir(µ|α)dµ
=
Γ(α0)Γ(αj+1)Γ(αl+1)
Γ(αj)Γ(αl)Γ(α0+2) 2E[µj]E[µl]+E[µj]E[µl]
=αjαl
α0(α0+1) E[µj]E[µl]
=αjαl
α0(α0+1) αjαl
α2
0
= − αjαl
α2
0(α0+1) (j̸= l)
39
Note : when j=l,cov[µjµl] will actually reduce to var[µj], however we
cannot simply replace lwith jin the expression of cov[µjµl] to get the right
result and that is because µjµlDir(µ|α)dαwill reduce to µ2
jDir(µ|α)dα
in this case.
Problem 2.11 Solution
Based on definition of Expectation and (2.38), we first denote :
Γ(α0)
Γ(α1)Γ(α2)...Γ(αK)=K(α)
Then we can write :
Dir(µ|α)
∂αj=(K(α)
K
i=1
µαi1
i)/∂αj
=K(α)
∂αj
K
i=1
µαi1
i+K(α)K
i=1µαi1
i
∂αj
=K(α)
∂αj
K
i=1
µαi1
i+lnµj·D ir(µ|α)
Then let us perform integral to both sides:
Dir(µ|α)
∂αj
dµ=K(α)
∂αj
K
i=1
µαi1
idµ+lnµj·D ir(µ|α)dµ
The left side can be further simplified as :
left side =D ir(µ|α)dµ
∂αj=1
∂αj=0
The right side can be further simplified as :
right side =K(α)
∂αjK
i=1
µαi1
idµ+E[lnµj]
=K(α)
∂αj
1
K(α)+E[lnµj]
=lnK(α)
∂αj+E[lnµj]
40
Therefore, we obtain :
E[lnµj]= −lnK(α)
∂αj
= −
lnΓ(α0)K
i=1lnΓ(αi)
∂αj
=lnΓ(αj)
∂αjlnΓ(α0)
∂αj
=lnΓ(αj)
∂αjlnΓ(α0)
∂α0
∂α0
∂αj
=lnΓ(αj)
∂αjlnΓ(α0)
∂α0
=ψ(αj)ψ(α0)
Therefore, the problem has been solved.
Problem 2.12 Solution
Since we have : b
a
1
badx =1
It is straightforward that it is normalized. Then we calculate its mean :
E[x]=b
a
x1
badx =x2
2(ba)b
a=a+b
2
Then we calculate its variance.
var[x]=E[x2]E[x]2=b
a
x2
badx (a+b
2)2=x3
3(ba)b
a(a+b
2)2
Hence we obtain:
var[x]=(ba)2
12
Problem 2.13 Solution
This problem is an extension of Prob.1.30. We can follow the same proce-
dure to solve it. Let’s begin by calculating ln p(x)
q((x):
ln(p(x)
q(x))=1
2ln (|L|
|Σ|)+1
2(xm)TL1(xm)1
2(xµ)TΣ1(xµ)
If xp(x)=N(µ|Σ), we then take advantage of the following properties.
p(x)dx=1
41
E[x]=xp(x)dx=µ
E[(xa)TA(xa)] =tr(AΣ)+(µa)TA(µa)
We obtain :
K L =1
2ln |L|
|Σ|1
2(xµ)TΣ1(xµ)+1
2(xm)TL1(xm)p(x)dx
=1
2ln |L|
|Σ|1
2E[(xµ)Σ1(xµ)T]+1
2E[(xm)TL1(xm)]
=1
2ln |L|
|Σ|1
2tr {ID}+1
2(µm)TL1(µm)+1
2tr{L1Σ}
=1
2[ln |L|
|Σ|D+tr{L1Σ}+(mµ)TL1(mµ)]
Problem 2.14 Solution
The hint given in the problem is straightforward, however it is a little bit
difficult to calculate, and here we will use a more simple method to solve this
problem, taking advantage of the property of Kullback—Leibler Distance. Let
g(x) be a Gaussian PDF with mean µand variance Σ, and f(x) an arbitrary
PDF with the same mean and variance.
0K L(f||g)=f(x)ln g(x)
f(x)dx=H(f)f(x)lng(x)dx()
Let’s calculate the second term of the equation above.
f(x)lng(x)dx=f(x)ln1
(2π)D/2
1
|Σ|1/2 exp1
2(xµ)TΣ1(xµ)dx
=f(x)ln 1
(2π)D/2
1
|Σ|1/2 dx+f(x)1
2(xµ)TΣ1(xµ)dx
=ln 1
(2π)D/2
1
|Σ|1/2 1
2E(xµ)TΣ1(xµ)
=ln 1
(2π)D/2
1
|Σ|1/2 1
2tr{ID}
= −1
2ln|Σ| + D
2(1 +ln(2π))
= −H(g)
We take advantage of two properties of PDF f(x), with mean µand vari-
ance Σ, as listed below. What’s more, we also use the result of Prob.2.15,
which we will proof later. f(x)dx=1
E[(xa)TA(xa)] =tr(AΣ)+(µa)TA(µa)
42
Now we can further simplify () to obtain:
H(g)H(f)
In other words, we have proved that an arbitrary PDF f(x) with the same
mean and variance as a Gaussian PDF g(x), its entropy cannot be greater
than that of Gaussian PDF.
Problem 2.15 Solution
We have already used the result of this problem to solve Prob.2.14, and
now we will prove it. Suppose xp(x)=N(µ|Σ) :
H[x]= −p(x)lnp(x)dx
= −p(x)ln 1
(2π)D/2
1
|Σ|1/2 exp1
2(xµ)TΣ1(xµ)dx
= −p(x)ln 1
(2π)D/2
1
|Σ|1/2 dxf(x)1
2(xµ)TΣ1(xµ)dx
= −ln 1
(2π)D/2
1
|Σ|1/2 +1
2E(xµ)TΣ1(xµ)
= −ln 1
(2π)D/2
1
|Σ|1/2 +1
2tr{ID}
=1
2ln|Σ| + D
2(1 +ln(2π))
Where we have taken advantage of :
p(x)dx=1
E[(xa)TA(xa)] =tr(AΣ)+(µa)TA(µa)
Note : Actually in Prob.2.14, we have already solved this problem, you can
intuitively view it by replacing the integrand f(x)lng(x) with g(x)lng(x), and
the same procedure in Prob.2.14 still holds to calculate g(x)lng(x)dx.
Problem 2.16 Solution
Let us consider a more general conclusion about the Probability Density
Function (PDF) of the summation of two independent random variables. We
denote two random variables Xand Y. Their summation Z=X+Y, is still a
random variable. We also denote f(·) as PDF, and F(·) as Cumulative Distri-
bution Function (CDF). We can obtain :
FZ(z)=P(Z<z)=x+yz
fX,Y(x,y)dxd y
43
Where zrepresents an arbitrary real number. We rewrite the double in-
tegral into iterated integral :
FZ(z)=+∞
−∞ zy
−∞
fX,Y(x,y)dxd y
We fix zand y, and then make a change of variable x=uyto the integral.
FZ(z)=+∞
−∞ zy
−∞
fX,Y(x,y)dxd y =+∞
−∞ z
−∞
fX,Y(ux,y)dud y
Note: fX,Y(·) is the joint PDF of Xand Y, and then we rearrange the
order, we will obtain :
FZ(z)=z
−∞+∞
−∞
fX,Y(uy,y)d ydu
Compare the equation above with th definition of CDF :
FZ(z)=z
−∞
fZ(u)du
We can obtain :
fZ(u)=+∞
−∞
fX,Y(uy,y)d y
And if Xand Yare independent, which means fX,Y(x,y)=fX(x)fY(y), we
can simplify fZ(z) :
fZ(u)=+∞
−∞
fX(uy)fY(y)d y i.e. fZ=fXfY
Until now we have proved that the PDF of the summation of two inde-
pendent random variable is the convolution of the PDF of them. Hence it is
straightforward to see that in this problem, where random variable xis the
summation of random variable x1and x2, the PDF of xshould be the convo-
lution of the PDF of x1and x2. To find the entropy of x, we will use a simple
method, taking advantage of (2.113)-(2.117). With the knowledge :
p(x2)=N(µ2,τ1
2)
p(x|x2)=N(µ1+x2,τ1
1)
We make analogies : x2in this problem to xin (2.113), xin this problem to
yin (2.114). Hence by using (2.115), we can obtain p(x) is still a normal dis-
tribution, and since the entropy of a Gaussian is fully decided by its variance,
there is no need to calculate the mean. Still by using (2.115), the variance of
xis τ1
1+τ1
2, which finally gives its entropy :
H[x]=1
21+ln2π(τ1
1+τ1
2)
44
Problem 2.17 Solution
This is an extension of Prob.1.14. The same procedure can be used here.
We suppose an arbitrary precision matrix Λcan be written as ΛS+ΛA, where
they satisfy :
ΛS
i j =
Λi j +Λj i
2,ΛA
i j =
Λi j Λj i
2
Hence it is straightforward that ΛS
i j =ΛS
ji , and ΛA
i j = ΛA
ji . If we expand
the quadratic form of exponent, we will obtain :
(xµ)TΛ(xµ)=
D
i=1
D
j=1
(xiµi)Λi j(xjµj) ()
It is straightforward then :
()=
D
i=1
D
j=1
(xiµi)ΛS
i j(xjµj)+
D
i=1
D
j=1
(xiµi)ΛA
i j(xjµj)
=
D
i=1
D
j=1
(xiµi)ΛS
i j(xjµj)
Therefore, we can assume precision matrix is symmetric, and so is covari-
ance matrix.
Problem 2.18 Solution
We will just follow the hint given in the problem. Firstly, we take complex
conjugate on both sides of (2.45) :
Σui=λiui=> Σui=λiui
Where we have taken advantage of the fact that Σis a real matrix, i.e.,
Σ=Σ. Then using that Σis a symmetric, i.e., ΣT=Σ:
uiTΣui=uiT(Σui)=uiT(λiui)=λiuiTui
uiTΣui=(Σui)Tui=(λiui)Tui=λi
TuiTui
Since ui̸=0, we have uiTui̸= 0. Thus λT
i=λT
i, which means λiis real.
Next we will proof that two eigenvectors corresponding to different eigenval-
ues are orthogonal.
λi<ui,uj>=<λiui,uj>=<Σui,uj>=< ui,ΣTuj>= λj<ui,uj>
Where we have taken advantage of ΣT=Σand for arbitrary real matrix
Aand vector x,y, we have :
<Ax,y>=< x,ATy>
45
Provided λi̸= λj, we have <ui,uj>= 0, i.e., uiand ujare orthogo-
nal. And then if we perform normalization on every eigenvector to force its
Euclidean norm to equal to 1, (2.46) is straightforward. By performing nor-
malization, I mean multiplying the eigenvector by a real number ato let its
Euclidean norm (length) to equal to 1, meanwhile we should also divide its
corresponding eigenvalue by a.
Problem 2.19 Solution
For every N×Nreal symmetric matrix, the eigenvalues are real and the
eigenvectors can be chosen such that they are orthogonal to each other. Thus
a real symmetric matrix Σcan be decomposed as Σ=UΛUT,where Uis an
orthogonal matrix, and Λis a diagonal matrix whose entries are the eigen-
values of A. Hence for an arbitrary vector x, we have:
Σx=UΛUTx=UΛ
uT
1x
.
.
.
uT
Dx
=U
λ1uT
1x
.
.
.
λDuT
Dx
=(
D
k=1
λkukuT
k)x
And since Σ1=UΛ1UT, the same procedure can be used to prove
(2.49).
Problem 2.20 Solution
Since u1,u2,..., uDcan constitute a basis for RD, we can make projection
for a:
a=a1u1+a2u2+... +aDuD
We substitute the expression above into aTΣa, taking advantage of the
property: uiuj=1 only if i=j, otherwise 0, we will obtain :
aTΣa=(a1u1+a2u2+... +aDuD)TΣ(a1u1+a2u2+... +aDuD)
=(a1u1T+a2u2T+... +aDuDT)Σ(a1u1+a2u2+... +aDuD)
=(a1u1T+a2u2T+... +aDuDT)(a1λ1u1+a2λ2u2+... +aDλDuD)
=λ1a12+λ2a22+... +λDaD2
Since ais real,the expression above will be strictly positive for any non-
zero a, if all eigenvalues are strictly positive. It is also clear that if an eigen-
value, λi, is zero or negative, there will exist a vector a(e.g. a=ui), for
which this expression will be no greater than 0. Thus, that a real symmet-
ric matrix has eigenvectors which are all strictly positive is a sufficient and
necessary condition for the matrix to be positive definite.
Problem 2.21 Solution
It is straightforward. For a symmetric matrix Λof size D×D, when the
lower triangular part is decided, the whole matrix will be decided due to
46
symmetry. Hence the number of independent parameters is D+(D1) +... +
1, which equals to D(D+1)/2.
Problem 2.22 Solution
Suppose Ais a symmetric matrix, and we need to prove that A1is also
symmetric, i.e., A1=(A1)T. Since identity matrix Iis also symmetric, we
have :
A A1=(A A1)T
And since ABT=BTATholds for arbitrary matrix Aand B, we will
obtain :
A A1=(A1)TAT
Since A=AT, we substitute the right side:
A A1=(A1)TA
And note that A A1=A1A=I, we rearrange the order of the left side :
A1A=(A1)TA
Finally, by multiplying A1to both sides, we can obtain:
A1A A1=(A1)TA A1
Using A A1=I, we will get what we are asked :
A1=(A1)T
Problem 2.23 Solution
Let’s reformulate the problem. What the problem wants us to prove is
that if (xµ)TΣ1(xµ)=r2, where r2is a constant, we will have the
volume of the hyperellipsoid decided by the equation above will equal to
VD|Σ|1/2rD. Note that the center of this hyperellipsoid locates at µ, and a
translation operation won’t change its volume, thus we only need to prove
that the volume of a hyperellipsoid decided by xTΣ1x=r2, whose center
locates at 0equals to VD|Σ|1/2rD.
This problem can be viewed as two parts. Firstly, let’s discuss about VD,
the volume of a unit sphere in dimension D. The expression of VDhas already
be given in the solution procedure of Prob.1.18, i.e., (1.144) :
VD=SD
D=2πD/2
Γ(D
2+1)
And also in the procedure, we show that a Ddimensional sphere with
radius r, i.e., xTx=r2, has volume V(r)=VDrD. We move a step forward: we
47
perform a linear transform using matrix Σ1/2, i.e., yTy=r2, where y=Σ1/2 x.
After the linear transformation, we actually get a hyperellipsoid whose center
locates at 0, and its volume is given by multiplying V(r) with the determinant
of the transformation matrix, which gives |Σ|1/2VDrD, just as required.
Problem 2.24 Solution
We just following the hint, and firstly let’s calculate :
A B
C D ×MMBD1
D1CM D1+D1CMBD1
The result can also be partitioned into four blocks. The block located at
left top equals to :
AM BD1CM =(ABD1C)(ABD1C)1=I
Where we have taken advantage of (2.77). And the right top equals to :
AMBD1+BD1+BD1CMBD1=(IAM +BD1CM)BD1=0
Where we have used the result of the left top block. And the left bottom
equals to :
CM DD1CM =0
And the right bottom equals to :
CMBD1+DD1+DD1CMDD1=I
we have proved what we are asked. Note: if you want to be more precise,
you should also multiply the block matrix on the right side of (2.76) and then
prove that it will equal to a identity matrix. However, the procedure above
can be also used there, so we omit the proof and what’s more, if two arbitrary
square matrix Xand Ysatisfied XY =I, it can be shown that Y X =Ialso
holds.
Problem 2.25 Solution
We will take advantage of the result of (2.94)-(2.98). Let’s first begin by
grouping xaand xbtogether, and then we rewrite what has been given as :
x=xa,b
xcµ=µa,b
µcΣ=Σ(a,b)(a,b)Σ(a,b)c
Σ(a,b)cΣcc
Then we take advantage of (2.98), we can obtain :
p(xa,b)=N(xa,b|µa,b,Σ(a,b)(a,b))
48
Where we have defined:
µa,b=µa
µbΣ(a,b)(a,b)=Σaa Σab
Σba Σbb
Since now we have obtained the joint contribution of xaand xb, we will
take advantage of (2.96) (2.97) to obtain conditional distribution, which gives:
p(xa|xb)=N(x|µa|b,Λ1
aa )
Where we have defined
µa|b=µaΛ1
aa Λab(xbµb)
And the expression of Λ1
aa and Λab can be given by using (2.76) and (2.77)
once we notice that the following relation exits:
Λaa Λab
Λba Λbb =Σaa Σab
Σba Σbb 1
Problem 2.26 Solution
This problem is quite straightforward, if we just follow the hint.
(A+BCD)A1A1B(C1+D A1B)1D A1
=A A1A A1B(C1+D A1B)1D A1+BCD A1BCD A1B(C1+D A1B)1D A1
=IB(C1+D A1B)1D A1+BCD A1+B(C1+D A1B)1D A1BCD A1
=I
Where we have taken advantage of
BCD A1B(C1+D A1B)1D A1
=BC(C1+C1+D A1B)(C1+D A1B)1D A1
=(BC)(C1)(C1+D A1B)1D A1+(BC)(C1+D A1B)(C1+D A1B)1D A1
=B(C1+D A1B)1D A1BCD A1
Here we will also directly calculate the inverse matrix instead to give
another solution. Let’s first begin by introducing two useful formulas.
(I+P)1=(I+P)1(I+PP)
=I(I+P)1P
And since
P+PQP =P(I+QP)=(I+PQ)P
49
The second formula is :
(I+PQ)1P=P(I+QP)1
And now let’s directly calculate (A+BCD)1:
(A+BCD)1=[A(I+A1BCD)]1
=(I+A1BCD)1A1
=I(I+A1BCD)1A1BCDA1
=A1(I+A1BCD)1A1BCD A1
Where we have assumed that Ais invertible and also used the first for-
mula we introduced. Then we also assume that Cis invertible and recur-
sively use the second formula :
(A+BCD)1=A1(I+A1BCD)1A1BCD A1
=A1A1(I+BCD A1)1BCD A1
=A1A1B(I+CD A1B)1CD A1
=A1A1BC(C1+D A1B)1CD A1
=A1A1B(C1+D A1B)1C1CD A1
=A1A1B(C1+D A1B)1D A1
Just as required.
Problem 2.27 Solution
The same procedure used in Prob.1.10 can be used here similarly.
E[x+z]= (x+z)p(x,z)dxdz
= (x+z)p(x)p(z)dxdz
=  xp(x)p(z)dxdz+  zp(x)p(z)dxdz
=(p(z)dz)xp(x)dx+(p(x)dx)zp(z)dz
=xp(x)dx+zp(z)dz
=E[x]+E[z]
And for covariance matrix, we will use matrix integral :
cov[x+z]= (x+zE[x+z] )( x+zE[x+z])Tp(x,z)dxdz
Also the same procedure can be used here. We omit the proof for simplic-
ity.
50
Problem 2.28 Solution
It is quite straightforward when we compare the problem with (2.94)-
(2.98). We treat xin (2.94) as zin this problem, xain (2.94) as xin this
problem, xbin (2.94) as yin this problem. In other words, we rewrite the
problem in the form of (2.94)-(2.98), which gives :
z=x
yE(z)=µ
Aµ+bcov(z)=Λ1Λ1AT
AΛ1L1+AΛ1AT
By using (2.98), we can obtain:
p(x)=N(x|µ,Λ1)
And by using (2.96) and (2.97), we can obtain :
p(y|x)=N(y|µy|x,Λ1
yy )
Where Λyy can be obtained by the right bottom part of (2.104),which gives
Λyy =L1, and you can also calculate it using (2.105) combined with (2.78)
and (2.79). Finally the conditional mean is given by (2.97) :
µy|x=Aµ+LL1(LA)(xµ)=Ax +L
Problem 2.29 Solution
It is straightforward. Firstly, we calculate the left top block :
left top =(Λ+ATLA)(ATL)(L1)(LA)1=Λ1
And then the right top block :
right top = −Λ1(ATL)L1=Λ1AT
And then the left bottom block :
left bottom = −L1(LA)Λ1=AΛ1
Finally the right bottom block :
right bottom =L1+L1(LA)Λ1(ATL)L1=L1+AΛ1AT
Problem 2.30 Solution
It is straightforward by multiplying (2.105) and (2.107), which gives :
Λ1Λ1AT
AΛ1L1+AΛ1AT ΛµATLb
Lb =µ
Aµ+b
Just as required in the problem.
51
Problem 2.31 Solution
According to the problem, we can write two expressions :
p(x)=N(x|µx,Σx),p(y|x)=N(y|µz+x,Σz)
By comparing the expression above and (2.113)-(2.117), we can write the
expression of p(y) :
p(y)=N(y|µx+µz,Σx+Σz)
Problem 2.32 Solution
Let’s make this problem more clear. The deduction in the main text, i.e.,
(2.101-2.110), firstly denote a new random variable zcorresponding to the
joint distribution, and then by completing square according to z,i.e.,(2.103),
obtain the precision matrix Rby comparing (2.103) with the PDF of a mul-
tivariate Gaussian Distribution, and then it takes the inverse of precision
matrix to obtain covariance matrix, and finally it obtains the linear term i.e.,
(2.106) to calculate the mean.
In this problem, we are asked to solve the problem from another perspec-
tive: we need to write the joint distribution p(x,y) and then perform inte-
gration over xto obtain marginal distribution p(y). Let’s begin by write the
quadratic form in the exponential of p(x,y) :
1
2(xµ)TΛ(xµ)1
2(yAx b)TL(yAx b)
We extract those terms involving x:
= −1
2xT(Λ+ATLA)x+xT[Λµ+ATL(yb) ] +const
= −1
2(xm)T(Λ+ATLA) (xm)+1
2mT(Λ+ATLA)m+const
Where we have defined :
m=(Λ+ATLA)1[Λµ+ATL(yb)]
Now if we perform integration over x, we will see that the first term van-
ish to a constant, and we extract the terms including yfrom the remaining
parts, we can obtain :
= −1
2yTLLA (Λ+ATLA)1ATLy
+yTLLA (Λ+ATLA)1ATLb
+LA(Λ+ATLA)1Λµ
We firstly view the quadratic term to obtain the precision matrix, and
then we take advantage of (2.289), we will obtain (2.110). Finally, using the
52
linear term combined with the already known covariance matrix, we can ob-
tain (2.109).
Problem 2.33 Solution
According to Bayesian Formula, we can write p(x|y)=p(x,y)
p(y), where we
have already known the joint distribution p(x,y) in (2.105) and (2.108), and
the marginal distribution p(y) in Prob.2.32., we can follow the same proce-
dure in Prob.2.32., i.e. firstly obtain the covariance matrix from the quadratic
term and then obtain the mean from the linear term. The details are omitted
here.
Problem 2.34 Solution
Let’s follow the hint by firstly calculating the derivative of (2.118) with
respect to Σand let it equal to 0 :
N
2
Σln|Σ|1
2
Σ
N
n=1
(xnµ)TΣ1(xnµ)=0
By using (C.28), the first term can be reduced to :
N
2
Σln|Σ|=−N
2(Σ1)T= −N
2Σ1
Provided with the result that the optimal covariance matrix is the sample
covariance, we denote sample matrix Sas :
S=1
N
N
n=1
(xnµ)(xnµ)T
We rewrite the second term :
second term = −1
2
Σ
N
n=1
(xnµ)TΣ1(xnµ)
= −N
2
ΣTr[Σ1S]
=N
2Σ1SΣ1
Where we have taken advantage of the following property, combined with
the fact that Sand Σis symmetric. (Note : this property can be found in The
Matrix Cookbook.)
XTr(AX1B)= −(X1BAX1)T= −(X1)TATBT(X1)T
Thus we obtain :
N
2Σ1+N
2Σ1SΣ1=0
53
Obviously, we obtain Σ=S, just as required.
Problem 2.35 Solution
The proof of (2.62) is quite clear in the main text, i.e., from page 82 to
page 83 and hence we won’t repeat it here. Let’s prove (2.124). We first begin
by proving (2.123) :
E[µML]=1
N
E[
N
n=1
xn]=1
N·Nµ=µ
Where we have taken advantage of the fact that xnis independently and
identically distributed (i.i.d).
Then we use the expression in (2.122) :
E[ΣML]=1
N
E[
N
n=1
(xnµML)(xnµML)T]
=1
N
N
n=1
E[(xnµML)(xnµML)T]
=1
N
N
n=1
E[(xnµML)(xnµML)T]
=1
N
N
n=1
E[xnxnT2µML xnT+µMLµT
ML]
=1
N
N
n=1
E[xnxnT]21
N
N
n=1
E[µML xnT]+1
N
N
n=1
E[µMLµT
ML]
By using (2.291), the first term will equal to :
first term =1
N·N(µµT+Σ)=µµT+Σ
The second term will equal to :
second term = −21
N
N
n=1
E[µML xnT]
= −21
N
N
n=1
E[1
N(
N
m=1
xm)xnT]
= −21
N2
N
n=1
N
m=1
E[xmxnT]
= −21
N2
N
n=1
N
m=1
(µµT+InmΣ)
= −21
N2(N2µµT+NΣ)
= −2(µµT+1
N
Σ)
54
Similarly, the third term will equal to :
third term =1
N
N
n=1
E[µMLµT
ML]
=1
N
N
n=1
E[( 1
N
N
j=1
xj)·(1
N
N
i=1
xi)]
=1
N3
N
n=1
E[(
N
j=1
xj)·(
N
i=1
xi)]
=1
N3
N
n=1
(N2µµT+NΣ)
=µµT+1
N
Σ
Finally, we combine those three terms, which gives:
E[ΣML]=N1
N
Σ
Note: the same procedure from (2.59) to (2.62) can be carried out to prove
(2.291) and the only difference is that we need to introduce index mand n
to represent the samples. (2.291) is quite straightforward if we see it in this
way: If m=n, which means xnand xmare actually the same sample, (2.291)
will reduce to (2.262) (i.e. the correlation between different dimensions ex-
ists) and if m̸= n, which means xnand xmare different samples, also i.i.d,
then no correlation should exist, we can guess E[xnxmT]=µµTin this case.
Problem 2.36 Solution
Let’s follow the hint. However, firstly we will find the sequential expres-
sion based on definition, which will make the latter process on finding coef-
ficient aN1more easily. Suppose we have Nobservations in total, and then
we can write:
σ2(N)
ML =1
N
N
n=1
(xnµ(N)
ML)2
=1
NN1
n=1
(xnµ(N)
ML)2+(xNµ(N)
ML)2
=N1
N
1
N1
N1
n=1
(xnµ(N)
ML)2+1
N(xNµ(N)
ML)2
=N1
Nσ2(N1)
ML +1
N(xNµ(N)
ML)2
=σ2(N1)
ML +1
N(xNµ(N)
ML)2σ2(N1)
ML
55
And then let us write the expression for σML.
∂σ21
N
N
n=1
lnp(xn|µ,σ)σML =0
By exchanging the summation and the derivative, and letting N→ +∞,
we can obtain :
lim
N→+∞
1
N
N
n=1
∂σ2lnp(xn|µ,σ)=Ex
∂σ2lnp(xn|µ,σ)
Comparing it with (2.127), we can obtain the sequential formula to esti-
mate σML :
σ2(N)
ML =σ2(N1)
ML +aN1
∂σ2(N1)
ML
lnp(xN|µ(N)
ML,σ(N1)
ML ) ()
=σ2(N1)
ML +aN11
2σ2(N1)
ML +(xNµ(N)
ML)2
2σ4(N1)
ML
Where we use σ2(N)
ML to represent the Nth estimation of σ2
ML, i.e., the esti-
mation of σ2
ML after the Nth observation. What’s more, if we choose :
aN1=2σ4(N1)
ML
N
Then we will obtain :
σ2(N)
ML =σ2(N1)
ML +1
Nσ2(N1)
ML +(xNµ(N)
ML)2
We can see that the results are the same. An important thing should be
noticed : In maximum likelihood, when estimating variance σ2(N)
ML , we will
first estimate mean µ(N)
ML, and then we we will calculate variance σ2(N)
ML .
In other words, they are decoupled. It is the same in sequential method.
For instance, if we want to estimate both mean and variance sequentially,
after observing the Nth sample (i.e., xN), firstly we can use µ(N1)
ML together
with (2.126) to estimate µ(N)
ML and then use the conclusion in this problem
to obtain σ(N)
ML. That is why in () we write lnp(xN|µ(N)
ML,σ(N1)
ML ) instead of
lnp(xN|µ(N1)
ML ,σ(N1)
ML ).
Problem 2.37 Solution (Wait for revising)
We follow the same procedure in Prob.2.36 to solve this problem. Firstly,
56
we can obtain the sequential formula based on definition.
Σ(N)
ML =1
N
N
n=1
(xnµ(N)
ML)(xnµ(N)
ML)T
=1
NN1
n=1
(xnµ(N)
ML)(xnµ(N)
ML)T+(xNµ(N)
ML)(xNµ(N)
ML)T
=N1
N
Σ(N1)
ML +1
N(xNµ(N)
ML)(xNµ(N)
ML)T
=Σ(N1)
ML +1
N(xNµ(N)
ML)(xNµ(N)
ML)TΣ(N1)
ML
If we use Robbins-Monro sequential estimation formula, i.e., (2.135), we
can obtain :
Σ(N)
ML =Σ(N1)
ML +aN1
Σ(N1)
ML
lnp(xN|µ(N)
ML,Σ(N1)
ML )
=Σ(N1)
ML +aN1
Σ(N1)
ML
lnp(xN|µ(N)
ML,Σ(N1)
ML )
=Σ(N1)
ML +aN11
2[Σ(N1)
ML ]1+1
2[Σ(N1)
ML ]1(xnµ(N1)
ML )(xnµ(N1)
ML )T[Σ(N1)
ML ]1
Where we have taken advantage of the procedure we carried out in Prob.2.34
to calculate the derivative, and if we choose :
aN1=2
N
Σ2(N1)
ML
We can see that the equation above will be identical with our previous
conclusion based on definition.
Problem 2.38 Solution
It is straightforward. Based on (2.137), (2.138) and (2.139), we focus on
the exponential term of the posterior distribution p(µ|X), which gives :
1
2σ2
N
n=1
(xnµ)21
2σ2
0
(µµ0)2= − 1
2σ2
N
(µµN)2
We rewrite the left side regarding to µ.
quadratic term = −(N
2σ2+1
2σ2
0
)µ2
linear term =(N
n=1xn
σ2+µ0
σ2
0
)µ
57
We also rewrite the right side regarding to µ, and hence we will obtain :
(N
2σ2+1
2σ2
0
)µ2= − 1
2σ2
N
µ2,(N
n=1xn
σ2+µ0
σ2
0
)µ=µN
σ2
N
µ
Then we will obtain : 1
σ2
N=1
σ2
0+N
σ2
And with the prior knowledge that N
n=1xn=N·µML, we can write :
µN=σ2
N·(N
n=1xn
σ2+µ0
σ2
0
)
=(1
σ2
0+N
σ2)1·(NµML
σ2+µ0
σ2
0
)
=σ2
0σ2
σ2+Nσ2
0·NµMLσ2
0+µ0σ2
σσ2
0
=σ2
Nσ2
0+σ2µ0+Nσ2
0
Nσ2
0+σ2µML
Problem 2.39 Solution
Let’s follow the hint.
1
σ2
N=1
σ2
0+N
σ2=1
σ2
0+N1
σ2+1
σ2=1
σ2
N1+1
σ2
However, it is complicated to derive a sequential formula for µNdirectly.
Based on (2.142), we see that the denominator in (2.141) can be eliminated if
we multiply 1/σ2
Non both side of (2.141). Therefore we will derive a sequen-
tial formula for µN/σ2
Ninstead.
µN
σ2
N=σ2+Nσ2
0
σ2
0σ2(σ2
Nσ2
0+σ2µ0+Nσ2
0
Nσ2
0+σ2µ(N)
ML)
=σ2+Nσ2
0
σ2
0σ2(σ2
Nσ2
0+σ2µ0+Nσ2
0
Nσ2
0+σ2µ(N)
ML)
=µ0
σ2
0+Nµ(N)
ML
σ2=µ0
σ2
0+N
n=1xn
σ2
=µ0
σ2
0+N1
n=1xn
σ2+xN
σ2
=µN1
σ2
N1+xN
σ2
58
Another possible solution is also given in the problem. We solve it by
completing the square.
1
2σ2(xNµ)21
2σ2
N1
(µµN1)2= − 1
2σ2
N
(µµN)2
By comparing the quadratic and linear term regarding to µ, we can ob-
tain: 1
σ2
N=1
σ2+1
σ2
N1
And : µN
σ2
N=xN
σ2+µN1
σ2
N1
It is the same as previous result. Note: after obtaining the Nth observa-
tion, we will firstly use the sequential formula to calculate σ2
N, and then µN.
This is because the sequential formula for µNis dependent on σ2
N.
Problem 2.40 Solution
Based on Bayes Theorem, we can write :
p(µ|X)p(X|µ)p(µ)
We focus on the exponential term on the right side and then rearrange it
regarding to µ.
right =N
n=11
2(xnµ)TΣ1(xnµ)1
2(µµ0)TΣ01(µµ0)
=N
n=11
2(xnµ)TΣ1(xnµ)1
2(µµ0)TΣ01(µµ0)
= −1
2µ(Σ01+NΣ1)µ+µT(Σ1
0µ0+Σ1N
n=1
xn)+const
Where ’const’ represents all the constant terms independent of µ. Accord-
ing to the quadratic term, we can obtain the posterior covariance matrix.
Σ1
N=Σ01+NΣ1
Then using the linear term, we can obtain :
Σ1
NµN=(Σ1
0µ0+Σ1N
n=1
xn)
Finally we obtain posterior mean :
µN=(Σ01+NΣ1)1(Σ1
0µ0+Σ1N
n=1
xn)
59
Which can also be written as :
µN=(Σ01+NΣ1)1(Σ01µ0+Σ1NµML)
Problem 2.41 Solution
Let’s compute the integral of (2.146) over λ.
+∞
0
1
Γ(a)baλa1exp(bλ)dλ=ba
Γ(a)+∞
0
λa1exp(bλ)dλ
=ba
Γ(a)+∞
0
(u
b)a1exp(u)1
bdu
=1
Γ(a)+∞
0
ua1exp(u)du
=1
Γ(a)·Γ(a)=1
Where we first perform change of variable bλ=u, and then take advan-
tage of the definition of gamma function:
Γ(x)=+∞
0
ux1eudu
Problem 2.42 Solution
We first calculate its mean.
+∞
0
λ1
Γ(a)baλa1exp(bλ)dλ=ba
Γ(a)+∞
0
λaexp(bλ)dλ
=ba
Γ(a)+∞
0
(u
b)aexp(u)1
bdu
=1
Γ(a)·b+∞
0
uaexp(u)du
=1
Γ(a)·b·Γ(a+1) =a
b
Where we have taken advantage of the property Γ(a+1) =aΓ(a). Then
we calculate E[λ2].
+∞
0
λ21
Γ(a)baλa1exp(bλ)dλ=ba
Γ(a)+∞
0
λa+1exp(bλ)dλ
=ba
Γ(a)+∞
0
(u
b)a+1exp(u)1
bdu
=1
Γ(a)·b2+∞
0
ua+1exp(u)du
=1
Γ(a)·b2·Γ(a+2) =a(a+1)
b2
60
Therefore, according to var[λ]=E[λ2]E[λ]2, we can obtain :
var[λ]=E[λ2]E[λ]2=a(a+1)
b2(a
b)2
=a
b2
For the mode of a gamma distribution, we need to find where the max-
imum of the PDF occurs, and hence we will calculate the derivative of the
gamma distribution with respect to λ.
d
dλ1
Γ(a)baλa1exp(bλ)=[(a1) bλ]1
Γ(a)baλa2exp(bλ)
It is obvious that Gam(λ|a,b) has its maximum at λ=(a1)/b. In other
words, the gamma distribution Gam(λ|a,b) has mode (a1)/b.
Problem 2.43 Solution
Let’s firstly calculate the following integral.
+∞
−∞
exp(|x|q
2σ2)dx =2+∞
−∞
exp(xq
2σ2)dx
=2+∞
0
exp(u)(2σ2)1
q
qu1
q1du
=2(2σ2)1
q
q+∞
0
exp(u)u1
q1dx
=2(2σ2)1
q
q
Γ(1
q)
And then it is obvious that (2.293) is normalized. Next, we consider about
the log likelihood function. Since ϵ=ty(x,w) and ϵp(ϵ|σ2,q), we can
write:
ln p(t|X,w,σ2)=
N
n=1
ln py(xn,w)tn|σ2,q
= 1
2σ2
N
n=1|y(xn,w)tn|q+N·lnq
2(2σ2)1/qΓ(1/q)
= 1
2σ2
N
n=1|y(xn,w)tn|qN
qln(2σ2)+const
Problem 2.44 Solution
Here we use a simple method to solve this problem by taking advantage
of (2.152) and (2.153). By writing the prior distribution in the form of (2.153),
i.e., p(µ,λ|β,c,d), we can easily obtain the posterior distribution.
p(µ,λ|X)p(X|µ,λ)·p(µ,λ)
λ1/2 exp(λµ2
2)N+β
ex p (c+
N
n=1
xn)λµ (d+
N
n=1
x2
n
2)λ
61
Therefore, we can see that the posterior distribution has parameters: β=
β+N,c=c+N
n=1xn,d=d+N
n=1
x2
n
2. And since the prior distribution is
actually the product of a Gaussian distribution and a Gamma distribution:
p(µ,λ|µ0,β,a,b)=Nµ|µ0,(βλ)1Gam(λ|a,b)
Where µ0=c/β,a=1+β/2, b=dc2/2β. Hence the posterior distri-
bution can also be written as the product of a Gaussian distribution and a
Gamma distribution.
p(µ,λ|X)=Nµ|µ
0,(βλ)1Gam(λ|a,b)
Where we have defined:
µ
0=c/β=(c+
N
n=1
xn)(N+β)
a=1+β/2 =1+(N+β)2
b=dc2/2β=d+
N
n=1
x2
n
2(c+
N
n=1
xn)2(2(β+N))
Problem 2.45 Solution
Let’s begin by writing down the dependency of the prior distribution W(Λ|W,v)
and the likelihood function p(X|µ,Λ) on Λ.
p(X|µ,Λ)∝ |Λ|N/2 expN
n=11
2(xnµ)TΛ(xnµ)
And if we denote
S=1
N
N
n=1
(xnµ)(xnµ)T
Then we can rewrite the equation above as:
p(X|µ,Λ)∝ |Λ|N/2 exp1
2Tr(SΛ)
Just as what we have done in Prob.2.34, and comparing this problem with
Prob.2.34, one important thing should be noticed: since Sand Λare both
symmetric, we have: Tr(SΛ)=Tr(SΛ)T=Tr(ΛTST)=Tr(ΛS). And we
can also write down the prior distribution as:
W(Λ|W,v)∝ |Λ|(vD1)/2 exp1
2Tr(W1Λ)
Therefore, the posterior distribution can be obtained:
p(Λ|X,W,v)p(X|µ,Λ)·W(Λ|W,v)
∝ |Λ|(N+vD1)/2 exp1
2Tr(W1+S)Λ
62
Therefore, p(Λ|X,W,v) is also a Wishart distribution, with parameters:
vN=N+v
WN=(W1+S)1
Problem 2.46 Solution
It is quite straightforward.
p(x|µ,a,b)=
0
N(x|µ,τ1)Gam(τ|a,b)dτ
=
0
baexp(bτ)τa1
Γ(a)(τ
2π)1/2 ex p τ
2(xµ)2dτ
=ba
Γ(a)(1
2π)1/2
0
τa1/2 ex p bττ
2(xµ)2dτ
And if we make change of variable: z=τ[b+(xµ)2/2], the integral above
can be written as:
p(x|µ,a,b)=ba
Γ(a)(1
2π)1/2
0
τa1/2 ex p bττ
2(xµ)2dτ
=ba
Γ(a)(1
2π)1/2
0z
b+(xµ)2/2a1/2
ex p {z}1
b+(xµ)2/2 dz
=ba
Γ(a)(1
2π)1/2 1
b+(xµ)2/2a+1/2
0
za1/2 ex p {z}dz
=ba
Γ(a)(1
2π)1/2 b+(xµ)2
2a1/2
Γ(a+1/2)
And if we substitute a=v/2 and b=v/2λ, we will obtain (2.159).
Problem 2.47 Solution
We focus on the dependency of (2.159) on x.
St(x|µ,λ,v)1+λ(xµ)2
vv/21/2
ex p v1
2ln(1 +λ(xµ)2
v)
ex p v1
2(λ(xµ)2
v+O(v2))
ex p λ(xµ)2
2(v)
Where we have used Taylor Expansion:ln(1 +ϵ)=ϵ+O(ϵ2). We see that
this, up to an overall constant, is a Gaussian distribution with mean µand
precision λ.
63
Problem 2.48 Solution
The same steps in Prob.2.46 can be used here.
St(xµ,Λ,v)=+∞
0
N(xµ,(ηΛ)1)·Gam(ηv
2,v
2)dη
=+∞
0
1
(2π)D/2 |ηΛ|1/2 ex p 1
2(xµ)T(ηΛ)(xµ)vη
21
Γ(v/2)(v
2)v/2
ηv/21dη
=(v/2)v/2 |Λ|1/2
(2π)D/2 Γ(v/2) +∞
0
ex p 1
2(xµ)T(ηΛ)(xµ)vη
2ηD/2+v/21dη
Where we have taken advantage of the property: |ηΛ| = ηD|Λ|, and if we
denote:
2=(xµ)TΛ(xµ) and z=η
2(2+v)
The expression above can be reduced to :
St(xµ,Λ,v)=(v/2)v/2 |Λ|1/2
(2π)D/2 Γ(v/2) +∞
0
exp(z)( 2z
2+v)
D/2+v/21
·2
2+vdz
=(v/2)v/2 |Λ|1/2
(2π)D/2 Γ(v/2)(2
2+v)
D/2+v/2 +∞
0
exp(z)·zD/2+v/21dz
=(v/2)v/2 |Λ|1/2
(2π)D/2 Γ(v/2)(2
2+v)
D/2+v/2
Γ(D/2 +v/2)
And if we rearrange the expression above, we will obtain (2.162) just as
required.
Problem 2.49 Solution
Firstly, we notice that if and only if x=µ,2equals to 0, so that St(x|µ,Λ,v)
achieves its maximum. In other words, the mode of St(x|µ,Λ,v) is µ. Then
we consider about its mean E[x].
E[x]=xRD
St(x|µ,Λ,v)·xdx
=xRD+∞
0
N(xµ,(ηΛ)1)·Gam(ηv
2,v
2)dηxdx
=xRD+∞
0
xN(xµ,(ηΛ)1)·Gam(ηv
2,v
2)dηdx
=+∞
0xRD
xN(xµ,(ηΛ)1)dx·Gam(ηv
2,v
2)dη
=+∞
0µ·Gam(ηv
2,v
2)dη
=µ+∞
0
Gam(ηv
2,v
2)dη=µ
Where we have taken the following property:
xRD
xN(xµ,(ηΛ)1)dx=E[x]=µ
64
Then we calculate E[xxT]. The steps above can also be used here.
E[xxT]=xRD
St(x|µ,Λ,v)·xxTdx
=xRD+∞
0
N(xµ,(ηΛ)1)·Gam(ηv
2,v
2)dηxxTdx
=xRD+∞
0
xxTN(xµ,(ηΛ)1)·Gam(ηv
2,v
2)dηdx
=+∞
0xRD
xxTN(xµ,(ηΛ)1)dx·Gam(ηv
2,v
2)dη
=+∞
0E[µµT]·Gam(ηv
2,v
2)dη
=+∞
0µµT+(ηΛ)1Gam(ηv
2,v
2)dη
=µµT++∞
0
(ηΛ)1·Gam(ηv
2,v
2)dη
=µµT++∞
0
(ηΛ)1·1
Γ(v/2)(v
2)v/2ηv/21exp(v
2η)dη
=µµT+Λ11
Γ(v/2)(v
2)v/2 +∞
0
ηv/22exp(v
2η)dη
If we denote: z=vη
2, the equation above can be reduced to :
E[xxT]=µµT+Λ11
Γ(v/2)(v
2)v/2 +∞
0
(2z
v)v/22exp(z)2
vdz
=µµT+Λ11
Γ(v/2) ·v
2+∞
0
zv/22exp(z)dz
=µµT+Λ1Γ(v/2 1)
Γ(v/2) ·v
2
=µµT+Λ11
v/2 1
v
2
=µµT+v
v2Λ1
Where we have taken advantage of the property: Γ(x+1) =xΓ(x), and
since we have cov[x]=E(xE[x])(xE[x])T, together with E[x]=µ, we
can obtain:
cov[x]=v
v2Λ1
Problem 2.50 Solution
65
The same steps in Prob.2.47 can be used here.
St(x|µ,Λ,v)1+
2
vD/2v/2
ex p (D/2 v/2) ·ln(1 +
2
v)
ex p D+v
2·(2
v+O(v2))
exp(
2
2) (v)
Where we have used Taylor Expansion:ln(1 +ϵ)=ϵ+O(ϵ2). And since
2=(xµ)TΛ(xµ), we see that this, up to an overall constant, is a Gaussian
distribution with mean µand precision Λ.
Problem 2.51 Solution
We first prove (2.177). Since we have exp(iA)·ex p(iA)=1, and exp(i A)=
cosA +isinA. We can obtain:
(cosA +isinA)·(cosA isinA)=1
Which gives cos2A+sin2A=1. And then we prove (2.178) using the hint.
cos(AB)= ℜ[exp(i(AB))]
= ℜ[exp(iA)exp(iB)]
= ℜ[cosA +isinA
cosB +isinB ]
= ℜ[(cosA +isinA)(cosB isinB)
(cosB +isinB)(cosB isinB)]
= ℜ[(cosA +isinA)(cosB isinB)]
=cosAcosB +sinAsinB
It is quite similar for (2.183).
sin(AB)= ℑ[exp(i(AB))]
= ℑ[(cosA +isinA)(cosB isinB)]
=sinAcosB cosAsinB
Problem 2.52 Solution
Let’s follow the hint. We first derive an approximation for exp[mcos(θ
66
θ0)].
ex p {mcos(θθ0)}=ex p m1(θθ0)2
2+O((θθ0)4)
=ex p mm(θθ0)2
2mO((θθ0)4)
=exp(m)·exp m(θθ0)2
2·ex p mO((θθ0)4)
It is same for exp(mcosθ) :
ex p {mcosθ}=exp(m)·exp(mθ2
2)·ex p mO(θ4)
Now we rearrange (2.179):
p(θ|θ0,m)=1
2πI0(m)ex p {mcos(θθ0)}
=1
2π
0ex p {mcosθ}dθ
ex p {mcos(θθ0)}
=
exp(m)·exp m(θθ0)2
2·ex p mO((θθ0)4)
2π
0exp(m)·exp(mθ2
2)·ex p mO(θ4)dθ
=1
2π
0exp(mθ2
2)dθ
ex p m(θθ0)2
2
Where we have taken advantage of the following fact:
ex p mO((θθ0)4)exp mO(θ4)(when m)
Therefore, it is straightforward that when m→ ∞, (2.179) reduces to a
Gaussian Distribution with mean θ0and precision m.
Problem 2.53 Solution
Let’s rearrange (2.182) according to (2.183).
N
n=1
sin(θθ0)=
N
n=1
(sinθncosθ0cosθnsinθ0)
=cosθ0
N
n=1
sinθnsinθ0
N
n=1
cosθn
Where we have used (2.183), and then together with (2.182), we can ob-
tain :
cosθ0
N
n=1
sinθnsinθ0
N
n=1
cosθn=0
67
Which gives:
θML
0=tan1nsinθn
ncosθn
Problem 2.54 Solution
We calculate the first and second derivative of (2.179) with respect to θ.
p(θ|θ0,m)=1
2πI0(m)[msin(θθ0)]expmcos(θθ0)
p(θ|θ0,m)′′ =1
2πI0(m)mcos(θθ0)+(msin(θθ0))2expmcos(θθ0)
If we let p(θ|θ0,m)equals to 0, we will obtain its root:
θ=θ0+kπ(kZ)
When k0 (mod2), i.e. θθ0(mod2π), we have:
p(θ|θ0,m)′′ =m exp(m)
2πI0(m)<0
Therefore, when θ=θ0, (2.179) obtains its maximum. And when k
1 (mod2), i.e. θθ0+π(mod2π), we have:
p(θ|θ0,m)′′ =m exp(m)
2πI0(m)>0
Therefore, when θ=θ0+π(mod2π), (2.179) obtains its minimum.
Problem 2.55 Solution
According to (2.185), we have :
A(mML)=1
N
N
n=1
cos(θnθML
0)
By using (2.178), we can write :
A(mML)=1
N
N
n=1
cos(θnθML
0)
=1
N
N
n=1cosθncosθML
0+sinθnsinθML
0
=1
N
N
n=1
cosθNcosθML
0+1
N
N
n=1
sinθNsinθML
0
By using (2.168), we can further derive:
A(mML)=1
N
N
n=1
cosθNcosθML
0+1
N
N
n=1
sinθNsinθML
0
=¯
rcos ¯
θ·cosθML
0+¯
rsin ¯
θ·sinθML
0
=¯
rcos(¯
θθML
0)
68
And then by using (2.169) and (2.184), it is obvious that ¯
θ=θML
0, and
hence A(mML)=¯
r.
Problem 2.56 Solution
Recall that the distributions belonging to the exponential family have the
form:
p(x|η)=h(x)g(η)exp(ηTu(x))
And according to (2.13), the beta distribution can be written as:
Beta(x|a,b)=
Γ(a+b)
Γ(a)Γ(b)xa1(1 x)b1
=
Γ(a+b)
Γ(a)Γ(b)ex p [(a1)lnx +(b1)ln(1 x)]
=
Γ(a+b)
Γ(a)Γ(b)
ex p [alnx +bln(1 x)]
x(1 x)
Comparing it with the standard form of exponential family, we can obtain:
η=[a,b]T
u(x)=[lnx,ln(1 x)]T
g(η)=Γ(η1+η2)Γ(η1)Γ(η2)
h(x)=1(x(1 x))
Where η1means the first element of η, i.e. η1=a1, and η2means the
second element of η, i.e. η2=b1. According to (2.146), Gamma distribution
can be written as:
Gam(x|a,b)=1
Γ(a)baxa1exp(bx)
Comparing it with the standard form of exponential family, we can obtain:
η=[a,b]T
u(x)=[0,x]
g(η)=ηη1
2Γ(η1)
h(x)=xη11
According to (2.179), the von Mises distribution can be written as:
p(x|θ0,m)=1
2πI0(m)exp(mcos(xθ0))
=1
2πI0(m)ex p [m(cosxcosθ0+sinxsinθ0)]
69
Comparing it with the standard form of exponential family, we can obtain:
η=[mcosθ0,msinθ0]T
u(x)=[cosx,sinx]
g(η)=12πI0(η2
1+η2
2)
h(x)=1
Note : a given distribution can be written into the exponential family in
several ways with different natural parameters.
Problem 2.57 Solution
Recall that the distributions belonging to the exponential family have the
form:
p(x|η)=h(x)g(η)exp(ηTu(x))
And the multivariate Gaussian Distribution has the form:
N(x|µ,Σ)=1
(2π)D/2
1
|Σ|1/2 ex p 1
2(xµ)TΣ1(xµ)
We expand the exponential term with respect to µ.
N(x|µ,Σ)=1
(2π)D/2
1
|Σ|1/2 ex p 1
2(xTΣ1x2µTΣ1x+µΣ1µ)
=1
(2π)D/2
1
|Σ|1/2 ex p 1
2xTΣ1x+µTΣ1xex p 1
2µΣ1µ)
Comparing it with the standard form of exponential family, we can obtain:
η=[Σ1µ,1
2vec(Σ1) ]T
u(x)=[x,vec(xxT) ]
g(η)=exp(1
4η1Tη21η1)+ |2η2|1/2
h(x)=(2π)D/2
Where we have used η1to denote the first element of η, and η2to denote
the second element of η. And we also take advantage of the vectorizing oper-
ator, i.e.vec(·). The vectorization of a matrix is a linear transformation which
converts the matrix into a column vector. This can be viewed in an example :
A=a b
c d => vec(A)=[a,c,b,d]T
Note: By introducing vectorizing operator, we actually have vec(Σ1)·
vec(xxT)=xTΣ1x
Problem 2.58 Solution (Wait for updating)
70
Based on (2.226), we rewrite the expression for g(η).
g(η)= −g(η)E[u(x)]
And then we calculate the derivative of both sides of the equation above
with respect to η.
∇∇g(η)= −g(η)E[u(x)T]+g(η)E[u(x)T]
If we multiply both sides by 1
g(η), we can obtain :
−∇∇lng(η)= ∇ln g(η)E[u(x)T]+ ∇E[u(x)T]
According to (2.225), we calculate E[u(x)T].
E[u(x)T]= ∇g(η)h(x)ex p ηTu(x)u(x)Tdx+
g(η)h(x)ex p ηTu(x)u(x)u(x)Tdx
=>E[u(x)T]= ∇lng(η)E[u(x)T]+E[u(x)u(x)T]
Therefore, we obtain :
−∇∇lng(η)=2lng(η)E[u(x)T]+E[u(x)u(x)T]= −2E[u(x)]E[u(x)T]+E[u(x)u(x)T]
Problem 2.59 Solution
It is straightforward.
p(x|σ)dx =1
σf(x
σ)dx
=1
σf(u)σdu
=f(u)du =1
Where we have denoted u=x/σ.
Problem 2.60 Solution
Firstly, we write down the log likelihood function.
N
n=1
lnp(xn)=
M
i=1
niln(hi)
Some details should be explained here. If xnfalls into region i, then
p(xn) will equal to hi, and since we have already been given that among
all the Nobservations, there are nisamples fall into region i, we can easily
write down the likelihood function just as the equation above, and note we use
71
Mto denote the number of different regions. Therefore, an implicit equation
should hold: M
i=1
ni=N
We now need to take account of the constraint that p(x) must integrate
to unity, which can be written as M
j=1hjj=1. We introduce a Lagrange
multiplier to the expression, and then we need to minimize:
M
i=1
niln(hi)+λ(
M
j
hjj1)
We calculate its derivative with respect to hiand let it equal to 0.
ni
hi+λi=0
Multiplying both sides by hi, performing summation over iand then us-
ing the constraint, we can obtain:
N+λ=0
In other words, λ= −N. Then we substitute the result into the likelihood
function, which gives:
hi=ni
N
1
i
Problem 2.61 Solution
It is straightforward. In K nearest neighbours (KNN), when we want to
estimate probability density at a point xi, we will consider a small sphere
centered on xiand then allow the radius to grow until it contains Kdata
points, and then p(xi) will equal to K/(NVi), where Nis total observations
and Viis the volume of the sphere centered on xi. We can assume that Viis
small enough that p(xi) is roughly constant in it. In this way, We can write
down the integral:
p(x)dx
N
i=1
p(xi)·Vi=
N
i=1
K
NVi·Vi=K̸= 1
We also see that if we use "1NN" (K=1), the probability density will be
well normalized. Note that if and only if the volume of all the spheres are
small enough and Nis large enough, the equation above will hold. Fortu-
nately, these two conditions can be satisfied in KNN.
72
0.3 Probability Distribution
Problem 3.1 Solution
Based on (3.6), we can write :
2σ(2a)1=2
1+exp(2a)1=1exp(2a)
1+exp(2a)=exp(a)exp(a)
exp(a)+exp(a)
Which is exactly tanh(a). Then we will find the relation between µi,wi
in (3.101) and (3.102). Let’s start from (3.101).
y(x,w)=w0+
M
j=1
wjσ(xµj
s)
=w0+
M
j=1
wj
tanh(xµj
2s)+1
2
=w0+1
2
M
j=1
wj+
M
j=1
wj
2tanh(xµj
2s)
Hence the relation is given by :
µ0=w0+1
2
M
j=1
wjand µj=wj
2
Note: there is a typo in (3.102), the denominator should be 2sinstead of
s, or alternatively you can view it as a new s, which equals to 2s.
Problem 3.2 Solution
We first need to show that (ΦTΦ)1is invertible. Suppose, for the sake of
contradiction, cis a nonzero vector in the kernel(Null space) of ΦTΦ. Then
ΦTΦcequals to 0and so we have:
0=cTΦTΦc=(Φc)TΦc= ||Φc||2
The equation above shows that Φc=0. However, Φc=c1ϕ1+c2ϕ2+
... +cMϕMand {ϕ1,ϕ2, ,..., ϕM}is a basis for Φ, there is no linear relation
between the ϕiand therefore we cannot have c1ϕ1+c2ϕ2+... +cMϕM=0.
This is the contradiction. Hence ΦTΦis invertible. Then let’s first prove two
specific cases.
Case 1:w1is in Φ. In this case, we have Φc=w1for some c. So we
have:
Φ(ΦTΦ)1ΦTw1=Φ(ΦTΦ)1ΦTΦc=Φc=w1
Case 2:w2is in Φ, where Φis used to denote the orthogonal comple-
ment of Φand then we have ΦTw2=0, which leads to:
Φ(ΦTΦ)1ΦTw2=0
73
Recall that any vector xRMcan be divided into the summation of two
vectors w1and w2, were w1Φand w2Φseparately. And so we have:
Φ(ΦTΦ)1ΦTw=Φ(ΦTΦ)1ΦT(w1+w2)=w1
Which is exactly what orthogonal projection is supposed to do.
Problem 3.3 Solution
Let’s calculate the derivative of (3.104) with respect to w.
ED(w)=
N
n=1
rntnwTΦ(xn)Φ(xn)T
We set the derivative equal to 0.
0=
N
n=1
rntnΦ(xn)TwTN
n=1
rnΦ(xn)Φ(xn)T
If we denote prnϕ(xn)=ϕ(xn) and prntn=t
n, we can obtain:
0=
N
n=1
t
nΦ(xn)TwTN
n=1
Φ(xn)Φ(xn)T
Taking advantage of (3.11) – (3.17), we can derive a similar result, i.e.
wML =(ΦTΦ)1ΦTt. But here, we define tas:
t=pr1t1,pr2t2, ... ,prNtNT
We also define Φas a N×Mmatrix, with element Φ(i,j)=priϕj(xi).
Problem 3.4 Solution
Firstly, we rearrange ED(w).
ED(w)=1
2
N
n=1w0+
D
i=1
wi(xi+ϵi)tn2
=1
2
N
n=1w0+
D
i=1
wixitn+
D
i=1
wiϵi2
=1
2
N
n=1y(xn,w)tn+
D
i=1
wiϵi2
=1
2
N
n=1y(xn,w)tn2+D
i=1
wiϵi2+2D
i=1
wiϵiy(xn,w)tn
Where we have used y(xn,w) to denote the output of the linear model
when input variable is xn, without noise added. For the second term in the
equation above, we can obtain :
Eϵ[(
D
i=1
wiϵi)2]=Eϵ[
D
i=1
D
j=1
wiwjϵiϵj]=
D
i=1
D
j=1
wiwjEϵ[ϵiϵj]=σ2D
i=1
D
j=1
wiwjδi j
74
Which gives
Eϵ[(
D
i=1
wiϵi)2]=σ2D
i=1
w2
i
For the third term, we can obtain:
Eϵ[2D
i=1
wiϵiy(xn,w)tn]=2y(xn,w)tnEϵ[
D
i=1
wiϵi]
=2y(xn,w)tnD
i=1
Eϵ[wiϵi]
=0
Therefore, if we calculate the expectation of ED(w) with respect to ϵ, we
can obtain:
Eϵ[ED(w)] =1
2
N
n=1y(xn,w)tn2+σ2
2
D
i=1
w2
i
Problem 3.5 Solution
We can firstly rewrite the constraint (3.30) as :
1
2M
j=1|wj|qη0
Where we deliberately introduce scaling factor 1/2 for convenience.Then
it is straightforward to obtain the Lagrange function.
L(w,λ)=1
2
N
n=1tnwTϕ(xn)2+λ
2M
j=1|wj|qη
It is obvious that L(w,λ) and (3.29) has the same dependence on w. Mean-
while, if we denote the optimal wthat can minimize L(w,λ) as w(λ), we can
see that
η=
M
j=1|w
j|q
Problem 3.6 Solution
Firstly, we write down the log likelihood function.
lnp(T|X,W,β)= −N
2ln|Σ|1
2
N
n=1tnWTϕ(xn)TΣ1tnWTϕ(xn)
Where we have already omitted the constant term. We set the derivative
of the equation above with respect to Wequals to zero.
0= −
N
n=1
Σ1[tnWTϕ(xn)ϕ(xn)T
75
Therefore, we can obtain similar result for Was (3.15). For Σ, comparing
with (2.118) – (2.124), we can easily write down a similar result :
Σ=1
N
N
n=1
[tnWT
MLϕ(xn)[tnWT
MLϕ(xn)T
We can see that the solutions for Wand Σare also decoupled.
Problem 3.7 Solution
Let’s begin by writing down the prior distribution p(w) and likelihood
function p(t|X,w,β).
p(w)=N(w|m0,S0),p(t|X,w,β)=
N
n=1
N(tn|wTϕ(xn),β1)
Since the posterior PDF equals to the product of the prior PDF and likeli-
hood function, up to a normalized constant. We mainly focus on the exponen-
tial term of the product.
exponential term = −β
2
N
n=1tnwTϕ(xn)21
2(wm0)TS1
0(wm0)
= −β
2
N
n=1t2
n2tnwTϕ(xn)+wTϕ(xn)ϕ(xn)Tw1
2(wm0)TS1
0(wm0)
= −1
2wTN
n=1
βϕ(xn)ϕ(xn)T+S1
0w
1
22mT
0S1
0
N
n=1
2βtnϕ(xn)Tw
+const
Hence, by comparing the quadratic term with standard Gaussian Distri-
bution, we can obtain: S1
N=S1
0+βΦTΦ. And then comparing the linear
term, we can obtain :
2mNTSN1= −2mT
0S1
0
N
n=1
2βtnϕ(xn)T
If we multiply 0.5 on both sides, and then transpose both sides, we can
easily see that mN=SN(S01m0+βΦTt)
Problem 3.8 Solution
Firstly, we write down the prior :
p(w)=N(mN,SN)
76
Where mN,SNare given by (3.50) and (3.51). And if now we observe
another sample (XN+1,tN+1), we can write down the likelihood function :
p(tN+1|xN+1,w)=N(tN+1|y(xN+1,w),β1)
Since the posterior equals to the production of likelihood function and the
prior, up to a constant, we focus on the exponential term.
exponential term =(wmN)TS1
N(wmN)+β(tN+1wTϕ(xN+1))2
=wTSN1+βϕ(xN+1)ϕ(xN+1)Tw
2wTS1
NmN+βϕ(xN+1)tN+1
+const
Therefore, after observing (XN+1,tN+1), we have p(w)=N(mN+1,SN+1),
where we have defined:
S1
N+1=S1
N+βϕ(xN+1)ϕ(xN+1)T
And
mN+1=SN+1S1
NmN+βϕ(xN+1)tN+1
Problem 3.9 Solution
We know that the prior p(w) can be written as:
p(w)=N(mN,SN)
And the likelihood function p(tN+1|xN+1,w) can be written as:
p(tN+1|xN+1,w)=N(tN+1|y(xN+1,w),β1)
According to the fact that y(xN+1,w)=wTϕ(xN+1)=ϕ(xN+1)Tw, the
likelihood can be further written as:
p(tN+1|xN+1,w)=N(tN+1|(ϕ(xN+1)Tw,β1)
Then we take advantage of (2.113), (2.114) and (2.116), which gives:
p(w|xN+1,tN+1)=N(Σϕ(xN+1)βtN+1+SN1mN,Σ)
Where Σ=(SN1+ϕ(xN+1)βϕ(xN+1)T)1, and we can see that the result
is exactly the same as the one we obtained in the previous problem.
Problem 3.10 Solution
We have already known:
p(t|w,β)=N(t|y(x,w),β1)
77
And
p(w|t,α,β)=N(w|mN,SN)
Where mN,SNare given by (3.53) and (3.54). As what we do in previous
problem, we can rewrite p(t|w,β) as:
p(t|w,β)=N(t|ϕ(x)Tw,β1)
And then we take advantage of (2.113), (2.114) and (2.115), we can obtain:
p(t|t,α,β)=N(ϕ(x)TmN,β1+ϕ(x)TSNϕ(x))
Which is exactly the same as (3.58), if we notice that
ϕ(x)TmN=mNTϕ(x)
Problem 3.11 Solution
We need to use the result obtained in Prob.3.8. In Prob.3.8, we have de-
rived a formula for S1
N+1:
S1
N+1=S1
N+βϕ(xN+1)ϕ(xN+1)T
And then using (3.110), we can obtain :
SN+1=SN1+βϕ(xN+1)ϕ(xN+1)T1
=SN1+βϕ(xN+1)βϕ(xN+1)T1
=SNSN(βϕ(xN+1))(βϕ(xN+1)T)SN
1+(βϕ(xN+1)T)SN(βϕ(xN+1))
=SNβSNϕ(xN+1)ϕ(xN+1)TSN
1+βϕ(xN+1)TSNϕ(xN+1)
Now we calculate σ2
N(x)σ2
N+1(x) according to (3.59).
σ2
N(x)σ2
N+1(x)=ϕ(x)T(SNSN+1)ϕ(x)
=ϕ(x)TβSNϕ(xN+1)ϕ(xN+1)TSN
1+βϕ(xN+1)TSNϕ(xN+1)ϕ(x)
=ϕ(x)TSNϕ(xN+1)ϕ(xN+1)TSNϕ(x)
1/β+ϕ(xN+1)TSNϕ(xN+1)
=ϕ(x)TSNϕ(xN+1)2
1/β+ϕ(xN+1)TSNϕ(xN+1)()
And since SNis positive definite, () is larger than 0. Therefore, we have
proved that σ2
N(x)σ2
N+1(x)0
Problem 3.12 Solution
78
Let’s begin by writing down the prior PDF p(w,β):
p(w,β)=N(w|m0,β1S0) Gam(β|a0,b0) ()
(β
|S0|)2exp(1
2(wm0)TβS1
0(wm0)) ba0
0βa01exp(b0β)
And then we write down the likelihood function p(t|X,w,β) :
p(t|X,w,β)=
N
n=1
N(tn|wTϕ(xn),β1)
N
n=1
β1/2 expβ
2(tnwTϕ(xn))2(∗∗)
According to Bayesian Inference, we have p(w,β|t)p(t|X,w,β)×p(w,β).
We first focus on the quadratic term with regard to win the exponent.
quadratic term = −β
2wTS01w+
N
n=1β
2wTϕ(xn)ϕ(xn)Tw
= −β
2wTS01+
N
n=1
ϕ(xn)ϕ(xn)Tw
Where the first term is generated by (), and the second by (∗∗). By now,
we know that:
SN1=S01+
N
n=1
ϕ(xn)ϕ(xn)T
We then focus on the linear term with regard to win the exponent.
linear term =βm0TS01w+
N
n=1
βtnϕ(xn)Tw
=βm0TS01+
N
n=1
tnϕ(xn)Tw
Again, the first term is generated by (), and the second by (∗∗). We can
also obtain:
mNTSN1=m0TS01+
N
n=1
tnϕ(xn)T
Which gives:
mN=SNS01m0+
N
n=1
tnϕ(xn)
Then we focus on the constant term with regard to win the exponent.
constant term =(β
2m0TS01m0b0β)β
2
N
n=1
t2
n
= −β1
2m0TS01m0+b0+1
2
N
n=1
t2
n
79
Therefore, we can obtain:
1
2mNTSN1mN+bN=1
2m0TS01m0+b0+1
2
N
n=1
t2
n
Which gives :
bN=1
2m0TS01m0+b0+1
2
N
n=1
t2
n1
2mNTSN1mN
Finally, we focus on the exponential term whose base is β.
exponent term =(2 +a01) +N
2
Which gives:
2+aN1=(2 +a01) +N
2
Hence,
aN=a0+N
2
Problem 3.13 Solution(Waiting for update)
Similar to (3.57), we write down the expression of the predictive distribu-
tion p(t|X,t):
p(t|X,t)=  p(t|w,β)p(w,β|X,t)dwdβ()
We know that:
p(t|w,β)=N(t|y(x,w),β1)=N(t|ϕ(x)Tw,β1)
And that:
p(w,β|X,t)=N(w|mN,β1SN) Gam(β|aN,bN)
We go back to (), and we first deal with the integral with regard to w:
p(t|X,t)=N(t|ϕ(x)Tw,β1)N(w|mN,β1SN)dwGam(β|aN,bN)dβ
=N(t|ϕ(x)TmN,β1+ϕ(x)Tβ1SNϕ(x)) Gam(β|aN,bN)dβ
=Nt|ϕ(x)TmN,β1(1 +ϕ(x)TSNϕ(x))Gam(β|aN,bN)dβ
Where we have used (2.113), (2.114) and (2.115). Then, we compare the
expression above with (2.160), we can see that p(t|X,t)=St(t|µ,λ,v), where
we have defined:
µ=ϕ(x)TmN,λ=1+ϕ(x)TSNϕ(x)1,v=2aN
80
Problem 3.14 Solution(Wait for updating)
Firstly, according to (3.16), if we use the new orthonormal basis set spec-
ified in the problem to construct Φ, we can obtain an important property:
ΦTΦ=I. Hence, if α=0, together with (3.54), we know that SN=1/β.
Finally, according to (3.62), we can obtain:
k(x,x)=βψ(x)TSNψ(x)=ψ(x)Tψ(x)
Problem 3.15 Solution
It is quite obvious if we substitute (3.92) and (3.95) into (3.82), which
gives,
E(mN)=β
2||tΦmN||2+α
2mNTmN=Nγ
2+γ
2=N
2
Problem 3.16 Solution(Waiting for update)
We know that
p(t|w,β)=
N
n=1
N(ϕ(xn)Tw,β1)N(Φw,β1I)
And
p(w|α)=N(0,α1I)
Comparing them with (2.113), (2.114) and (2.115), we can obtain:
p(t|α,β)=N(0,β1I+α1ΦΦT)
Problem 3.17 Solution
We know that:
p(t|w,β)=
N
n=1
N(ϕ(xn)Tw,β1)
=
N
n=1
1
(2πβ1)1/2 exp1
2β1(tnϕ(xn)Tw)2
=(β
2π)N/2 expN
n=1β
2(tnϕ(xn)Tw)2
=(β
2π)N/2 expβ
2||tΦw||2
And that:
p(w|α)=N(0,α1I)
=αM/2
(2π)M/2 expα
2||w||2
81
If we substitute the expressions above into (3.77), we can obtain (3.78)
just as required.
Problem 3.18 Solution
We expand (3.79) as follows:
E(w)=β
2||tΦw||2+α
2wTw
=β
2(tTt2tTΦw+wTΦTΦw)+α
2wTw
=1
2wT(βΦTΦ+αI)w2βtTΦw+βtTt
Observing the equation above, we see that E(w) contains the following
term : 1
2(wmN)TA(wmN) ()
Now, we need to solve Aand mN. We expand () and obtain:
()=1
2(wTAw2mNTAw+mNTAmN)
We firstly compare the quadratic term, which gives:
A=βΦTΦ+αI
And then we compare the linear term, which gives:
mNTA=βtTΦ
Noticing that A=AT, which implies A1is also symmetric, we first trans-
pose and then multiply A1on both sides, which gives:
mN=βA1ΦTt
Now we rewrite E(w):
E(w)=1
2wT(βΦTΦ+αI)w2βtTΦw+βtTt
=1
2(wmN)TA(wmN)+βtTtmNTAmN
=1
2(wmN)TA(wmN)+1
2(βtTtmNTAmN)
=1
2(wmN)TA(wmN)+1
2(βtTt2mNTAmN+mNTAmN)
=1
2(wmN)TA(wmN)+1
2(βtTt2mNTAmN+mNT(βΦTΦ+αI)mN)
=1
2(wmN)TA(wmN)+1
2βtTt2βtTΦmN+mNT(βΦTΦ)mN+α
2mNTmN
=1
2(wmN)TA(wmN)+β
2||tΦmN||2+α
2mNTmN
82
Just as required.
Problem 3.19 Solution
Based on the standard form of a multivariate normal distribution, we
know that
1
(2π)M/2
1
|A|1/2 exp1
2(wmN)TA(wmN)dw=1
Hence,
exp1
2(wmN)TA(wmN)dw=(2π)M/2|A|1/2
And since E(mN) doesn’t depend on w, (3.85) is quite obvious. Then we
substitute (3.85) into (3.78), which will immediately gives (3.86).
Problem 3.20 Solution
You can just follow the steps from (3.87) to (3.92), which is already very
clear.
Problem 3.21 Solution
Let’s first prove (3.117). According to (C.47) and (C.48), we know that if A
is a M×Mreal symmetric matrix, with eigenvalues λi,i=1,2, ..., M,|A|and
Tr(A) can be written as:
|A|=
M
i=1
λi,Tr(A)=
M
i=1
λi
Back to this problem, according to section 3.5.2, we know that Ahas
eigenvalues α+λi,i=1,2,..., M. Hence the left side of (3.117) equals to:
left side =d
dαlnM
i=1
(α+λi)=
M
i=1
d
dαln(α+λi)=
M
i=1
1
α+λi
And according to (3.81), we can obtain:
A1d
dαA=A1I=A1
For the symmetric matrix A, its inverse A1has eigenvalues 1/(α+λi),i=
1,2,..., M. Therefore,
Tr(A1d
dαA)=
M
i=1
1
α+λi
Hence there are the same, and (3.92) is quite obvious.
Problem 3.22 Solution
83
Let’s derive (3.86) with regard to β. The first term dependent on βin
(3.86) is :
d
dβ(N
2lnβ)=N
2β
The second term is :
d
dβE(mN)=1
2||tΦmN||2+β
2
d
dβ||tΦmN||2+d
dβ
α
2mNTmN
The last two terms in the equation above can be further written as:
β
2
d
dβ||tΦmN||2+d
dβ
α
2mNTmN=β
2
d
dmN||tΦmN||2+d
dmN
α
2mNTmN·dmN
dβ
=β
2[2ΦT(tΦmN)] +α
22mN·dmN
dβ
=βΦT(tΦmN)+αmN·dmN
dβ
=βΦTt+(αI+βΦTΦ)mN·dmN
dβ
=βΦTt+AmN·dmN
dβ
=0
Where we have taken advantage of (3.83) and (3.84). Hence
d
dβE(mN)=1
2||tΦmN||2=1
2
N
n=1
(tnmNTϕ(xn))2
The last term dependent on βin (3.86) is:
d
dβ(1
2ln|A|)=γ
2β
Therefore, if we combine all those expressions together, we will obtain
(3.94). And then if we rearrange it, we will obtain (3.95).
Problem 3.23 Solution
First, according to (3.10), we know that p(t|X,w,β) can be further written
as p(t|X,w,β)=N(t|Φw,β1I), and given that p(w|β)=N(m0,β1S0) and
84
p(β)=Gam(β|a0,b0). Therefore, we just follow the hint in the problem.
p(t)=  p(t|X,w,β)p(w|β)dwp(β)dβ
= (β
2π)N/2 expβ
2(tΦw)T(tΦw)·
(β
2π)M/2|S0|1/2 expβ
2(wm0)TS01(wm0)dw
Γ(a0)1ba0
0βa01exp(b0β)dβ
=ba0
0
(2π)(M+N)/2|S0|1/2   expβ
2(tΦw)T(tΦw)
expβ
2(wm0)TS01(wm0)dw
βa01+N/2+M/2 exp(b0β)dβ
=ba0
0
(2π)(M+N)/2|S0|1/2   expβ
2(wmN)TSN1(wmN)dw
expβ
2(tTt+m0TS01m0mNTSN1mN)
βaN1+M/2 exp(b0β)dβ
Where we have defined
mN=SN(S01m0+ΦTt)
SN1=S01+ΦTΦ
aN=a0+N
2
bN=b0+1
2(m0TS01m0mNTSN1mN+
N
n=1
t2
n)
Which are exactly the same as those in Prob.3.12, and then we evaluate
the integral, taking advantage of the normalized property of multivariate
Gaussian Distribution and Gamma Distribution.
p(t)=ba0
0
(2π)(M+N)/2|S0|1/2 (2π
β)M/2|SN|1/2 βaN1+M/2 exp(bNβ)dβ
=ba0
0
(2π)(M+N)/2|S0|1/2 (2π)M/2|SN|1/2 βaN1exp(bNβ)dβ
=1
(2π)N/2 |SN|1/2
|S0|1/2
ba0
0
baN
N
Γ(aN)
Γ(bN)
Just as required.
Problem 3.24 Solution
85
Let’s just follow the hint and we begin by writing down expression for the
likelihood, prior and posterior PDF. We know that p(t|w,β)=N(t|Φw,β1I).
What’s more, the form of the prior and posterior are quite similar:
p(w,β)=N(w|m0,β1S0) Gam(β|a0,b0)
And
p(w,β|t)=N(w|mN,β1SN) Gam(β|aN,bN)
Where the relationships among those parameters are shown in Prob.3.12,
Prob.3.23. Now according to (3.119), we can write:
p(t)=N(t|Φw,β1I)N(w|m0,β1S0) Gam(β|a0,b0)
N(w|mN,β1SN) Gam(β|aN,bN)
=N(t|Φw,β1I)N(w|m0,β1S0)
N(w|mN,β1SN)
ba0
0βa01exp(b0β) / Γ(a0)
baN
NβaN1exp(bNβ) / Γ(aN)
=N(t|Φw,β1I)N(w|m0,β1S0)
N(w|mN,β1SN)
ba0
0
baN
N
Γ(aN)
Γ(a0)βa0aNexp(b0bN)β
=N(t|Φw,β1I)N(w|m0,β1S0)
N(w|mN,β1SN)exp(b0bN)βba0
0
baN
N
Γ(aN)
Γ(a0)βN/2
Where we have used aN=a0+N
2. Now we deal with the terms expressed
in the form of Gaussian Distribution:
Gaussian terms =N(t|Φw,β1I)N(w|m0,β1S0)
N(w|mN,β1SN)
=(β
2π)N/2 expβ
2(tΦw)T(tΦw)·
|β1SN|1/2
|β1S0|1/2
expβ
2(wm0)TS01(wm0)
expβ
2(wmN)TSN1(wmN)
=(β
2π)N/2 |SN|1/2
|S0|1/2 expβ
2(tΦw)T(tΦw)·
expβ
2(wm0)TS01(wm0)
expβ
2(wmN)TSN1(wmN)
We look back to the previous problem and we notice that at the last step
in the deduction of p(t), we complete the square according to w. And if we
carefully compare the left and right side at the last step, we can obtain :
expβ
2(tΦw)T(tΦw)expβ
2(wm0)TS01(wm0)
=expβ
2(wmN)TSN1(wmN)exp(bNb0)β
86
Hence, we go back to deal with the Gaussian terms:
Gaussian terms =(β
2π)N/2 |SN|1/2
|S0|1/2 exp(bNb0)β
If we substitute the expressions above into p(t), we will obtain (3.118)
immediately.
0.4 Linear Models Classification
Problem 4.1 Solution
If the convex hull of {xn}and {yn}intersects, we know that there will be a
point zwhich can be written as z=nαnxnand also z=nβnyn. Hence we
can obtain:
wTz+w0=
wT(
n
αnxn)+w0
=(
n
αn
wTxn)+(
n
αn)w0
=
n
αn(
wTxn+w0) ()
Where we have used nαn=1. And if {xn}and {yn}are linearly separa-
ble, we have
wTxn+w0>0 and
wTyn+w0<0, for xn,yn. Together with
αn0 and (), we know that
wTz+w0>0. And if we calculate
wTz+w0
from the perspective of {yn}following the same procedure, we can obtain
wTz+w0<0. Hence contradictory occurs. In other words, they are not lin-
early separable if their convex hulls intersect.
Now let’s assume they are linearly separable and try to prove their convex
hulls don’t intersect. This is obvious. We can obtain
wTxn+w0>0 and
wTyn+w0<0, for xn,yn, if the two sets are linearly separable. And if there
is a point z, which belongs to both convex hulls of {xn}and {yn}, following the
same procedure as above, we can see that
wTz+w0>0 from the perspective
of {xn}and
wTz+w0<0 from the perspective of {yn}, which directly leads to
a contradictory. Therefore, the convex hulls don’t intersect.
Problem 4.2 Solution
Let’s make the dependency of ED(
W) on w0explicitly:
ED(
W)=1
2Tr(XW +1w0TT)T(XW +1w0TT)
Then we calculate the derivative of ED(
W) with respect to w0:
ED(
W)
w0=2Nw0+2(XW T)T1
87
Where we have used the property:
XTr(AXB +C)(AXB +C)T=2AT(AXB +C)BT
We set the derivative equals to 0, which gives:
w0= − 1
N(XW T)T1=¯
tWT¯
x
Where we have denoted:
¯
t=1
NTT1,and ¯
x=1
NXT1
If we substitute the equations above into ED(
W), we can obtain:
ED(
W)=1
2Tr(XW +¯
T¯
XW T)T(XW +¯
T¯
XW T)
Where we further denote
¯
T=1¯
tT,and ¯
X=1¯
xT
Then we set the derivative of ED(
W) with regard to Wto 0, which gives:
W=
X
T
Where we have defined:
X=X¯
X,and
T=T¯
T
Now consider the prediction for a new given x, we have:
y(x)=WTx+w0
=WTx+¯
tWT¯
x
=¯
t+WT(x¯
x)
If we know that aTtn+b=0 holds for some aand b, we can obtain:
aT¯
t=1
NaTTT1=1
N
N
n=1
aTtn= −b
Therefore,
aTy(x)=aT¯
t+WT(x¯
x)
=aT¯
t+aTWT(x¯
x)
= −b+aT
TT(
X)T(x¯
x)
= −b
88
Where we have used:
aT
TT=aT(T¯
T)T=aT(T1
N11TT)T
=aTTT1
NaTTT11T= −b1T+b1T
=0T
Problem 4.3 Solution
Suppose there are Qconstraints in total. We can write aqTtn+bq=0,q=
1,2,...,Qfor all the target vector tn,n=1,2..., N. Or alternatively, we can
group them together:
ATtn+b=0
Where Ais a Q×Qmatrix, and the qth column of Ais aq, and mean-
while bis a Q×1 column vector, and the qth element is bq. for every pair
of {aq,bq}we can follow the same procedure in the previous problem to show
that aqy(x)+bq=0. In other words, the proofs will not affect each other.
Therefore, it is obvious :
ATy(x)+b=0
Problem 4.4 Solution
We use Lagrange multiplier to enforce the constraint wTw=1. We now
need to maximize :
L(λ,w)=wT(m2m1)+λ(wTw1)
We calculate the derivatives:
L(λ,w)
∂λ =wTw1
And L(λ,w)
w=m2m1+2λw
We set the derivatives above equals to 0, which gives:
w= − 1
2λ(m2m1)(m2m1)
Problem 4.5 Solution
We expand (4.25) using (4.22), (4.23) and (4.24).
J(w)=(m2m1)2
s2
1+s2
2
=||wT(m2m1)||2
nC1(wTxnm1)2+nC2(wTxnm2)2
89
The numerator can be further written as:
numerator =wT(m2m1)wT(m2m1)T=wTSBw
Where we have defined:
SB=(m2m1)(m2m1)T
And ti is the same for the denominator:
denominator =
nC1
[wT(xnm1)]2+
nC2
[wT(xnm2)]2
=wTSw1w+wTSw2w
=wTSww
Where we have defined:
Sw=
nC1
(xnm1)(xnm1)T+
nC2
(xnm2)(xnm2)T
Just as required.
Problem 4.6 Solution
Let’s follow the hint, beginning by expanding (4.33).
(4.33) =
N
n=1
wTxnxn+w0
N
n=1
xn
N
n=1
tnxn
=
N
n=1
xnxnTwwTm
N
n=1
xn(
nC1
tnxn+
nC2
tnxn)
=
N
n=1
xnxnTwwTm·(Nm)(
nC1
N
N1
xn+
nC2
N
N2
xn)
=
N
n=1
xnxnTwNwTmm N(
nC1
1
N1
xn
nC2
1
N2
xn)
=
N
n=1
xnxnTwNmmTwN(m1m2)
=[
N
n=1
(xnxnT)NmmT]wN(m1m2)
If we let the derivative equal to 0, we will see that:
[
N
n=1
(xnxnT)NmmT]w=N(m1m2)
Therefore, now we need to prove:
N
n=1
(xnxnT)NmmT=Sw+N1N2
NSB
90
Let’s expand the left side of the equation above:
left =
N
n=1
xnxnTN(N1
Nm1+N2
Nm2)2
=
N
n=1
xnxnTN(N2
1
N2||m1||2+N2
2
N2||m2||2+2N1N2
N2m1m2T)
=
N
n=1
xnxnTN2
1
N||m1||2N2
2
N||m2||22N1N2
Nm1m2T
=
N
n=1
xnxnT+(N1+N1N2
N2N1)||m1||2+(N2+N1N2
N2N2)||m2||22N1N2
Nm1m2T
=
N
n=1
xnxnT+(N12N1)||m1||2+(N22N2)||m2||2+N1N2
N||m1m2||2
=
N
n=1
xnxnT+N1||m1||22m1·(N1m1T)+N2||m2||22m2·(N2m2T)+N1N2
NSB
=
N
n=1
xnxnT+N1||m1||22m1
nC1
xT
n+N2||m2||22m2
nC2
xT
n+N1N2
NSB
=
nC1
xnxnT+N1||m1||22m1
nC1
xT
n
+
nC2
xnxnT+N2||m2||22m2
nC2
xT
n+N1N2
NSB
=
nC1
(xnxnT+||m1||22m1xT
n)+
nC2
(xnxnT+||m2||22m2xnT)+N1N2
NSB
=
nC1||xnm1||2+
nC2||xnm2||2+N1N2
NSB
=Sw+N1N2
NSB
Just as required.
Problem 4.7 Solution
This problem is quite simple. We can solve it by definition. We know that
logistic sigmoid function has the form:
σ(a)=1
1+exp(a)
Therefore, we can obtain:
σ(a)+σ(a)=1
1+exp(a)+1
1+exp(a)
=2+exp(a)+exp(a)
[1 +exp(a)][1 +exp(a)]
=2+exp(a)+exp(a)
2+exp(a)+exp(a)=1
91
Next we exchange the dependent and independent variables to obtain its
inverse.
a=1
1+exp(y)
We first rearrange the equation above, which gives:
exp(y)=1a
a
Then we calculate the logarithm for both sides, which gives:
y=ln( a
1a)
Just as required.
Problem 4.8 Solution
According to (4.58) and (4.64), we can write:
a=ln p(x|C1)p(C1)
p(x|C2)p(C2)
=ln p(x|C1)ln p(x|C2)+ln p(C1)
p(C2)
= −1
2(xµ1)TΣ1(xµ1)+1
2(xµ2)TΣ1(xµ2)+ln p(C1)
p(C2)
=Σ1(µ1µ2)x1
2µ1TΣ1µ1+1
2µ2TΣ1µ2+ln p(C1)
p(C2)
=wTx+w0
Where in the last second step, we rearrange the term according to x, i.e.,
its quadratic, linear, constant term. We have also defined :
w=Σ1(µ1µ2)
And
w0= −1
2µ1TΣ1µ1+1
2µ2TΣ1µ2+ln p(C1)
p(C2)
Finally, since p(C1|x)=σ(a) as stated in (4.57), we have p(C1|x)=σ(wTx+
w0) just as required.
Problem 4.9 Solution
We begin by writing down the likelihood function.
p({ϕn,tn}|π1,π2, ...,πK)=
N
n=1
K
k=1
[p(ϕn|Ck)p(Ck)]tnk
=
N
n=1
K
k=1
[πkp(ϕn|Ck)]tnk
92
Hence we can obtain the expression for the logarithm likelihood:
ln p=
N
n=1
K
k=1
tnk lnπk+ln p(ϕn|Ck)
N
n=1
K
k=1
tnk lnπk
Since there is a constraint on πk, so we need to add a Lagrange Multiplier
to the expression, which becomes:
L=
N
n=1
K
k=1
tnk lnπk+λ(
K
k=1
πk1)
We calculate the derivative of the expression above with regard to πk:
L
∂πk=
N
n=1
tnk
πk+λ
And if we set the derivative equal to 0, we can obtain:
πk= −(
N
n=1
tnk) / λ= −Nk
λ()
And if we preform summation on both sides with regard to k, we can see
that:
1= −(
K
k=1
Nk) / λ= −N
λ
Which gives λ= −N, and substitute it into (), we can obtain πk=Nk/N.
Problem 4.10 Solution
This time, we focus on the term which dependent on µkand Σin the
logarithm likelihood.
ln p=
N
n=1
K
k=1
tnk lnπk+ln p(ϕn|Ck)
N
n=1
K
k=1
tnk ln p(ϕn|Ck)
Provided p(ϕ|Ck)=N(ϕ|µk,Σ), we can further derive:
ln p
N
n=1
K
k=1
tnk 1
2ln|Σ|1
2(ϕnµk)Σ1(ϕnµk)T
We first calculate the derivative of the expression above with regard to
µk:
ln p
µk=
N
n=1
tnkΣ1(ϕnµk)
We set the derivative equals to 0, which gives:
N
n=1
tnkΣ1ϕn=
N
n=1
tnkΣ1µk=NkΣ1µk
93
Therefore, if we multiply both sides by Σ/Nk, we will obtain (4.161). Now
let’s calculate the derivative of ln pwith regard to Σ, which gives:
ln p
Σ=
N
n=1
K
k=1
tnk (1
2Σ1)1
2
Σ
N
n=1
K
k=1
tnk(ϕnµk)Σ1(ϕnµk)T
=
N
n=1
K
k=1tnk
2Σ11
2
Σ
K
k=1
N
n=1
tnk(ϕnµk)Σ1(ϕnµk)T
=
N
n=11
2Σ11
2
Σ
K
k=1
NkTr(Σ1Sk)
= −N
2Σ1+1
2
K
k=1
NkΣ1SkΣ1
Where we have denoted
Sk=1
Nk
N
n=1
tnk(ϕnµk)(ϕnµk)T
Now we set the derivative equals to 0, and rearrange the equation, which
gives:
Σ=
K
k=1
Nk
NSk
Problem 4.11 Solution
Based on definition, we can write down
p(ϕ|Ck)=
M
m=1
L
l=1
µϕml
kml
Note that here only one of the value among ϕm1,ϕm2, ... ϕmL is 1, and the
others are all 0 because we have used a 1 of Lbinary coding scheme, and
also we have taken advantage of the assumption that the Mcomponents of
ϕare independent conditioned on the class Ck. We substitute the expression
above into (4.63), which gives:
ak=
M
m=1
L
l=1
ϕml µkml +ln p(Ck)
Hence it is obvious that akis a linear function of the components of ϕ.
Problem 4.12 Solution
Based on definition, i.e., (4.59), we know that logistic sigmoid has the
form:
σ(a)=1
1+exp(a)
94
Now, we calculate its derivative with regard to a.
dσ(a)
da =exp(a)
[ 1 +exp(a) ]2=exp(a)
1+exp(a)·1
1+exp(a)=[ 1 σ(a) ] ·σ(a)
Just as required.
Problem 4.13 Solution
Let’s follow the hint.
E(w)= −∇
N
n=1
{tnln yn+(1 tn)ln(1 yn)}
= −
N
n=1{tnln yn+(1 tn)ln(1 yn)}
= −
N
n=1
d{tnln yn+(1 tn)ln(1 yn)}
d yn
d yn
dan
dan
dw
= −
N
n=1
(tn
yn1tn
1yn
)·yn(1 yn)·ϕn
= −
N
n=1
tnyn
yn(1 yn)·yn(1 yn)·ϕn
= −
N
n=1
(tnyn)ϕn
=
N
n=1
(yntn)ϕn
Where we have used yn=σ(an), an=wTϕn, the chain rules and (4.88).
Problem 4.14 Solution
According to definition, we know that if a dataset is linearly separable,
we can find w, for some points xn, we have wTϕ(xn)>0, and the others
wTϕ(xm)<0. Then the boundary is given by wTϕ(x) =0. Note that for any
point x0in the dataset, the value of wTϕ(x0)should either be positive or
negative, but it can not equal to 0.
Therefore, the maximum likelihood solution for logistic regression is triv-
ial. We suppose for those points xnbelonging to class C1, we have wTϕ(xn)>
0 and wTϕ(xm)<0 for those belonging to class C2. According to (4.87), if
|w|→∞, we have
p(C1|ϕ(xn)) =σ(wTϕ(xn)) 1
Where we have used wTϕ(xn)+∞. And since wTϕ(xm) −∞, we can
also obtain:
p(C2|ϕ(xm)) =1p(C1|ϕ(xm)) =1σ(wTϕ(xm)) 1
95
In other words, for the likelihood function, i.e.,(4.89), if we have |w|→∞,
and also we label all the points lying on one side of the boundary as class C1,
and those on the other side as class C2, the every term in (4.89) can achieve
its maximum value, i.e., 1, finally leading to the maximum of the likelihood.
Hence, for a linearly separable dataset, the learning process may prefer
to make |w|→∞and use the linear boundary to label the datasets, which
can cause severe over-fitting problem.
Problem 4.15 Solution(Waiting for update)
Since ynis the output of the logistic sigmoid function, we know that 0 <
yn<1 and hence yn(1 yn)>0. Then we use (4.97), for an arbitrary non-zero
real vector a̸=0, we have:
aTHa =aTN
n=1
yn(1 yn)ϕnϕT
na
=
N
n=1
yn(1 yn)(ϕT
na)T(ϕT
na)
=
N
n=1
yn(1 yn)b2
n
Where we have denoted bn=ϕT
na. What’s more, there should be at least
one of {b1,b2,..., bN}not equal to zero and then we can see that the expression
above is larger than 0 and hence His positive definite.
Otherwise, if all the bn=0, a=[a1,a2, ..., aM]Twill locate in the null
space of matrix ΦN×M. However, with regard to the rank-nullity theorem,
we know that Rank(Φ)+Nullity(Φ)=M, and we have already assumed that
those Mfeatures are independent, i.e., Rank(Φ)=M, which means there is
only 0in its null space. Therefore contradictory occurs.
Problem 4.16 Solution
We still denote yn=p(t=1|ϕn), and then we can write down the log
likelihood by replacing tnwith πnin (4.89) and (4.90).
ln p(t|w)=
N
n=1
{πnln yn+(1 πn) ln(1 yn)}
Problem 4.17 Solution
We should discuss in two situations separately, namely j=kand j̸= k.
When j̸= k, we have:
yk
aj=exp(ak)·exp(aj)
[jexp(aj)]2= −yk·yj
And when j=k, we have:
yk
ak=exp(ak)jexp(aj)exp(ak)exp(ak)
[jexp(aj)]2=yky2
k=yk(1 yk)
96
Therefore, we can obtain:
yk
aj=yk(Ik j yj)
Where Ik j is the elements of the indentity matrix.
Problem 4.18 Solution
We derive every term tnk ln ynk with regard to aj.
tnk ln ynk
wj=tnk ln ynk
ynk
ynk
aj
aj
wj
=tnk
1
ynk ·ynk(Ik j yn j)·ϕn
=tnk (Ik j yn j)ϕn
Where we have used (4.105) and (4.106). Next we perform summation
over nand k.
wjE= −
N
n=1
K
k=1
tnk (Ik j yn j)ϕn
=
N
n=1
K
k=1
tnk yn j ϕn
N
n=1
K
k=1
tnk Ik j ϕn
=
N
n=1(
K
k=1
tnk)yn j ϕn
N
n=1
tn jϕn
=
N
n=1
yn j ϕn
N
n=1
tn jϕn
=
N
n=1
(yn j tn j)ϕn
Where we have used the fact that for arbitrary n, we have K
k=1tnk =1.
Problem 4.19 Solution
We write down the log likelihood.
ln p(t|w)=
N
n=1tnln yn+(1 tn)ln(1 yn)
Therefore, we can obtain:
wln p=ln p
yn·yn
an·an
w
=
N
n=1
(tn
yn1tn
1yn
)Φ(an)ϕn
=
N
n=1
yntn
yn(1 yn)Φ(an)ϕn
97
Where we have used y=p(t=1|a)=Φ(a) and an=wTϕn. According to
(4.114), we can obtain:
Φ(a)=N(θ|0,1)θ=a=1
p2πexp(1
2a2)
Hence, we can obtain:
wln p=
N
n=1
yntn
yn(1 yn)
exp(a2
n
2)
p2πϕn
To calculate the Hessian Matrix, we need to first evaluate several deriva-
tives.
w{yntn
yn(1 yn)}=
yn
{yntn
yn(1 yn)}·yn
an·an
w
=yn(1 yn)(yntn)(1 2yn)
[yn(1 yn)]2Φ(an)ϕn
=y2
n+tn2yntn
y2
n(1 yn)2
exp(a2
n
2)
p2πϕn
And
w{exp(a2
n
2)
p2π}=
an
{exp(a2
n
2)
p2π}an
w
= an
p2πexp(a2
n
2)ϕn
Therefore, using the chain rule, we can obtain:
w{yntn
yn(1 yn)
exp(a2
n
2)
p2π}=
w{yntn
yn(1 yn)}exp(a2
n
2)
p2π+yntn
yn(1 yn)
w{exp(a2
n
2)
p2π}
=y2
n+tn2yntn
yn(1 yn)
exp(a2
n
2)
p2πan(yntn)exp(a2
n
2)
p2πyn(1 yn)ϕn
Finally if we perform summation over n, we can obtain the Hessian Ma-
trix:
H= ∇∇wln p
=
N
n=1
w{yntn
yn(1 yn)
exp(a2
n
2)
p2π}·ϕn
=
N
n=1y2
n+tn2yntn
yn(1 yn)
exp(a2
n
2)
p2πan(yntn)exp(a2
n
2)
p2πyn(1 yn)ϕnϕnT
98
Problem 4.20 Solution(waiting for update)
We know that the Hessian Matrix is of size MK ×MK, and the ( j,k)th
block with size M×Mis given by (4.110), where j,k=1,2, ..., K. Therefore,
we can obtain:
uTHu =
K
j=1
K
k=1
uT
jHj,kuk()
Where we use ukto denote the kth block vector of uwith size M×1, and
Hj,kto denote the ( j,k)th block matrix of Hwith size M×M. Then based on
(4.110), we further expand (4.110):
()=
K
j=1
K
k=1
uT
j{
N
n=1
ynk(Ik j yn j)ϕnϕnT}uk
=
K
j=1
K
k=1
N
n=1
uT
j{ynk(Ik j yn j)ϕnϕnT}uk
=
K
j=1
K
k=1
N
n=1
uT
j{ynk Ik j ϕnϕnT}uk+
K
j=1
K
k=1
N
n=1
uT
j{ynk yn j ϕnϕnT}uk
=
K
k=1
N
n=1
uT
k{ynk ϕnϕnT}uk+
K
j=1
K
k=1
N
n=1
yn juT
j{ϕnϕnT}ynkuk
Problem 4.21 Solution
It is quite obvious.
Φ(a)=a
−∞
N(θ|0,1) dθ
=1
2+a
0
N(θ|0,1) dθ
=1
2+a
0
N(θ|0,1) dθ
=1
2+1
p2πa
0
exp(θ2/2) dθ
=1
2+1
p2π
pπ
2a
0
2
pπexp(θ2/2) dθ
=1
2(1 +1
p2a
0
2
pπexp(θ2/2) dθ)
=1
21+1
p2er f (a)
Where we have used
0
−∞
N(θ|0,1) dθ=1
2
99
Problem 4.22 Solution
If we denote f(θ)=p(D|θ)p(θ), we can write:
p(D)=p(D|θ)p(θ)dθ=f(θ)dθ
=f(θM AP )(2π)M/2
|A|1/2
=p(D|θM AP )p(θM AP )(2π)M/2
|A|1/2
Where θM AP is the value of θat the mode of f(θ), Ais the Hessian Matrix
of ln f(θ) and we have also used (4.135). Therefore,
ln p(D)=ln p(D|θM AP )+ln p(θM AP )+M
2ln2π1
2ln|A|
Just as required.
Problem 4.23 Solution
According to (4.137), we can write:
ln p(D)=ln p(D|θM AP )+ln p(θM AP )+M
2ln2π1
2ln|A|
=ln p(D|θM AP )M
2ln2π1
2ln|V0| − 1
2(θM AP m)TV01(θM AP m)
+M
2ln2π1
2ln|A|
=ln p(D|θM AP )1
2ln|V0| − 1
2(θM AP m)TV01(θM AP m)1
2ln|A|
Where we have used the definition of the multivariate Gaussian Distri-
bution. Then, from (4.138), we can write:
A= −∇∇ln p(D|θM AP )p(θM AP )
= −∇∇ln p(D|θM AP )∇∇ln p(θM AP )
=H∇∇1
2(θM AP m)TV01(θM AP m)
=H+V01(θM AP m)
=H+V01
Where we have denoted H= −∇∇ln p(D|θM AP ). Therefore, the equation
100
above becomes:
ln p(D)=ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|V0|·|H+V1
0|
=ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|V0H+I|
ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|V0| − 1
2ln|H|
ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|H|+const
Where we have used the property of determinant: |A|·|B|=|AB|, and the
fact that the prior is board, i.e. Ican be neglected with regard to V0H. What’s
more, since the prior is pre-given, we can view V0as constant. And if the data
is large, we can write:
H=
N
n=1
Hn=N
H
Where
H=1/NN
n=1Hn, and then
ln p(D)ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|H|+const
ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)1
2ln|N
H|+const
ln p(D|θM AP )1
2(θM AP m)TV01(θM AP m)M
2ln N1
2ln|
H|+const
ln p(D|θM AP )M
2ln N
This is because when N>>1, other terms can be neglected.
Problem 4.24 Solution(Waiting for updating)
Problem 4.25 Solution
We first need to obtain the expression for the first derivative of probit
function Φ(λa) with regard to a. According to (4.114), we can write down:
d
da
Φ(λa)=dΦ(λa)
d(λa)·dλa
da
=λ
p2πexp1
2(λa)2
Which further gives:
d
da
Φ(λa)a=0=λ
p2π
And for logistic sigmoid function, according to (4.88), we have
dσ
da =σ(1 σ)=0.5×0.5=1
4
101
Where we have used σ(0) =0.5. Let their derivatives at origin equals, we
have: λ
p2π=1
4
i.e., λ=p2π4. And hence λ2=π8 is obvious.
Problem 4.26 Solution
We will prove (4.152) in a more simple and intuitive way. But firstly, we
need to prove a trivial yet useful statement: Suppose we have a random vari-
able satisfied normal distribution denoted as XN(X|µ,σ2), the probability
of Xxis P(Xx)=Φ(xµ
σ), and here xis a given real number. We can see
this by writing down the integral:
P(Xx)=x
−∞
1
p2πσ2exp1
2σ2(Xµ)2dX
=xµ
σ
−∞
1
p2πσ2exp(1
2γ2)σdγ
=xµ
σ
−∞
1
p2πexp(1
2γ2)dγ
=Φ(xµ
σ)
Where we have changed the variable X=µ+σγ. Now consider two ran-
dom variables XN(0,λ2) and YN(µ,σ2). We first calculate the condi-
tional probability P(XY|Y=a):
P(XY|Y=a)=P(Xa)=Φ(a0
λ1)=Φ(λa)
Together with Bayesian Formula, we can obtain:
P(XY)=+∞
−∞
P(XY|Y=a)pd f (Y=a)dY
=+∞
−∞
Φ(λa)N(a|µ,σ2)da
Where pd f (·) denotes the probability density function and we have also
used pd f (Y)=N(µ,σ2). What’s more, we know that XYshould also sat-
isfy normal distribution, with:
E[XY]=E[X]E[Y]=0µ= −µ
And
var[XY]=var[X]+var[Y]=λ2+σ2
Therefore, XYN(µ,λ2+σ2) and it follows that:
P(XY0) =Φ(0(µ)
pλ2+σ2)=Φ(µ
pλ2+σ2)
Since P(XY)=P(XY0), we obtain what have been required.
102
0.5 Neural Networks
Problem 5.1 Solution
Based on definition of tanh(·), we can obtain:
tanh(a)=eaea
ea+ea
= −1+2ea
ea+ea
= −1+21
1+e2a
=2σ(2a)1
If we have parameters w(1s)
ji ,w(1s)
j0and w(2s)
k j ,w(2s)
k0for a network whose
hidden units use logistic sigmoid function as activation and w(1t)
ji ,w(1t)
j0and
w(2t)
k j ,w(2t)
k0for another one using tanh(·), for the network using tanh(·) as
activation, we can write down the following expression by using (5.4):
a(t)
k=
M
j=1
w(2t)
k j tanh(a(t)
j)+w(2t)
k0
=
M
j=1
w(2t)
k j [ 2σ(2a(t)
j)1] +w(2t)
k0
=
M
j=1
2w(2t)
k j σ(2a(t)
j)+
M
j=1
w(2t)
k j +w(2t)
k0
What’s more, we also have :
a(s)
k=
M
j=1
w(2s)
k j σ(a(s)
j)+w(2s)
k0
To make the two networks equivalent, i.e., a(s)
k=a(t)
k, we should make
sure:
a(s)
j=2a(t)
j
w(2s)
k j =2w(2t)
k j
w(2s)
k0= −M
j=1w(2t)
k j +w(2t)
k0
Note that the first condition can be achieved by simply enforcing:
w(1s)
ji =2w(1t)
ji ,and w(1s)
j0=2w(1t)
j0
Therefore, these two networks are equivalent under a linear transforma-
tion.
Problem 5.2 Solution
103
It is obvious. We write down the likelihood.
p(T|X,w)=
N
n=1
N(tn|y(xn,w),β1I)
Taking the negative logarithm, we can obtain:
E(w,β)= −ln p(T|X,w)=β
2
N
n=1(y(xn,w)tn)T(y(xn,w)tn)NK
2lnβ+const
Here we have used const to denote the term independent of both wand
β. Note that here we have used the definition of the multivariate Gaussian
Distribution. What’s more, we see that the covariance matrix β1Iand the
weight parameter whave decoupled, which is distinct from the next prob-
lem. We can first solve wML by minimizing the first term on the right of the
equation above or equivalently (5.11), i.e., imaging βis fixed. Then according
to the derivative of E(w,β) with regard to β, we can obtain (5.17) and hence
βML.
Problem 5.3 Solution
Following the process in the previous question, we first write down the
negative logarithm of the likelihood function.
E(w,Σ)=1
2
N
n=1[y(xn,w)tn]TΣ1[y(xn,w)tn]+N
2ln|Σ| + const ()
Note here we have assumed Σis unknown and const denotes the term
independent of both wand Σ. In the first situation, if Σis fixed and known,
the equation above will reduce to:
E(w)=1
2
N
n=1[y(xn,w)tn]TΣ1[y(xn,w)tn]+const
We can simply solve wML by minimizing it. If Σis unknown, since Σis
in the first term on the right of (), solving wML will involve Σ. Note that in
the previous problem, the main reason that they can decouple is due to the
independent assumption, i.e., Σreduces to β1I, so that we can bring βto the
front and view it as a fixed multiplying factor when solving wML.
Problem 5.4 Solution
Based on (5.20), the current conditional distribution of targets, consider-
ing mislabel, given input xand weight wis:
p(t=1|x,w)=(1 ϵ)·p(tr=1|x,w)+ϵ·p(tr=0|x,w)
104
Note that here we use tto denote the observed target label, trto denote
its real label, and that our network is aimed to predict the real label trnot t,
i.e., p(tr=1|x,w)=y(x,w), hence we see that:
p(t=1|x,w)=(1 ϵ)·y(x,w)+ϵ·1y(x,w)()
Also, it is the same for p(t=0|x,w):
p(t=0|x,w)=(1 ϵ)·1y(x,w)+ϵ·y(x,w) (∗∗)
Combing () and (∗∗), we can obtain:
p(t|x,w)=(1 ϵ)·yt(1 y)1t+ϵ·(1 y)ty1t
Where yis short for y(x,w). Therefore, taking the negative logarithm, we
can obtain the error function:
E(w)= −
N
n=1
ln(1 ϵ)·ytn
n(1 yn)1tn+ϵ·(1 yn)tny1tn
n
When ϵ=0, it is obvious that the equation above will reduce to (5.21).
Problem 5.5 Solution
It is obvious by using (5.22).
E(w)= −ln
N
n=1
p(t|xn,w)
= −ln
N
n=1
K
k=1
yk(xn,w)tnk 1yk(xn,w)1tnk
= −
N
n=1
K
k=1
lnyk(xn,w)tnk 1yk(xn,w)1tnk
= −
N
n=1
K
k=1
lnytnk
nk (1 ynk )1tnk
= −
N
n=1
K
k=1tnk ln ynk +(1 tnk) ln(1 ynk )
Where we have denoted
ynk =yk(xn,w)
Problem 5.6 Solution
We know that yk=σ(ak), where σ(·) represents the logistic sigmoid func-
tion. Moreover,
dσ
da =σ(1 σ)
105
dE(w)
dak= −tk
1
ykyk(1 yk)+(1 tk)1
1ykyk(1 yk)
=yk(1 yk)1tk
1yktk
yk
=(1 tk)yktk(1 yk)
=yktk
Just as required.
Problem 5.7 Solution
It is similar to the previous problem. First we denote ykn =yk(xn,w). If
we use softmax function as activation for the output unit, according to (4.106),
we have: d ykn
da j=ykn (Ik j yjn)
Therefore,
dE(w)
da j=d
dak
N
n=1
K
k=1
tkn ln yk(xn,w)
= −
N
n=1
K
k=1
d
da jtkn ln ykn
= −
N
n=1
K
k=1
tkn
1
ykn ykn (Ik j yjn)
= −
N
n=1
K
k=1
(tkn Ik j tkn yjn)
= −
N
n=1
K
k=1
tkn Ik j +
N
n=1
K
k=1
tkn yjn
= −
N
n=1
tjn +
N
n=1
yjn
=
N
n=1
(yjn tjn)
Where we have used the fact that only when k=j,Ik j =1̸= 0 and that
K
k=1tkn =1.
Problem 5.8 Solution
It is obvious based on definition of ’tanh’, i.e., (5.59).
d
da tanh(a)=(ea+ea)(ea+ea)(eaea)(eaea)
(ea+ea)2
=1(eaea)2
(ea+ea)2
=1tanh(a)2
106
Problem 5.9 Solution
We know that the logistic sigmoid function σ(a)[0,1], therefore if we
perform a linear transformation h(a)=2σ(a)1, we can find a mapping func-
tion h(a) from (−∞,+∞) to [1,1]. In this case, the conditional distribution
of targets given inputs can be similarly written as:
p(t|x,w)=1+y(x,w)
2(1+t)/21y(x,w)
2(1t)/2
Where 1+y(x,w)/2 represents the conditional probability p(C1|x). Since
now y(x,w)[1,1], we also need to perform the linear transformation to
make it satisfy the constraint for probability.Then we can further obtain:
E(w)= −
N
n=11+tn
2ln 1+yn
2+1tn
2ln 1yn
2
= −1
2
N
n=1(1 +tn)ln(1 +yn)+(1 tn)ln(1 yn)+Nln2
Problem 5.10 Solution
It is obvious. Suppose His positive definite, i.e., (5.37) holds. We set v
equals to the eigenvector of H, i.e., v=uiwhich gives:
vTHv =vT(Hv)=uiTλiui=λi||ui||2
Therefore, every λishould be positive. On the other hand, If all the eigen-
values λiare positive, from (5.38) and (5.39), we see that His positive defi-
nite.
Problem 5.11 Solution
It is obvious. We follow (5.35) and then write the error function in the
form of (5.36). To obtain the contour, we enforce E(w) to equal to a constant
C.
E(w)=E(w)+1
2
i
λiα2
i=C
We rearrange the equation above, and then obtain:
i
λiα2
i=B
Where B=2C2E(w) is a constant. Therefore, the contours of con-
stant error are ellipses whose axes are aligned with the eigenvector uiof
the Hessian Matrix H. The length for the jth axis is given by setting all
αi=0,s.t.i̸= j:
αj=B
λj
107
In other words, the length is inversely proportional to the square root of
the corresponding eigenvalue λj.
Problem 5.12 Solution
If His positive definite, we know the second term on the right side of
(5.32) will be positive for arbitrary w. Therefore, E(w) is a local minimum.
On the other hand, if wis a local minimum, we have
E(w)E(w)= −1
2(ww)TH(ww)<0
In other words, for arbitrary w, (ww)TH(ww)>0, according to the
previous problem, we know that this means His positive definite.
Problem 5.13 Solution
It is obvious. Suppose that there are Wadaptive parameters in the net-
work. Therefore, bhas Windependent parameters. Since His symmetric,
there should be W(W+1)/2 independent parameters in it. Therefore, there
are W+W(W+1)/2 =W(W+3)/2 parameters in total.
Problem 5.14 Solution
It is obvious. Since we have
En(wji +ϵ)=En(wji)+ϵE
n(wji )+ϵ2
2E′′
n(wji )+O(ϵ3)
And
En(wji ϵ)=En(wji)ϵE
n(wji )+ϵ2
2E′′
n(wji )+O(ϵ3)
We combine those two equations, which gives,
En(wji +ϵ)En(wji ϵ)=2ϵE
n(wji )+O(ϵ3)
Rearrange the equation above, we obtain what has been required.
Problem 5.15 Solution
It is obvious. The back propagation formalism starts from performing
summation near the input, as shown in (5.73). By symmetry, the forward
propagation formalism should start near the output.
Jki =yk
xi=h(ak)
xi=h(ak)ak
xi
()
Where h(·) is the activation function at the output node ak. Considering
all the units j, which have links to unit k:
ak
xi=
j
ak
aj
aj
xi=
j
wk j h(aj)aj
xi
(∗∗)
108
Where we have used:
ak=
j
wk j zj,zj=h(aj)
It is similar for aj/xi. In this way we have obtained a recursive formula
starting from the input node:
al
xi=wli,if there is a link from input unit ito l
0,if there isn’t a link from input unit ito l
Using recursive formula (∗∗) and then (), we can obtain the Jacobian
Matrix.
Problem 5.16 Solution
It is obvious. We begin by writing down the error function.
E=1
2
N
n=1||yntn||2=1
2
N
n=1
M
m=1
(yn,mtn,m)2
Where the subscript mdenotes the mthe element of the vector. Then we
can write down the Hessian Matrix as before.
H= ∇∇E=
N
n=1
M
m=1yn,myn,m+
N
n=1
M
m=1
(yn,mtn,m)∇∇yn,m
Similarly, we now know that the Hessian Matrix can be approximated as:
H
N
n=1
M
m=1
bn,mbT
n,m
Where we have defined:
bn,m= ∇yn,m
Problem 5.17 Solution
It is obvious.
2E
wrws=
wr
1
2  2(yt)y
ws
p(x,t)dxdt
=  (yt)y2
wrws+y
ws
y
wrp(x,t)dxdt
Since we know that
 (yt)y2
wrws
p(x,t)dxdt = (yt)y2
wrws
p(t|x)p(x)dxdt
=y2
wrws(yt)p(t|x)dtp(x)dx
=0
109
Note that in the last step, we have used y=tp(t|x)dt. Then we substi-
tute it into the second derivative, which gives,
2E
wrws=  y
ws
y
wr
p(x,t)dxdt
=y
ws
y
wr
p(x)dx
Problem 5.18 Solution
By analogy with section 5.3.2, we denote wskip
ki as those parameters corre-
sponding to skip-layer connections, i.e., it connects the input unit iwith the
output unit k. Note that the discussion in section 5.3.2 is still correct and
now we only need to obtain the derivative of the error function with respect
to the additional parameters wskip
ki .
En
wskip
ki =En
ak
ak
wskip
ki =δkxi
Where we have used ak=ykdue to linear activation at the output unit
and:
yk=
M
j=0
w(2)
k j zj+
i
wskip
ki xi
Where the first term on the right side corresponds to those information
conveying from the hidden unit to the output and the second term corre-
sponds to the information conveying directly from the input to output.
Problem 5.19 Solution
The error function is given by (5.21). Therefore, we can obtain:
E(w)=
N
n=1
E
anan
= −
N
n=1
antnln yn+(1 tn)ln(1 yn)an
= −
N
n=1(tnln yn)
yn
yn
an+(1 tn)ln(1 yn)
yn
yn
anan
= −
N
n=1tn
yn·yn(1 yn)+(1 tn)1
1yn·yn(1 yn)an
= −
N
n=1tn(1 yn)(1 tn)ynan
=
N
n=1
(yntn)an
110
Where we have used the conclusion of problem 5.6. Now we calculate the
second derivative.
∇∇E(w)=
N
n=1yn(1 yn)anan+(yntn)∇∇an
Similarly, we can drop the last term, which gives exactly what has been
asked.
Problem 5.20 Solution(waiting for update)
We begin by writing down the error function.
E(w)= −
N
n=1
K
k=1
tnk ln ynk
Here we assume that the output of the network has Kunits in total and
there are Wweights parameters in the network. WE first calculate the first
derivative:
E=
N
n=1
dE
dan·an
= −
N
n=1d
dan
(
K
k=1
tnk ln ynk )·an
=
N
n=1
cn·an
Note that here cn= −dE/danis a vector with size K×1, anis a matrix
with size K×W. Moreover, the operator ·means inner product, which gives
Eas a vector with size 1 ×W. According to (4.106), we can obtain the jth
element of cn:
cn,j=
aj
(
K
k=1
tnk ln ynk )
= −
K
k=1
aj
(tnk ln ynk )
= −
K
k=1
tnk
ynk
ynk(Ik j yn j)
= −
K
k=1
tnk Ik j +
K
k=1
tnk yn j
= −tn j +yn j (
K
k=1
tnk)
=yn j tn j
111
Now we calculate the second derivative:
∇∇E=
N
n=1
(dcn
danan)·an+cn∇∇an
Here dcn/danis a matrix with size K×K. Therefore, the second term can
be neglected as before, which gives:
H=
N
n=1
(dcn
danan)·an
Problem 5.21 Solution
We first write down the expression of Hessian Matrix in the case of K
outputs.
HN,K=
N
n=1
K
k=1
bn,kbT
n,k
Where bn,k= ∇wan,k. Therefore, we have:
HN+1,K=HN,K+
K
k=1
bN+1,kbT
N+1,k=HN,K+BN+1BT
N+1
Where BN+1=[bN+1,1,bN+1,2, ..., bN+1,K] is a matrix with size W×K,
and here Wis the total number of the parameters in the network. By analogy
with (5.88)-(5.89), we can obtain:
H1
N+1,K=H1
N,K
H1
N,KBN+1BT
N+1H1
N,K
1+BT
N+1H1
N,KBN+1
()
Furthermore, similarly, we have:
HN+1,K+1=HN+1,K+
N+1
n=1
bn,K+1bT
n,K+1=HN+1,K+BK+1BT
K+1
Where BK+1=[b1,K+1,b2,K+1, ..., bN+1,K+1] is a matrix with size W×(N+
1). Also, we can obtain:
H1
N+1,K+1=H1
N+1,K
H1
N+1,KBK+1BT
K+1H1
N+1,K
1+BT
K+1H1
N+1,KBK+1
Where H1
N+1,Kis defined by (). If we substitute () into the expression
above, we can obtain the relationship between H1
N+1,K+1and H1
N,K.
Problem 5.22 Solution
112
We begin by handling the first case.
2En
w(2)
k j w(2)
kj=
w(2)
k j
(En
w(2)
kj
)
=
w(2)
k j
(En
ak
ak
w(2)
kj
)
=
w(2)
k j
(En
ak
jwkjzj
w(2)
kj
)
=
w(2)
k j
(En
ak
zj)
=
w(2)
k j
(En
ak
)zj+En
ak
zj
w(2)
k j
=
ak
(En
ak
)ak
w(2)
k j
zj+0
=
ak
(En
ak
)zjzj
=zjzjMkk
Then we focus on the second case, and if here j̸= j
2En
w(1)
ji w(1)
ji=
w(1)
ji
(En
w(1)
ji
)
=
w(1)
ji
(
k
En
ak
ak
w(1)
ji
)
=
k
w(1)
ji
(En
ak
w(2)
kjh(aj)xi)
=
k
h(aj)xi
w(1)
ji
(En
ak
w(2)
kj)
=
k
h(aj)xi
k
ak
(En
ak
w(2)
kj)ak
w(1)
ji
=
k
h(aj)xi
k
ak
(En
ak
w(2)
kj)·(w(2)
k j h(aj)xi)
=
k
h(aj)xi
k
Mkkw(2)
kj·w(2)
k j h(aj)xi
=xixih(aj)h(aj)
k
k
w(2)
kj·w(2)
k j Mkk
113
When j=j, similarly we have:
2En
w(1)
ji w(1)
ji=
k
w(1)
ji
(En
ak
w(2)
kjh(aj)xi)
=xi
k
w(1)
ji
(En
ak
w(2)
kj)h(aj)+xi
k
(En
ak
w(2)
kj)h(aj)
w(1)
ji
=xixih(aj)h(aj)
k
k
w(2)
kj·w(2)
k j Mkk+xi
k
(En
ak
w(2)
kj)h(aj)
w(1)
ji
=xixih(aj)h(aj)
k
k
w(2)
kj·w(2)
k j Mkk+xi
k
(En
ak
w(2)
kj)h′′(aj)xi
=xixih(aj)h(aj)
k
k
w(2)
kj·w(2)
k j Mkk+h′′(aj)xixi
k
δkw(2)
kj
It seems that what we have obtained is slightly different from (5.94) when
j=j. However this is not the case, since the summation over kin the second
term of our formulation and the summation over kin the first term of (5.94) is
actually the same (i.e., they both represent the summation over all the output
units). Combining the situation when j=jand j̸= j, we can obtain (5.94)
just as required. Finally, we deal with the third case. Similarly we first focus
on j̸= j:
2En
w(1)
ji w(2)
k j=
w(1)
ji
(En
w(2)
k j
)
=
w(1)
ji
(En
ak
ak
w(2)
k j
)
=
w(1)
ji
(En
ak
jwk jzj
w(2)
k j
)
=
w(1)
ji
(En
ak
zj)
=zj
k
ak
(En
ak
)ak
w(1)
ji
=zj
k
Mkkw(2)
kjh(aj)xi
=xih(aj)zj
k
Mkkw(2)
kj
Note that in (5.95), there are two typos: (i)Hkkshould be Mkk. (ii) jshould
114
exchange position with jin the right side of (5.95). When j=j, we have:
2En
w(1)
ji w(2)
k j =
w(1)
ji
(En
w(2)
k j
)
=
w(1)
ji
(En
ak
ak
w(2)
k j
)
=
w(1)
ji
(En
ak
jwk j zj
w(2)
k j
)
=
w(1)
ji
(En
ak
zj)
=
w(1)
ji
(En
ak
)zj+En
ak
zj
w(1)
ji
=xih(aj)zj
k
Mkkw(2)
kj+En
ak
zj
w(1)
ji
=xih(aj)zj
k
Mkkw(2)
kj+δkh(aj)xi
Combing these two situations, we obtain (5.95) just as required.
Problem 5.23 Solution
It is similar to the previous problem.
2En
wkiwk j =
wki
(En
wk j
)
=
wki
(En
ak
zj)
=zj
wki
ak
ak
(En
ak
)
=zjxiMkk
115
And
2En
wkiwji =
wki
(
k
En
ak
ak
wji
)
=
wki
(
k
En
ak
wk j h(aj)xi)
=
k
h(aj)xiwk j
wki
(En
ak
)
=
k
h(aj)xiwk j
ak
(En
ak
)ak
wki
=
k
h(aj)xiwk j Mkkxi
=xixih(aj)
k
wk j Mkk
Finally, we have
2En
wkiwki =
wki
(En
wki
)
=
wki
(En
ak
xi)
=xi
ak
(En
ak
)ak
wki
=xixiMkk
Problem 5.24 Solution
It is obvious. According to (5.113), we have:
aj=
iwji xi+wj0
=
i
1
awji ·(axi+b)+wj0b
a
i
wji
=
i
wji xi+wj0=aj
Where we have used (5.115), (5.116) and (5.117). Currently, we have
proved that under the transformation the hidden unit ajis unchanged. If
the activation function at the hidden unit is also unchanged, we have zj=zj.
Now we deal with the output unit yk:
yk=
jwk j zj+wk0
=
j
cwk j ·zj+cwk0+d
=c
jwk j ·zj+wk0+d
=c yk+d
116
Where we have used (5.114), (5.119) and (5.120). To be more specific,
here we have proved that the linear transformation between ykand ykcan
be achieved by making transformation (5.119) and (5.120).
Problem 5.25 Solution
Since we know the gradient of the error function with respect to wis:
E=H(ww)
Together with (5.196), we can obtain:
w(τ)=w(τ1) ρE
=w(τ1) ρH(w(τ1) w)
Multiplying both sides by uT
j, using wj=wTuj, we can obtain:
w(τ)
j=uT
jw(τ1) ρH(w(τ1) w)
=w(τ1)
jρuT
jH(w(τ1) w)
=w(τ1)
jρη juT
j(w(τ1) w)
=w(τ1)
jρη j(w(τ1)
jw
j)
=(1 ρη j)w(τ1)
j+ρη jw
j
Where we have used (5.198). Then we use mathematical deduction to
prove (5.197), beginning by calculating w(1)
j:
w(1)
j=(1 ρη j)w(0)
j+ρη jw
j
=ρη jw
j
=1(1 ρη j)w
j
Suppose (5.197) holds for τ, we now prove that it also holds for τ+1.
w(τ+1)
j=(1 ρη jw(τ)
j+ρη jw
j
=(1 ρη j)1(1 ρη j)τw
j+ρη jw
j
=(1 ρη j)1(1 ρη j)τ+ρη jw
j
=1(1 ρη j)τ+1w
j
Hence (5.197) holds for τ=1,2, .... Provided |1ρη j| < 1, we have (1
ρη j)τ0 as τ→ ∞ ans thus w(τ)=w. If τis finite and ηj>> (ρτ)1, the
above argument still holds since τis still relatively large. Conversely, when
ηj<<(ρτ)1, we expand the expression above:
|w(τ)
j|=|1(1 ρη j)τw
j|≈|τρη jw
j| << |w
j|
117
We can see that (ρτ)1works as the regularization parameter αin section
3.5.3.
Problem 5.26 Solution
Based on definition or by analogy with (5.128), we have:
n=1
2
k
(ynk
∂ξ ξ=0)2
=1
2
k
(
i
ynk
xi
xi
∂ξ ξ=0)2
=1
2
k
(
i
τi
xi
ynk)2
Where we have denoted
τi=xi
∂ξ ξ=0
And this is exactly the form given in (5.201) and (5.202) if the nth obser-
vation ynk is denoted as ykin short. Firstly, we define αjand βjas (5.205)
shows, where zjand ajare given by (5.203). Then we will prove (5.204) holds:
αj=
i
τi
zj
xi=
i
τi
h(aj)
xi
=
i
τi
h(aj)
aj
aj
xi
=h(aj)
i
τi
xi
aj=h(aj)βj
Moreover,
βj=
i
τi
aj
xi=
i
τi
iwjizi
xi
=
i
τi
i
wjizi
xi=
i
τi
i
wji
zi
xi
=
i
wji
i
τi
zi
xi=
i
wjiαi
So far we have proved that (5.204) holds and now we aim to find a forward
propagation formula to calculate n. We firstly begin by evaluating {βj}at
the input units, and then use the first equation in (5.204) to obtain {αj}at the
input units, and then the second equation to evaluate {βj}at the first hidden
layer, and again the first equation to evaluate {αj}at the first hidden layer.
We repeatedly evaluate {βj}and {αj}in this way until reaching the output
118
layer. Then we deal with (5.206):
n
wrs =
wrs 1
2
k
(Gyk)2=1
2
k
(Gyk)2
wrs
=1
2
k
(Gyk)2
(Gyk)
(Gyk)
wrs =
k
Gyk
Gyk
wrs
=
k
GykGyk
wrs =
k
αkGyk
ar
ar
wrs
=
k
αkGδkr zs=
k
αkG[δkr]zs+G[zs]δkr
=
k
αkϕkr zs+αsδkr
Provided with the idea in section 5.3, the backward propagation formula
is easy to derive. We can simply replace Enwith ykto obtain a backward
equation, so we omit it here.
Problem 5.27 Solution
Following the procedure in section 5.5.5, we can obtain:
=1
2(τTy(x))2p(x)dx
Since we have τ=s(x,ξ)ξand s=x+ξ, so we have τ=I. Therefore,
substituting τinto the equation above, we can obtain:
=1
2(y(x))2p(x)dx
Just as required.
Problem 5.28 Solution
The modifications only affect derivatives with respect to the weights in
the convolutional layer. The units within a feature map (indexed m) have
different inputs, but all share a common weight vector, w(m). Therefore, we
can write:
En
w(m)
i=
j
En
a(m)
j
a(m)
j
w(m)
i=
j
δ(m)
jz(m)
ji
Here a(m)
jdenotes the activation of the jth unit in th mth feature map,
whereas w(m)
idenotes the ith element of the corresponding feature vector
and finally z(m)
i j denotes the ith input for the jth unit in the mth feature map.
Note that δ(m)
jcan be computed recursively from the units in the following
layer.
Problem 5.29 Solution
119
It is obvious. Firstly, we know that:
wiπjN(wi|µj,σ2
j)= −πj
wiµj
σ2
j
N(wi|µj,σ2
j)
We now derive the error function with respect to wi:
E
wi=E
wi+∂λ(w)
wi
=E
wiλ
wi
i
lnM
j=1
πjN(wi|µj,σ2
j)
=E
wiλ
wilnM
j=1
πjN(wi|µj,σ2
j)
=E
wiλ1
M
j=1πjN(wi|µj,σ2
j)
wiM
j=1
πjN(wi|µj,σ2
j)
=E
wi+λ1
M
j=1πjN(wi|µj,σ2
j)M
j=1
πj
wiµj
σ2
j
N(wi|µj,σ2
j)
=E
wi+λM
j=1πj
wiµj
σ2
j
N(wi|µj,σ2
j)
kπkN(wi|µk,σ2
k)
=E
wi+λ
M
j=1
πjN(wi|µj,σ2
j)
kπkN(wi|µk,σ2
k)
wiµj
σ2
j
=E
wi+λ
M
j=1
γj(wi)wiµj
σ2
j
Where we have used (5.138) and defined (5.140).
Problem 5.30 Solution
Is is similar to the previous problem. Since we know that:
∂µjπjN(wi|µj,σ2
j)=πj
wiµj
σ2
j
N(wi|µj,σ2
j)
120
We can derive:
E
∂µj=∂λ(w)
∂µj
= −λ
∂µj
i
lnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
∂µjlnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
∂µjM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)πj
wiµj
σ2
j
N(wi|µj,σ2
j)
=λ
i
πjN(wi|µj,σ2
j)
K
k=1πkN(wi|µk,σ2
k)
µjwi
σ2
j=λ
i
γj(wi)µjwi
σ2
j
Note that there is a typo in (5.142). The numerator should be µjwi
instead of µiwj. This can be easily seen through the fact that the mean and
variance of the Gaussian Distribution should have the same subindex and
since σjis in the denominator, µjshould occur in the numerator instead of
µi.
Problem 5.31 Solution
It is similar to the previous problem. Since we know that:
∂σjπjN(wi|µj,σ2
j)=1
σj+(wiµj)2
σ3
jπjN(wi|µj,σ2
j)
121
We can derive:
E
∂σj=∂λ(w)
∂σj
= −λ
∂σj
i
lnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
∂σjlnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
∂σjM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
∂σjπjN(wi|µj,σ2
j)
=λ
i
1
M
j=1πjN(wi|µj,σ2
j)1
σj(wiµj)2
σ3
jπjN(wi|µj,σ2
j)
=λ
i
πjN(wi|µj,σ2
j)
M
k=1πkN(wi|µk,σ2
k)1
σj(wiµj)2
σ3
j
=λ
i
γj(wi)1
σj(wiµj)2
σ3
j
Just as required.
Problem 5.32 Solution
It is trivial. We begin by verifying (5.208) when j̸= k.
∂πk
∂η j=
∂η jexp(ηk)
kexp(ηk)
=exp(ηk)exp(ηj)
kexp(ηk)2
= −πjπk
And if now we have j=k:
∂πk
∂ηk=
∂ηkex p(ηk)
kexp(ηk)
=exp(ηk)kexp(ηk)exp(ηk)exp(ηk)
kexp(ηk)2
=πkπkπk
If we combine these two cases, we can easily see that (5.208) holds. Now
122
we prove (5.147).
E
∂η j=λ(w)
∂η j
= −λ
∂η j
i
lnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
∂η jlnM
j=1
πjN(wi|µj,σ2
j)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
∂η jM
k=1
πkN(wi|µk,σ2
k)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
M
k=1
∂η jπkN(wi|µk,σ2
k)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
M
k=1
∂πkπkN(wi|µk,σ2
k)∂πk
∂η j
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)
M
k=1
N(wi|µk,σ2
k)(δjkπjπjπk)
= −λ
i
1
M
j=1πjN(wi|µj,σ2
j)πjN(wi|µj,σ2
j)πj
M
k=1
πkN(wi|µk,σ2
k))
= −λ
iπjN(wi|µj,σ2
j)
M
j=1πjN(wi|µj,σ2
j)πjM
k=1πkN(wi|µk,σ2
k))
M
j=1πjN(wi|µj,σ2
j)
= −λ
iγj(wi)πj=λ
iπjγj(wi)
Just as required.
Problem 5.33 Solution
It is trivial. We set the attachment point of the lower arm with the ground
as the origin of the coordinate. We first aim to find the vertical distance from
the origin to the target point, and this is also the value of x2.
x2=L1sin(πθ1)+L2sin(θ2(πθ1))
=L1sinθ1L2sin(θ1+θ2)
Similarly, we calculate the horizontal distance from the origin to the tar-
get point.
x1= −L1cos(πθ1)+L2cos(θ2(πθ1))
=L1cosθ1L2cos(θ1+θ2)
From these two equations, we can clearly see the ’forward kinematics’ of
the robot arm.
123
Problem 5.34 Solution
By analogy with (5.208), we can write:
∂πk(x)
aπ
j=δjkπj(x)πj(x)πk(x)
Using (5.153), we can see that:
En= −lnK
k=1
πkN(tn|µk,σ2
k)
Therefore, we can derive:
En
aπ
j=
aπ
j
lnK
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)
aπ
jK
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)
K
k=1
∂πk
aπ
j
N(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)
K
k=1δjkπj(xn)πj(xn)πk(xn)N(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)πj(xn)N(tn|µj,σ2
j)πj(xn)
K
k=1
πk(xn)N(tn|µk,σ2
k)
=1
K
k=1πkN(tn|µk,σ2
k)πj(xn)N(tn|µj,σ2
j)+πj(xn)
K
k=1
πk(xn)N(tn|µk,σ2
k)
And if we denoted (5.154), we will have:
En
aπ
j= −γj+πj
Note that our result is slightly different from (5.155) by the subindex. But
there are actually the same if we substitute index jby index kin the final
expression.
Problem 5.35 Solution
We deal with the derivative of error function with respect to µkinstead,
which will give a vector as result. Furthermore, the lth element of this vector
will be what we have been required. Since we know that:
µkπkN(tn|µk,σ2
k)=tnµk
σ2
k
πkN(tn|µk,σ2
k)
124
One thing worthy noticing is that here we focus on the isotropic case as
stated in page 273 of the textbook. To be more precise, N(tn|µk,σ2
k) should
be N(tn|µk,σ2
kI). Provided with the equation above, we can further obtain:
En
µk=
µkln
K
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)
µkK
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)·tnµk
σ2
k
πkN(tn|µk,σ2
k)
= −γk
tnµk
σ2
k
Hence noticing (5.152), the lth element of the result above is what we are
required.
En
aµ
kl =En
∂µkl =γk
µkl tl
σ2
k
Problem 5.36 Solution
Similarly, we know that:
∂σkπkN(tn|µk,σ2
k)=D
σk+||tnµk||2
σ3
kπkN(tn|µk,σ2
k)
Therefore, we can obtain:
En
∂σk=
∂σkln
K
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)
∂σkK
k=1
πkN(tn|µk,σ2
k)
= − 1
K
k=1πkN(tn|µk,σ2
k)·D
σk+||tnµk||2
σ3
kπkN(tn|µk,σ2
k)
= −γkD
σk+||tnµk||2
σ3
k
Note that there is a typo in (5.157) and the underlying reason is that:
|σ2
kID×D| = (σ2
k)D
Problem 5.37 Solution
First we know two properties for the Gaussian distribution N(t|µ,σ2I):
E[t]=tN(t|µ,σ2I)dt=µ
125
And
E[||t||2]=||t||2N(t|µ,σ2I)dt=Lσ2+||µ||2
Where we have used E[tTAt]=Tr[Aσ2I]+µTAµby setting A=I. This
property can be found in Matrixcookbook eq(378). Here Lis the dimension of
t. Noticing (5.148), we can write:
E[t|x]=tp(t|x)dt
=t
K
k=1
πkN(t|µk,σ2
k)dt
=
K
k=1
πktN(t|µk,σ2
k)dt
=
K
k=1
πkµk
Then we prove (5.160).
s2(x)=E[||tE[t|x]||2|x]=E[t22tE[t|x]+E[t|x]2|x]
=E[t2|x]E[2tE[t|x]|x]+E[t|x]2=E[t2|x]E[t|x]2
=||t||2K
k=1
πkN(µk,σ2
k)dt||
K
l=1
πlµl||2
=
K
k=1
πk||t||2N(µk,σ2
k)dt||
K
l=1
πlµl||2
=
K
k=1
πk(Lσ2
k+||µk||2)||
K
l=1
πlµl||2
=L
K
k=1
πkσ2
k+
K
k=1
πk||µk||2||
K
l=1
πlµl||2
=L
K
k=1
πkσ2
k+
K
k=1
πk||µk||22×||
K
l=1
πlµl||2+1×||
K
l=1
πlµl||2
=L
K
k=1
πkσ2
k+
K
k=1
πk||µk||22(
K
l=1
πlµl)(
K
k=1
πkµk)+K
k=1
πk||
K
l=1
πlµl||2
=L
K
k=1
πkσ2
k+
K
k=1
πk||µk||22(
K
l=1
πlµl)(
K
k=1
πkµk)+
K
k=1
πk||
K
l=1
πlµl||2
=L
K
k=1
πkσ2
k+
K
k=1
πk||µk
K
l=1
πlµl||2
=
K
k=1
πkLσ2
k+||µk
K
l=1
πlµl||2
Note that there is a typo in (5.160), i.e., the coefficient Lin front of σ2
kis
missing.
126
Problem 5.38 Solution
From (5.167) and (5.171), we can write down the expression for the pre-
dictive distribution:
p(t|x,D,α,β)=p(w|D,α,β)p(t|x,w,β)dw
q(w|D)p(t|x,w,β)dw
=N(w|wMAP,A1)N(t|gTwgTwMAP +y(x,wMAP),β1)dw
Note here p(t|x,w,β) is given by (5.171) and q(w|D) is the approximation
to the posterior p(w|D,α,β), which is given by (5.167). Then by analogy with
(2.115), we first deal with the mean of the predictive distribution:
mean =gTwgTwMAP +y(x,wMAP)|w=wMAP
=y(x,wMAP)
Then we deal with the covariance matrix:
Covariance matrix =β1+gTA1g
Just as required.
Problem 5.39 Solution
Using Laplace Approximation, we can obtain:
p(D|w,β)p(w|α)=p(D|wMAP,β)p(wMAP|α)exp(wwMAP)TA(wwMAP)
Then using (5.174), (5.162) and (5.163), we can obtain:
p(D|α,β)=p(D|w,β)p(w,α)dw
=p(D|wMAP,β)p(wMAP|α)exp(wwMAP)TA(wwMAP)dw
=p(D|wMAP,β)p(wMAP|α)(2π)W/2
|A|1/2
=
N
n=1
N(tn|y(xn,wMAP),β1)N(wMAP|0,α1I)(2π)W/2
|A|1/2
If we take logarithm of both sides, we will obtain (5.175) just as required.
Problem 5.40 Solution
For a k-class classification problem, we need to use softmax activation
function and also the error function is now given by (5.24). Therefore, the
127
Hessian matrix should be derived from (5.24) and the cross entropy in (5.184)
will also be replaced by (5.24).
Problem 5.41 Solution
By analogy to Prob.5.39, we can write:
p(D|α)=p(D|wMAP)p(wMAP|α)(2π)W/2
|A|1/2
Since we know that the prior p(w|α) follows a Gaussian distribution, i.e.,
(5.162), as stated in the text. Therefore we can obtain:
ln p(D|α)=ln p(D|wMAP)+ln p(wMAP|α)1
2ln|A| + const
=ln p(D|wMAP)α
2wTw+W
2lnα1
2ln|A| + const
= −E(wMAP)+W
2lnα1
2ln|A| + const
Just as required.
0.6 Kernel Methods
Problem 6.1 Solution
Recall that in section.6.1, ancan be written as (6.4). We can derive:
an= −1
λ{wTϕ(xn)tn}
= −1
λ{w1ϕ1(xn)+w2ϕ2(xn)+... +wMϕM(xn)tn}
= −w1
λϕ1(xn)w2
λϕ2(xn)... wM
λϕM(xn)+tn
λ
=(cnw1
λ)ϕ1(xn)+(cnw2
λ)ϕ2(xn)+... +(cnwM
λ)ϕM(xn)
Here we have defined:
cn=tn/λ
ϕ1(xn)+ϕ2(xn)+... +ϕM(xn)
From what we have derived above, we can see that anis a linear com-
bination of ϕ(xn). What’s more, we first substitute K=ΦΦTinto (6.7), and
then we will obtain (6.5). Next we substitute (6.3) into (6.5) we will obtain
(6.2) just as required.
Problem 6.2 Solution
128
If we set w(0) =0in (4.55), we can obtain:
w(τ+1) =
N
n=1
ηcntnϕn
where Nis the total number of samples and cnis the times that tnϕnhas
been added from step 0 to step τ+1. Therefore, it is obvious that we have:
w=
N
n=1
αntnϕn
We further substitute the expression above into (4.55), which gives:
N
n=1
α(τ+1)
ntnϕn=
N
n=1
α(τ)
ntnϕn+ηtnϕn
In other words, the update process is to add learning rate ηto the coeffi-
cient αncorresponding to the misclassified pattern xn, i.e.,
α(τ+1)
n=α(τ)
n+η
Now we similarly substitute it into (4.52):
y(x)=f(wTϕ(x))
=f(
N
n=1
αntnϕT
nϕ(x))
=f(
N
n=1
αntnk(xn,x))
Problem 6.3 Solution
We begin by expanding the Euclidean metric.
||xxn||2=(xxn)T(xxn)
=(xTxT
n)(xxn)
=xTx2xT
nx+xT
nxn
Similar to (6.24)-(6.26), we use a nonlinear kernel k(xn,x) to replace xT
nx,
which gives a general nonlinear nearest-neighbor classifier with cost function
defined as:
k(x,x)+k(xn,xn)2k(xn,x)
Problem 6.4 Solution
To construct such a matrix, let us suppose the two eigenvalues are 1 and
2, and the matrix has form: a b
c d
129
Therefore, based on the definition of eigenvalue, we have two equations:
(a2)(d2) =bc (1)
(a1)(d1) =bc (2)
(2)-(1), yielding:
a+d=3
Therefore, we set a=4 and d=1. Then we substitute them into (1), and
thus we see:
bc = −6
Finally, we choose b=3 and c= −2. The constructed matrix is:
4 3
21
Problem 6.5 Solution
Since k1(x,x) is a valid kernel, it can be written as:
k1(x,x)=ϕ(x)Tϕ(x)
We can obtain:
k(x,x)=ck1(x,x)=pcϕ(x)Tpcϕ(x)
Therefore, (6.13) is a valid kernel. It is similar for (6.14):
k(x,x)=f(x)k1(x,x)f(x)=f(x)ϕ(x)Tf(x)ϕ(x)
Just as required.
Problem 6.6 Solution
We suppose q(x) can be written as:
q(x)=anxn+an1xn1+... +a1x+a0
We now obtain:
k(x,x)=ank1(x,x)n+an1k1(x,x)n1+... +a1k1(x,x)+a0
By repeatedly using (6.13), (6.17) and (6.18), we can easily verify k(x,x)
is a valid kernel. For (6.16), we can use Taylor expansion, and since the
coefficients of Taylor expansion are all positive, we can similarly prove its
validity.
Problem 6.7 Solution
To prove (6.17), we will use the property stated below (6.12). Since we
know k1(x,x) and k2(x,x) are valid kernels, their Gram matrix K1and K2
130
are both positive semidefinite. Given the relation (6.12), it can be easily
shown K=K1+K2is also positive semidefinite and thus k(x,x) is also a
valid kernel.
To prove (6.18), we assume the map function for kernel k1(x,x) is ϕ(1)(x),
and similarly ϕ(2)(x) for k2(x,x). Moreover, we further assume the dimension
of ϕ(1)(x) is M, and ϕ(2)(x) is N. We expand k(x,x) based on (6.18):
k(x,x)=k1(x,x)k2(x,x)
=ϕ(1)(x)Tϕ(1)(x)ϕ(2)(x)Tϕ(2)(x)
=
M
i=1
ϕ(1)
i(x)ϕ(1)
i(x)
N
j=1
ϕ(2)
j(x)ϕ(2)
j(x)
=
M
i=1
N
j=1ϕ(1)
i(x)ϕ(2)
j(x)ϕ(1)
i(x)ϕ(2)
j(x)
=
MN
k=1
ϕk(x)ϕk(x)=ϕ(x)Tϕ(x)
where ϕ(1)
i(x) is the ith element of ϕ(1)(x), and ϕ(2)
j(x) is the jth element
of ϕ(2)(x). To be more specific, we have proved that k(x,x) can be written as
ϕ(x)Tϕ(x). Here ϕ(x) is a MN×1 column vector, and the kth (k=1,2,..., MN)
element is given by ϕ(1)
i(x)×ϕ(2)
j(x). What’s more, we can also express i,jin
terms of k:
i=(k1) N+1 and j=(k1) N+1
where and means integer division and remainder, respectively.
Problem 6.8 Solution
For (6.19) we suppose k3(x,x)=g(x)Tg(x), and thus we have:
k(x,x)=k3(ϕ(x),ϕ(x)) =g(ϕ(x))Tg(ϕ(x)) =f(x)Tf(x)
where we have denoted g(ϕ(x)) =f(x) and now it is obvious that (6.19)
holds. To prove (6.20), we suppose xis a N×1 column vector and Ais a N×N
symmetric positive semidefinite matrix. We know that Acan be decomposed
to QBQT. Here Qis a N×Northogonal matrix, and Bis a N×Ndiagonal
matrix whose elements are no less than 0. Now we can derive:
k(x,x)=xTAx=xTQBQTx=(QTx)TB(QTx)=yTBy
=
N
i=1
Bii yiy
i=
N
i=1
(Bii yi)(Bii y
i)=ϕ(x)Tϕ(x)
To be more specific, we have proved that k(x,x)=ϕ(x)Tϕ(x), and here
ϕ(x) is a N×1 column vector, whose ith (i=1,2,..., N) element is given by
Bii yi, i.e., Bii(QTx)i.
131
Problem 6.9 Solution
To prove (6.21), let’s first expand the expression:
k(x,x)=ka(xa,x
a)+kb(xb,x
b)
=
M
i=1
ϕ(a)
i(xa)ϕ(a)
i(x
a)+
N
j=1
ϕ(b)
i(xb)ϕ(b)
i(x
b)
=
M+N
k=1
ϕk(x)ϕk(x)=ϕ(x)Tϕ(x)
where we have assumed the dimension of xais Mand the dimension of
xbis N. The mapping function ϕ(x) is a (M+N)×1 column vector, whose kth
(k=1,2,..., M+N) element ϕk(x) is:
ϕk(x)=ϕ(a)
k(x) 1 kM
ϕ(b)
kM(xa)M+1kM+N
(6.22) is quite similar to (6.18). We follow the same procedure:
k(x,x)=ka(xa,x
a)kb(xb,x
b)
=
M
i=1
ϕ(a)
i(xa)ϕ(a)
i(x
a)
N
j=1
ϕ(b)
j(xb)ϕ(b)
j(x
b)
=
M
i=1
N
j=1ϕ(a)
i(xa)ϕ(b)
j(xb)ϕ(a)
i(x
a)ϕ(b)
j(x
b)
=
MN
k=1
ϕk(x)ϕk(x)=ϕ(x)Tϕ(x)
By analogy to (6.18), the mapping function ϕ(x) is a MN×1 column vector,
whose kth (k=1,2,..., MN) element ϕk(x) is:
ϕk(x)=ϕ(a)
i(xa)×ϕ(b)
j(xb)
To be more specific, xais the sub-vector of xmade up of the first Mele-
ment of x, and xbis the sub-vector of xmade up of the last Nelement of x.
What’s more, we can also express i,jin terms of k:
i=(k1) N+1 and j=(k1) N+1
where and means integer division and remainder, respectively.
Problem 6.10 Solution
According to (6.9), we have:
y(x)=k(x)T(K+λIN)1t=k(x)Ta=
N
n=1
f(xn)·f(x)·an=N
n=1
f(xn)··anf(x)
132
We see that if we choose k(x,x)=f(x)f(x) we will always find a solution
y(x) proportional to f(x).
Problem 6.11 Solution
We follow the hint.
k(x,x)=exp(xTx/2σ2)·exp(xTx/σ2)·exp((x)Tx/2σ2)
=exp(xTx/2σ2)·
1+xTx
σ2+(xTx
σ2)2
2! +···
·exp((x)Tx/2σ2)
=ϕ(x)Tϕ(x)
where ϕ(x) is a column vector with infinite dimension. To be more spe-
cific, (6.12) gives a simple example on how to decompose (xTx)2. In our case,
we can also decompose (xTx)k,k=1,2,...,in the similar way. However,
since k→ ∞, i.e., the decomposition will consist monomials with infinite de-
gree. Thus, there will be infinite terms in the decomposition and the feature
mapping function ϕ(x) will have infinite dimension.
Problem 6.12 Solution
First, let’s explain the problem a little bit. According to (6.27), what we
need to prove here is:
k(A1,A2)=2|A1A2|=ϕ(A1)Tϕ(A2)
The biggest difference from the previous problem is that ϕ(A) is a 2|D|×1
column vector and instead of indexed by 1,2,...,2|D|here we index it by {U|U
D}(Note that {U|UD}is all the possible subsets of Dand thus there are
2|D|elements in total). Therefore, according to (6.95), we can obtain:
ϕ(A1)Tϕ(A2)=
UD
ϕU(A1)ϕU(A2)
By using the summation, we actually iterate through all the possible sub-
sets of D. If and only if the current iterating subset Uis a subset of both A1
and A2simultaneously, the current adding term equals to 1. Therefore, we
actually count how many subsets of Dis in the intersection of A1and A2.
Moreover, since A1and A2are both defined in the subset space of D, what
we have deduced above can be written as:
ϕ(A1)Tϕ(A2)=2|A1A2|
Just as required.
Problem 6.13 SolutionWait for update
Problem 6.14 Solution
133
Since the covariance matrix Sis fixed, according to (6.32) we can obtain:
g(µ,x)= ∇µln p(x|µ)=
µ1
2(xµ)TS1(xµ)=S1(xµ)
Therefore, according to (6.34), we can obtain:
F=Exg(µ,x)g(µ,x)T=S1Ex(xµ)(xµ)TS1
Since xN(x|µ,S), we have:
Ex(xµ)(xµ)T=S
So we obtain F=S1and then according to (6.33), we have:
k(x,x)=g(µ,x)TF1g(µ,x)=(xµ)TS1(xµ)
Problem 6.15 Solution
We rewrite the problem. What we are required to prove is that the Gram
matrix K:
K=k11 k12
k21 k22 ,
where ki j (i,j=1,2) is short for k(xi,xj), should be positive semidefinite. A
positive semidefinite matrix should have positive determinant, i.e.,
k12k21 k11k22.
Using the symmetric property of kernel, i.e.,k12 =k21, we obtain what
has been required.
Problem 6.16 Solution
Based on the total derivative of function f, we have:
f(w+w)Tϕ1,(w+w)Tϕ2,...,(w+w)TϕN=
N
n=1
f
(wTϕn)·wTϕn
Which can be further written as:
f(w+w)Tϕ1,(w+w)Tϕ2,...,(w+w)TϕN=N
n=1
f
(wTϕn)·ϕT
nw
Note that here ϕnis short for ϕ(xn). Based on the equation above, we can
obtain:
wf=
N
n=1
f
(wTϕn)·ϕT
n
134
Now we focus on the derivative of function gwith respect to w:
wg=g
(wTw)·2wT
In order to find the optimal w, we set the derivative of Jwith respect to
wequal to 0, yielding:
wJ= ∇wf+ ∇wg=
N
n=1
f
(wTϕn)·ϕT
n+g
(wTw)·2wT=0
Rearranging the equation above, we can obtain:
w=1
2a
N
n=1
f
(wTϕn)·ϕn
Where we have defined: a=1÷g
(wTw), and since gis a monotonically
increasing function, we have a>0.
Problem 6.17 Solution
We consider a variation in the function y(x) of the form:
y(x)y(x)+ϵη(x)
Substituting it into (6.39) yields:
E[y+ϵη]=1
2
N
n=1y+ϵη tn2v(ξ)dξ
=1
2
N
n=1(ytn)2+2·(ϵη)·(ytn)+(ϵη)2v(ξ)dξ
=E[y]+ϵ
N
n=1{ytn}ηvdξ+O(ϵ2)
Note that here yis short for y(xn+ξ), ηis short for η(xn+ξ) and vis short
for v(ξ) respectively. Several clarifications must be made here. What we have
done is that we vary the function yby a little bit (i.e., ϵη) and then we expand
the corresponding error with respect to the small variation ϵ. The coefficient
before ϵis actually the first derivative of the error E[y+ϵη] with respect to
ϵat ϵ=0. Since we know that yis the optimal function that can make E
the smallest, the first derivative of the error E[y+ϵη] should equal to zero at
ϵ=0, which gives:
N
n=1{y(xn+ξ)tn}η(xn+ξ)v(ξ)dξ=0
135
Now we are required to find a function ythat can satisfy the equation
above no matter what ηis. We choose:
η(x)=δ(xz)
This allows us to evaluate the integral:
N
n=1{y(xn+ξ)tn}η(xn+ξ)v(ξ)dξ=
N
n=1
{y(z)tn}v(zxn)
We set it to zero and rearrange it, which finally gives (6.40) just as re-
quired.
Problem 6.18 Solution
According to the main text below Eq (6.48), we know that f(x,t), i.e., f(z),
follows a zero-mean isotropic Gaussian:
f(z)=N(z|0,σ2I)
Then f(xxm,ttm), i.e., f(zzm) should also satisfy a Gaussian distri-
bution:
f(zzm)=N(z|zm,σ2I)
Where we have defined:
zm=(xm,tm)
The integral f(zzm)dt corresponds to the marginal distribution with
respect to the remaining variable xand, thus, we obtain:
f(zzm)dt =N(x|xm,σ2)
We substitute all the expressions into Eq (6.48), which gives:
p(t|x)=p(t,x)
p(t,x)dt =nN(z|zm,σ2I)
mN(x|xm,σ2)
=n1
2πσ2ex p 1
2(zzn)T(σ2I)1(zzn)
m1
(2πσ2)1/2 ex p 1
2σ2(xxm)2
=n1
2πσ2ex p 1
2σ2(xxn)2ex p 1
2σ2(ttn)2
m1
(2πσ2)1/2 ex p 1
2σ2(xxm)2
=
n
1
p2πσ2ex p 1
2σ2(xxn)2
m1
(2πσ2)1/2 ex p 1
2σ2(xxm)2·1
p2πσ2ex p 1
2σ2(ttn)2
=
n
πn·N(t|tn,σ2)
136
Where we have defined:
πn=
ex p 1
2σ2(xxn)2
mex p 1
2σ2(xxm)2
We also observe that:
n
πn=1
Therefore, the conditional distribution p(t|x) is given by a Gaussian Mix-
ture. Similarly, we attempt to find a specific form for Eq (6.46):
k(x,xn)=f(xxn,t)dt
mf(xxm,t)dt
=N(x|xn,σ2)
mN(x|xm,σ2)
=πn
In other words, the conditional distribution can be more precisely written
as:
p(t|x)=
n
k(x,xn)·N(t|tn,σ2)
Thus its mean is given by:
E[t|x]=
n
k(x,xn)·tn
Its variance is given by:
var[t|x]=E[(t|x)2]E[t|x]2
=
n
k(x,xn)·(t2
n+σ2)
n
k(x,xn)·tn2
Problem 6.19 Solution
Similar to Prob.6.17, it is straightforward to show that:
y(x)=
n
tnk(x,xn)
Where we have defined:
k(x,xn)=g(xnx)
ng(xnx)
Problem 6.20 Solution
Since we know that tN+1=(t1,t2,..., tN,tN+1)Tfollows a Gaussian distri-
bution, i.e., tN+1N(tN+1|0,CN+1) given in Eq (6.64), if we rearrange its
137
order by putting the last element (i.e., tN+1) to the first position, denoted as
¯
tN+1, it should also satisfy a Gaussian distribution:
¯
tN+1=(tN+1,t1,..., t2,tN)TN(¯
tN+1|0,¯
CN+1)
Where we have defined:
¯
CN+1=ckT
k CN
Where kand chave been given in the main text below Eq (6.65). The
conditional distribution p(tN+1|tN) should also be a Gaussian. By analogy to
Eq (2.94)-(2.98), we can simply treat tN+1as xa,tNas xb,cas Σaa,kas Σba,
kTas Σab and CNas Σbb . Substituting them into Eq (2.79) and Eq (2.80)
yields:
Λaa =(ckTC1
Nk)1
And:
Λab = −(ckTC1
Nk)1kTC1
N
Then we substitute them into Eq (2.96) and (2.97), yields:
p(tN+1|tN)=N(µa|b,Λ1
aa )
For its mean µa|b, we have:
µa|b=0ckTC1
Nk·(ckTC1
Nk)1kTC1
N·(tN0)
=kTC1
NtN=m(xN+1)
Similarly, for its variance Λ1
aa (Note that here since tN+1is a scalar, the
mean and the covariance matrix actually degenerate to one dimension case),
we have:
Λ1
aa =ckTC1
Nk=σ2(xN+1)
Problem 6.21 Solution
We follow the hint beginning by verifying the mean. We write Eq (6.62)
in a matrix form:
CN=1
α
ΦΦT+β1IN
Where we have used Eq (6.54). Here Φis the design matrix defined below
Eq (6.51) and INis an identity matrix. Before we use Eq (6.66), we need to
obtain k:
k=[k(x1,xN+1),k(x2,xN+1),..., k(xN,xN+1)]T
=1
α[ϕ(x1)Tϕ(xN+1),ϕ(x2)Tϕ(xN+1),...,ϕ(xn)Tϕ(xN+1)]T
=1
α
Φϕ(xN+1)T
138
Now we substitute all the expressions into Eq (6.66), yielding:
m(xN+1)=α1ϕ(xN+1)TΦTα1ΦΦT+β1IN1t
Next using matrix identity (C.6), we obtain:
ΦTα1ΦΦT+β1IN1=αββΦTΦ+αIM1ΦT=αβSNΦT
Where we have used Eq (3.54). Substituting it into m(xN+1), we obtain:
m(xN+1)=βϕ(xN+1)TSNΦTt=<ϕ(xN+1)T,βSNΦTt>
Where < ·,· > represents the inner product. Comparing the result above
with Eq (3.58), (3.54) and (3.53), we conclude that the means are equal. It is
similar for the variance. We substitute c,kand CNinto Eq (6.67). Then we
simplify the expression using matrix identity (C.7). Finally, we will observe
that it is equal to Eq (3.59).
Problem 6.22 Solution
Based on Eq (6.64) and (6.65), We first write down the joint distribution
for tN+L=[t1(x),t2(x),..., tN+L(x)]T:
p(tN+L)=N(tN+L|0,CN+L)
Where CN+Lis similarly given by:
CN+L=C1,NK
KTCN+1,N+L
The expression above has already implicitly divided the vector tN+Linto
two parts. Similar to Prob.6.20, for later simplicity we rearrange the order
of tN+Ldenoted as ¯
tN+L=[tN+1,..., tN+L,t1,..., tN]T. Moreover, ¯
tN+Lshould
also follows a Gaussian distribution:
p(¯
tN+L)=N(¯
tN+L|0,¯
CN+L)
Where we have defined:
¯
CN+L=CN+1,N+LKT
K C1,N
Now we use Eq (2.94)-(2.98) and Eq (2.79)-(2.80) to derive the conditional
distribution, beginning by calculate Λaa:
Λaa =(CN+1,N+LKT·C1
1,N·K)1
and Λab:
Λab = −(CN+1,N+LKT·C1
1,N·K)1·KT·C1
1,N
139
Now we can obtain:
p(tN+1,..., tN+L|tN)=N(µa|b,Λ1
aa )
Where we have defined:
µa|b=0+KT·C1
1,N·tN=KT·C1
1,N·tN
If now we want to find the conditional distribution p(tj|tN), where N+1
jN+L, we only need to find the corresponding entry in the mean (i.e., the
(jN)-th entry) and covariance matrix (i.e., the ( jN)-th diagonal entry) of
p(tN+1,..., tN+L|tN). In this case, it will degenerate to Eq (6.66) and (6.67)
just as required.
Problem 6.24 Solution
By definition, we only need to prove that for arbitrary vector x̸=0,xTWx
is positive. Here suppose that Wis a M×Mmatrix. We expand the multipli-
cation:
xTWx =
M
i=1
M
j=1
Wi j ·xi·xj=
M
i=1
Wii ·x2
i
where we have used the fact that Wis a diagonal matrix. Since Wii >0,
we obtain xTWx >0 just as required. Suppose we have two positive definite
matrix, denoted as A1and A2, i.e., for arbitrary vector x, we have xTA1x>0
and xTA2x>0. Therefore, we can obtain:
xT(A1+A2)x=xTA1x+xTA2x>0
Just as required.
Problem 6.25 Solution
Based on Newton-Raphson formula, Eq(6.81) and Eq(6.82), we have:
anew
N=aN(WNC1
N)1(tNσNC1
NaN)
=aN+(WN+C1
N)1(tNσNC1
NaN)
=(WN+C1
N)1(WN+C1
N)aN+tNσNC1
NaN
=CNC1
N(WN+C1
N)1(tNσN+WNaN)
=CN(CNWN+I)1(tNσN+WNaN)
Just as required.
Problem 6.26 Solution
Using Eq(6.77), (6.78) and (6.86), we can obtain:
p(aN+1|tN)=p(aN+1|aN)p(aN|tN)daN
=N(aN+1|kTC1
NaN,ckTC1
Nk)·N(aN|a
N,H1)daN
140
By analogy to Eq (2.115), i.e.,
p(y)=p(y|x)p(x)dx
We can obtain:
p(aN+1|tN)=N(Aµ+b,L1+AΛ1AT) ()
Where we have defined:
A=kTC1
N,b=0,L1=ckTC1
Nk
And
µ=a
N,Λ=H
Therefore, the mean is given by:
Aµ+b=kTC1
Na
N=kTC1
NCN(tNσN)=kT(tNσN)
Where we have used Eq (6.84). The covariance matrix is given by:
L1+AΛ1AT=ckTC1
Nk+kTC1
NH1(kTC1
N)T
=ckT(C1
NC1
NH1C1
N)k
=ckTC1
NC1
N(WN+C1
N)1C1
Nk
=ckTC1
N(CNWNCN+C1
N)1k
Where we have used Eq (6.85) and the fact that CNis symmetric. Then we
use matrix identity (C.7) to further reduce the expression, which will finally
give Eq (6.88).
Problem 6.27 Solution(Wait for update) This problem is really complicated.
What’s more, I find that Eq (6.91) seems not right.
0.7 Sparse Kernel Machines
Problem 7.1 Solution
By analogy to Eq (2.249), we can obtain:
p(x|t)=
1
N+1
N+1
n=1
1
Zk·k(x,xn)t= +1
1
N1
N1
n=1
1
Zk·k(x,xn)t= −1
141
where N+1represents the number of samples with label t= +1 and it is
the same for N1.Zkis a normalization constant representing the volume of
the hypercube. Since we have equal prior for the class, i.e.,
p(t)=0.5t= +1
0.5t= −1
Based on Bayes’ Theorem, we have p(t|x)p(x|t)·p(t), yielding:
p(t|x)=
1
Z·1
N+1
N+1
n=1·k(x,xn)t= +1
1
Z·1
N1
N1
n=1·k(x,xn)t= −1
Where 1/Zis a normalization constant to guarantee the integration of the
posterior equal to 1. To classify a new sample x, we try to find the value t
that can maximize p(t|x). Therefore, we can obtain:
t=
+1 if 1
N+1
N+1
n=1·k(x,xn)1
N1
N1
n=1·k(x,xn)
1 if 1
N+1
N+1
n=1·k(x,xn)1
N1
N1
n=1·k(x,xn)
()
If we now choose the kernel function as k(x,x)=xTx,we have:
1
N+1
N+1
n=1
k(x,xn)=1
N+1
N+1
n=1
xTxn=xT˜
x+1
Where we have denoted:
˜
x+1=1
N+1
N+1
n=1
xn
and similarly for ˜
x1. Therefore, the classification criterion () can be
written as:
t=+1 if ˜
x+1˜
x1
1 if ˜
x+1˜
x1
When we choose the kernel function as k(x,x)=ϕ(x)Tϕ(x), we can sim-
ilarly obtain the classification criterion:
t=+1 if ˜
ϕ(x+1)˜
ϕ(x1)
1 if ˜
ϕ(x+1)˜
ϕ(x1)
Where we have defined:
˜
ϕ(x+1)=1
N+1
N+1
n=1
ϕ(xn)
142
Problem 7.2 Solution
Suppose we have find w0and b0, which can let all points satisfy Eq (7.5)
and simultaneously minimize Eq (7.3). This hyperlane decided by w0and
b0is the optimal classification margin. Now if the constraint in Eq (7.5)
becomes:
tn(wTϕ(xn)+b)γ
We can conclude that if we perform change of variables: w0− > γw0and
b− > γb, the constraint will still satisfy and Eq (7.3) will be minimize. In
other words, if the right side of the constraint changes from 1 to γ, The new
hyperlane decided by γw0and γb0is the optimal classification margin. How-
ever, the minimum distance from the points to the classification margin is
still the same.
Problem 7.3 Solution
Suppose we have x1belongs to class one and we denote its target value
t1=1, and similarly x2belongs to class two and we denote its target value
t2= −1. Since we only have two points, they must have ti·y(xi)=1 as shown
in Fig. 7.1. Therefore, we have an equality constrained optimization problem:
minimize 1
2||w||2s.t.wTϕ(x1)+b=1
wTϕ(x2)+b= −1
This is an convex optimization problem and it has been proved that global
optimal exists.
Problem 7.4 Solution
Since we know that
ρ=1
||w||
Therefore, we have:
1
ρ2= ||w||2
In other words, we only need to prove that
||w||2=
N
n=1
an
When we find th optimal solution, the second term on the right hand side
of Eq (7.7) vanishes. Based on Eq (7.8) and Eq (7.10), we also observe that its
dual is given by:
˜
L(a)=
N
n=1
an1
2||w||2
143
Therefore, we have:
1
2||w||2=L(a)=˜
L(a)=
N
n=1
an1
2||w||2
Rearranging it, we will obtain what we are required.
Problem 7.5 Solution
We have already proved this problem in the previous one.
Problem 7.6 Solution
If the target variable can only choose from {1,1}, and we know that
p(t=1|y)=σ(y)
We can obtain:
p(t= −1|y)=1p(t=1|y)=1σ(y)=σ(y)
Therefore, combining these two situations, we can derive:
p(t|y)=σ(yt)
Consequently, we can obtain the negative log likelihood:
ln p(D)= −ln
N
n=1
σ(yntn)= −
N
n=1
lnσ(yntn)=
N
n=1
ELR (yntn)
Here Drepresents the dataset, i.e.,D={(xn,tn); n=1,2,..., N}, and ELR (yt)
is given by Eq (7.48). With the addition of a quadratic regularization, we ob-
tain exactly Eq (7.47).
Problem 7.7 Solution
The derivatives are easy to obtain. Our main task is to derive Eq (7.61)
144
using Eq (7.57)-(7.60).
L=C
N
n=1
(ξn+
ξn)+1
2||w||2
N
n=1
(µnξn+µn
ξn)
N
n=1
an(ϵ+ξn+yntn)
N
n=1an(ϵ+
ξn+yntn)
=C
N
n=1
(ξn+
ξn)+1
2||w||2
N
n=1
(an+µn)ξn
N
n=1
(an+µn)
ξn
N
n=1
an(ϵ+yntn)
N
n=1an(ϵ+yntn)
=C
N
n=1
(ξn+
ξn)+1
2||w||2
N
n=1
Cξn
N
n=1
C
ξn
N
n=1
(an+an)ϵ
N
n=1
(anan)(yntn)
=1
2||w||2
N
n=1
(an+an)ϵ
N
n=1
(anan)(yntn)
=1
2||w||2
N
n=1
(anan)(wTϕ(xn)+btn)
N
n=1
(an+an)ϵ+
N
n=1
=1
2||w||2
N
n=1
(anan)(wTϕ(xn)+b)
N
n=1
(an+an)ϵ+
N
n=1
(anan)tn
=1
2||w||2
N
n=1
(anan)wTϕ(xn)
N
n=1
(an+an)ϵ+
N
n=1
(anan)tn
=1
2||w||2||w||2
N
n=1
(an+an)ϵ+
N
n=1
(anan)tn
= −1
2||w||2
N
n=1
(an+an)ϵ+
N
n=1
(anan)tn
Just as required.
Problem 7.8 Solution
This obviously follows from the KKT condition, described in Eq (7.67) and
(7.68).
Problem 7.9 Solution
The prior is given by Eq (7.80).
p(w|α)=
M
i=1
N(0,α1
i)=N(w|0,A1)
Where we have defined:
A=dia g(αi)
145
The likelihood is given by Eq (7.79).
p(t|X,w,β)=
N
n=1
p(tn|xn,w,β1)
=
N
n=1
N(tn|wTϕ(xn),β1)
=N(t|Φw,β1I)
Where we have defined:
Φ=[ϕ(x1),ϕ(x2), ...,ϕ(xn)]T
Our definitions of Φand Aas consistent with the main text. Therefore,
according to Eq (2.113)-Eq (2.117), we have:
p(w|t,X,α,β)=N(m,Σ)
Where we have defined:
Σ=(A+βΦTΦ)1
And
m=βΣΦTt
Just as required.
Problem 7.10&7.11 Solution
It is quite similar to the previous problem. We begin by writting down the
prior:
p(w|α)=
M
i=1
N(0,α1
i)=N(w|0,A1)
Then we write down the likelihood:
p(t|X,w,β)=
N
n=1
p(tn|xn,w,β1)
=
N
n=1
N(tn|wTϕ(xn),β1)
=N(t|Φw,β1I)
Since we know that:
p(t|X,α,β)=p(t|X,w,β)p(w|α)dw
146
First as required by Prob.7.10, we will solve it by completing the square.
We begin by write down the expression for p(t|X,w,β):
p(t|X,α,β)=N(w|0,A1)N(t|Φw,β1I)dw
=(β
2π)N/2 ·1
(2π)M/2 ·
M
m=1
α1/2
i·exp{E(w)}dw
Where we have defined:
E(w)=1
2wTAw +β
2||tΦw||2
We expand E(w) with respect to w:
E(w)=1
2wT(A+βΦTΦ)w2βtT(Φw)+βtTt
=1
2wTΣ1w2mTΣ1w+βtTt
=1
2(wm)TΣ1(wm)+βtTtmTΣ1m
Where we have used Eq (7.82) and Eq (7.83). Substituting E(w) into the
integral, we will obtain:
p(t|X,α,β)=(β
2π)N/2 ·1
(2π)M/2 ·
M
m=1
α1/2
i·exp{E(w)}dw
=(β
2π)N/2 ·1
(2π)M/2 ·
M
m=1
α1/2
i·(2π)M/2 ·|Σ|1/2 exp1
2(βtTtmTΣ1m)
=(β
2π)N/2 ·|Σ|1/2 ·
M
m=1
α1/2
i·exp1
2(βtTtmTΣ1m)
=(β
2π)N/2 ·|Σ|1/2 ·
M
m=1
α1/2
i·expE(t)
We further expand E(t):
E(t)=1
2(βtTtmTΣ1m)
=1
2(βtTt(βΣΦTt)TΣ1(βΣΦTt))
=1
2(βtTtβ2tTΦΣΣ1ΣΦTt)
=1
2(βtTtβ2tTΦΣΦTt)
=1
2tT(βIβ2ΦΣΦT)t
=1
2tTβIβΦ(A+βΦTΦ)1ΦTβt
=1
2tT(β1I+ΦA1ΦT)1t=1
2tTC1t
147
Note that in the last step we have used matrix identity Eq (C.7). There-
fore, as we know that the pdf is Gaussian and the exponential term has been
given by E(t), we can easily write down Eq (7.85) considering those normal-
ization constant.
What’s more, as required by Prob.7.11, the evaluation of the integral can
be easily performed using Eq(2.113)- Eq(2.117).
Problem 7.12 Solution
According to the previous problem, we can explicitly write down the log
marginal likelihood in an alternative form:
ln p(t|X,α,β)=N
2lnβN
2ln2π+1
2ln|Σ|+ 1
2
M
i=1
lnαiE(t)
We first derive:
dE(t)
dαi= −1
2
d
dαi
(mTΣ1m)
= −1
2
d
dαi
(β2tTΦΣΣ1ΣΦTt)
= −1
2
d
dαi
(β2tTΦΣΦTt)
= −1
2Trd
dΣ1(β2tTΦΣΦTt)·dΣ1
dαi
]
=1
2β2TrΣ(ΦTt)(ΦTt)TΣ·Ii]=1
2m2
ii
In the last step, we have utilized the following equation:
d
dXTr(AX1B)= −XTATBTXT
Moreover, here Iiis a matrix with all elements equal to zero, expect the
i-th diagonal element, and the i-th diagonal element equals to 1. Then we
utilize matrix identity Eq (C.22) to derive:
dln|Σ|
dαi= −dln |Σ1|
dαi
= −T rΣd
dαi
(A+βΦTΦ)
= −Σii
Therefore, we can obtain:
dln p
dαi=1
2αi1
2m2
i1
2Σii
148
Set it to zero and obtain:
αi=1αiΣii
mi=γi
m2
i
Then we calculate the derivatives of ln pwith respect to βbeginning by:
dln|Σ|
dβ= −dln |Σ1|
dβ
= −T rΣd
dβ(A+βΦTΦ)
= −T rΣΦTΦ
Then we continue:
dE(t)
dβ=1
2tTt1
2
d
dβ(mTΣ1m)
=1
2tTt1
2
d
dβ(β2tTΦΣΣ1ΣΦTt)
=1
2tTt1
2
d
dβ(β2tTΦΣΦTt)
=1
2tTtβtTΦΣΦTt1
2β2d
dβ(tTΦΣΦTt)
=1
2tTt2βtTΦΣΦTtβ2d
dβ(tTΦΣΦTt)
=1
2tTt2tT(Φm)β2d
dβ(tTΦΣΦTt)
=1
2tTt2tT(Φm)β2Trd
dΣ1(tTΦΣΦTt)·dΣ1
dβ
=1
2tTt2tT(Φm)+β2TrΣ(ΦTt)(ΦTt)TΣ·ΦTΦ
=1
2tTt2tT(Φm)+TrmmT·ΦTΦ]
=1
2tTt2tT(Φm)+TrΦmmT·ΦT]
=1
2||tΦm||2
Therefore, we have obtained:
dln p
dβ=1
2N
β||tΦm||2Tr[ΣΦTΦ]
149
Using Eq (7.83), we can obtain:
ΣΦTΦ=ΣΦTΦ+β1ΣAβ1ΣA
=Σ(βΦTΦ+A)β1β1ΣA
=Iβ1β1ΣA
=(IΣA)β1
Setting the derivative equal to zero, we can obtain:
β1=||tΦm||2
NTr(IΣA)=||tΦm||2
Niγi
Just as required.
Problem 7.13 Solution
This problem is quite confusing. In my point of view, the posterior should
be denoted as p(w|t,X,{ai,bi},aβ,bβ), where aβ,bβcontrols the Gamma dis-
tribution of β, and ai,bicontrols the Gamma distribution of αi. What we
should do is to maximize the marginal likelihood p(t|X,{ai,bi},aβ,bβ) with
respect to {ai,bi},aβ,bβ. Now we do not have a point estimation for the hyper-
parameters βand αi. We have a distribution (controled by the hyper priors,
i.e., {ai,bi},aβ,bβ) instead.
Problem 7.14 Solution
We begin by writing down p(t|x,w,β). Using Eq (7.76) and Eq (7.77), we
can obtain:
p(t|x,w,β)=N(t|wTϕ(x),(β)1)
Then we write down p(w|X,t,α,β). Using Eq (7.81), (7.82) and (7.83),
we can obtain:
p(w|X,t,α,β)=N(w|m,Σ)
Where mand Σare evaluated using Eq (7.82) and (7.83) given α=α
and β=β. Then we utilize Eq (7.90) and obtain:
p(t|x,X,t,α,β)=N(t|wTϕ(x),(β)1)N(w|m,Σ)dw
=N(t|ϕ(x)Tw,(β)1)N(w|m,Σ)dw
Using Eq (2.113)-(2.117), we can obtain:
p(t|x,X,t,α,β)=N(µ,σ2)
Where we have defined:
µ=mTϕ(x)
150
And
σ2=(β)1+ϕ(x)TΣϕ(x)
Just as required.
Problem 7.15 Solution
We just follow the hint.
L(α)= −1
2{Nln2π+ln|C|+tTC1t}
= −1
2Nln2π+ln|Ci|+ln |1+α1
iφT
iC1
iφi|
+tT(C1
iC1
iφiφT
iC1
i
αi+φT
iC1
iφi
)t
=L(αi)1
2ln|1+α1
iφT
iC1
iφi|+ 1
2tTC1
iφiφT
iC1
i
αi+φT
iC1
iφi
t
=L(αi)1
2ln|1+α1
isi|+ 1
2
q2
i
αi+si
=L(αi)1
2ln αi+si
αi+1
2
q2
i
αi+si
=L(αi)+1
2lnαiln(αi+si)+q2
i
αi+si=L(αi)+λ(αi)
Where we have defined λ(αi), siand qias shown in Eq (7.97)-(7.99).
Problem 7.16 Solution
We first calculate the first derivative of Eq(7.97) with respect to αi:
∂λ
∂αi=1
2[1
αi1
αi+siq2
i
(αi+si)2]
Then we calculate the second derivative:
2λ
∂α2
i=1
2[1
α2
i+1
(αi+si)2+2q2
i
(αi+si)3]
Next we aim to prove that when αiis given by Eq (7.101), i.e., setting the
first derivative equal to 0, the second derivative (i.e., the expression above) is
negative. First we can obtain:
αi+si=s2
i
q2
isi+si=siq2
i
q2
isi
151
Therefore, substituting αi+siand αiinto the second derivative, we can
obtain:
2λ
∂α2
i=1
2[(q2
isi)2
s4
i+(q2
isi)2
s2
iq4
i+2q2
i(q2
isi)3
s3
iq6
i
]
=1
2[q4
i(q2
isi)2
q4
is4
i+s2
i(q2
isi)2
s4
iq4
i+2si(q2
isi)3
s4
iq4
i
]
=1
2
(q2
isi)2
q4
is4
i
[q4
i+s2
i+2si(q2
isi)]
=1
2
(q2
isi)2
q4
is4
i
[(q2
isi)2]
= −1
2
(q2
isi)4
q4
is4
i<0
Just as required.
Problem 7.17 Solution
We just follow the hint. According to Eq (7.102), Eq (7.86) and matrix
identity (C.7), we have:
Qi=φT
iC1t
=φT
i(β1I+ΦA1ΦT)1t
=φT
i(βIβIΦ(A+ΦTβIΦ)1ΦTβI)t
=φT
i(ββ2Φ(A+βΦTΦ)1ΦT)t
=φT
i(ββ2ΦΣΦT)t
=βφT
itβ2φT
iΦΣΦTt
Similarly, we can obtain:
Si=φT
iC1φi
=φT
i(ββ2ΦΣΦT)φi
=βφT
iφiβ2φT
iΦΣΦTφi
Just as required.
Problem 7.18 Solution
We begin by deriving the first term in Eq (7.109) with respect to w. This
can be easily evaluate based on Eq (4.90)-(4.91).
wN
n=1
tnln yn+(1 tn)ln(1 yn)=
N
n=1
(tnyn)ϕn=ΦT(ty)
152
Since the derivative of the second term in Eq (7.109) with respect to w
is rather simple to obtain. Therefore, The first derivative of Eq (7.109) with
respect to wis:
ln p
w=ΦT(ty)Aw
For the Hessian matrix, we can first obtain:
wΦT(ty)=
N
n=1
w(tnyn)ϕn
= −
N
n=1
wyn·ϕn
= −
N
n=1
∂σ(wTϕn)
w·ϕT
n
= −
N
n=1
∂σ(a)
a·a
w·ϕT
n
Where we have defined a=wTϕn. Then we can utilize Eq (4.88) to derive:
wΦT(ty)= −
N
n=1
σ(1 σ)·ϕn·ϕT
n= −ΦTBΦ
Where Bis a diagonal N×Nmatrix with elements bn=yn(1yn). There-
fore, we can obtain the Hessian matrix:
H=
wln p
w= −(ΦTBΦ+A)
Just as required.
Problem 7.19 Solution
We begin from Eq (7.114).
p(t|α)=p(t|w)p(w|α)(2π)M/2|Σ|1/2
=N
n=1
p(tn|xn,w)M
i=1
N(wi|0,α1
i)(2π)M/2|Σ|1/2w=w
=N
n=1
p(tn|xn,w)·N(w|0,A)·(2π)M/2|Σ|1/2w=w
We further take logarithm for both sides.
ln p(t|α)=N
n=1
ln p(tn|xn,w)+lnN(w|0,A)+M
2ln2π+1
2ln|Σ|w=w
=N
n=1tnln yn+(1 tn)ln(1 yn)1
2wTAw 1
2ln|A|+ 1
2ln|Σ|+constw=w
=N
n=1tnln yn+(1 tn)ln(1 yn)1
2wTAw+1
2ln|Σ|1
2ln|A|+constw=w
153
Using the Chain rule, we can obtain:
ln p(t|α)
∂αiw=w=ln p(t|α)
w
w
∂αiw=w
Observing Eq (7.109), (7.110) and that (7.110) will equal 0 at w, we can
conclude that the first term on the right hand side of ln p(t|α) will have zero
derivative with respect to wat w. Therefore, we only need to focus on the
second term:
ln p(t|α)
∂αiw=w=
∂αi1
2ln|Σ|1
2ln|A|w=w
It is rather easy to obtain:
∂αi
[1
2ln|A|]= −1
2
∂αi
i
lnα1
i=1
2αi
Then we follow the same procedure as in Prob.7.12, we can obtain:
∂αi
[1
2ln|Σ|]= −1
2Σii
Therefore, we obtain:
ln p(t|α)
∂αi=1
2αi1
2Σii
Note: here I draw a different conclusion as the main text. I have also
verified my result in another way. You can write the prior as the product of
N(wi|0,α1
i) instead of N(w|0,A). In this form, since we know that:
∂αi
M
i=1
lnN(wi|0,α1
i)=
∂αi
(1
2lnαiαi
2w2
i)=1
2αi1
2(w
i)2
The above expression can be used to replace the derivative of 1/2wTAw
1/2ln|A|. Since the derivative of the likelihood with respect to αiis not zero
at w, (7.115) seems not right anyway.
0.8 Graphical Models
Problem 8.1 Solution
We are required to prove:
x
p(x)dx=x
K
k=1
p(xk|pak)dx=1
154
Here we adopt the same assumption as in the main text: No arrows lead
from a higher numbered node to a According to Eq(8.5), we can write:
x
p(x)dx=x
K
k=1
p(xk|pak)dx
=x
p(xK|paK)
K1
k=1
p(xk|pak)dx
=[x1,x2,...,xK1]xKp(xK|paK)
K1
k=1
p(xk|pak)dxKdx1dx2, ...dxK1
=[x1,x2,...,xK1]K1
k=1
p(xk|pak)xK
p(xK|paK)dxKdx1dx2,...dxK1
=[x1,x2,...,xK1]K1
k=1
p(xk|pak)dx1dx2, ...dxK1
=[x1,x2,...,xK1]
K1
k=1
p(xk|pak)dx1dx2, ...dxK1
Note that from the third line to the fourth line, we have used the fact
that x1,x2,...xK1do not depend on xK, and thus the product from k=1 to
K1 can be moved to the outside of the integral with respect to xK, and that
we have used the fact that the conditional probability is correctly normalized
from the fourth line to the fifth line. The aforementioned procedure will be
repeated for Ktimes until all the variables have been integrated out.
Problem 8.2 Solution
This statement is obvious. Suppose that there exists an ordered num-
bering of the nodes such that for each node there are no links going to a
lower-numbered node, and that there is a directed cycle in the graph:
a1a2... aN
To make it a real cycle, we also require aNa1. According to the as-
sumption, we have a1a2... aN. Therefore, the last link aNa1is
invalid since aNa1.
Problem 8.3 Solution
Based on definition, we can obtain:
p(a,b)=p(a,b,c=0) +p(a,b,c=1) =
0.336,if a=0,b=0
0.264,if a=0,b=1
0.256,if a=1,b=0
0.144,if a=1,b=1
Similarly, we can obtain:
p(a)=p(a,b=0) +p(a,b=1) =0.6,if a=0
0.4,if a=1
155
And
p(b)=p(a=0,b)+p(a=1,b)=0.592,if b=0
0.408,if b=1
Therefore, we conclude that p(a,b)̸= p(a)p(b). For instance, we have
p(a=1,b=1) =0.144, p(a=1) =0.4 and p(b=1) =0.408. It is obvious
that:
0.144 =p(a=1,b=1) ̸= p(a=1)p(b=1) =0.4×0.408
To prove the conditional dependency, we first calculate p(c):
p(c)=
a,b=0,1
p(a,b,c)=0.480,if c=0
0.520,if c=1
According to Bayes’ Theorem, we have:
p(a,b|c)=p(a,b,c)
p(c)=
0.400,if a=0,b=0,c=0
0.277,if a=0,b=0,c=1
0.100,if a=0,b=1,c=0
0.415,if a=0,b=1,c=1
0.400,if a=1,b=0,c=0
0.123,if a=1,b=0,c=1
0.100,if a=1,b=1,c=0
0.185,if a=1,b=1,c=1
Similarly, we also have:
p(a|c)=p(a,c)
p(c)=
0.240/0.480 =0.500,if a=0,c=0
0.360/0.520 =0.692,if a=0,c=1
0.240/0.480 =0.500,if a=1,c=0
0.160/0.520 =0.308,if a=1,c=1
Where we have used p(a,c)=p(a,b=0,c)+p(a,b=1,c). Similarly, we
can obtain:
p(b|c)=p(b,c)
p(c)=
0.384/0.480 =0.800,if b=0,c=0
0.208/0.520 =0.400,if b=0,c=1
0.096/0.480 =0.200,if b=1,c=0
0.312/0.520 =0.600,if b=1,c=1
Now we can easily verify the statement p(a,b|c)=p(a|c)p(b|c). For in-
stance, we have:
0.1=p(a=1,b=1|c=0) =p(a=1|c=0)p(b=1|c=0) =0.5×0.2=0.1
Problem 8.4 Solution
156
This problem follows the previous one. We have already calculated p(a)
and p(b|c), we rewrite it here.
p(a)=p(a,b=0) +p(a,b=1) =0.6,if a=0
0.4,if a=1
And
p(b|c)=p(b,c)
p(c)=
0.384/0.480 =0.800,if b=0,c=0
0.208/0.520 =0.400,if b=0,c=1
0.096/0.480 =0.200,if b=1,c=0
0.312/0.520 =0.600,if b=1,c=1
We can also obtain p(c|a):
p(c|a)=p(a,c)
p(a)=
0.24/0.6=0.4,if a=0,c=0
0.36/0.6=0.6,if a=0,c=1
0.24/0.4=0.6,if a=1,c=0
0.16/0.4=0.4,if a=1,c=1
Now we can easily verify the statement that p(a,b,c)=p(a)p(c|a)p(b|c)
given Table 8.2. The directed graph looks like:
acb
Problem 8.5 Solution
It looks quite like Figure 8.6. The difference is that we introduce αifor
each wi, where i=1,2,..., M.
Figure 1: probabilistic graphical model corresponding to the RVM described
in (7.79) and (7.80).
Problem 8.6 Solution(Wait for update)
Problem 8.7 Solution
Let’s just follow the hint. We begin by calculating the mean µ.
E[x1]=b1
157
According to Eq (8.15), we can obtain:
E[x2]=
jpa2
w2jE[xj]+b2=w21b1+b2
Then we can obtain:
E[x3]=w32E[x2]+b3
=w32(w21b1+b2)+b3
=w32w21 b1+w32b2+b3
Therefore, we obtain Eq (8.17) just as required. Next, we deal with the
covariance matrix.
cov[x1,x1]=v1
Then we can obtain:
cov[x1,x2]=
k=1
w2kcov[x1,xk]+I12v2=w21cov[x1,x1]=w21v1
And also cov[x2,x1]=cov[x1,x2]=w21v1. Hence, we can obtain:
cov[x2,x2]=
k=1
w2kcov[x2,xk]+I22v2=w2
21v1+v2
Next, we can obtain:
cov[x1,x3]=
k=2
w3kcov[x1,xk]+I31v1=w32w21v1
Then, we can obtain:
cov[x2,x3]=
k=2
w3kcov[x2,xk]+I23v3=w32(v2+w2
21v1)
Finally, we can obtain:
cov[x3,x3]=
k=2
w3kcov[x3,xk]+I33v3
=w32w32(v2+w2
21v1)+v3
Where we have used the fact that cov[x3,x2]=cov[x2,x3]. By now, we
have obtained Eq (8.18) just as required.
Problem 8.8 Solution

Navigation menu