- p.5, line 5:
Var(α′1x)=α′1Σα1
Proof: See Hogg IMS, p.141, thm.2.6.3 or equation (2.6.16). - p.5, line 13:
To maximize α′1Σα1 subject to α′1α1=1, the standard approach is to use the technique of Lagrange multipliers. Maximize α′1Σα1−λ(α′1α1−1),where λ is a Lagrange multiplier. Differentiation with respect to α1 gives Σα1−λα1=0,
Proof: Recall that we have to solve λ in ∇α′1Σα1=λ∇(α′1α1−1)Suppose that α1=(c1c2⋮cn)Then ∇α′1Σα1=(∂∂c1α′1Σα1,∂∂c2α′1Σα1,...,∂∂cnα′1Σα1)=(∂∂c1∑ni=1∑nj=1cicjΣij,∂∂c2∑ni=1∑nj=1cicjΣij,...,∂∂cn∑ni=1∑nj=1cicjΣij)=(∑nj=1cjΣ1j+∑ni=1ciΣi1,∑nj=1cjΣ2j+∑ni=1ciΣi2,...,∑nj=1cjΣnj+∑ni=1ciΣin)Σ is symmetry=(∂∂c1∑ni=1∑nj=1cicjΣij,∂∂c2∑ni=1∑nj=1cicjΣij,...,∂∂cn∑ni=1∑nj=1cicjΣij)=(2∑nj=1cjΣ1j,2∑nj=1cjΣ2j,...,2∑nj=1cjΣnj)=(2Σα1)T.On the other hand, ∇(α′1α1−1)=(∂∂c1(α′1α1−1),∂∂c2(α′1α1−1),...,∂∂cn(α′1α1−1))=(∂∂c1∑ni=1c2i,∂∂c2∑ni=1c2i,...,∂∂cn∑ni=1c2i)=(2c1,2c2,...,2cn)=(2α1)T.Therefore, Solving ∇α′1Σα1=λ∇(α′1α1−1) is equivalent to solving Σα1=λα1. - p.5, line -4:
Cov(α′1x,α′2x)=α′1Σα2
Proof: See Casella, p.170, thm.4.5.3. - p.6, line -11:
It should be noted that sometimes the vectors αk are referred to as `principal components.' This usage, though sometimes defended (see Dawkins (1990), Kuhfeld (1990) for some discussion), is confusing. It is preferable to reserve the term `principal components' for the derived variables α′kx, and refer to αk as the vector of coefficients or loadings for the kth PC. Some authors distinguish between the terms `loadings' and `coefficients,' depending on the normalization constraint used, but they will be used interchangeably in this book.
Proof: - p.18, line 12:
It is well known that the eigenvectors of Σ−1 are the same as those of Σ, and that the eigenvalues of Σ−1 are the reciprocals of those of Σ,
Proof: If Σv=λΣ, then Σ−1v=1λΣ−1λv=1λΣ−1Σv=1λv. - p.18, line -16:
Equation (2.2.2) also implies that the half-lengths of the principal axes are proportional to λ1/21,λ1/22,...,λ1/2p.
Proof: Consider the simple case, z21λ1+z22λ2=c. When z2=0, the length of the principal axis is 2c√λ1. So the half-length of the principal axis is c√λ1. - p.18, line 15:
AΣ−1A=Λ−1.
Proof: It should be A′Σ−1A=Λ−1. - p.18, line -14:
This result is statistically important if the random vector x has a multivariate normal distribution. In this case, the ellipsoids given by (2.2.1) define contours of constant probability for the distribution of x. The first (largest) principal axis of such ellipsoids will then define the direction in which statistical variation is greatest, which is another way of expressing the algebraic definition of the first PC given in Section 1.1. The direction of the first PC, defining the first principal axis of constant probability ellipsoids, is illustrated in Figures 2.1 and 2.2 in Section 2.3.
Proof: See p.5, line 4, the vector α1 maximizes Var(α′1x). - p.18, line -7:
The second principal axis maximizes statistical variation, subject to being orthogonal to the first, and so on, again corresponding to the algebraic definition. This interpretation of PCs, as defining the principal axes of ellipsoids of constant density, was mentioned by Hotelling (1933) in his original paper.
Proof: See p.5, line -1, α′2α1=0. - p.19, line -5:
To prove Property G2, first note that x1,x2 have the same mean μ and covariance matrix Σ. Hence y1,y2 also have the same mean and covariance matrix, B′μ,B′ΣB respectively.
Proof: See Hogg IMS, p.140, thm.2.6.2 or equation (2.6.11) and p.141, thm.2.6.3 or equation (2.6.16) - p.34, prop.G3:
As before, suppose that the observations x1,x2,...,xn are transformed by yi=B′xi, i=1,2,...,n, where B is s (p×q) matrix with orthonormal columns, so that y1,y2,...,yn are projections of x1,x2,...,xn onto a q-dimensional subspace. A measure of `goodness-of-fit' of this q-dimensional subspace to x1,x2,...,xn can be defined as the sum of squared perpendicular distances of x1,x2,...,xn from the subspace. This measure is minimized when B=Aq.
Proof: 這個定理的證明中,Distances are preserved under orthogonal transformations, so the squared distance m′imi of yi from the origin is the same in y coordinates as in x coordinates. 這句話看不懂,而且最後要用到Property A1。我直接用p.5的方法(Lagrange Multiplier)證明 B=Aq 的時候,∑ni=1m′imi 有最大值,就不用使用Property A1。
∑ni=1⟨mi,mi⟩=∑ni=1⟨⟨xi,z1⟩z1+⟨xi,z2⟩z2+⋯+⟨xi,zq⟩zq⟩_,⟨xi,z1⟩z1+⟨xi,z2⟩z2+⋯+⟨xi,zq⟩zq⟩_⟩=∑ni=1⟨xi,z1⟩2+⟨xi,z2⟩2+⋯+⟨xi,zq⟩2=⟨x1,z1⟩2+⟨x1,z2⟩2+⋯+⟨x1,zq⟩2+⟨x2,z1⟩2+⟨x2,z2⟩2+⋯+⟨x2,zq⟩2+⋯+⟨xn,z1⟩2+⟨xn,z2⟩2+⋯+⟨xn,zq⟩2=zT1x1xT1z1+zT2x1xT1z2+⋯+zTqx1xT1zq+zT1x2xT2z1+zT2x2xT2z2+⋯+zTqx2xT2zq+⋯+zT1xnxTnz1+zT2xnxTnz2+⋯+zTqxnxTnzq=zT1(x1xT1+x2xT2+⋯+xnxTn)z1+zT2(x1xT1+x2xT2+⋯+xnxTn)z2+⋯+zTq(x1xT1+x2xT2+⋯+xnxTn)zq.接著利用如p.5的討論,考慮 z1,z2,...,zq 的每一個分量,並對其做微分,當 z1,z2,...,zq 為 ∑ni=1xixTi 的eigenvector時,∑ni=1⟨mi,mi⟩ 有最大值。
這裡改一下定理使用的符號並簡化敘述:- 首先將 ∑ni=1xixTi 正交對角化,得到
n∑i=1xixTi=PDPT.其中 P=(z1|z2|⋯|zq),也就是只取前 q 個orthonormal eigenvector,q<p,D=diag(λ1,λ2,...,λq), λ1>λ2>⋯>λq.
- 考慮 mi=projspan(z1,z2,...,zq)xi
- 定義 ri=xi−mi
- ∑ni=1m′imi 有最大值。
- ∑ni=1r′iri 有最小值。
關於Property G3,有一個地方需要特別說明。參考下圖,考慮最簡單的情況,假設我們只有兩個樣本點,x1=(1,0)T,x2=(0,1)T,如果我們要決定一條直線,滿足樣本點到這條直線的垂直距離平方和最小,如果只有給這樣的條件還不夠,我們還要對這條直線多一些條件,例如- 如果這條直線必須通過原點,那就會決定出圖中的那條斜線。
- 如果這條直線必須通過 x1+x22,那就會決定出圖中的那條水平直線(非座標軸)。
幾何上來看,當 p=2,q=1,z1 決定了某個通過原點的直線,樣本點到這個直線的垂直距離平方和最小;類似地,當 p=3,q=2,z1,z2 決定了某個通過原點的平面,樣本點到這個平面的垂直距離平方和最小。
但是這個Property討論的 z1,z2,...,zq 都必定通過原點,我們在其他地方使用的時候不是這個,而是要通過 ¯x=∑ni=1xi。所以我們在使用Property G3的時候,要把 x1,x2,...,xn 換成 x1−¯x,x2−¯x,...,xn−¯x。(其實就是把原點移到 ¯x,然後再套用Property G3罷了。)
這裡提供一個例子,假設有三個樣本點 x1=(3,4)T,x2=(8,3)T,x3=(9,8)T,我們要決定一條直線,滿足樣本點到這條直線的垂直距離平方和最小,- 如果這條直線必須通過原點,那就是求 ∑ni=1xixTi 對應到最大eigenvalue的eigenvector,這個eigenvector的方向就是那條直線的方向。
- 如果這條直線必須通過 ¯x=∑ni=1xi,那就是求 ∑ni=1(xi−¯x)(xi−¯x)T 對應到最大eigenvalu的eigenvector,這個eigenvector的方向就是那條直線的方向。在這個情況下,這條直線為 y=0.67x+0.56,其中 ∑ni=1(xi−¯x)(xi−¯x)T=(20.678814)。當然,也可以直接套Casella, p.582, subsec.12.2.11的公式求出來。
population sample X=(X1X2⋮Xp) or X−μ=(X1−μX1X2−μX2⋮Xp−μXp) x1,x2,...,xn are p×1 column vectors Σ=Cov(X)=Cov(X−μ) ¯x=1n∑ni=1xi
S=(x1−¯x|x2−¯x|⋯|xn−¯x)Σ=PDPT
P=(u1|u2|⋯|uq), a p×q matrix, q<p, that is, we only use the first q orthonormal eigenvectors.
D=diag(λ1,λ2,...,λq), λ1>λ2>⋯>λqSST=∑ni=1(xi−¯x)(xi−¯x)T=PDPT
P=(z1|z2|⋯|zq), a p×q matrix, q<p, that is, we only use the first q orthonormal eigenvectors.
D=diag(λ1,λ2,...,λq), λ1>λ2>⋯>λqprincipal components
Z1=uT1X,Z2=uT2X,...,Zq=uTqX
or Z1=uT1(X−μ),Z2=uT2(X−μ),...,Zq=uTq(X−μ)principal components
(zT1(x1−¯x)zT1(x2−¯x)⋮zT1(xn−¯x)),(zT2(x1−¯x)zT2(x2−¯x)⋮zT2(xn−¯x)),...,(zTq(x1−¯x)zTq(x2−¯x)⋮zTq(xn−¯x))我們想要用 X1,X2,...,Xp 線性組合出 Z1,Z2,...,Zq,
使得 Var(Z1),Var(Z2),...,Var(Zq) 盡可能地大,
而且 Var(Z1)>Var(Z2)>⋯>Var(Zq)假設 M=(zT1(x1−¯x)zT2(x1−¯x)⋯zTq(x1−¯x)zT1(x2−¯x)zT2(x2−¯x)⋯zTq(x2−¯x)⋮⋮⋱⋮zT1(xn−¯x)zT2(xn−¯x)⋯zTq(xn−¯x)),
也就是以principal components為column的矩陣,則 M 的第 i row給出了用 z1,z2,...,zq 線性組合出 projspan(z1,z2,...,zq)(xi−¯x) 的係數。如果是平面上的點投影到直線上,則這些係數表示了投影點到原點的距離,這些距離加起來會是最大(∑ni=1m′imi最大),也就是說,這些平面上的點,在這個方向的變異程度最大。
計算上,先用Excel算出 SST,再用wolframalpha的diagonalize指令將 SST 對角化,求出eigenvector (column)構成的matrix Q(可能要換eigenvector的順序,按照對應的eigenvalue的大小遞減排序),再用transpose(orthogonalize(transpose(Q))) 求出 z1,z2,...,zq(要做兩次transpose是因為orthogonalize是處理row vector)。
zi 中的分量的正負號可能會隨著不同的計算方法而有差異,例如 z2=(0.544,−0.839)T 或是 z2=(−0.544,0.839)T,但本質上都是一樣的,都滿足 zTizi=1,zTizj=0。 - 首先將 ∑ni=1xixTi 正交對角化,得到
n∑i=1xixTi=PDPT.
Jolliffe的Principal Component Analysis的筆記
Subscribe to:
Posts (Atom)
No comments:
Post a Comment