Processing math: 100%

Jolliffe的Principal Component Analysis的筆記

Jolliffe的Principal Component Analysis的筆記
  • p.5, line 5:
    Var(α1x)=α1Σα1


    Proof: See Hogg IMS, p.141, thm.2.6.3 or equation (2.6.16).

  • p.5, line 13:
    To maximize α1Σα1 subject to α1α1=1, the standard approach is to use the technique of Lagrange multipliers. Maximize α1Σα1λ(α1α11),
    where λ is a Lagrange multiplier. Differentiation with respect to α1 gives Σα1λα1=0,


    Proof: Recall that we have to solve λ in α1Σα1=λ(α1α11)
    Suppose that α1=(c1c2cn)
    Then α1Σα1=(c1α1Σα1,c2α1Σα1,...,cnα1Σα1)=(c1ni=1nj=1cicjΣij,c2ni=1nj=1cicjΣij,...,cnni=1nj=1cicjΣij)=(nj=1cjΣ1j+ni=1ciΣi1,nj=1cjΣ2j+ni=1ciΣi2,...,nj=1cjΣnj+ni=1ciΣin)Σ is symmetry=(c1ni=1nj=1cicjΣij,c2ni=1nj=1cicjΣij,...,cnni=1nj=1cicjΣij)=(2nj=1cjΣ1j,2nj=1cjΣ2j,...,2nj=1cjΣnj)=(2Σα1)T.
    On the other hand, (α1α11)=(c1(α1α11),c2(α1α11),...,cn(α1α11))=(c1ni=1c2i,c2ni=1c2i,...,cnni=1c2i)=(2c1,2c2,...,2cn)=(2α1)T.
    Therefore, Solving α1Σα1=λ(α1α11) is equivalent to solving Σα1=λα1.

  • p.5, line -4:
    Cov(α1x,α2x)=α1Σα2


    Proof: See Casella, p.170, thm.4.5.3.

  • p.6, line -11:
    It should be noted that sometimes the vectors αk are referred to as `principal components.' This usage, though sometimes defended (see Dawkins (1990), Kuhfeld (1990) for some discussion), is confusing. It is preferable to reserve the term `principal components' for the derived variables αkx, and refer to αk as the vector of coefficients or loadings for the kth PC. Some authors distinguish between the terms `loadings' and `coefficients,' depending on the normalization constraint used, but they will be used interchangeably in this book.

    Proof:

  • p.18, line 12:
    It is well known that the eigenvectors of Σ1 are the same as those of Σ, and that the eigenvalues of Σ1 are the reciprocals of those of Σ,

    Proof: If Σv=λΣ, then Σ1v=1λΣ1λv=1λΣ1Σv=1λv.

  • p.18, line -16:
    Equation (2.2.2) also implies that the half-lengths of the principal axes are proportional to λ1/21,λ1/22,...,λ1/2p.

    Proof: Consider the simple case, z21λ1+z22λ2=c. When z2=0, the length of the principal axis is 2cλ1. So the half-length of the principal axis is cλ1.

  • p.18, line 15:
    AΣ1A=Λ1.

    Proof: It should be AΣ1A=Λ1.

  • p.18, line -14:
    This result is statistically important if the random vector x has a multivariate normal distribution. In this case, the ellipsoids given by (2.2.1) define contours of constant probability for the distribution of x. The first (largest) principal axis of such ellipsoids will then define the direction in which statistical variation is greatest, which is another way of expressing the algebraic definition of the first PC given in Section 1.1. The direction of the first PC, defining the first principal axis of constant probability ellipsoids, is illustrated in Figures 2.1 and 2.2 in Section 2.3.

    Proof: See p.5, line 4, the vector α1 maximizes Var(α1x).

  • p.18, line -7:
    The second principal axis maximizes statistical variation, subject to being orthogonal to the first, and so on, again corresponding to the algebraic definition. This interpretation of PCs, as defining the principal axes of ellipsoids of constant density, was mentioned by Hotelling (1933) in his original paper.

    Proof: See p.5, line -1, α2α1=0.

  • p.19, line -5:
    To prove Property G2, first note that x1,x2 have the same mean μ and covariance matrix Σ. Hence y1,y2 also have the same mean and covariance matrix, Bμ,BΣB respectively.

    Proof: See Hogg IMS, p.140, thm.2.6.2 or equation (2.6.11) and p.141, thm.2.6.3 or equation (2.6.16)

  • p.34, prop.G3:
    As before, suppose that the observations x1,x2,...,xn are transformed by yi=Bxi, i=1,2,...,n, where B is s (p×q) matrix with orthonormal columns, so that y1,y2,...,yn are projections of x1,x2,...,xn onto a q-dimensional subspace. A measure of `goodness-of-fit' of this q-dimensional subspace to x1,x2,...,xn can be defined as the sum of squared perpendicular distances of x1,x2,...,xn from the subspace. This measure is minimized when B=Aq.

    Proof: 這個定理的證明中,Distances are preserved under orthogonal transformations, so the squared distance mimi of yi from the origin is the same in y coordinates as in x coordinates. 這句話看不懂,而且最後要用到Property A1。我直接用p.5的方法(Lagrange Multiplier)證明 B=Aq 的時候,ni=1mimi 有最大值,就不用使用Property A1。

    ni=1mi,mi=ni=1xi,z1z1+xi,z2z2++xi,zqzq_,xi,z1z1+xi,z2z2++xi,zqzq_=ni=1xi,z12+xi,z22++xi,zq2=x1,z12+x1,z22++x1,zq2+x2,z12+x2,z22++x2,zq2++xn,z12+xn,z22++xn,zq2=zT1x1xT1z1+zT2x1xT1z2++zTqx1xT1zq+zT1x2xT2z1+zT2x2xT2z2++zTqx2xT2zq++zT1xnxTnz1+zT2xnxTnz2++zTqxnxTnzq=zT1(x1xT1+x2xT2++xnxTn)z1+zT2(x1xT1+x2xT2++xnxTn)z2++zTq(x1xT1+x2xT2++xnxTn)zq.
    接著利用如p.5的討論,考慮 z1,z2,...,zq 的每一個分量,並對其做微分,當 z1,z2,...,zqni=1xixTi 的eigenvector時,ni=1mi,mi 有最大值。

    這裡改一下定理使用的符號並簡化敘述:
    • 首先將 ni=1xixTi 正交對角化,得到 ni=1xixTi=PDPT.
      其中 P=(z1|z2||zq),也就是只取前 q 個orthonormal eigenvector,q<pD=diag(λ1,λ2,...,λq), λ1>λ2>>λq.
    • 考慮 mi=projspan(z1,z2,...,zq)xi
    • 定義 ri=ximi
    • ni=1mimi 有最大值。
    • ni=1riri 有最小值。


    關於Property G3,有一個地方需要特別說明。參考下圖,考慮最簡單的情況,假設我們只有兩個樣本點,x1=(1,0)T,x2=(0,1)T,如果我們要決定一條直線,滿足樣本點到這條直線的垂直距離平方和最小,如果只有給這樣的條件還不夠,我們還要對這條直線多一些條件,例如
    • 如果這條直線必須通過原點,那就會決定出圖中的那條斜線。
    • 如果這條直線必須通過 x1+x22,那就會決定出圖中的那條水平直線(非座標軸)。




    幾何上來看,當 p=2,q=1z1 決定了某個通過原點的直線,樣本點到這個直線的垂直距離平方和最小;類似地,當 p=3,q=2z1,z2 決定了某個通過原點的平面,樣本點到這個平面的垂直距離平方和最小。

    但是這個Property討論的 z1,z2,...,zq必定通過原點,我們在其他地方使用的時候不是這個,而是要通過 ¯x=ni=1xi。所以我們在使用Property G3的時候,要把 x1,x2,...,xn 換成 x1¯x,x2¯x,...,xn¯x。(其實就是把原點移到 ¯x,然後再套用Property G3罷了。)

    這裡提供一個例子,假設有三個樣本點 x1=(3,4)T,x2=(8,3)T,x3=(9,8)T,我們要決定一條直線,滿足樣本點到這條直線的垂直距離平方和最小,
    • 如果這條直線必須通過原點,那就是求 ni=1xixTi 對應到最大eigenvalue的eigenvector,這個eigenvector的方向就是那條直線的方向。
    • 如果這條直線必須通過 ¯x=ni=1xi,那就是求 ni=1(xi¯x)(xi¯x)T 對應到最大eigenvalu的eigenvector,這個eigenvector的方向就是那條直線的方向。在這個情況下,這條直線為 y=0.67x+0.56,其中 ni=1(xi¯x)(xi¯x)T=(20.678814)。當然,也可以直接套Casella, p.582, subsec.12.2.11的公式求出來。
    population sample
    X=(X1X2Xp) or Xμ=(X1μX1X2μX2XpμXp) x1,x2,...,xn are p×1 column vectors
    Σ=Cov(X)=Cov(Xμ) ¯x=1nni=1xi
    S=(x1¯x|x2¯x||xn¯x)
    Σ=PDPT
    P=(u1|u2||uq), a p×q matrix, q<p, that is, we only use the first q orthonormal eigenvectors.
    D=diag(λ1,λ2,...,λq), λ1>λ2>>λq
    SST=ni=1(xi¯x)(xi¯x)T=PDPT
    P=(z1|z2||zq), a p×q matrix, q<p, that is, we only use the first q orthonormal eigenvectors.
    D=diag(λ1,λ2,...,λq), λ1>λ2>>λq
    principal components
    Z1=uT1X,Z2=uT2X,...,Zq=uTqX
    or Z1=uT1(Xμ),Z2=uT2(Xμ),...,Zq=uTq(Xμ)
    principal components
    (zT1(x1¯x)zT1(x2¯x)zT1(xn¯x)),(zT2(x1¯x)zT2(x2¯x)zT2(xn¯x)),...,(zTq(x1¯x)zTq(x2¯x)zTq(xn¯x))
    我們想要用 X1,X2,...,Xp 線性組合出 Z1,Z2,...,Zq
    使得 Var(Z1),Var(Z2),...,Var(Zq) 盡可能地大,
    而且 Var(Z1)>Var(Z2)>>Var(Zq)
    假設 M=(zT1(x1¯x)zT2(x1¯x)zTq(x1¯x)zT1(x2¯x)zT2(x2¯x)zTq(x2¯x)zT1(xn¯x)zT2(xn¯x)zTq(xn¯x))
    也就是以principal components為column的矩陣,則 M 的第 i row給出了用 z1,z2,...,zq 線性組合出 projspan(z1,z2,...,zq)(xi¯x) 的係數。如果是平面上的點投影到直線上,則這些係數表示了投影點到原點的距離,這些距離加起來會是最大(ni=1mimi最大),也就是說,這些平面上的點,在這個方向的變異程度最大。

    計算上,先用Excel算出 SST,再用wolframalpha的diagonalize指令將 SST 對角化,求出eigenvector (column)構成的matrix Q(可能要換eigenvector的順序,按照對應的eigenvalue的大小遞減排序),再用transpose(orthogonalize(transpose(Q))) 求出 z1,z2,...,zq(要做兩次transpose是因為orthogonalize是處理row vector)。

    zi 中的分量的正負號可能會隨著不同的計算方法而有差異,例如 z2=(0.544,0.839)T 或是 z2=(0.544,0.839)T,但本質上都是一樣的,都滿足 zTizi=1,zTizj=0

No comments:

Post a Comment