讀書筆記,Hastie and Tibshirani's An Introduction to Statistical Learning

讀書筆記,Hastie and Tibshirani's An Introduction to Statistical Learning
  • p.34, (2.7):
    E(y0ˆf(x0))2=Var(ˆf(x0))+[Bias(ˆf(x0))]2+Var(ϵ).


    Proof: I don't know why. The authors don't prove it in the book The Elements of Statistical Learning (3.22).

  • p.79, line -14:
    Recall that in simple regression, R2 is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals Cor(Y,ˆY)2,

    Proof: I don't know why.

  • p.98, (3.37):
    hi=1n+(xiˉx)2ni=1(xiˉx)2.


    Proof: See Anderson, p.707, (14.33) and Casella, p.557, subsec. 11.3.5.

  • p.133, sec.4.3.2:
    Estimating the regression coefficients β0 and β1 in p(X)=eβ0+β1X1+eβ0+β1X

    Proof: See Casella, p.593, subsec.12.3.2.

  • p.138, sec.4.4:
    Linear Discriminant Analysis

    Proof: The Elements of Statistical Learning p.116, subsec.4.3.3 有直觀解釋。

  • p.140, (4.13):
    δk(x)=xμkσ2μ2k2σ2+log(πk)


    Proof: 在(4.12) pk(x)=πk12πσexp(12σ2(xμk)2)Kl=1πl12πσexp(12σ2(xμl)2)
    中,是 x 固定,k 在變動。因為分母不會隨著 k 變動而變動,所以不用管,分子的 12πσ 也是一樣不用管。將分子的 (xμk)2 展開後的 x22σ2 也不用管。

  • p.143, (4.19):
    δk(x)=xTΣμk12μTkΣ1μk+logπk


    Proof: 類似(4.13)的證明,但這裡比較不一樣的地方是,因為 xTΣ1μk 是一個scalar,所以 μTkΣ1x=xTΣ1μk

  • p.148, fig.4.8:
    The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that that same threshold value.

    Proof: 想想我們生活中的用語“偽陽性”,就比較容易理解這段話。

  • p.151, (4.24):
    log(p1(x)1p1(x))=log(p1(x)p2(x))=c0+c1x


    Proof: c0=lnπ1π2μ21μ222σ2, c1=(μ1μ2)σ2.

  • p.187, (5.6):
    α=σ2YσXYσ2X+σ2Y2σXY


    Proof: By [Casella, p.171, thm.4.5.6], Var(αX+(1α)Y)=α2Var(X)+(1α)2Var(Y)+2α(1α)Cov(X,Y).
    Express it in terms of α. We have f(α)=2[Var(X)+Var(Y)2Cov(X,Y)]α2+[2Var(Y)+2Cov(X,Y)]α+Var(Y).
    Solve f(α)=0. Then we have α=σ2YσXYσ2X+σ2Y2σXY.

  • p.213, line 20
    As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods discussed in Chapter 5.

    Proof: 這裡並沒有說明清楚該怎麼用cross-validation,下面說明使用方法,主要是參考p.275, line 4。

    K-fold cross-validation

    在一個模型 M 中,有個值 n 要決定,n1,n2,... 是候選的值。
    • 在sec.6.1是要決定predictor的個數 p=1 or 2 or 3 or 
    • 在sec.7.4是要決定knot的個數 k=1 or 2 or 3 or (原文用大寫 K ,但這裡避免符號重複,改用小寫 k)。
    我們會按照下面的程序來決定 n 的值。
    • 假設 n=n1,把資料分成 Kg1,g2,g3,...,gK
      • 取出 g1 放旁邊,用 g2,g3,...,gK train出模型 M(記住,這時候模型是假設 n=n1),然後用 g1 test出誤差 e1
      • 取出 g2 放旁邊,用 g1,g3,...,gK train出模型 M(記住,這時候模型是假設 n=n1),然後用 g2 test出誤差 e2
      • 反覆此步驟
      得到 e1,e2,...,eK,並計算 e1+e2++eKK=c1
    • 假設 n=n2,把資料分成 Kg1,g2,g3,...,gK
      • 取出 g1 放旁邊,用 g2,g3,...,gK train出模型 M(記住,這時候模型是假設 n=n2),然後用 g1 test出誤差 e1
      • 取出 g2 放旁邊,用 g1,g3,...,gK train出模型 M(記住,這時候模型是假設 n=n2),然後用 g2 test出誤差 e2
      • 反覆此步驟
      得到 e1,e2,...,eK,並計算 e1+e2++eKK=c2
    • 反覆此步驟
    得到 c1,c2,...,看哪個 ci 最小,就選擇 ni

  • p.214, line -1
    It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

    Proof: 參考Bishop的Pattern Recognition and Machine Learning,p.8, table 1.1,當有 M=9 個predictors的時候,estimated coefficients的值變得很大,所以把p.5, (1.2) E(w)=12Nn=1{y(xn,w)tn}2
    改成p.10, (1.4) ˜E(w)=12Nn=1{y(xn,w)tn}2(w)+λ2||w||2
    這樣就可以限制那些estimated coefficient的大小。

  • p.220, (6.8)
    One can show that the lasso and ridge regression coefficient estimates solve the problems minimizeβ{ni=1(yiβ0pj=1βjxij)2} subject to pj=1|βj|s


    Proof: I don't know why.

  • p.231, line 24.
    Var(ϕ11×(pop¯pop)+ϕ21×(ad¯ad)

    Proof: See Jolliffe's Principal Component Analysis Sections 1.1, p.5,但是Hastie是考慮 Xμ=(popμpopadμad)
    的covariance matrix而不是 X 的covariance matrix。這並沒有差別,因為兩個是一樣的,參考Hogg, IMS, p.141, (2.6.13)及p.143, thm.2.6.3, (2.6.15)。 Cov(Xμ)=E((Xμ)(Xμ)T)=E((Xμ)(XTμT))=E(XXTμXTXμT+μμT)=E(XXT)μE(XT)E(X)μT+μμT=E(XXT)μμT=Cov(X).

  • p.232, fig.6.15, left panel.


    Proof: 圖中直線的求法可以參考Casella, p.581, subsec.12.2.2或是Jolliffe's Principal Component Analysis p.34, prop.G3.

  • p.267, line -5
    What is the variance of the fit, i.e. Var(ˆf(x0))?

    Proof: See Montgomery's Introduction to Linear Regression Analysis, ch.3.

  • p.278, subsec.7.5.2
    Choosing the Smoothing Parameter λ

    Proof: See Wang's Smoothing Splines Methods and Applications, ch.3.

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

No comments:

Post a Comment