讀書筆記,Hastie and Tibshirani's An Introduction to Statistical Learning

讀書筆記,Hastie and Tibshirani's An Introduction to Statistical Learning
  • p.34, (2.7):
    \[ \text{E}(y_0-\hat{f}(x_0))^2 =\text{Var}(\hat{f}(x_0))+[\text{Bias}(\hat{f}(x_0))]^2+\text{Var}(\epsilon). \]

    Proof: I don't know why. The authors don't prove it in the book The Elements of Statistical Learning (3.22).

  • p.79, line -14:
    Recall that in simple regression, \(R^2\) is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals \(\text{Cor}(Y, \hat{Y})^2\),

    Proof: I don't know why.

  • p.98, (3.37):
    \[ h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{i'=1}^{n}(x_{i'}-\bar{x})^2}. \]

    Proof: See Anderson, p.707, (14.33) and Casella, p.557, subsec. 11.3.5.

  • p.133, sec.4.3.2:
    Estimating the regression coefficients \(\beta_0\) and \(\beta_1\) in \(p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}\)

    Proof: See Casella, p.593, subsec.12.3.2.

  • p.138, sec.4.4:
    Linear Discriminant Analysis

    Proof: The Elements of Statistical Learning p.116, subsec.4.3.3 有直觀解釋。

  • p.140, (4.13):
    \[ \delta_k(x)=x\cdot \frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\log{(\pi_k)} \]

    Proof: 在(4.12) \[ p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{1}{2\sigma^2}(x-\mu_k)^2\right)}}{\sum_{l=1}^{K}\pi_l \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{1}{2\sigma^2}(x-\mu_l)^2\right)}} \] 中,是 \(x\) 固定,\(k\) 在變動。因為分母不會隨著 \(k\) 變動而變動,所以不用管,分子的 \(\frac{1}{\sqrt{2\pi}\sigma}\) 也是一樣不用管。將分子的 \((x-\mu_k)^2\) 展開後的 \(\frac{-x^2}{2\sigma^2}\) 也不用管。

  • p.143, (4.19):
    \[ \delta_k(x)=x^T\mathbf{\Sigma}\mu_k-\frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k+\log{\pi_k} \]

    Proof: 類似(4.13)的證明,但這裡比較不一樣的地方是,因為 \(\mathbf{x}^T\mathbf{\Sigma}^{-1}\boldsymbol{\mu}_k\) 是一個scalar,所以 \[ \boldsymbol{\mu}_k^T\mathbf{\Sigma}^{-1}\mathbf{x} = \mathbf{x}^T\mathbf{\Sigma}^{-1}\boldsymbol{\mu}_k \]

  • p.148, fig.4.8:
    The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that that same threshold value.

    Proof: 想想我們生活中的用語“偽陽性”,就比較容易理解這段話。

  • p.151, (4.24):
    \[ \log{\left(\frac{p_1(x)}{1-p_1(x)}\right)}=\log{\left(\frac{p_1(x)}{p_2(x)}\right)}=c_0+c_1x \]

    Proof: \(c_0=\ln{\frac{\pi_1}{\pi_2}}-\frac{\mu_1^2-\mu_2^2}{2\sigma^2}\), \(c_1=\frac{(\mu_1-\mu_2)}{\sigma^2}\).

  • p.187, (5.6):
    \[ \alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma_X^2+\sigma_Y^2-2\sigma_{XY}} \]

    Proof: By [Casella, p.171, thm.4.5.6], \[\text{Var}(\alpha X+(1-\alpha)Y)=\alpha^2\text{Var}(X)+(1-\alpha)^2\text{Var}(Y)+2\alpha(1-\alpha)\text{Cov}(X, Y). \] Express it in terms of \(\alpha\). We have \[ f(\alpha)=2[\text{Var}(X)+\text{Var}(Y)-2\text{Cov}(X, Y)]\alpha^2+[-2\text{Var}(Y)+2\text{Cov}(X, Y)]\alpha+\text{Var}(Y). \] Solve \(f'(\alpha)=0\). Then we have \[ \alpha=\frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}. \]

  • p.213, line 20
    As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods discussed in Chapter 5.

    Proof: 這裡並沒有說明清楚該怎麼用cross-validation,下面說明使用方法,主要是參考p.275, line 4。

    \(K\)-fold cross-validation

    在一個模型 \(M\) 中,有個值 \(n\) 要決定,\(n_1, n_2, ...\) 是候選的值。
    • 在sec.6.1是要決定predictor的個數 \(p=1\text{ or }2\text{ or }3\text{ or }\cdots\)
    • 在sec.7.4是要決定knot的個數 \(k=1\text{ or }2\text{ or }3\text{ or }\cdots\)(原文用大寫 \(K\) ,但這裡避免符號重複,改用小寫 \(k\))。
    我們會按照下面的程序來決定 \(n\) 的值。
    • 假設 \(n=n_1\),把資料分成 \(K\) 份 \(g_1, g_2, g_3, ..., g_K\)。
      • 取出 \(g_1\) 放旁邊,用 \(g_2, g_3, ..., g_K\) train出模型 \(M\)(記住,這時候模型是假設 \(n=n_1\)),然後用 \(g_1\) test出誤差 \(e_1\)。
      • 取出 \(g_2\) 放旁邊,用 \(g_1, g_3, ..., g_K\) train出模型 \(M\)(記住,這時候模型是假設 \(n=n_1\)),然後用 \(g_2\) test出誤差 \(e_2\)。
      • 反覆此步驟
      得到 \(e_1, e_2, ..., e_K\),並計算 \(\frac{e_1+e_2+\cdots+e_K}{K}=c_1\)。
    • 假設 \(n=n_2\),把資料分成 \(K\) 份 \(g_1, g_2, g_3, ..., g_K\)。
      • 取出 \(g_1\) 放旁邊,用 \(g_2, g_3, ..., g_K\) train出模型 \(M\)(記住,這時候模型是假設 \(n=n_2\)),然後用 \(g_1\) test出誤差 \(e_1\)。
      • 取出 \(g_2\) 放旁邊,用 \(g_1, g_3, ..., g_K\) train出模型 \(M\)(記住,這時候模型是假設 \(n=n_2\)),然後用 \(g_2\) test出誤差 \(e_2\)。
      • 反覆此步驟
      得到 \(e_1, e_2, ..., e_K\),並計算 \(\frac{e_1+e_2+\cdots+e_K}{K}=c_2\)。
    • 反覆此步驟
    得到 \(c_1, c_2, ...\),看哪個 \(c_i\) 最小,就選擇 \(n_i\)。

  • p.214, line -1
    It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

    Proof: 參考Bishop的Pattern Recognition and Machine Learning,p.8, table 1.1,當有 \(M=9\) 個predictors的時候,estimated coefficients的值變得很大,所以把p.5, (1.2) \[ \text{E}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\{y(x_n, \mathbf{w})-t_n\}^2 \] 改成p.10, (1.4) \[ \tilde{\text{E}}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\{y(x_n, \mathbf{w})-t_n\}^2(\mathbf{w})+\frac{\lambda}{2}||\mathbf{w}||^2 \] 這樣就可以限制那些estimated coefficient的大小。

  • p.220, (6.8)
    One can show that the lasso and ridge regression coefficient estimates solve the problems \[ \underset{\beta}{\text{minimize}}\left\{\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_j x_{ij}\right)^2\right\} \text{ subject to }\sum_{j=1}^{p}|\beta_j|\leq s \]

    Proof: I don't know why.

  • p.231, line 24.
    \(\text{Var}(\phi_{11}\times(\text{pop}-\overline{\text{pop}})+\phi_{21}\times(\text{ad}-\overline{\text{ad}})\)

    Proof: See Jolliffe's Principal Component Analysis Sections 1.1, p.5,但是Hastie是考慮 \[ \mathbf{X}-\boldsymbol{\mu} = \begin{pmatrix} \text{pop}-\mu_{\text{pop}}\\ \text{ad}-\mu_{\text{ad}} \end{pmatrix} \] 的covariance matrix而不是 \(\mathbf{X}\) 的covariance matrix。這並沒有差別,因為兩個是一樣的,參考Hogg, IMS, p.141, (2.6.13)及p.143, thm.2.6.3, (2.6.15)。 \[ \begin{array}{lll} \text{Cov}(\mathbf{X}-\boldsymbol{\mu}) &=& \text{E}((\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^T) \\ &=& \text{E}((\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}^T-\boldsymbol{\mu}^T)) \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T-\boldsymbol{\mu}\mathbf{X}^T-\mathbf{X}\boldsymbol{\mu}^T+\boldsymbol{\mu}\boldsymbol{\mu}^T) \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T)-\boldsymbol{\mu}\text{E}(\mathbf{X}^T)-\text{E}(\mathbf{X})\boldsymbol{\mu}^T+\boldsymbol{\mu}\boldsymbol{\mu}^T \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T)-\boldsymbol{\mu}\boldsymbol{\mu}^T \\ &=& \text{Cov}(\mathbf{X}). \end{array} \]

  • p.232, fig.6.15, left panel.


    Proof: 圖中直線的求法可以參考Casella, p.581, subsec.12.2.2或是Jolliffe's Principal Component Analysis p.34, prop.G3.

  • p.267, line -5
    What is the variance of the fit, i.e. \(\text{Var}(\hat{f}(x_0))\)?

    Proof: See Montgomery's Introduction to Linear Regression Analysis, ch.3.

  • p.278, subsec.7.5.2
    Choosing the Smoothing Parameter \(\lambda\)

    Proof: See Wang's Smoothing Splines Methods and Applications, ch.3.

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

  • p.
    bla

    Proof:

No comments:

Post a Comment