福別問的部落格: 讀書筆記，Hastie and Tibshirani's An Introduction to Statistical Learning

讀書筆記，Hastie and Tibshirani's An Introduction to Statistical Learning

p.34, (2.7):
\[ \text{E}(y_0-\hat{f}(x_0))^2 =\text{Var}(\hat{f}(x_0))+[\text{Bias}(\hat{f}(x_0))]^2+\text{Var}(\epsilon). \]

Proof: I don't know why. The authors don't prove it in the book The Elements of Statistical Learning (3.22).

p.79, line -14:
Recall that in simple regression, \(R^2\) is the square of the correlation of the response and the variable. In multiple linear regression, it turns out that it equals \(\text{Cor}(Y, \hat{Y})^2\),

Proof: I don't know why.

p.98, (3.37):
\[ h_i=\frac{1}{n}+\frac{(x_i-\bar{x})^2}{\sum_{i'=1}^{n}(x_{i'}-\bar{x})^2}. \]

Proof: See Anderson, p.707, (14.33) and Casella, p.557, subsec. 11.3.5.

p.133, sec.4.3.2:
Estimating the regression coefficients \(\beta_0\) and \(\beta_1\) in \(p(X)=\frac{e^{\beta_0+\beta_1 X}}{1+e^{\beta_0+\beta_1 X}}\)

Proof: See Casella, p.593, subsec.12.3.2.

p.138, sec.4.4:
Linear Discriminant Analysis

Proof: The Elements of Statistical Learning p.116, subsec.4.3.3 有直觀解釋。

p.140, (4.13):
\[ \delta_k(x)=x\cdot \frac{\mu_k}{\sigma^2}-\frac{\mu_k^2}{2\sigma^2}+\log{(\pi_k)} \]

Proof: 在(4.12) \[ p_k(x)=\frac{\pi_k \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{1}{2\sigma^2}(x-\mu_k)^2\right)}}{\sum_{l=1}^{K}\pi_l \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{1}{2\sigma^2}(x-\mu_l)^2\right)}} \] 中，是 \(x\) 固定，\(k\) 在變動。因為分母不會隨著 \(k\) 變動而變動，所以不用管，分子的 \(\frac{1}{\sqrt{2\pi}\sigma}\) 也是一樣不用管。將分子的 \((x-\mu_k)^2\) 展開後的 \(\frac{-x^2}{2\sigma^2}\) 也不用管。

p.143, (4.19):
\[ \delta_k(x)=x^T\mathbf{\Sigma}\mu_k-\frac{1}{2}\mu_k^T \mathbf{\Sigma}^{-1}\mu_k+\log{\pi_k} \]

Proof: 類似(4.13)的證明，但這裡比較不一樣的地方是，因為 \(\mathbf{x}^T\mathbf{\Sigma}^{-1}\boldsymbol{\mu}_k\) 是一個scalar，所以 \[ \boldsymbol{\mu}_k^T\mathbf{\Sigma}^{-1}\mathbf{x} = \mathbf{x}^T\mathbf{\Sigma}^{-1}\boldsymbol{\mu}_k \]

p.148, fig.4.8:
The true positive rate is the sensitivity: the fraction of defaulters that are correctly identified, using a given threshold value. The false positive rate is 1-specificity: the fraction of non-defaulters that we classify incorrectly as defaulters, using that that same threshold value.

Proof: 想想我們生活中的用語“偽陽性”，就比較容易理解這段話。

p.151, (4.24):
\[ \log{\left(\frac{p_1(x)}{1-p_1(x)}\right)}=\log{\left(\frac{p_1(x)}{p_2(x)}\right)}=c_0+c_1x \]

Proof: \(c_0=\ln{\frac{\pi_1}{\pi_2}}-\frac{\mu_1^2-\mu_2^2}{2\sigma^2}\), \(c_1=\frac{(\mu_1-\mu_2)}{\sigma^2}\).

p.187, (5.6):
\[ \alpha=\frac{\sigma_Y^2-\sigma_{XY}}{\sigma_X^2+\sigma_Y^2-2\sigma_{XY}} \]

Proof: By [Casella, p.171, thm.4.5.6], \[\text{Var}(\alpha X+(1-\alpha)Y)=\alpha^2\text{Var}(X)+(1-\alpha)^2\text{Var}(Y)+2\alpha(1-\alpha)\text{Cov}(X, Y). \] Express it in terms of \(\alpha\). We have \[ f(\alpha)=2[\text{Var}(X)+\text{Var}(Y)-2\text{Cov}(X, Y)]\alpha^2+[-2\text{Var}(Y)+2\text{Cov}(X, Y)]\alpha+\text{Var}(Y). \] Solve \(f'(\alpha)=0\). Then we have \[ \alpha=\frac{\sigma^2_Y-\sigma_{XY}}{\sigma^2_X+\sigma^2_Y-2\sigma_{XY}}. \]

p.213, line 20
As an alternative to the approaches just discussed, we can directly estimate the test error using the validation set and cross-validation methods discussed in Chapter 5.

Proof: 這裡並沒有說明清楚該怎麼用cross-validation，下面說明使用方法，主要是參考p.275, line 4。

\(K\)-fold cross-validation

在一個模型 \(M\) 中，有個值 \(n\) 要決定，\(n_1, n_2, ...\) 是候選的值。
- 在sec.6.1是要決定predictor的個數 \(p=1\text{ or }2\text{ or }3\text{ or }\cdots\)
- 在sec.7.4是要決定knot的個數 \(k=1\text{ or }2\text{ or }3\text{ or }\cdots\)（原文用大寫 \(K\) ，但這裡避免符號重複，改用小寫 \(k\)）。
我們會按照下面的程序來決定 \(n\) 的值。
- 假設 \(n=n_1\)，把資料分成 \(K\) 份 \(g_1, g_2, g_3, ..., g_K\)。
  - 取出 \(g_1\) 放旁邊，用 \(g_2, g_3, ..., g_K\) train出模型 \(M\)（記住，這時候模型是假設 \(n=n_1\)），然後用 \(g_1\) test出誤差 \(e_1\)。
  - 取出 \(g_2\) 放旁邊，用 \(g_1, g_3, ..., g_K\) train出模型 \(M\)（記住，這時候模型是假設 \(n=n_1\)），然後用 \(g_2\) test出誤差 \(e_2\)。
  - 反覆此步驟
  得到 \(e_1, e_2, ..., e_K\)，並計算 \(\frac{e_1+e_2+\cdots+e_K}{K}=c_1\)。
- 假設 \(n=n_2\)，把資料分成 \(K\) 份 \(g_1, g_2, g_3, ..., g_K\)。
  - 取出 \(g_1\) 放旁邊，用 \(g_2, g_3, ..., g_K\) train出模型 \(M\)（記住，這時候模型是假設 \(n=n_2\)），然後用 \(g_1\) test出誤差 \(e_1\)。
  - 取出 \(g_2\) 放旁邊，用 \(g_1, g_3, ..., g_K\) train出模型 \(M\)（記住，這時候模型是假設 \(n=n_2\)），然後用 \(g_2\) test出誤差 \(e_2\)。
  - 反覆此步驟
  得到 \(e_1, e_2, ..., e_K\)，並計算 \(\frac{e_1+e_2+\cdots+e_K}{K}=c_2\)。
- 反覆此步驟
得到 \(c_1, c_2, ...\)，看哪個 \(c_i\) 最小，就選擇 \(n_i\)。

p.214, line -1
It may not be immediately obvious why such a constraint should improve the fit, but it turns out that shrinking the coefficient estimates can significantly reduce their variance.

Proof: 參考Bishop的Pattern Recognition and Machine Learning，p.8, table 1.1，當有 \(M=9\) 個predictors的時候，estimated coefficients的值變得很大，所以把p.5, (1.2) \[ \text{E}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\{y(x_n, \mathbf{w})-t_n\}^2 \] 改成p.10, (1.4) \[ \tilde{\text{E}}(\mathbf{w})=\frac{1}{2}\sum_{n=1}^{N}\{y(x_n, \mathbf{w})-t_n\}^2(\mathbf{w})+\frac{\lambda}{2}||\mathbf{w}||^2 \] 這樣就可以限制那些estimated coefficient的大小。

p.220, (6.8)
One can show that the lasso and ridge regression coefficient estimates solve the problems \[ \underset{\beta}{\text{minimize}}\left\{\sum_{i=1}^{n}\left(y_i-\beta_0-\sum_{j=1}^{p}\beta_j x_{ij}\right)^2\right\} \text{ subject to }\sum_{j=1}^{p}|\beta_j|\leq s \]

Proof: I don't know why.

p.231, line 24.
\(\text{Var}(\phi_{11}\times(\text{pop}-\overline{\text{pop}})+\phi_{21}\times(\text{ad}-\overline{\text{ad}})\)

Proof: See Jolliffe's Principal Component Analysis Sections 1.1, p.5，但是Hastie是考慮 \[ \mathbf{X}-\boldsymbol{\mu} = \begin{pmatrix} \text{pop}-\mu_{\text{pop}}\\ \text{ad}-\mu_{\text{ad}} \end{pmatrix} \] 的covariance matrix而不是 \(\mathbf{X}\) 的covariance matrix。這並沒有差別，因為兩個是一樣的，參考Hogg, IMS, p.141, (2.6.13)及p.143, thm.2.6.3, (2.6.15)。 \[ \begin{array}{lll} \text{Cov}(\mathbf{X}-\boldsymbol{\mu}) &=& \text{E}((\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}-\boldsymbol{\mu})^T) \\ &=& \text{E}((\mathbf{X}-\boldsymbol{\mu})(\mathbf{X}^T-\boldsymbol{\mu}^T)) \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T-\boldsymbol{\mu}\mathbf{X}^T-\mathbf{X}\boldsymbol{\mu}^T+\boldsymbol{\mu}\boldsymbol{\mu}^T) \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T)-\boldsymbol{\mu}\text{E}(\mathbf{X}^T)-\text{E}(\mathbf{X})\boldsymbol{\mu}^T+\boldsymbol{\mu}\boldsymbol{\mu}^T \\ &=& \text{E}(\mathbf{X}\mathbf{X}^T)-\boldsymbol{\mu}\boldsymbol{\mu}^T \\ &=& \text{Cov}(\mathbf{X}). \end{array} \]

p.232, fig.6.15, left panel.

Proof: 圖中直線的求法可以參考Casella, p.581, subsec.12.2.2或是Jolliffe's Principal Component Analysis p.34, prop.G3.

p.267, line -5
What is the variance of the fit, i.e. \(\text{Var}(\hat{f}(x_0))\)?

Proof: See Montgomery's Introduction to Linear Regression Analysis, ch.3.

p.278, subsec.7.5.2
Choosing the Smoothing Parameter \(\lambda\)

Proof: See Wang's Smoothing Splines Methods and Applications, ch.3.

p.
bla

Proof:

p.
bla

Proof:

p.
bla

Proof:

p.
bla

Proof:

p.
bla

Proof:

福別問的部落格

Pages

讀書筆記，Hastie and Tibshirani's An Introduction to Statistical Learning

No comments:

Post a Comment