A Proof of Pearson's Chi-Square Goodness-of-Fit Test
Contents
Theorem
- [Hogg, PSI, p.416] To generalize, we let an experiment have k (instead of only two) mutually exclusive and exhaustive outcomes, say, A1,A2,...,Ak. Let pi=P(Ai), and thus ∑ki=1pi=1. The experiment is repeated n independent times, and we let Yi represent the number of times the experiment results in Ai, i=1,2,...,k. This joint distribution of Y1,Y2,...,Yk−1 is a straightforward generalization of the binomial distribution, as follows.
- [Hogg, PSI, p.416] In considering the joint pmf, we see that f(y1,y2,...,yk−1)=P(Y1=y1,Y2=y2,...,Yk−1=yk−1), where y1,y2,...,yk−1 are nonnegative integers such that y1+y2+⋯+yk−1≤n. Note that we do not need to consider Yk, since, once the other k−1 random variables are observed to equal y1,y2,...,yk−1, respectively, we know that Yk=n−y1−y2−⋯−yk−1=yk, say. From the independence of the trials, the probability of each particular arrangement of y1 A1s, y2 A2s, ..., yk Aks is py11py22⋯pykk. The number of such arrangements is the multinomial coefficient (ny1,y2,...,yk)=n!y1!y2!⋯yk!. Hence, the product of these two expressions gives the joint pmf of Y1,Y2,...,Yk−1: f(y1,y2,...,yk−1)=n!y1!y2!⋯yk!py11py22⋯pykk. (Recall that yk=n−y1−y2−⋯−yk−1.)
- [Hogg, PSI, p.416] Pearson then constructed an expression similar to Q1 (Equation 9.1-1), which involves Y1 and Y2=n−Y1, that we denote by Qk−1, which involves Y1,Y2,...,Yk−1, and Yk=n−Y1−Y2−⋯−Yk−1, namely, Qk−1=k∑i=1(Yi−npi)2npi. He argued that Qk−1 has an approximate chi-square distribution with k−1 degree of freedom in much the same way we argued that Q1 is approximately χ2(1). We accept Pearson's conclusion, as the proof is beyond the level of this text.
Proof
The main idea is from this lecture note by David Hunter in Penn State Department of Statistics. [Keeping, sec.13.5, 13.6, appendix A.17] has a partial discussion. If you know where I can find a complete proof in a book. Please give me an email.
Part I
- We need some preliminaries.
- [Hogg, IMS, p.140] In Section 2.5 we discussed the covariance between two random variables. In this section we want to extend this discussion to the n-variate case. Let X=(X1,...,Xn)′ be an n-dimensional random vector. Recall that we defined E(X)=(E(X1),...,E(Xn))′, that is, the expectation of a random vector is just the vector of the expectations of its components.
- [Hogg, IMS, p.140] Now suppose W is an m×n matrix of random variables, say, W=[Wij] for the random variables Wij, 1≤i≤m and 1≤j≤n. Note that we can always string out the matrix into an mn×1 random vector. Hence, we define the expectation of a random matrix E[W]=[E(Wij)]. As the following theorem shows, the linearity of the expectation operator easily follows from this definition:
- [Hogg, IMS, p.141] Let X=(X1,...,Xn)′ be an n-dimensional random vector, such that σ2i=Var(Xi)<∞. The mean of X is μ=E[X] and we define its variance-covariance matrix as Cov(X)=E[(X−μ)(X−μ)′]=[σij], where σii denotes σ2i. As Exercise 2.6.8 shows, the ith diagonal entry of Cov(X) is σ2i=Var(Xi) and the (i,j)th off diagonal entry is Cov(Xi,Xj).
- [Hogg, IMS, p.350] As another simple application, consider the multivariate analog of the sample mean and sample variance. Let {Xn} be a sequence of iid random vectors with common mean vector μ and variance-covariance matrix Σ. Denote the vector of means by ¯Xn=1nn∑i=1Xi. Of course, ¯Xn is just the vector of sample means, (¯X1,...,¯Xp)′. By the Weak Law of Large Numbers. Theorem 5.1.1, ¯Xj→μj, in probability, for each j. Hence, by Theorem 5.4.1, ¯Xn→μ, in probability.
- 這張圖可以幫助你理解 X1X2⋯Xn¯Xn→μ||||⋯||||||[X11X21⋮Xp1][X12X22⋮Xp2]⋯[X1nX2n⋮Xpn][¯X1¯X2⋮¯Xp]→→⋮→[μ1μ2⋮μp] 這裡的 ¯Xk 定義成 ¯Xk=∑nl=1Xkln,我覺得寫成 ¯Xk◻ 會更一目瞭然。 這裡所說的common mean vector μ and variance-covariance matrix Σ 是指 E(X1)=E(X2)=⋯=μ 也就是說, E(Xi1)=E(Xi2)=⋯=μi for i=1,2,... 及 Cov(X1)=Cov(X2)=⋯=Σ
Part II
- Let X1,X2,... be a sequence of independent and identically distributed k-variate random vectors. For each i, Xi has a multinomial distribution multinomial(1,p), where p=[p1p2⋮pk]
- Each Xij is either 0 or 1. As the following figure shows. Each column has exactly one 1 and 0 elsewhere. X1X2⋯Xn||||⋯||[X11X21⋮Xk1][X12X22⋮Xk2]⋯[X1nX2n⋮Xkn]∑nj=1X1j=n¯X1=n¯X1◻=Y1∑nj=1X2j=n¯X2=n¯X2◻=Y2⋮∑nj=1Xkj=n¯Xk=n¯Xk◻=Yk Recall that Yi represent the number of times the experiment results in Ai, i=1,2,...,k.
- By [Casella, p.182, line 5], Xij∼B(1,pi). So E(Xij)=pi and Var(Xij)=pi(1−pi). By [Casella, p.182, line -10], Cov(Xkj,Xlj)=−pkpl. Thus, E(X1)=E(X2)=⋯=p and variance-covariance matrix of X1,X2,... is Cov(X1)=Cov(X2)=⋯=Σ=[p1(1−p1)−p1p2⋯−p1pk−p2p1p2(1−p2)⋯−p2pk⋮⋮⋱⋮−pkp1−pkp2⋯pk(1−pk)] Note that the sum of every entries in any column or row of Σ is 0. Hence, detΣ=0 and Σ is not invertible.
Part III
- For each j=1,2,..., consider Zj=[X1jX2j⋮Xk−1,j] Check that E(Z1)=E(Z2)=⋯=p∗=[p1p2⋮pk−1] and the variance-covariance matrix of Z1,Z2,... is Cov(Z1)=Cov(Z2)=⋯=Σ∗=[p1(1−p1)−p1p2⋯−p1pk−1−p2p1p2(1−p2)⋯−p2pk−1⋮⋮⋱⋮−pk−1p1−pk−1p2⋯pk−1(1−pk−1)] That is, the upper-left (k−1)×(k−1) submatrix of Σ.
- Note that Σ∗=[p1(1−p1)−p1p2⋯−p1pk−1−p2p1p2(1−p2)⋯−p2pk−1⋮⋮⋱⋮−pk−1p1−pk−1p2⋯pk−1(1−pk−1)]=[p1p2⋱pk−1]−[p1p2⋮pk−1][p1p2⋯pk−1]=diag(p1,p2,...,pk−1)−p∗(p∗)T. Now, Σ∗ is invertible and its inverse is (Σ∗)−1=[1p1+1pk1pk⋯1pk1pk1p2+1pk⋯1pk⋮⋮⋱⋮1pk1pk⋯1pk−1+1pk]=[1p11p2⋱1pk−1]+1pk[11⋯111⋯1⋮⋮⋱⋮11⋯1]=diag(1p1,1p2,...,1pk−1)+1pk11T, where 1=[11⋮1]. It is easy to check that it is indeed the inverse by using that (p∗)Tdiag(1p1,1p2,...,1pk−1)=1T,diag(p1,p2,...,pk−1)1=p∗,(p∗)T1=p1+p2+⋯+pk−1=1−pk.
Part IV
- [Hogg, IMS, p.351, def.5.4.2] Let {Xn} be a sequence of random vectors with Xn having distribution function Fn(x) and X be a random vector with distribution function F(x). Then {Xn} converges in distribution to X if limn→∞Fn(x)=F(x), for all points x at which F(x) is continuous. We write XnD→X.
- [Hogg, IMS, p.351, thm.5.4.4] (Multivariate Central Limit Theorem). Let {Xn} be a sequence of iid random vectors with common mean vector μ and variance-covariance matrix Σ which is positive definite. Assume that the common moment generating function M(t) exists in an open neighborhood of 0. Let Yn=1√nn∑i=1(Xi−μ)=√n(¯X−μ). Then Yn converges in distribution to a Np(0,Σ) distribution.
- Consider that ¯Zn=Z1+Z2+⋯+Znn=[X11+X12+⋯+X1nnX21+X22+⋯+X2nn⋮Xk−1,1+Xk−1,2+⋯+Xk−1,nn]=[¯X1◻¯X2◻⋮¯Xk−1,◻] By (*), (**) and the Multivariate Central Limit Theorem, √n(¯Zn−p∗)d→Nk−1(0,Σ∗).
- [Hogg, IMS, p.202, thm.3.5.1] Suppose X has a Nn(μ,Σ) distribution, where Σ is positive definite. Then the random variable Y=(X−μ)′Σ−1(X−μ) has a χ2(n) distribution.
- (X−μ)′ means the transpose of X−μ. That is, (X−μ)′=(X−μ)T
- By [Hogg, IMS, p.202, thm.3.5.1], (X=√n(¯Zn−p∗) and μ=0), √n(¯Zn−p∗)T(Σ∗)−1√n(¯Zn−p∗) has a χ2(k−1) distribution.
Part V
- The final step is to show that √n(¯Zn−p∗)T(Σ∗)−1√n(¯Zn−p∗)=k∑i=1(Yi−npi)2npi Note that ¯Zn−p∗=[¯X1◻−p1¯X2◻−p2⋮¯Xk−1,◻−pk−1] Therefore, √n(¯Zn−p∗)T(Σ∗)−1√n(¯Zn−p∗)(***)=n(¯Zn−p∗)T(Σ∗)−1(¯Zn−p∗)=n(¯Zn−p∗)T(diag(1p1,1p2,...,1pk−1)+1pk11T)(¯Zn−p∗)=n[(¯Zn−p∗)Tdiag(1p1,1p2,...,1pk−1)(¯Zn−p∗)+1pk(¯Zn−p∗)T11T(¯Zn−p∗)]=n[∑k−1i=1(¯Xi◻−pi)2pi+1pk∑k−1i,j=1(¯Xi◻−pi)(¯Xj◻−pj)]=n{∑k−1i=1(¯Xi◻−pi)2pi+1pk[∑k−1i=1(¯Xi◻−pi)]2}=∑k−1i=1(n¯Xi◻−npi)2npi+[∑k−1i=1(n¯Xi◻−npi)]2npk∑ki=1n¯Xi◻=∑ki=1Yi=n=∑k−1i=1(n¯Xi◻−npi)2npi+[(n−n¯Xk◻)−n(1−pk)]2npk=∑ki=1(n¯Xi◻−npi)2npi=∑ki=1(Yi−npi)2npi
Another Incomplete Proof
下面是另一個不完整的證法,主要是參考這裡,這個證法看起來不需要用到Multivariate Normal Distribution,但最後一步要證明sum of dependent normal random variables is still normal的時候(標記三個問號???的地方),就不可避免地要用到Multivariate Normal,參考這裡。
- By [Casella, p.182, line 5], Yi∼B(n,pi). Then by the Central Limit Theorem or De-Moivre Laplace Theorem, Yi∼N(npi,npi(1−pi)).
- Let Xi=Yi−npi√npi. Then Xi=Yi−npi√npi=Yi−npi√npi(1−pi)√1−piCentral Limit Theorem∼√1−piN(0,1). By [Casella, p.184, cor.4.6.10], Xi∼N(0,1−pi). This follows that Var(Xi)=1−pi
- If i≠j, then Cov(Xi,Xj)=Cov(Yi−npi√npi,Yj−npj√npj)Cov(aX+b,cY+d)=acCov(X,Y)=1√npi⋅1√npj⋅Cov(Yi,Yj)[Casella, p.182, line -10]=1√npi⋅1√npj⋅(−npipj)=−√pipj
- By (i) and (ii), Cov(X)=[1−p1−√p1p2⋯−√p1pk−√p2p11−p2⋯−√p2pk⋮⋮⋱⋮−√pkp1−√pkp2⋯1−pk]=I−ppT, where p=[√p1√p2⋮√pk].
- We find the eigenvalues of Cov(X). Note that pTp=p1+p2+⋯+pk=1. det(Cov(X)−λI)=det(I−ppT−λI)=det((1−λ)I−ppT)=(1−λ)kdet(I−11−λppT)Sylvester's Theorem=(1−λ)k(1−11−λpTp)=−λ(1−λ)k−1. The eigenvalues of Cov(X) are 0 and 1 with k−1 multiplicity.
- Since Cov(X) is symmetric, Cov(X) is orthogonally diagonalizable. (See [Friedberg, p.384, thm.6.20].) That is, there exists an orthogonal matrix Q (QQT=QTQ=I) such that QCov(X)QT=[Ik−1OO0] and QCov(X)=[Ik−1OO0]Q
- Set Z=QX
- Note that Cov(Z)=Cov(QX)[Hogg, IMS, p.141, thm.2.63]=QCov(X)QT=[Ik−1OO0]
Now, we prove the main four results.
- For each i, by (v), Zi is a linear combination of normal random variables X1,X2,...,Xk. By ???, Zi is also normal.
- Since E(Xi)=0, by [Casella, p.57], E(Zi)=0. By (vi), Var(Zi)=1 for i=1,2,...,k−1.
- By (vi), Cov(Zi,Zj)=0 for i≠j. That is, Zi and Zj are uncorrelated. By [Roussas, Course, p.466, cor.2], Z1,Z2,...,Zk are independent.
- Now, check thatpTX=0 by their definitions⇒Cov(X)X(iii)=(I−ppT)X=X−ppTX=X⇒Z(v)=QX=Q(Cov(X)X)=(QCov(X))X(iv)=[Ik−1OO0]QX⇒Zk=0 This follows that k−1∑i=1Z2i=k∑i=1Z2i and k−1∑i=1Z2i=k∑i=1Z2i=ZTZ=(QX)T(QX)=XTQTQX=XTX=k∑i=1X2i.
The textbooks which have no proof
- [Hogg, PSI, p.416] We accept Pearson's conclusion, as the proof is beyond the level of this text.
- [Hogg, IMS, p.284] It is proved in a more advanced course that, as n→∞, Qk−1 has an approximate χ2(n−1) distribution.
- [Mood, p.445] We will not prove the above theorem, but we will indicate its proof for k=1.
- [DeGroot, p.626] In 1900, Karl Pearson proved the following result, whose proof will not be given here.
Note
- This theorem can be proved by maximum likelihood estimator. See [Rice, p.341] or [Roussas, Course, p.370] or [Spokoiny, p.205]
- This proof is similar to the approximation of multinomial by multivariate normal. See here
- Some books don't require that Σ is invertible. In that case, you can use [Hogg, IMS, p.202, thm.3.5.2] directly. At the end of Part II, we have √n(¯Xn−p)d→Nk(0,Σ). By [Hogg, IMS, p.202, thm.3.5.2], (A=[Ik−10],b=0), √n(¯Zn−p∗)=A√n(¯Xn−p)d→Nk−1(0,Σ∗). A same result as (****).
References
- [Casella] Casella and Berger's Statistical Inference
- [DeGroot] DeGroot and Schervish's Probability and Statistics
- [Friedberg] Friedberg, Insel and Spence's Linear Algebra
- [Hogg, PSI] Hogg and Tanis's Probability and Statistical Inference
- [Hogg, IMS] Hogg, McKean and Craig's Introduction to Mathematical Statistics
- [Keeping] Keeping's Introduction to Statistical Inference
- [Mood] Mood, Graybill and Boes's Introduction to Theory of Statistics
- [Rice] Rice's Mathematical Statistics and Data Analysis
- [Roussas, Course] Roussas's A Course in Mathematical Statistics
- [Spokoiny] Spokoiny and Dickhaus's Basics of Modern Mathematical Statistics
No comments:
Post a Comment