{"title": "Covariance shrinkage for autocorrelated data", "book": "Advances in Neural Information Processing Systems", "page_first": 1592, "page_last": 1600, "abstract": "The accurate estimation of covariance matrices is essential for many signal processing and machine learning algorithms. In high dimensional settings the sample covariance is known to perform poorly, hence regularization strategies such as analytic shrinkage of Ledoit/Wolf are applied. In the standard setting, i.i.d. data is assumed, however, in practice, time series typically exhibit strong autocorrelation structure, which introduces a pronounced estimation bias. Recent work by Sancetta has extended the shrinkage framework beyond i.i.d. data. We contribute in this work by showing that the Sancetta estimator, while being consistent in the high-dimensional limit, suffers from a high bias in finite sample sizes. We propose an alternative estimator, which is (1) unbiased, (2) less sensitive to hyperparameter choice and (3) yields superior performance in simulations on toy data and on a real world data set from an EEG-based Brain-Computer-Interfacing experiment.", "full_text": "Covariance shrinkage for autocorrelated data\n\nDaniel Bartz\n\nDepartment of Computer Science\n\nTU Berlin, Berlin, Germany\n\ndaniel.bartz@tu-berlin.de\n\nklaus-robert.mueller@tu-berlin.de\n\nKlaus-Robert M\u00a8uller\n\nTU Berlin, Berlin, Germany\nKorea University, Korea, Seoul\n\nAbstract\n\nThe accurate estimation of covariance matrices is essential for many signal pro-\ncessing and machine learning algorithms. In high dimensional settings the sample\ncovariance is known to perform poorly, hence regularization strategies such as\nanalytic shrinkage of Ledoit/Wolf are applied. In the standard setting, i.i.d. data\nis assumed, however, in practice, time series typically exhibit strong autocorrela-\ntion structure, which introduces a pronounced estimation bias. Recent work by\nSancetta has extended the shrinkage framework beyond i.i.d. data. We contribute\nin this work by showing that the Sancetta estimator, while being consistent in the\nhigh-dimensional limit, suffers from a high bias in \ufb01nite sample sizes. We propose\nan alternative estimator, which is (1) unbiased, (2) less sensitive to hyperparame-\nter choice and (3) yields superior performance in simulations on toy data and on a\nreal world data set from an EEG-based Brain-Computer-Interfacing experiment.\n\n1\n\nIntroduction and Motivation\n\nCovariance matrices are a key ingredient in many algorithms in signal processing, machine learning\nand statistics. The standard estimator, the sample covariance matrix S, has appealing properties in\nthe limit of large sample sizes n: its entries are unbiased and consistent [HTF08]. On the other hand,\nfor sample sizes of the order of the dimensionality p or even smaller, its entries have a high variance\nand the spectrum has a large systematic error. In particular, large eigenvalues are overestimated and\nsmall eigenvalues underestimated, the condition number is large and the matrix dif\ufb01cult to invert\n[MP67, ER05, BS10]. One way to counteract this issue is to shrink S towards a biased estimator T\n(the shrinkage target) with lower variance [Ste56],\n\nCsh := (1 \u2212 \u03bb)S + \u03bbT,\n\nthe default choice being T = p\u22121trace(S)I, the identity multiplied by the average eigenvalue.\nFor the optimal shrinkage intensity \u03bb(cid:63), a reduction of the expected mean squared error is guaran-\nteed [LW04]. Model selection for \u03bb can be done by cross-validation (CV) with the known draw-\nbacks: for (i) problems with many hyperparameters, (ii) very high-dimensional data sets, or (iii)\nonline settings which need fast responses, CV can become unfeasible and a faster model selection\nmethod is required. A popular alternative to CV is Ledoit and Wolf\u2019s analytic shrinkage proce-\ndure [LW04] and more recent variants [CWEH10, BM13]. Analytic shrinkage directly estimates the\nshrinkage intensity which minimizes the expected mean squared error of the convex combination\nwith a negligible computational cost, especially for applications which rely on expensive matrix\ninversions or eigendecompositions in high dimensions.\nAll of the above algorithms assume i.i.d. data. Real world time series, however, are often non-i.i.d. as\nthey possess pronounced autocorrelation (AC). This makes covariance estimation in high dimensions\nthe data dependence lowers the effective sample size available for constructing the\neven harder:\nestimator [TZ84]. Thus, stronger regularization \u03bb will be needed. In Figure 1 the simple case of an\nautoregressive model serves as an example for an arbitrary generative model with autocorrelation.\n\n1\n\n\fFigure 1: Dependency of the eigendecomposition on autocorrelation. p = 200, n = 250.\n\nThe Figure shows, for three levels of autocorrelation (left), the population and sample eigenvalues\n(middle): with increasing autocorrelation the sample eigenvalues become more biased. This bias\nis an optimistic measure for the quality of the covariance estimator: it neglects that population and\nsample eigenbasis also differ [LW12]. Comparing sample eigenvalues to the population variance in\nthe sample eigenbasis, the bias is even larger (right).\nIn practice, violations of the i.i.d. assumption are often ignored [LG11, SBMK13, GLL+14], al-\nthough Sancetta proposed a consistent shrinkage estimator under autocorrelation [San08]. In this\npaper, we contribute by showing in theory, simulations and on real world data, that (i) ignoring au-\ntocorrelations for shrinkage leads to large estimation errors and (ii) for \ufb01nite samples Sancetta\u2019s es-\ntimator is still substantially biased and highly sensitive to the number of incorporated time lags. We\npropose a new bias-corrected estimator which (iii) outperforms standard shrinkage and Sancetta\u2019s\nmethod under the presence of autocorrelation and (iv) is robust to the choice of the lag parameter.\n\n2 Shrinkage for autocorrelated data\n\nLedoit and Wolf derived a formula for the optimal shrinkage intensity [LW04, SS05]:\n\n\u03bb(cid:63) =\n\n(cid:1)\nij Var(cid:0)Sij\n(cid:80)\nE(cid:104)(cid:0)Sij \u2212 Tij\n(cid:1)2(cid:105) .\n(cid:16)\nn(cid:88)\n(cid:1) =\n\n(cid:80)\n(cid:1) \u2212\u2192 (cid:100)Var(cid:0)Sij\n(cid:1)2(cid:105) \u2212\u2192 (cid:98)E(cid:2)(Sij \u2212 Tij)2(cid:3) = (Sij \u2212 Tij)2,\n\nxisxjs \u2212 1\nn\n\n1\nn2\n\ns=1\n\nij\n\nVar(cid:0)Sij\nE(cid:104)(cid:0)Sij \u2212 Tij\n\n(cid:17)2\n\nxitxjt\n\nn(cid:88)\n\nt=1\n\nThe analytic shrinkage estimator \u02c6\u03bb is obtained by replacing expectations with sample estimates:\n\n(1)\n\n(2)\n\n(3)\n\nwhere xit is the tth observation of variable i. While the estimator eq. (3) is unbiased even under a\nviolation of the i.i.d. assumption, the estimator eq. (2) is based on\n\nVar\n\nxitxjt\n\ni.i.d.\n=\n\n1\nn\n\nVar (xitxjt) .\n\nIf the data are autocorrelated, cross terms cannot be ignored and we obtain\n\nVar\n\nxitxjt\n\n=\n\nCov(xitxjt, xisxjs)\n\nCov(xitxjt, xitxjt) +\n\nn \u2212 s\nn\n\nCov(xitxjt, xi,t+sxj,t+s)\n\nn\u22121(cid:88)\n\ns=1\n\n2\nn\n\n\u0393ij(s)\n\n(4)\n\nFigure 2 illustrates the effect of ignoring the cross terms for increasing autocorrelation (larger AR-\ncoef\ufb01cients, see section 3 for details on the simulation). It compares standard shrinkage to an oracle\nshrinkage based on the population variance of the sample covariance1. The population variance of S\n\n1calculated by resampling.\n\n2\n\n(cid:32)\n\nn(cid:88)\n\nt=1\n\n1\nn\n\n(cid:33)\n\nn(cid:88)\n\nt=1\n\n(cid:32)\nn(cid:88)\n\n1\nn\n\ns,t=1\n\n(cid:33)\n\n1\nn2\n\n=\n\n1\nn\n\n=:\n\n1\nn\n\n\u0393ij(0) +\n\n2\nn\n\nn\u22121(cid:88)\n\ns=1\n\n02040608010000.20.40.60.81time lagautocorrelation (AC) no AC (AR\u2212coeff. = 0)low AC (AR\u2212coeff. = 0.7)high AC (AR\u2212coeff. = 0.95)150160170180190200010203040#eigenvalueeigenvalue populationsample150160170180190200010203040#sample eigendirectionvariance\fFigure 2: Dependency of shrinkage on autocorrelation. p = 200, n = 250.\n\nincreases because the effective sample size is reduced [TZ84], yet the standard shrinkage variance\nestimator eq. (2) does not increase (outer left). As a consequence, for oracle shrinkage the shrinkage\nintensity increases, for the standard shrinkage estimator it even decreases because the denominator\nin eq. (1) grows (middle left). With increasing autocorrelation, the sample covariance becomes\na less precise estimator: for optimal (stronger) shrinkage more improvement becomes possible,\nyet standard shrinkage does not improve (middle right). Looking at the variance estimates in the\nsample eigendirections for AR-coef\ufb01cients of 0.7, we see that the bias of standard shrinkage is only\nmarginally smaller than the bias of the sample covariance, while oracle shrinkage yields a substantial\nbias reduction (outer right).\n\nSancetta-estimator An estimator for eq. (4) was proposed by [San08]:\n\nn\u2212s(cid:88)\n(cid:32)\n\nt=1\n\n1\nn\n\n1\nn\n\n\u02c6\u0393San\nij (s) :=\n\n(cid:100)Var(cid:0)Sij\n\n(cid:1)San,b\n\n(xitxjt \u2212 Sij) (xi,t+sxj,t+s \u2212 Sij) ,\n\n(5)\n\n(cid:33)\n\nn\u22121(cid:88)\n\ns=1\n\n:=\n\n\u02c6\u0393San\nij (0) + 2\n\n\u03ba(s/b)\u02c6\u0393San\n\nij (s)\n\n,\n\nb > 0,\n\nwhere \u03ba is a kernel which has to ful\ufb01ll Assumption B in [And91]. We will restrict our analysis to\nthe truncated kernel \u03baTR(x) = {1 for |x| \u2264 1, 0 otherwise} to obtain less cluttered formulas2. The\nkernel parameter b describes how many time lags are taken into account.\nThe Sancetta estimator behaves well in the high dimensional limit: the main theoretical result states\nthat for (i) a \ufb01xed decay of the autocorrelation, (ii) b, n \u2192 \u221e and (iii) b2 increasing at a lower rate\nthan n, the estimator is consistent independently of the rate of p (for details, see [San08]). This is\nin line with the results in [LW04, CWEH10, BM13]: as long as n increases, all of these shrinkage\nestimators are consistent.\n\nBias of the Sancetta-estimator\nIn the following we will show that the Sancetta-estimator is sub-\noptimal in \ufb01nite samples: it has a non-negligible bias. To understand this, consider a lag s large\nenough to have \u0393ij(s) \u2248 0. If we approximate the expectation of the Sancetta-estimator, we see that\nit is biased downwards:\n\nE(cid:104)\u02c6\u0393San\n\nij (s)\n\nij\n\nt=1\n\n1\nn\n\n(cid:1)(cid:35)\n(cid:34)\n(cid:105) \u2248 E\nn\u2212s(cid:88)\n(cid:0)xitxjtxi,t+sxj,t+s \u2212 S2\n(cid:3)(cid:1) = \u2212 n \u2212 s\n(cid:0)E2 [Sij] \u2212 E(cid:2)S2\n\u2248 n \u2212 s\nn\u2212s(cid:88)\n(cid:0)xitxjtxi,t+sxj,t+s \u2212 S2\n(cid:1) ,\n(cid:32)\n\nt=1\n\nn\n\nn\n\nij\n\nij\n\nn\u22121(cid:88)\n\ns=1\n\n1\nn\n\n\u02c6\u0393BC\nij (s) :=\n\n(cid:100)Var(cid:0)Sij\n\n(cid:1)BC,b\n\n.\n\nVar (Sij) < 0.\n\nBias-corrected (BC) estimator We propose a bias-corrected estimator for the variance of the\nentries in the sample covariance matrix:\n\n:=\n\nn \u2212 1 \u2212 2b + b(b + 1)/n\n\n1\n\n\u02c6\u0393BC\nij (0) + 2\n\n\u03baTR(s/b)\u02c6\u0393BC\n\nij (s)\n\n,\n\nb > 0.\n\n2in his simulations, Sancetta uses the Bartlett kernel. For \ufb01xed b, this increases the truncation bias.\n\n3\n\n(cid:33)\n\n(6)\n\n00.20.40.60.8100.020.040.060.080.1norm. \u03a3ij var(Sij)00.20.40.60.8100.20.40.60.81shrinkage intensity \u03bbAR\u2212coefficients050100150200051015#sample eigendirectionvarianceAR\u2212coeff. = 0.7 populationsamplepop. var(S) shrinkagestandard shrinkage00.20.40.60.810.20.40.60.81impr. over sample cov.\fThe estimator \u02c6\u0393BC\n\nis the denominator in(cid:100)Var(cid:0)Sij\n\n(cid:1)BC,b: it is smaller than n and thus corrects the downwards bias.\n\nij (s), but slightly easier to compute. The main difference\n\nij (s) is very similar to \u02c6\u0393San\n\n2.1 Theoretical results\n\nIt is straightforward to extend the theoretical results on the Sancetta estimator ([San08], see sum-\nmary above) to our proposed estimator. In the following, to better understand the limitations of the\nSancetta estimator, we will provide a complementary theoretical analysis on the behaviour of the\nestimator for \ufb01nite n.\nOur theoretical results are based on the analysis of a sequence of statistical models indexed by p. Xp\ndenotes a p \u00d7 n matrix of n observations of p variables with mean zero and covariance matrix Cp.\nYp = R(cid:62)\np Xp denotes the same observations rotated in their eigenbasis, having diagonal covariance\n\u039bp = R(cid:62)\nit denote the entries of Xp and Yp, respectively3. The\nit and yp\np CpRp. Lower case letters xp\nanalysis is based on the following assumptions:\nAssumption 1 (A1, bound on average eighth moment). There exists a constant K1 independent of\np such that\n\np(cid:88)\n\ni=1\n\n1\np\n\nE[(xp\n\ni1)8] \u2264 K1.\n\nAssumption 2 (A2, uncorrelatedness of higher moments). Let Q denote the set of quadruples\n\n{i,j,k,l} of distinct integers.(cid:80)\n(cid:80)\n\nand\n\n\u2200s :\n\ni,j,kl,l\u2208Qp\n\nCov2[yp\ni1yp\n|Qp|\n\nj1, yp\n\nk,1+syp\n\nl,1+s]\n\ni,j,kl,l\u2208Qp\n\nCov\n\nj1)2, (yp\n\nk,1+syp\n\n= O(cid:0)p\u22121(cid:1) ,\nl,1+s)2(cid:105)\n\n= O(cid:0)p\u22121(cid:1) ,\n\n(cid:104)\n\n(yp\ni1yp\n|Qp|\n\np(cid:88)\n\ni=1\n\n1\np\n\nE[(xp\n\ni1)2] \u2265 K2.\n\nhold.\nAssumption 3 (A3, non-degeneracy). There exists a constant K2 such that\n\nAssumption 4 (A4, moment relation). There exist constants \u03b14, \u03b18, \u03b24 and \u03b28 such that\n\nE[y8\nE[y8\n\ni ] \u2264 (1 + \u03b18)E2[y4\ni ],\ni ] \u2265 (1 + \u03b28)E2[y4\ni ],\n\nE[y4\nE[y4\n\ni ] \u2264 (1 + \u03b14)E2[y2\ni ],\ni ] \u2265 (1 + \u03b24)E2[y2\ni ].\n\nRemarks on the assumptions A restriction on the eighth moment (assumption A1) is necessary\nbecause the estimators eq. (2), (3), (5) and (6) contain fourth moments, their variances therefore\ncontain eighths moments. Note that, contrary to the similar assumption in the eigenbasis in [LW04],\nA1 poses no restriction on the covariance structure [BM13]. To quantify the effect of averaging\nover dimensions, assumption A2 restricts the correlations of higher moments in the eigenbasis. This\nassumption is trivially ful\ufb01lled for Gaussian data, but much weaker (see [LW04]). Assumption A3\nrules out the degenerate case of adding observation channels without any variance and assumption\nA4 excludes distributions with arbitrarily heavy tails.\nBased on these assumptions, we can analyse the difference between the Sancetta-estimator and our\nproposed estimator for large p:\nTheorem 1 (consistency under \u201c\ufb01xed n\u201d-asympotics). Let A1, A2, A3, A4 hold. We then have\n\n(cid:88)\n\nij\n\n1\np2\n\nVar (Sij) = \u0398(1)\n\n3We shall often drop the sequence index p and the observation index t to improve readability of formulas.\n\n4\n\n\fSan,b\n\n(Sij) \u2212 Var (Sij)\n\nBC,b\n\n(Sij) \u2212 Var (Sij)\n\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) 1\n\np2\n\np2\n\n(cid:88)\n(cid:88)\n\nij\n\nij\n\nE\n\nE\n\n(cid:16)(cid:100)Var\n(cid:16)(cid:100)Var\n(cid:40)\n(cid:88)\n(cid:88)\n\nij\n\n(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n(cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13)2\n\nTR\n\n(cid:1)2\n=(cid:0)BiasSan,b + BiasSan,b\n(cid:32) (cid:80)\n=(cid:0)BiasBC,b\n((cid:80)\nn(cid:88)\nn(cid:88)\n\nb(cid:88)\n\n(cid:1)2\n\n+ O\n\nTR\n\n+ O\n\nj \u03b32\nj\nj \u03b3j)2\n\nVar (Sij) \u2212 4\nn3\n\nt=n\u2212s\n\ns=1\n\nu=1\n\n(cid:33)\n\nj \u03b32\nj\nj \u03b3j)2\n\n(cid:32) (cid:80)\n((cid:80)\n(cid:33)\n\n(cid:41)\n\nCov [xitxjt, xiuxju]\n\nFigure 3: Dependence of the variance estimates on the dimensionality. Averaged over R = 50\nmodels. n = 250.\n\nwhere the \u03b3i denote the eigenvalues of C and\n1 + 2b \u2212 b(b + 1)/n\n\nBiasSan,b := \u2212 1\np2\n\nn\nn \u2212 s\nn\n\nn(cid:88)\n\n2\n\nij\n\ns=b+1\n\nBiasSan,b\n\nTR\n\n:= \u2212 1\np2\n\n2\nn\n\nBiasBC,b\nTR\n\n:= \u2212 1\np2\n\nCov [xitxjt, xi,t+sxj,t+s]\n\n(cid:88)\n\nn\u22121(cid:88)\n\nn \u2212 1 \u2212 2b + b(b+1)\n\nn\n\nij\n\ns=b+1\n\nCov [xitxjt, xi,t+sxj,t+s]\n\nProof. see the supplemental material.\n\n(iii) The variance of both estimators behaves as O((cid:80)\n\nRemarks on Theorem 1\n(i) The mean squared error of both estimators consists of a bias and a\nvariance term. Both estimators have a truncation bias which is a consequence of including only a\nlimited number of time lags into the variance estimation. When b is chosen suf\ufb01ciently high, this\nterm gets close to zero. (ii) The Sancetta-estimator has an additional bias term which is smaller than\nzero in each dimension and therefore does not average out. Simulations will show that, as a conse-\nquence, the Sancetta-estimator has a strong bias which gets larger with increasing lag parameter b.\ni \u03b3i)2): the more the variance of the\ndata is spread over the eigendirections, the smaller the variance of the estimators. This bound is\nminimal if the eigenvalues are identical. (iv) Theorem 1 does not make a statement on the relative\nsizes of the variances of the estimators. Note that the BC estimator mainly differs by a multiplicative\nfactor > 1, hence the variance is larger, but not relative to the expectation of the estimator.\n\ni / ((cid:80)\n\ni \u03b32\n\n3 Simulations\n\nOur simulations are based on those in [San08]: We average over R = 50 multivariate Gaussian\nAR(1) models\nwith parameter matrix4 A = \u03c8AC \u00b7 I , with \u03c8no AC = 0, \u03c8low AC = 0.7, and \u03c8high AC = 0.95 (see Fig-\ni drawn from a log-normal distribution\nure 1). The innovations \u0001it are Gaussian with variances \u03c32\n4more complex parameter matrices or a different generative model do not pose a problem for the bias-\n\n(cid:126)xt = A(cid:126)xt\u22121 + (cid:126)\u0001t,\n\ncorrected estimator. The simple model was chosen for clarity of presentation.\n\n5\n\n010020030040050066.577.588.599.5x 10\u22123no AC (b = 10)\u03a3ij var(Sij)/p2dimensionality01002003004005000.020.030.040.050.060.070.080.09low AC (b = 20)dimensionality0100200300400500051015202530high AC (b = 90)dimensionality populationshrinkageSancettabias\u2212corr\fFigure 4: Robustness to the choice of lag parameter b. Variance estimates (upper row), shrinkage\nintensities (middle row) and improvement over sample covariance (lower row). Averaged over R =\n50 models. p = 200, n = 250.\n\ncalculate the std. deviations of the estimators and to obtain an approximation of p\u22122(cid:80)\n\nwith mean \u00b5 = 1 and scale parameter \u03c3 = 0.5. For each model, we generate K = 50 data sets to\nij Var (Sij).\nSimulation 1 analyses the dependence of the estimators on the dimensionality of the data. The num-\nber of observations is \ufb01xed at n = 250 and the lag parameter b chosen by hand such that the whole\nautocorrelation is covered5: bno AC = 10, blow AC = 20 and bhigh AC = 90. Figure 3 shows that the\nstandard shrinkage estimator is unbiased and has low variance in the no AC-setting, but under the\npresence of autocorrelation it strongly underestimates the variance. As predicted by Theorem 1,\nthe Sancetta estimator is also biased; its remains stays constant for increasing dimensionality. Our\nproposed estimator has no visible bias. For increasing dimensionality the variances of all estima-\ntors decrease. Relative to the average estimate, there is no visible difference between the standard\ndeviations of the Sancetta and the BC estimator.\nSimulation 2 analyses the dependency on the lag parameter b for \ufb01xed dimensionality p = 200\nand number of observations n = 250. In addition to variance estimates, Figure 4 reports shrinkage\nintensities and the percentage improvement in absolute loss (PRIAL) over the sample covariance\nmatrix:\n\nPRIAL(cid:0)C{pop., shr, San., BC}(cid:1) =\n\nE(cid:107)S \u2212 C(cid:107) \u2212 E(cid:107)C{pop., shr, San., BC} \u2212 C(cid:107)\n\nE(cid:107)S \u2212 C(cid:107)\n\n.\n\nThe three quantities show very similar behaviour. Standard shrinkage performs well in the no AC-\ncase, but is strongly biased in the autocorrelated settings. The Sancetta estimator is very sensitive to\nthe choice of the lag parameter b. For low AC, the bias at the optimal b is small: only a small number\nof biased terms are included. For high AC the optimal b is larger, the higher number of biased terms\ncauses a larger bias. The BC-estimator is very robust: it performs well for all b large enough to\ncapture the autocorrelation. For very large b its variance increases slightly, but this has practically\n\n5for b < 1, optimal in the no AC-setting, Sancetta and BC estimator are equivalent to standard shrinkage.\n\n6\n\n02550751002468x 10\u22123no AC\u03a3ij var(Sij)/p202550751000.10.20.30.40.5shrinkage intensity \u03bb02550751000.20.250.30.350.4PRIAL02550751000.020.040.060.080.1low AC02550751000.10.20.30.40.50.602550751000.20.30.40.50.60.7lag parameter b025507510005101520high AC025507510000.20.40.60.81025507510000.20.40.60.81 pop. var(S)shrinkageSancettabias\u2212corr\fFigure 5: BCI motor imagery data for lag parameter b = 75 (upper row) and b = 300 (lower row).\nAveraged over subjects and K = 100 runs.\n\nno effect on the PRIAL. An interesting aspect is that our proposed estimator even outperforms\nshrinkage based on the the population Var (Sij) (calculated by resampling). This results from the\n\n(cid:1)BC,b with the sample estimate eq. (3) of the denominator in\n\ncorrelation of the estimator (cid:100)Var(cid:0)Sij\n\neq. (1).\n\n4 Real World Data: Brain Computer Interface based on Motor Imagery\n\nAs an example of autocorrelated data we reanalyzed a data set from a motor imagery experiment. In\nthe experiment, brain activity for two different imagined movements was measured via EEG (p = 55\nchannels, 80 subjects, 150 trials per subject, each trial with ntrial = 390 measurements [BSH+10]).\nThe frequency band was optimized for each subject and from the class-wise covariance matrices, 1-3\n\ufb01lters per class were extracted by Common Spatial Patterns (CSP), adaptively chosen by a heuristic\n(see [BTL+08]). We trained Linear Discriminant Analysis on log-variance features.\nTo improve the estimate of the class covariances on these highly autocorrelated data, standard shrink-\nage, Sancetta shrinkage, cross-validation and and our proposed BC shrinkage estimator were ap-\nplied. The covariance structure is far from diagonal, therefore, for each subject, we used the average\nof the class covariances of the other subjects as shrinkage target [BLT+11]. Shrinkage is domi-\nnated by the in\ufb02uence of high-variance directions [BM13], which are pronounced in this data set.\nTo reduce this effect, we rescaled, only for the calculation of the shrinkage intensities, the \ufb01rst \ufb01ve\nprincipal components to have the same variance as the sixth principal component.\nWe analyse the dependency of the four algorithms on the number of supplied training trials. Figure 5\n(upper row) shows results for an optimized time lag (b = 75) which captures well the autocorre-\nlation of the data (outer left). Taking the autocorrelation into account makes a clear difference\n(middle left/right): while standard shrinkage outperforms the sample covariance, it is clearly out-\nperformed by the autocorrelation-adjusted approaches. The Sancetta-estimator is slightly worse\nthan our proposed estimator. The shrinkage intensities (outer right) are extremely low for standard\nshrinkage and the negative bias of the Sancetta-estimator shows clearly for small numbers of train-\ning trials. Figure 5 (lower row) shows results for a too large time lag (b = 300). The performance\nof the Sancetta-estimator strongly degrades as its shrinkage intensities get smaller, while our pro-\nposed estimator is robust to the choice of b, only for the smallest number of trials we observe a\nsmall degradation in performance. Figure 6 (left) compares our bias-corrected estimator to the four\nother approaches for 10 training trials: it signi\ufb01cantly outperforms standard shrinkage and Sancetta\nshrinkage for both the larger (b = 300, p \u2264 0.01) and the smaller time lag (b = 75, p \u2264 0.05).\n\n7\n\n051015200.550.60.650.70.750.8number of trials per class accuracy sample covstandard shrinkagesancettabias\u2212corrcross\u2212val.0510152000.010.020.030.040.05number of trials per class accuracy \u2212 acc(sample cov)0510152000.050.10.150.2number of trials per class shrinkage intensity0255075\u22120.500.51time lagb = 75AC051015200.550.60.650.70.750.8number of trials per class accuracy sample covstandard shrinkagesancettabias\u2212corrcross\u2212val.0510152000.010.020.030.040.05number of trials per class accuracy \u2212 acc(sample cov)0510152000.050.10.150.2number of trials per class shrinkage intensity0100200300\u22120.500.51time lagb = 300AC\fFigure 6: Subject-wise BCI classi\ufb01cation accuracies for 10 training trials (left) and time demands\n(right). \u2217\u2217/\u2217 := signi\ufb01cant at p \u2264 0.01 or p \u2264 0.05, respectively.\n\nAnalytic shrinkage procedures optimize only the mean squared error of the covariance matrix, while\ncross-validation directly optimizes the classi\ufb01cation performance. Yet, Figure 5 (middle) shows that\nfor small numbers of training trials our proposed estimator outperforms CV, although the difference\nis not signi\ufb01cant (see Fig. 6). For larger numbers of training trials CV performs better. This shows\nthat the MSE is not a very good proxy for classi\ufb01cation accuracies in the context of CSP: for op-\ntimal MSE, shrinkage intensities decrease with increasing number of observations. CV shrinkage\nintensities instead stay on a constant level between 0.1 and 0.15. Figure 6 (right) shows that the\nthree shrinkage approaches (b = 300) have a huge performance advantage over cross-validation (10\nfolds/10 parameter candidates) with respect to runtime.\n\n5 Discussion\n\nAnalytic Shrinkage estimators are highly useful tools for covariance matrix estimation in time-\ncritical or computationally expensive applications: no time-consuming cross-validation procedure\nis required. In addition, it has been observed that in some applications, cross-validation is not a\ngood predictor for out-of-sample performance [LG11, BKT+07]. Its speed and good performance\nhas made analytic shrinkage widely used: it is, for example, state-of-the-art in ERP experiments\n[BLT+11]. While standard shrinkage assumes i.i.d. data, many real world data sets, for example\nfrom video, audio, \ufb01nance, biomedical engineering or energy systems clearly violate this assump-\ntion as strong autocorrelation is present. Intuitively this means that the information content per data\npoint becomes lower, and thus the covariance estimation problem becomes harder: the dimension-\nality remains unchanged but the effective number of samples available decreases. Thus stronger\nregularization is required and standard analytic shrinkage [LW04] needs to be corrected.\nSancetta already moved the \ufb01rst step into this important direction by providing a consistent estimator\nunder i.i.d. violations [San08]. In this work we analysed \ufb01nite sample sizes and showed that (i) even\napart from truncation bias \u2014which results from including a limited number of time lags\u2014 Sancetta\u2019s\nestimator is biased, (ii) this bias is only negligible if the autocorrelation decays fast compared to\nthe length of the time series and (iii) the Sancetta estimator is very sensitive to the choice of lag\nparameter.\nWe proposed an alternative estimator which is (i) both consistent and \u2014apart from truncation bias\u2014\nunbiased and (ii) highly robust to the choice of lag parameter: In simulations on toy and real world\ndata we showed that the proposed estimator yields large improvements for small samples and/or sub-\noptimal lag parameter. Even for optimal lag parameter there is a slight but signi\ufb01cant improvement.\nAnalysing data from BCI motor imagery experiments we see that (i) the BCI data set possesses\nsigni\ufb01cant autocorrelation, that (ii) this adversely affects CSP based on the sample covariance and\nstandard shrinkage (iii) this effect can be alleviated using our novel estimator, which is shown to\n(iv) compare favorably to Sancetta\u2019s estimator.\n\nAcknowledgments\n\nThis research was also supported by the National Research Foundation grant (No. 2012-005741)\nfunded by the Korean government. We thank Johannes H\u00a8ohne, Sebastian Bach and Duncan Blythe\nfor valuable discussions and comments.\n\n8\n\n0.50.60.70.80.90.50.60.70.80.91**90%**90%**8.75%**8.75%bias\u2212corrsample covariance0.50.60.70.80.90.50.60.70.80.91**77.50%**78.75%**21.25%**20%standard shrinkage0.50.60.70.80.90.50.60.70.80.91 51.25% 53.75% 47.50% 45.00%CVsubject\u2212wise classification accuracies 0.50.60.70.80.90.50.60.70.80.91 *60%**60% *38.75%**38.75%Sancetta estimator b = 75b = 300SCShrSanBCCV020406080100120time demandnormalized runtime\fReferences\n\n[And91] Donald WK Andrews. Heteroskedasticity and autocorrelation consistent covariance matrix esti-\n\nmation. Econometrica: Journal of the Econometric Society, pages 817\u2013858, 1991.\n\n[BKT+07] Benjamin Blankertz, Motoaki Kawanabe, Ryota Tomioka, Friederike Hohlefeld, Klaus-Robert\nM\u00a8uller, and Vadim V Nikulin. Invariant common spatial patterns: Alleviating nonstationarities in\nbrain-computer interfacing. In Advances in Neural Information Processing Systems, pages 113\u2013\n120, 2007.\n\n[BLT+11] Benjamin Blankertz, Steven Lemm, Matthias Treder, Stefan Haufe, and Klaus-Robert M\u00a8uller.\nSingle-trial analysis and classi\ufb01cation of ERP components \u2013 a tutorial. NeuroImage, 56(2):814\u2013\n825, 2011.\n\n[BM13] Daniel Bartz and Klaus-Robert M\u00a8uller. Generalizing analytic shrinkage for arbitrary covariance\nstructures. In C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K.Q. Weinberger, editors,\nAdvances in Neural Information Processing Systems 26, pages 1869\u20131877. Curran Associates,\nInc., 2013.\n\n[BS10] Zhidong Bai and Jack William Silverstein. Spectral Analysis of Large Dimensional Random Ma-\n\ntrices. Springer Series in Statistics. Springer New York, 2010.\n\n[BSH+10] Benjamin Blankertz, Claudia Sannelli, Sebastian Halder, Eva M Hammer, Andrea K\u00a8ubler, Klaus-\nRobert M\u00a8uller, Gabriel Curio, and Thorsten Dickhaus. Neurophysiological predictor of SMR-\nbased BCI performance. Neuroimage, 51(4):1303\u20131309, 2010.\n\n[BTL+08] Benjamin Blankertz, Ryota Tomioka, Steven Lemm, Motoaki Kawanabe, and Klaus-Robert\nM\u00a8uller. Optimizing spatial \ufb01lters for robust EEG single-trial analysis. Signal Processing Mag-\nazine, IEEE, 25(1):41\u201356, 2008.\n\n[CWEH10] Yilun Chen, Ami Wiesel, Yonina C Eldar, and Alfred O Hero. Shrinkage algorithms for MMSE\n\ncovariance estimation. Signal Processing, IEEE Transactions on, 58(10):5016\u20135029, 2010.\n[ER05] Alan Edelman and N. Raj Rao. Random matrix theory. Acta Numerica, 14:233\u2013297, 2005.\n\n[GLL+14] Alexandre Gramfort, Martin Luessi, Eric Larson, Denis A. Engemann, Daniel Strohmeier, Chris-\ntian Brodbeck, Lauri Parkkonen, and Matti S. H\u00a8am\u00a8al\u00a8ainen. MNE software for processing MEG\nand EEG data. NeuroImage, 86(0):446 \u2013 460, 2014.\n\n[HTF08] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. The Elements of Statistical Learning.\n\nSpringer, 2008.\n\n[LG11] Fabien Lotte and Cuntai Guan. Regularizing common spatial patterns to improve BCI designs:\nuni\ufb01ed theory and new algorithms. Biomedical Engineering, IEEE Transactions on, 58(2):355\u2013\n362, 2011.\n\n[LW04] Olivier Ledoit and Michael Wolf. A well-conditioned estimator for large-dimensional covariance\n\nmatrices. Journal of Multivariate Analysis, 88(2):365\u2013411, 2004.\n\n[LW12] Olivier Ledoit and Michael Wolf. Nonlinear shrinkage estimation of large-dimensional covariance\n\nmatrices. The Annals of Statistics, 40(2):1024\u20131060, 2012.\n\n[MP67] Vladimir A. Mar\u02c7cenko and Leonid A. Pastur. Distribution of eigenvalues for some sets of random\n\nmatrices. Mathematics of the USSR-Sbornik, 1(4):457, 1967.\n\n[San08] Alessio Sancetta. Sample covariance shrinkage for high dimensional dependent data. Journal of\n\nMultivariate Analysis, 99(5):949\u2013967, May 2008.\n\n[SBMK13] Wojciech Samek, Duncan Blythe, Klaus-Robert M\u00a8uller, and Motoaki Kawanabe. Robust spatial\n\ufb01ltering with beta divergence. In L. Bottou, C.J.C. Burges, M. Welling, Z. Ghahramani, and K.Q.\nWeinberger, editors, Advances in Neural Information Processing Systems 26, pages 1007\u20131015.\n2013.\n\n[SS05] Juliane Sch\u00a8afer and Korbinian Strimmer. A shrinkage approach to large-scale covariance matrix\nestimation and implications for functional genomics. Statistical Applications in Genetics and\nMolecular Biology, 4(1):1175\u20131189, 2005.\n\n[Ste56] Charles Stein. Inadmissibility of the usual estimator for the mean of a multivariate normal dis-\nIn Proc. 3rd Berkeley Sympos. Math. Statist. Probability, volume 1, pages 197\u2013206,\n\ntribution.\n1956.\n\n[TZ84] H. Jean Thi\u00b4ebaux and Francis W. Zwiers. The interpretation and estimation of effective sample\n\nsize. Journal of Climate and Applied Meteorology, 23(5):800\u2013811, 1984.\n\n9\n\n\f", "award": [], "sourceid": 840, "authors": [{"given_name": "Daniel", "family_name": "Bartz", "institution": "TU Berlin"}, {"given_name": "Klaus-Robert", "family_name": "M\u00fcller", "institution": "TU Berlin"}]}