ARTICLE IN PRESS

YNIMG-11553; No. of pages: 12; 4C: 5, 7, 8, 9,10,11

Neurolmage xxx (2014) xxx-xxx

Group-PCA for very large fMRI datasets

Stephen M. Smith a'*, Aapo Hyvarinen b, Gael Varoquauxc, Karla L. Miller a, Christian F. Beckmann d,a

a FMRIB (Oxford University Centre for Functional MRI of the Brain), University of Oxford, UK b Dept of Computer Science, University of Helsinki, Finland c Parietal Team, INRIA Saclay-Ile-de-France, Saclay, France

d Donders Institute for Brain, Cognition and Behaviour, Radboud University Nijmegen, The Netherlands

ARTICLE INFO ABSTRACT

Increasingly-large datasets (for example, the resting-state fMRI data from the Human Connectome Project) are demanding analyses that are problematic because of the sheer scale of the aggregate data. We present two approaches for applying group-level PCA; both give a close approximation to the output of PCA applied to full concatenation of all individual datasets, while having very low memory requirements regardless of the number of datasets being combined. Across a range of realistic simulations, we find that in most situations, both methods are more accurate than current popular approaches for analysis of multi-subject resting-state fMRI studies. The group-PCA output can be used to feed into a range of further analyses that are then rendered practical, such as the estimation of group-averaged voxelwise connectivity, group-level parcellation, and group-ICA.

© 2014 Elsevier Inc. All rights reserved.

Article history: Accepted 25 July 2014 Available online xxxx

Keywords: fMRI PCA ICA

Big data

Introduction

Many branches of science are increasingly needing to tackle problems associated with the increasing scale of datasets. A common approach for identifying the important information within large amounts of data is to identify common patterns, achieving intelligent data reduction. A generic approach for this - one that is often used to reduce data to its dominant constituents - is principal component analysis (PCA). However, even such a simple approach is computationally challenging for very large datasets. Here we are specifically interested in the use of resting-state functional magnetic resonance imaging (rfMRI), a powerful and popular approach for studying functional brain networks. Analysis of a multi-subject rfMRI imaging study often begins at the group level, for example, estimating group-averaged functional connectivity across all subjects in a resting-state fMRI (rfMRI) study. For very large datasets, this can become problematic, as the computational expense and/or memory requirements for many analysis methods increase as the number of subjects increases, and so can quickly become impractical. Several important analysis approaches can be applied in a computationally practical way if the dataset can first be reduced by a group-level PCA.

For example, one of the most widely used methods for analysing such data is independent component analysis (ICA), which identifies multiple distinct networks simultaneously from a given dataset. ICA is often applied with relatively low dimensionality (the number of independent

* Corresponding author. Fax: +441865 222717. E-mail address: steve@fmrib.ox.ac.uk (S.M. Smith).

components), resulting in relatively extended networks being estimated. ICA can also be used to obtain a detailed data-driven functional parcellation, when applied with relatively high dimensionality, e.g., when more than 50 ICA components are estimated. In both cases, if multiple subjects are to be co-analysed (e.g., in order to compare the resting-state networks between subjects), most researchers begin by carrying out a (low- or high-dimensional) ICA on the group dataset as a whole. The groupwise ICA components can then be mapped back onto individual subjects in order to allow for cross-subject network comparisons.

For such data-driven decompositions of multi-subject datasets, the first (and most computationally intensive) stage in the analysis is normally to reduce the entire dataset down to a set of "group-average" spatial eigenvectors, using principal component analysis (PCA, based on singular value decomposition or SVD). The most natural approach here is to temporally demean and concatenate all subjects' datasets, and apply PCA — effectively treating all the data as if it were one single huge dataset. For an n-dimensional group-ICA, the resulting n strongest spatial eigenvectors can then be fed into the core ICA unmixing algorithm, which identifies the set of group-average spatially-independent components. This last stage is not in general computationally burdensome, because the data now comprises at most a few hundred spatial eigenvectors — the dataset has been reduced from the size of (voxels x timepoints x subjects) to (voxels x n).

However, as increasingly large datasets/databases are being generated, with large numbers of timepoints and voxels in each subject's dataset, and very large numbers of subjects being combined, there are major computational challenges associated with group-level data-driven analyses. Current approaches for combining all subjects' datasets, and applying basic dimensionality reduction, cannot be run using the computational facilities available to most researchers, primarily because

http://dx.doi.org/10.1016/j.neuroimage.2014.07.051 1053-8119/© 2014 Elsevier Inc. All rights reserved.

ARTICLE IN PRESS

2 S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx

of the very large memory requirements. There are two mathematically equivalent ways to estimate the PCA from the temporally-concatenated data from all subjects: a) estimate for each subject the voxels x voxels covariance matrix of temporal correlations (this is very large, as the number of voxels is large), and then average over subjects; or, b) estimate the (subjects x timepoints) x (subjects x timepoints) covariance matrix of spatial correlations, which will be very large if the number of subjects or number of individual-session timepoints is large.

Existing approaches to this computational problem include the initial reduction of each subject's dataset to a (normally comparatively small) number of spatial eigenvectors, before these are then combined across subjects, and a final group-wise PCA computed (Calhoun et al., 2001). However, this approach has potential limitations: a) There can be a significant loss of accuracy (and even bias in final cross-subject comparisons) by virtue of the within-subject reduction; for example, the within-subject dimensionality reduction is variance-greedy, and is not able to de-prioritise potentially strong subject-specific artefacts versus group-common (potentially less strong) components. b) The amount of memory required is proportional to the number of subjects analysed, and hence can still exceed available resources, with large numbers of subjects analysed.

Here we present two group-PCA approaches which generate an accurate approximation to PCA applied to full temporal concatenation of all subjects, in both cases avoiding the reduction of individual subjects to a small number of principal components. Both methods' memory requirements do not increase with increasing numbers of subjects analysed, and overall execution time scales linearly with the number of subjects. We validate the accuracy and computational effectiveness of the approaches using rich simulations, comparing against alternative approaches, with favourable results. We also validate against an extremely large rfMRl dataset from the Human Connectome Project (HCP), utilising a 24-core compute server with 1.25 TB of RAM to calculate the "gold standard" empirical result, achieving an accuracy of greater than 99.99%.

The first of the new approaches for large group-level PCA (referred to as MIGP - MELODlC's Incremental Group-PCA), was developed specifically with rfMRl in mind, and the validations described below use data that attempts to match the characteristics (dimensions, intrinsic rank, etc.) of rfMRl data. Since the original development and validation of MlGP, we have become aware of closely related approaches being developed in computer science, where thorough theoretical investigations of the mathematical properties have been reported (with very positive conclusions regarding the accuracy of these approaches) (Baker et al., 2012; Rehurek, 2010). The HCP has already started disseminating group-average "dense connectomes" (full voxelwise/vertexwise correlation matrices) calculated using MIGP, which is now released as part of the MELODIC ICA tool in FSL (FMRIB Software Library, www.fmrib.ox.ac.uk/fsl).

The second approach (referred to as SMIG — Small-Memory Iterated Group-PCA) is a distinct approach, building on top of the original method proposed in Hyvarinen and Smith (2012), and achieving increased accuracy compared with the original approach, by iterating the main estimation of group-average eigenvectors several times. While both MIGP and SMIG have very low-memory requirements, and are accurate, MIGP may be more time-efficient if file I/O dominates total compute time, and SMIG would be more time-efficient otherwise.

These approaches will hopefully also be of value for analysing large datasets from other modalities (and branches of science), particularly in this era of "Big Data"; however, here we have specifically tailored our detailed simulations to the characteristics of resting-state fMRl data. Note that we have specified resting-state specifically rather than also task-fMRl, because data reduction in the latter case is more straightforward computationally for large studies — in the cases where one does not have to account for the random temporal phase across datasets.

Method

Background

Each subject's dataset comprising t timepoints and v voxels can be represented as a 2-dimensional space-time matrix Y(t x v). We assume that the data has already been preprocessed to remove artefacts and align the data spatially into a standard space (co-ordinate system), so that the voxels are anatomically compatible across all subjects. We also assume that each voxel's timeseries has been demeaned (and potentially variance-normalised (Beckmann and Smith, 2004)).

lf we were to carry out a single-subject n-dimensional spatial-lCA, we would first apply a PCA, or equivalently the singular value decomposition, for example, representing the data as:

Y\txv)~U\txn) x S(nxn) x V(nxv); 0)

where n is typically much smaller than t, U is the set of temporal eigenvectors, V is the set of spatial eigenvectors, and the eigenvalues (components' strengths) lie on the diagonal ofS (the above assumes that the t-n weakest components have already been discarded). ICA is then applied to the matrix V, estimating a new set of spatial maps, which are linear combinations of the maps in V, and which are maximally independent of each other.

ln the case of a multi-subject dataset, a natural way to generate a group-averaged set of spatial eigenvectors (to feed into analyses such as group-average parcellation or group-lCA) is to temporally concatenate all s subjects' datasets, and apply PCA + ICA as described above, or, equivalently, to average the v x v matrices of temporal covariances across all subjects. The resulting PCA-based approximation will then be the same as above, but now with n x s "timepoints" in the temporally concatenated data.

Unfortunately, with large datasets and large numbers of subjects, it becomes impractical to form this full concatenated dataset and then run a full PCA; memory limitations and/or computational time become prohibitive. Various approaches have been suggested previously for large multi-subject fMRl datasets, and we now describe the most relevant prior work.

Calhoun et al. (2001) suggested approximating temporal concatenation by first reducing each subject's dataset to the top m spatial eigenvectors, then concatenating these eigenvectors (estimated separately for each subject) across all subjects, before running a final PCA to further reduce this dataset to the top n eigenvectors, to be fed into ICA. Typically m = n, meaning that each subject is reduced to the same number of PCA components as the final group-average ICA will estimate. Although using a small value for m limits the memory requirements for these operations, the data size does scale linearly with the number of subjects, which can eventually become impractically large. Furthermore, important information may be lost unless m is relatively large (which in general is not the case when using this approach); information may be hard to estimate at the single-subject level, but could be estimable and important at the group level.

MIGP - MELODlC's incremental Group-PCA

MIGP is an incremental approach that aims to provide a very close approximation to full temporal concatenation followed by PCA, but without the large memory requirements. The high accuracy is achieved by never reducing individual subjects' datasets to small numbers of PCA components. The incremental approach keeps an "internal" PCA space of m weighted spatial eigenvectors, where m is typically larger than the number of timepoints in each individual dataset. By "weighted", we mean that the eigenvalues (component strengths) are incorporated into the matrix of spatial eigenvectors. The final set of m components, representing the group-average (or temporally concatenated) PCA output, can then be reduced to the desired dimensionality n by simply

ARTICLE IN PRESS

S.M. Smith et al. / Neuroimage xxx (2014) xxx-xxx

keeping the top n components, and, if required, discarding the weightings (eigenvector amplitudes). We now describe the approach in more detail.

Start by temporally concatenating a number of subjects' datasets such that the number of combined timepoints is larger than m. Typically we might set m = 2 t, and begin by concatenating 2 datasets. This dataset is then fed into m-dimensional PCA, and the matrix W(m x v) = S(m x m) x V(m x v) is kept. Note that this is not just the set of (unweighted) eigenvectors; each is weighted by its respective eigenvalue, which is an important distinction — the eigenvalues, characterizing the strength of an effect, are here used to scale the eigenmaps (spatial eigenvectors) so that the strength is retained in the reduced subspace. W becomes a running estimate of the final group-average set of spatial eigenvectors, and can be considered a pseudo-timeseries matrix of m "timepoints" and v voxels. For each additional subject's dataset in turn, we now incrementally update W, by concatenating W with each dataset Yi, and applying PCA to reduce back to an updated W, again with just the top m components kept (see Appendices A and B). Because the eigenvalue weights are kept in W, this approach automatically achieves the "right" balance (relative weighting) between W and each new dataset — where "right" means that the relative weighting between the running estimate and new datasets matches what would have happened in the full-temporal-concatenation scenario. In other words, this retains the overall variance of each batch of data.1

A computationally efficient approach for estimating the top m weighted eigenvectors is to first estimate the "timepoints" x "timepoints" covariance matrix, apply eigenvalue decomposition to extract the top m temporal eigenvectors (e.g., using the efficient eigs function in MATLAB), and then multiply these eigenvectors into the original data matrix to obtain the weighted spatial eigenvectors. This avoids the need to estimate the very large voxels x voxels covariance matrix.

MIGP does not increase at all in memory requirement with increasing numbers of subjects, as no large matrices are ever formed, and the computation time scales linearly with the number of subjects. It is easily parallelisable, simply by applying the approach in parallel to subsets of subjects, and then combining across these with the same "concatenate and reduce" approach described above. (This is mathematically almost identical to the fully-serial algorithm, particularly for large numbers of subjects.)

SMIG - Small-Memory Iterative Group-PCA

Hyvarinen and Smith (2012) suggested a very fast and low-memory method which rotated each subject's data matrix using a rotation derived from the correlation between the original data matrix and the group-averaged data matrix (i.e., the pure mean of all raw timeseries matrices from all subjects). All subjects' rotated data matrices can then be averaged, and PCA applied, without ever needing to form large concatenated data matrices of timeseries or eigenvectors — although two "passes through" all of the original datasets is necessary.

Building on this, and noting that the initial group-average data matrix will not have intrinsically "high resting-state correlation CNR"

1 It is easy to show that the covariance of concatenated datasets equals the sum of the covariances of the individual datasets (even if they have different lengths and overall variance). Given the decomposition Y = USV' = UW (where W is the weighted spatial eigenvectors), the (voxels x voxels) covariance is W'U'UW = W'W. Hence it is clear that when taking multiple (previously concatenated) datasets and concatenating these with a new single dataset, the concatenation followed by covariance estimation is equivalent to summing the two separate W'W estimates (one for the previously combined multiple datasets and one for the new dataset) — hence no change in the relative weighting of the two separate covariances is necessary in order for the estimate of covariance to be equivalent to the temporal concatenation of a new subject onto a previously concatenated set of subjects. An alternative way of establishing that the concatenation of W with a new data set Yi is using the correct weighting is to note that, due to U being an orthonormal basis, any eigenvalue/vector pair {Sj, Vj} of the covariance matrix Cov( Yf) is also an eigenvalue/ vector pair of UCov(W)U', and therefore transforms into an eigenvalue/vector pair {Sj,U' Vj} of the covariance matrix Cov(W).

(due to the random phase of RSN timeseries in different datasets being averaged), we have developed an iterated approach to the estimation of the group-averaged spatial eigenvectors. We now describe the mathematical justification for this method, showing that the iterations converge towards the same output as PCA applied to full temporal concatenation.

Denote by Yi the data matrix of the i-th subject in a group of s subjects. We want to approximate the group data by a single set of (group) spatial patterns collected as the n rows of the matrix W, as well as the individual time course matrices Mi. This is accomplished by solving the following optimization problem:

W,MiYl "Yi(txV)-Mi(txn)W(nxv)H

with respect to all the time course matrices and the single spatial pattern matrix.

This formulation is equivalent to temporally concatenating the group data into a big matrix and approximating it by a low-rank matrix, but for notational convenience, we do not form such a big matrix. In general, the problem could be solved in the concatenated matrix formulation simply by applying SVD. However, we consider here a simpler and more efficient optimization approach which does not need explicit temporal concatenation.

Our approach is based on two principles. First, we use an alternating variables optimization of the Mi and W in the objective function in Eq. (2) above, which leads to surprisingly simple iteration steps. Second, we divide the computations into two stages: In the first stage, we solve the objective above for a larger number of spatial patterns m than the number n that we want as the final output, and in the second stage, we reduce the number from m to n by ordinary SVD/EVD computation. The justification for this two-stage procedure is similar to MIGP: In the first stage, we compute only a rough approximation of the m-dimensional reduced representation because each iteration may be quite slow due to the high dimensionality of the data and/or slow access to the data matrices Yi, and thus we want to preserve more information in the representation than just the n dimensions.

There is clearly some indeterminacy in the optimization problem, since we could multiply W from the left by any invertible matrix, and multiply all the Mi from the right by the inverse of that matrix, without affecting the value of the objective function. To reduce this indeterminacy, we constrain the concatenated form of Mi, M = (Mf, ...,MTS j to have orthogonal columns of unit norm, i.e.

EM^Mi = i.

We impose this instead of the more typical constraint WWT = I, also found in ICA, because this constraint enables the computation of W in such a form that the singular values (variances of different directions) are preserved. This is crucial if we want to further reduce the dimension in a second stage SVD/EVD.

Now, we can solve the optimization problem by an alternating variables approach: optimizing the function first with respect to Mi, then with respect to W, and iterating this.

To initialize the algorithm, we propose to take the average over the Yi, i.e.

This implies choosing the first-stage dimension m to be equal to the number of timepoints (number of rows in Y'). If we choose to use a smaller m, we can simply reduce the dimension of this initial value by ordinary PCA, without changing anything in what follows.

ARTICLE IN PRESS

S.M. Smith et al./ Neurolmage xxx (2014 ) xxx-xxx

Given some value of W, the optimal Mi is given by first computing

and then projecting M on the constraint set by orthogonalizing it as

Mi -M^M, TMg for all i

This is because by orthogonality of Mi, the objective is equal to tr( EiM[YiWT) plus terms which are constant with respect to Mi on

the constraint set. In general, to maximize an objective tr(M Z

under the orthogonality constraint (i.e. on the Stiefel manifold), the projection of its gradient Z on the tangent space of the Stiefel manifold should vanish, which means Z-MZTM = 0, and this holds for the Mi computed above.

Likewise, given all the Mi, the optimal W can be found by computing

which follows from basic linear algebra since the pseudoinverse of M is equal to MT due to orthogonality.

The initialization in Eq. (3) and the subsequent three formulae in Eqs. (4), (5), and (6) above give the iterations of the (first stage of) SMIG. Thus, we have shown that SMIG is a principled method for finding the optimal approximation of the group data in terms of Mi and W. In other words, SMIG is an iterative method for solving the SVD of the temporally concatenated data matrix, since the two problems are equivalent. In the limit of an infinite number of iterations, it will converge to the true SVD of the data. If we take only a single iteration of the algorithm we have the method proposed in Hyvarinen and Smith (2012).

We can also interpret these iterations as a power method. Grouping the formulae together, we have

the same approach cannot be applied using lCA spatial maps. Hence, if we calculate the matrix of the weighted eigenvectors W, we can then later trivially estimate from that the v x v covariance matrix, and from this the closely-related dense connectome (correlation matrix).

This simple result is not surprising if we consider W to comprise all voxels' pseudo-timeseries — if two voxels have similar timeseries (on average in the group), they will also have similar weights across multiple spatial eigenvectors.

Naturally the estimation of the dense connectome implies having enough RAM to estimate the v x v covariance matrix; part of the original argument for needing more RAM-efficient methods was to avoid ever having to estimate such a potentially large quantity (given that one simple way of estimating a group-PCA is to first average all subjects' individual v x v covariance matrices and then perform an eigenvalue decomposition). However, the above approach (for estimating the dense connectome) only requires a single copy of this large matrix, whereas a running sum/average would require at least two copies, i.e., at least doubling the RAM requirements. Secondly, if the purpose of the group-level analysis is indeed to carry out group-PCA (and not just estimate the group-level dense connectome), then the RAM requirement for passing the group-average dense connectome into an eigenvalue decomposition would be much larger than that just required to store a single copy of the dense connectome, whereas the PCA components estimated by our methods do not ever require the formation of this matrix, or the running of a large-RAM PCA calculation.

Using group-PCA output to carry out group-level parcellation

For similar reasons, we can also feed the W matrix of pseudo-timeseries into clustering algorithms, to achieve group-level spatial parcellation on the basis of the "temporal" similarity of voxels. Standard parcellation methods can be fed either from the raw W matrix, or from the W'W correlation matrix, depending on the algorithm to be applied.

Empirical evaluations

which should be followed by multiplying W from the left by the matrix orthogonalizing M. The matrix in brackets is in fact the covariance matrix (which is very large — voxels x voxels covariance matrix, but in practice we do not need to compute this) of the temporally concatenated data (up to a multiplicative constant). Multiplying it repeatedly by W leads to the power method, the most fundamental numerical algorithm for computing the dominant eigenvectors. However, our normalisation of W is different from that of usual implementations of the power method, and particularly suitable for our two-stage method.

Using group-PCA output to generate the "dense connectome"

It is now trivial to show that one can easily estimate the large group-average "dense connectome" ("space" by "space" matrix of voxel-level temporal correlations) from the PCA output, should this be desired — starting conceptually from the fully-temporally-concatenated dataset Y. Forming the dense connectome is not necessary if one only wishes to run group-level lCA, because normally all that is passed onto spatial-lCA, after the initial PCA, is the group-PCA spatial eigenvectors, as described above. However, if we do want to estimate the group-average v x v covariance matrix (dense connectome), we have:

covariance(Y ) = Y'Y KVS'U'USV' = VS'SV' = W'W.

where W can be the weighted spatial eigenvectors output by (e.g.) MIGP. Thus we do not need to estimate any temporal eigenvectors. Note that ICA component timeseries are not in general orthogonal, so

We compare several group-PCA approaches:

• TemporalConcat — full temporal concatenation of all subjects' datasets, followed by PCA.

• MIGP — as described above.

• SMIG — as described above. We need to specify m (the internal dimensionality retained) and a (the number of iterations).

• GIFT — the method of Calhoun et al. (2001), concatenating within-subject PCA outputs across subjects, followed by a final PCA. We also test GIFT with an internal dimensionality m double that of the final output dimensionality; this keeps more subject-specific detail than is normally done by default, at the cost of increased RAM requirement.

• MeanProjection — an approach utilised in old MELODIC software versions, that projects individual datasets onto a PCA-reduced version of the group-average-data, concatenates the results across subjects, and then reduces down with a final group-level PCA. ln the context of SMIG, this could be seen as a similar approach, but without the temporal rotation that brings datasets "into phase" with each other.

We now present various results: simple calculation of RAM requirements for the different methods; accuracy results from a range of simulations; and accuracy results for MIGP, on a large real dataset.

RAM requirements — different methods

We first estimated the RAM required by several group-PCA methods, as a function of number of timepoints, voxels, subjects and estimated dimensionality. For each method: the size of the largest "timeseries" data matrix formed is estimated (in some cases, for example, this might be voxels x timepoints, in others, voxels x timepoints x subjects, and in others, voxels x subjects x dimensions). Additionally, the largest

ARTICLE IN PRESS

S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx 5

data covariance formed is estimated (either voxels x voxels, or timepoints x timepoints, depending on the method). Simple testing in MATLAB shows that running the most efficient SVD on a large data co-variance matrix requires approximately double the total RAM than is needed to just store the covariance matrix.

For our purposes then, we plot whichever is the larger of the two quantities; the largest timeseries data matrix formed, and the doubled covariance matrix size. Fig. 1 shows these maximum-RAM estimates for a range of methods, under a number of different study scenarios. We use approximate/typical values for 4 study scenarios: typical small imaging studies; 1000 datasets from the "thousand functional connectomes" (KFC) data (Biswal et al., 2010); 1200 subjects (that will eventually be available) from the Human Connectome Project (Van Essen et al., 2013), with a large number of timepoints per subject; and 100,000 subjects' datasets that will eventually be acquired by the UK Biobank Imaging study (www.ukbiobank.ac.uk), if it runs to completion as hoped.

The voxel counts are either: 25,000 (number of brain voxels in MNI standard space, working at the rather crude resolution of 4 mm); 200,000 (2 mm MNI-space brain voxels); and 100,000 (the approximate number of grayordinates in the HCP standard co-ordinate system, with approximate 2 mm spacing between surface vertices and subcortical voxels).

Full temporal concatenation does not continue to rise greatly for the larger datasets, as might have been expected; this is because full temporal concatenation is mathematically equivalent to an approach based around averaging the (very large) within-subject voxels x voxels covari-ance matrices across all subjects. Therefore, once the dataset becomes large enough, this latter approach becomes the more RAM-efficient option, and hence the resulting RAM requirements stop increasing with the number of subjects.

It can be seen that for none of these scenarios do the MIGP/SMIG methods require large amounts of RAM — indeed never more than around 8 GB.

Simulations - methods

Our primary evaluations were based on simulated datasets containing a hierarchy of effects, including structured signal and artefact in the dataset for each simulated subject, and subject variability (including outlier effects). We generated several distinct simulations, spanning a range of different scenarios as described below.

The core of the simulation is to randomly generate a number of (group-level ground truth) spatial maps (that will be modulated by randomly distinct timecourses), which in the simplest simulations will be exactly the same sets of maps for all subjects. The maps are randomly generated and somewhat sparse, with a unity-standard-deviation Gaussian random distribution embedded on top of values that are 0 in most voxels, and 5 in a minority of voxels. We tested a range of methods for defining these spatial maps, including pure Gaussian "noise", and the final results depended very little on the method chosen. The ground-truth spatial maps may all have the same strength (amplitude), or may have controllably variable amplitudes relative to each other.

Two sets of these spatial maps are created, representing two distinct subject groups (sub-populations). The extent to which the two sets of spatial maps are similar is controllable, varying from being identical (group-difference = 0, i.e., we have just one population of subjects) to being totally unrelated (group-difference = 1, i.e., we have two groups of totally different subjects). The number of subjects formed in the two groups can vary, from having no subjects in group 2 (again — we have just one population of subjects), to having a single subject in group 2 (i.e., we have a single-subject outlier), to having multiple subjects in the second group.

Individual subjects' datasets are then created on the basis of the "group-level" maps defined above. Each subject within a given group has spatial maps with controllable similarity to the group-maps defined above; each map for each subject has a controllable amount of subject-specific variation added, and is then modulated by a single random timecourse (taking the outer product of the spatial map and a random

timepoints 200 200 4800 360

voxels 25,000 200,000 25,000 200,000 100,000 100,000 200,000

subjects 20 1000 1200 100,000

dimensions 30 30 100 30 | 200 30 | 200 200

small study small study KFC KFC HCP UK Biobank UK Biobank

4mm 2mm 4mm 1 2mm grayordinates grayordinates 2mm MNI

Fig. 1. RAM required for different group-PCA methods, analysing a range of study scenarios, shown using a log scale. See main text for detailed description and comments.

ARTICLE IN PRESS

S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx

timecourse). All components are then added to form the full t x v data matrix. The subject-specific randomness can be added both to the spatial map (both its shape and amplitude) before temporal modulation, and after (the former therefore representing subject-specific alterations to the spatial map, compared with the group ground-truth, and the latter representing unstructured measurement white-noise). Additionally, a controlled number of subject-specific structured "artefact" components (meaning that these do not correspond to any group-average spatial effect) can be added.

Each method was evaluated using 4 primary measures — in all cases "higher is better":

1. Overall "TPR" (analogous to true positive rate). This is the percentage of the total variance of the full set of ground-truth spatial maps that can be explained by the eigenvectors output by a given group-PCA method. lt is calculated by projecting the former onto the latter, and then dividing the sum of squares of the result against the sum of squares of the former. If two groups of subjects were generated, the combined (v x m x 2) space of all ground-truth spatial maps is used.

2. TPR for group 1 only; the percentage of the variance of the ground-truth space of the spatial maps from group 1 explained by the group-PCA. In many cases this will be the "majority" group (as opposed to outlier subjects).

3. TPR for group 2; in many cases this will be the accuracy of estimation of outlier subjects.

4. Overall "1-FPR" (analogous to 1 — false positive rate). This is the fraction of the estimated total spatial map space which lives in the space of the ground-truth maps. A high value close to 1 means a low false positive rate — estimated spatial eigenvectors only reflect the correct underlying group-truths.

All TPR and 1-FPR values are reported as percentages. Each simulation is run 10 times with different randomised data; results are therefore shown as boxplots depicting the TPR and 1-FPR distributions over the 10 runs of each simulation.

Simulations — results

We now present results from a range of dataset scenarios, varying dataset dimensions, as well as the relative sizes of various forms of noise and subject variability. The full simulation parameter definitions are listed within each figure.

In addition to the range of SMIG parameters reported on below, we also tested the following parameter settings; however, because they resulted in no (or very little) improved performance compared with the settings reported on, we do not include any further reporting on these. 1) With an internal (working) dimensionality m smaller than the number of original session timepoints t, we found virtually no accuracy advantage in raising the number of iterations from 10 to 25, and so do not report on more than 10 iterations. 2) With m = t, there was never a significant improvement when moving from 5 to 10 iterations, so we do not report on higher than 5.3) With m = n (final output dimensionality), results were often worse than when keeping a larger number of internal components; a similar result was also found for m = 1.5n, though with less of an accuracy loss. For m = 2n and m = 3n the results were very similar, with a very slight improvement in a few scenarios using 3n (for only a small CPU-time increase), and so we only report on m = 3n and m = t below.

Variations in subject variability and white noise (no artefacts)

Fig. 2 shows results from relatively "high CNR, low variability" datasets (30 subjects, 200 timepoints). The first test has relatively low white noise and a low subject-variability of 0.1 (meaning the extent to which the underlying common spatial maps vary across subjects). All methods perform well, being close to 100% accuracy in terms of both

TPR (finding the correct common maps) and 1-FPR (only estimating the common maps and nothing extra).

The second test increases the subject-variability to 0.3, which might be considered to be relatively high (although high-resolution fMRl data with no spatial smoothing in the preprocessing might be expected to exceed this). All methods are still doing well, with around ~>-/>94% accuracy. Interestingly, GIFT and MeanProjection perform very slightly better than the other methods. In the case of GIFT, this might be because of the potential "2-stage denoising" effect that could happen when the dataset exactly matches the assumptions and parameters being used by the method; given that GIFT is internally keeping exactly the correct number of components (here 10) in each within-subject PCA, and because this simulation has no subject-specific artefacts, then one expects that the estimated set of components for each subject should be close to perfect, achieving within-subject denoising effectively. Then, at the group level, when combining across the sets of within-subject components, the higher level of subject variability can be more effectively dealt with, given that within-subject noise has been suppressed.

The third test reverts to the lower level of subject variability, and increases the white-noise to 10. Now GIFT becomes less accurate than the best of the other methods, with MIGP and SMIG (after more than 2 iterations) performing very well.

Interactions between subject-specific artefacts and estimated dimensionality

Fig. 3 shows the effects of adding in subject-specific artefact components, and how the estimated dimensionality interacts with this. There are 10 common ("good") components in each subject's dataset, as well as 30 subject-specific "artefact" components (a realistic ratio (Griffanti et al., in press)). The difference between the 3 tests is the number of final estimated components, varying from being less than the true number (first test), equal (second test), and greater (third test). Not surprisingly, the TPR improves significantly when estimated dimensionality is raised, and 1-FPR falls.

The performance of the best of the methods is now lower than before, with these more challenging scenarios, with iterated SMlG and MIGP reaching ~>-/>63-88% accuracies (depending on the exact scenario). Note however that TemporalConcat (often considered the "gold standard" — but here we know the true gold standard set of maps that were fed into the simulations) performs no better than SMlG/MlGP.

GIFT performs poorly, only approaching the best of the other methods when the estimated number of components is larger than the true number, and when the internal dimensionality is double that of the default.

Interactions between outlier subject and estimated dimensionality

Fig. 4 shows the effects of adding in an outlier subject, which has a controllable, typically large, difference in its underlying non-artefact spatial maps, compared with the primary group of subjects. We also tested larger numbers of subjects in the second ("outlier") group, but the results from those were simply intermediate between the results reported here and the cases with no outliers, so we do not report further on those here. ln these tests no subject-specific artefacts were added.

ln the first test, with the outlier being different than the other subjects by a factor of 0.3 (i.e., "intermediate"), and when estimating the same number of components as are created for each subject (10), all methods perform similarly, with slightly improved accuracy of estimation of the outlier subject by MeanProjection. Presumably, in all cases, the estimated 10 components are driven almost entirely by the group of 30 homogeneous subjects, and the ~>-/>47% accuracy in estimating the outlier subject's maps is primarily driven by the extent to which its maps are similar to the other subjects.

The second test doubles the final estimated dimensionality, which therefore in theory could allow perfect estimation of both groups of subjects. In this case MeanProjection and GIFT perform poorly. The third test keeps this doubled final dimensionality, and now raises the

ARTICLE IN PRESS

S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx

varying good-component subject-variability & white-noise

subjects=30 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=2.0

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP TemporalConcat

10 20 30 40 50 60 70 80 90 TPR: % gold-standard variance correctly estimated

subject-variability=0.1

10 20 30 40 50 60 70 80 90 1-FPR: % estimated variance within gold-standard space

white-noise=2

subjects=30 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 subject-variability=0.3 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=2.0

MeanProjection SMIG a=1 m=t SMIG a=2 m=t

SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP TemporalConcat

+ № + №

+ a + »

+ о +o

+m + »

0 10 20 30 40 50 60 70 80 90 0 TPR: % gold-standard variance correctly estimated

subject-variability=0.3

10 20 30 40 50 60 70 80 90 1-FPR: % estimated variance within gold-standard space

white-noise=2

subjects=30 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 subject—variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=10.0

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP TemporalConcat

, , , , ^ 1 ' 1 H

i- -Ю

»-(О

0 10 20 30 40 50 60 70

TPR: % gold-standard variance correctly estimated

subject-variability=0.1

10 20 30 40 50 60 70 80 1-FPR: % estimated variance within gold-standard space

white-noise=10

Fig. 2. Results from relatively "high CNR, low variability" datasets (30 subjects, 200 timepoints).

610 group-difference (the extent to which the outlier subject is different

611 from the others) to 1 (i.e., totally unrelated). Again, MIGP and iterated

612 SMIG perform well (as well as TemporalContat) — and MeanProjection

613 and GIFT perform poorly. It should be noted, though, that most methods

614 perform well with respect to the homogeneous group of subjects,

615 which, arguably, is the most important aspect of these results.

616 Combinations between outlier subjects and subject-specific artefacts

617 Fig. 5 shows tests combining the effects of subject-specific artefact

618 components and the presence of an outlier subject. The results are rea-

619 sonably consistent with being a mixture of effects shown above. MIGP

620 and iterated SMIG perform well overall — as well as TemporalConcat.

621 GIFT overall performs poorly, except when estimating an increased

622 number of components and when the internal dimensionality is double

623 the default — in such cases, the performance can be greater than with

624 TemporalConcat. Possibly the reduced performance of GIFT (particular-

625 ly when compared with its performance in the previous set of tests) is

626 due to the subject-specific artefact components damaging the within-

627 subject PCA applied by the GIFT approach.

628 Very large datasets

629 Fig. 6 shows tests with much larger datasets, with increased num-

630 bers of subjects or timepoints, and varying true and estimated

631 dimensionality.

The first and second tests have 50 (respectively, 200) subjects and 632

200 timepoints, with true and estimated dimensionality of 70 (a dimen- 633

sionality that would most likely be too high to estimate robustly for in- 634

dividual subjects, but hopefully estimable at the group-level). GIFT with 635

m = 2n, MIGP and iterated SMIG perform well, with a very slight differ- 636

ence between their best performances; here, the 10-iterations of m = t 637

SMIG is 1-2% more accurate than the other options, and equal to 638

TemporalConcat. The reason that here m = t, rather than m = 3n as 639

in previous tests, is that here 3n > t. Default GIFT and MeanProjection 640

perform less well than the other methods. 641

The third test has 30 subjects and 1000 timepoints. The estimated 642

dimensionality here is lower than the true dimensionality; all methods 643

perform well except for GIFT. 644

Subject ordering effects in MIGP 645

Although there is no bias in overall eigenvector weighting as new 646

subjects are added into the MIGP calculations, there still might be a 647

small "ordering effect", for example, if eigenvectors found to be present 648

early on (in the incremental estimations) may boost patterns that ran- 649

domly match these in later datasets. Hence in this evaluation we simulat- 650

ed two fairly different groups of subjects. In addition to the standard MIGP 651

analysis (which randomises the order that subjects are processed, and 652

hence should avoid bias between different subject subgroups caused by 653

any order effect), we also ran MIGP twice without randomisation. In 654

ARTICLE IN PRESS

S.M. Smith et al./ Neurolmage xxx (2014 ) xxx-xxx

subject-specific artefacts: varying estimated-dimensionality

subjects=30 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=5 component-strength-variability=0.5 subject—variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

MeanProjection1 1 1 1 1---1 1---

SMIG a=1 m=t - 1--- I--

SMIG a=2 m=t - 1--н

SMIG a=5 m=t - 1--H

SMIG a=5 m=3n - '--<

SMIG a=10 m=3n - '--H

GIFT- I- - ----'

GIFT m=2n - I-------------1

MIGP- 1---

TemporalConcat ^_,_,_,_,_---

1 1 1 1 1- '— -с —1—I---1 -1

H 1—1

1--с н- ___1

, H Hl

10 20 30 40 50 60 TPR: % gold-standard variance correctly estimated

10 20 30 40 50 60 70 80 90 1-FPR: % estimated variance within gold-standard space

estimated-dimensionality=5

subjects=30 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 subject—variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

MeanProjection - 1 1 1 1 1--'- ^ 1 1 ] 1 1 ''---1 —1 1 1

SMIG a=1 m=t -SMIG a=2 m=t -

■ □□--+

SMIG a=5 m=t -SMIG a=5 m=3n-SMIG a=10 m=3n -GIFTGIFT m=2n -MIGP -TemporalConcat ^ 0

10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 TPR: % gold-standard variance correctly estimated 1-FPR: % estimated variance within gold-standard space

estimated-dimensionality=I0

subjects=30 timepoints=200 voxels=100000 tme-dimensionality-per-group=10 estimated-dimensionality=15 component-strength-variability=0.5 subject—variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0 MeanProjection -

SMIG a=1 m=t -SMIG a=2 m=t -SMIG a=5 m=t -SMIG a=5 m=3n-SMIG a=10 m=3n-GIFTGIFT m=2n -MIGP -TemporalConcat :

1— 1 I— -1

+ 1--ci 1---1

+ 1--с 1---1

+ 1--с 1---1

+ 1--с 1---1

1--- --

1--с 1--1

+ 1— с 1---1

+ 1--с 1---1

0 10 20 30 40 50 60 70 80 90 0 10 20 30 40 50

TPR: % gold-standard variance correctly estimated 1-FPR: % estimated variance within gold-standard space

estimated-dimensionality=I5

Fig. 3. Results showing the effects of adding in subject-specific artefact components, and how the estimated dimensionality interacts with this.

655 one case ("forwards"), we first fed all subjects from group one into MIGP,

656 followed by group two; in the other MIGP analysis ("backwards") we fed

657 in group two first. We also applied these 3 approaches using 5 iterations

658 of the full algorithm, with each iteration initialised from the output of

659 the previous one, rather than by the first subject.

660 Fig. 7 shows the results from this evaluation. This shows that with-

661 out order randomisation, some order bias can occur when different

662 groups of subjects are processed, in group-organised ordering; small

663 but significant biases can be seen in the results for "forwards" and

664 "backwards" versions of MIGP, most strongly with respect to the accura-

665 cy of estimating group 2 in the "forwards" case (where group 1 is

666 processed first).

667 With the default MIGP approach of subject order randomisation, and

668 only a single pass through the entire dataset, accuracy is very close to

669 full temporal concatenation (which processes all subjects simulta-

670 neously). Only an additional ~>-/>0.1% improvement is obtained by it-

671 erating MIGP 5 times, which in general will not be computationally

672 worthwhile, particularly given that sub-group variability would not

673 generally be expected to be as large as was inserted here.

674 Real data results

675 Finally, we validated MIGP on a very large real dataset, utilising a

676 powerful compute cluster having 1.25 TB of RAM to estimate the

group-level PCA both with MIGP and with the "gold-standard" ap- 677 proach of full temporal concatenation. We used rfMRl datasets from 678 the first 131 subjects publicly released by the HCP. Each subject's 679 dataset comprised 4 15-minute runs, totalling 4800 timepoints. The 680 data had been preprocessed and transformed into the compact 681 "grayordinate" standard space (Glasser et al., 2013) of approximately 682 90,000 spatial locations; therefore the entire raw dataset comprises 683 225 GB (if using a 4-bit float for each datapoint) — nearly half a Terabyte 684 when using double precision storage. 685

For this analysis, we set m = 4700, i.e., close to the number of 686 timepoints obtained by combining the 4 runs from each subject. At 687 the completion of MIGP, we saved out the top 4500 components. 688 MIGP took approximately the same amount of time to run than the 689 gold-standard approach, but it required less than 16 GB of memory. 690 The primary final comparison was of the 90,000 x 90,000 full "dense 691 connectome" estimated on the basis of the two approaches. This 692 showed an accuracy of 99.99% for MlGP against the gold standard. lf 693 the gold-standard analysis was reduced to its top 4500 PCA components 694 before re-estimating the ground-truth dense connectome, MlGP had an 695 accuracy (against this) of 100.00%, telling us that the 0.01% error in the 696 initial comparison was in fact due to the PCA approximation of the 697 group dataset to the top 4500 components, and not due to MlGP's 698 approximation to the PCA; this is thus a powerful validation of the over- 699 all algorithm. 700

outlier subject: varying outlier difference and estimated-dimensionality

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIGP a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIGP a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 group-difference=0.3 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=2.0

10 20 30 40 50 60 70 TPR: % gold-standard variance correctly estimated

TPR: Group1 only (N=30)

TPR: Group2 only (N=1)

10 20 30 40 50 60 70 80 90 1-FPR: % estimated variance within gold-standard space

group-difference=0.3 estimated-dimensionality=10

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=20 component-strength-variability=0.5 group-difference=0.3 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=2.0

20 30 40 50 60 70 80 3: % gold-standard variance correctly estimated

30 40 50 60 70 TPR: Group1 only (N=30)

20 30 40 50

TPR: Group2 only (N=1)

10 20 30 40

1-FPR: % estimated variance within gold-

- nu- mm—

н- п и

I_■— pzzi— '

standard space

group-difference=0.3 estimated-dimensionality=20

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=20 component-strength-variability=0.5 group-difference=1.0 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=0 artefact-strength=2.0 white-noise=2.0

MeanProjection SMIGP a=1 m=t SMIG a=2 m=t SMIGP a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

10 20 30 40 50 60 70 80 TPR: % gold-standard variance correctly estimated

30 40 50 60 70 TPR: Group1 only (N=30)

20 30 40

TPR: Group2 only (N=1)

10 20 30 40 50 60 70 1-FPR: % estimated variance within gold-standard space

group-difference=1.0 estimated-dimensionality=20

Fig. 4. Results showing the effects of adding in an outlier subject, which has a controllable, typically large, difference in its underlying non-artefact spatial maps, compared with the primary group of subjects.

я ro r m tu [Л

о ГО

я' оо

я EU ro

я £1J

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIGP a=5 m=t SMIG1 a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

outlier subject and subject-specific artefacts

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=10 component-strength-variability=0.5 group-difference=0.3 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 0 5 10 15 20 25 30 35 40 0 10 20 30 40 50 60 70

TPR: % gold-standard variance correctly estimated TPR: Group1 only (N=30) TPR: Group2 only (N=1) 1-FPR: % estimated variance within gold-standard space

group-difference=0.3 estimated-dimensionality=10

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=20 component-strength-variability=0.5 group-difference=0.3 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

-1-1-1-1-1-HZt^'H-1-

10 20 30 40 50 60 70 80 0 10 20 30 40 50 60 70 80 90 0 5 10 15 20 25 30 35 40 45 0 5 10 15 20 25 30 35 40 45 TPR: % gold-standard variance correctly estimated TPR: Group1 only (N=30) TPR: Group2 only (N=1) 1-FPR: % estimated variance within gold-standard space

group-difference=0.3 estimated-dimensionality=20

subjects=31 timepoints=200 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=20 component-strength-variability=0.5 group-difference=1.0 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

10 20 30 40 50 60 0 10 20 30 40 50 60 70 80 90 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0 5 10 15 20 25 30 35 40 45

TPR: % gold-standard variance correctly estimated TPR: Group1 only (N=30) TPR: Group2 only (N=1) 1-FPR: % estimated variance within gold-standard space

group-difference=1.0 estimated-dimensionality=20

Fig. 5. Results showing tests combining the effects of subject-specific artefact components and the presence of an outlier subject.

ARTICLE IN PRESS

S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx

very large datasets

subjects=50 timepoints=200 voxels=100000 true-dimensionality-per-group=70 estimated-dimensionality=70 component-strength-variability=0.5 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=1.0 white-noise=5.0

MeanProjection -SMIG a=1 m=t -SMIG a=2 m=t -SMIG a=5 m=t SMIG a=5 m=t SMIG a=10 m=t GIFT GIFT m=2n MIGP TemporalConcat

- 1 1 1 kn-' _ 1 <ÎM- i

- ŒH -

^ - +

•-D-'

•-D-'

iD- i-CHF

- 1 1 1 1 e , '"fW

10 20 30 40 50 60 70 TPR: % gold-standard variance correctly estimated

10 20 30 40 50 60 70 -FPR: % estimated variance within gold-standard space

subjects=50 timepoints=200 true-dimensionality-per-group=70

subjects=200 timepoints=200 voxels=100000 true-dimensionality-per-group=70 estimated-dimensionality=70 component-strength-variability=0.5 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=5.0

MeanProjection - 1 1 1 1 1 1 &

SMIG a=1 m=t - Ko-

SMIG a=2 m=t № -

SMIG a=5 m=t »

SMIG a=5 m=t №

SMIG a=10 m=t IF

GIFT OH -

GIFT m=2n

TemporalConcat - i i i i i i 0*

1 ' ' ^ '+ 1 1 1

-1-1-1-1-1- , KD

10 20 30 40 50 60 70 80 TPR: % gold-standard variance correctly estimated

10 20 30 40 50 60 70 80 -FPR: % estimated variance within gold-standard space

subjects=200 timepoints=200 true-dimensionality-per-group=70

subjects=30 timepoints=1000 voxels=100000 true-dimensionality-per-group=10 estimated-dimensionality=5 component-strength-variability=0.5 subject-variability=0.1 subjectwise-component-strength-variability=0.5 artefact-components=30 artefact-strength=2.0 white-noise=2.0

MeanProjection SMIG a=1 m=t SMIG a=2 m=t SMIG a=5 m=t SMIG a=5 m=3n SMIG a=10 m=3n GIFT GIFT m=2n MIGP

TemporalConcat

il 1 1 + at

i—i i— (

1--1 1 1- H

10 20 30 40 50 60 TPR: % gold-standard variance correctly estimated

subjects=30 timepoints=1000 true-dimensionality-per-group=10 (estimated 5)

10 20 30 40 50 60 70 80 90 1-FPR: % estimated variance within gold-standard space

Fig. 6. Results showing tests with much larger datasets, with increased numbers of subjects or timepoints, and varying true and estimated dimensionality.

701 Conclusions

702 We have shown that two simple approaches for estimating group-

703 level PCA can achieve virtually the same accuracy as full temporal con-

704 catenation of all subjects' datasets, for any number of subjects, or size of

705 data, without needing large amounts of RAM. Indeed, the RAM require-

706 ments do not increase at all with increasing numbers of subjects. Addi-

707 tionally, in the majority of simulation scenarios, the methods (along

708 with the temporal concatenation approach that they approximate) pro-

709 vide more accurate results than the other methods tested.

710 We have not commented on the computational time required for the

711 different methods. This is partly because this work relates to group-level

712 (study-level) analyses, that by definition would not be carried out very

frequently. It is also because most of these methods are parallelisable 713 across multiple CPUs with high efficiency. In the case of MIGP, subsets 714 of a dataset can be processed in parallel (with randomisation of subject 715 membership across subsets), and then the multiple outputs combined 716 with exactly the same approach, treating each subset's output as if it 717 were an individual subject's dataset (which is straightforward given 718 that MIGP retains the scaling/amplitude information). Such an approach 719 would therefore be almost identical to the fully serial MIGP analysis. 720 MIGP and SMIG have similar memory requirements and similar 721 accuracy. They differ in overall computation time in a way that will be 722 data and computer/network dependent. With respect to computation 723 (CPU) time, SMIG is generally faster than MIGP (to an extent that 724 depends primarily on how many SMIG iterations are applied). However, 725

TemporalConcat MIGP 5 iterations MIGP forwards 5 iterations MIGP backwards 5 iterations

90.9 91 91.1 91.2 91.3 91.4 91.5 91.6 91.7 94.05 94.1 94.15 94.2 94.25 87.6 87.8 88 88.2 88.4 88.6 88.8 89 89.2 68.5 69 69.5 70 70.5

TPR: % gold-standard variance correctly estimated TPR: Group1 only (N=20) TPR: Group2 only (N=10) 1-FPR: % estimated variance within gold-standard space

Order effects in MIGP

HCD- H

sub|ects=30 timepoints=100 voxels=10000 true-dimensionality-per-group=15 estimated-dimensionality=36 component-strength-variability=0.0 group-difference=0.3 sub|ect-variability=0.0 sub|ectwise-component-strength-variability=0.0 arteiact-components=0 arteiact-strength=0.0 white-noise=3.0

Fig. 7. Evaluation of subject ordering effects in MIGP.

ARTICLE IN PRESS

S.M. Smith et al. / Neurolmage xxx (2014) xxx-xxx

whereas MIGP only requires each dataset to be read from file once, SMIG requires that every dataset be read from file a + 1 times (i.e. one more time than the number of iterations, because the approach needs to start by forming the average data), if the entire set of raw datasets cannot be held in memory; hence the additional file read time may in some circumstances outweighs the gain in compute time.

Because MIGP uses a one-pass incremental approach to estimate the group-average spatial eigenvectors, it is trivial to add new subjects into a study-level PCA estimation, as they become available (e.g., in the case of HCP, as more subjects' datasets are acquired and publicly released), without needing to restart the processing from scratch. Also, MIGP is able to utilise datasets from different subjects having different numbers of timepoints.

MIGP is implemented in FSL's MELODIC tool, and simple MATLAB code for MIGP and SMIG is given in the Appendices A and B. 4500-component group-level PCA outputs from the first 500 HCP subjects have been computed with MIGP and are publicly available at the humanconnectome.org/data website, along with ICA-based parcellations (at a range of ICA dimensionalities), and "dense connectomes" (grayordinate x grayordinate correlation matrices), both derived from the MIGP group-level PCA decompositions.

Acknowledgments

We are grateful, for funding, to the Wellcome Trust.

Appendix B. SMIG MATLAB code

The internal dimensionality is m, which would normally be set 763

somewhere between n and t. The average data matrix (across all subjects) is meanY. The function nets_svds (from the FSLNets package) is a simple wrapper around the MATLAB function eigs, that ensures that the shorter dimension (regardless of data matrix orientation) is used to calculate the working covariance matrix, and estimating just the top m components. Here, 1e — 3 is a small constant needed in case R is near-singular; in general, it should be set to auto-scale properly with R, e.g., as 1e — 6 * norm(R).

if m<t

,~,V]=nets_svds(meanY,m); W=V';

W=meanY;

for iter=l:a T=zeros(m,v); R=zeros(m); for i=l:s

Mi = Y{i} * W' ;

T = T + Mi'* Y{i};

R = R + Mi' * Mi; end

W=real(inv(sqrtm(R+le-3*eye(m)))) * T;

[W]=nets_svds(W,n);

Appendix A. MIGP MATLAB code

References

Below, Y{i} is the t x v matrix for subject i, and should already have been processed to remove the mean of every column (timeseries), and, in the case of MELODIC, will also by default have had the (subject-specific) temporal variance normalisation procedure applied (Beckmann and Smith, 2004). The final output dimensionality is n.

r = randperm(s); % used to randomise the order subjects are processed in

W = Y{r(1)}; % copy first (randomly chosen) subject Y (t x v matrix)

into W

for i = 2:s % main loop over all other subjects

W = [W; Y{r(i)}]; % concatenate W with the next subject

[U,D] = eigs(W*W', % efficiently get the top "temporal" eigenvectors of

t*2 - 1); W...

W = U'*W; %... and multiply these into W to get weighted spatial

eigenvectors

W = W(1:n,:); % output just the required number of strongest spatial

eigenvectors

To parallelise (across different sub-groups of subjects), simply estimate Wg for each sub-group, concatenate these, and apply the SVD. W can of course be re-loaded later in order to add further subjects into the calculation. Different subjects' datasets can contain different numbers of timepoints.

Baker, C.G., Gallivan, K.A., Van Dooren, P., 2012. Low-rank incremental methods for 776 computing dominant singular subspaces. Linear Algebra Appl. 436 (8), 2866-2888. 777 Beckmann, C.,Smith, S., 2004. Probabilistic independent component analysis for function- 778 al magnetic resonance imaging. IEEE Trans. Med. Imaging 23 (2), 137-152. 779

Biswal, B., Mennes, M.,Zuo, X.,Gohel, S., Kelly, C., Smith, S., Beckmann, C.,Adelstein, J., 780 Milham, M., et al., 2010. Towards discovery science of human brain function. Proc. 781 Nat. Acad. Sci. U.SA 107 (10), 4734-4739. 782

Calhoun, V., Adali, T., Pearlson, G., Pekar, J., 2001. A method for making group inferences 783 from functional MRI data using independent component analysis. Hum. Brain 784 Mapp. 14 (3), 140-151. Glasser, M.,Sotiropoulos, S.,Wilson, J.,Coalson, T.,Fischl, B.,Andersson, J.,Xu,J.,Jbabdi, S. Webster, M., Polimeni, J., Van Essen, D., Jenkinson, M., for the WU-Minn HCP 787 Consortium, 2013. The minimal preprocessing pipelines for the Human Connectome Project. NeuroImage 80, 105-124. Griffanti, L.,Salimi-Khorshidi, G.,Beckmann, C.,Auerbach, E.,Douaud, G.,Sexton, C.,Zsoldos, E.,Ebmeier, K.,Filippini, N.,Mackay, C.,Moeller, S.,Xu, J.,Yacoub, E.,Baselli, G.,Ugurbil, K.,

789 791

Miller, K.,Smith, S., 2014. ICA-based artefact removal and accelerated fMRI acquisition 792 for improved resting state network imaging. NeuroImage (In press). 793

Hyvarinen, A., Smith, S., 2012. Computationally efficient group ICA for large groups. 794 Annual Meeting of the Organization for Human Brain Mapping. 795

ftehurelc, R., 2010. Fast and faster: a comparison of two streamed matrix decomposition 796 algorithms. NIPS Workshop on Low-Rank Methods for Large-Scale Machine Learning. 797 Van Essen, D.,Smith, S.,Barch, D.,Behrens, T.,Yacoub, E.,Ugurbil, K.,for the WU-Minn HCP 798 Consortium, 2013. The WU-Minn Human Connectome Project: an overview. 799 NeuroImage 80, 62-79. 800