Meta-analysis

Meta-analysis of multiple populations along with GxE

Genome-wide association

Let the dataset have n observations, f subpopulations, m markers and e environments. The method starts from the genome-wide association analysis in the jth environment, following the alternative model yj = μj + Zγ̂i + ξj + ϵj. Where y is the vector corresponding to the response variable, μ is the intercept, Z is a n × f incidence matrix indicating the haplotype of the maker under evaluation, γ̂i is a vector of the allele effects of the ith marker of f subpopulations, ξ is a vector of length n corresponding to the polygenic term and ϵ is a vector of residuals with length n.

Meta-analysis

For each i marker, the meta-analysis is based upon the concept of sufficient statistics, assuming that environments are independent and all information in each environment can be expressed by the allele effects γ̂i and the observed residual matrix Ri obtained from the association analysis. The meta-analysis attempt to verify whether the genetic (G) and environmental (E) components of γ differ from zero and, in addition, to verify the existance of G × E component. In this step, the set of γ̂ from the association analyses becomes the response variable, a vector with length e × f. For the ith marker, variance components are obtained from the following random model: γi = μi + Zαi + Wβi + Hδi + ei. Where μi is the intercept, Z is a ef × f incidence matrix indicating the allele source, αi is the genetic effect associated to the marker, W is a ef × e incidence matrix indicating the environmental factor, βi is the coefficient associated to each environment, H is the incidence matrix of genotype by environment interaction, δi is the coefficient associated to the G × E term, and ei is the vector of residuals with a known residual covariance matrix R, a block diagonal matrix ef × ef.

AMMI term

The G × E term might saturate the model once each regression coefficient γ is observed as an unreplicated combination of genotype and environment. The saturation does not occur because the residuals are not independent and the structure is known. Yet, there exist an alternative reparameterization of this term: the additive main effect and multiplicative interaction (AMMI) term. The AMMI term works as follows: Suppose that the analysis are being performed in a dataset with f = 5 subpopulations and e = 4 environments. Once γ has been estimated from the association analysis (step 1) and variance components of the genetic and environmental have been estimated with meta-analysis (step 2), then one can build the following E matrix of residuals that also contains the higher-order interaction term:

E1 E2 E3 E4
G1 ε11 ε21 ε31 ε41
G2 ε12 ε22 ε32 ε42
G3 ε13 ε23 ε33 ε43
G4 ε14 ε24 ε34 ε44
G5 ε15 ε25 ε35 ε45

The AMMI term is extracted from the singlar-value decomposition (SVD). The SVD procedure is commonly used for the extraction of signals from non-square matrices. The decomposition is E = UDS. Where, U is a e × e matrix, D is a e × f retangular diagonal matrix, and S is a f × f matrix. In analogy to the Eigendecomposition, U and S represent Eigenvectors while D are Eigenvalues. Likewise, a small fraction of principal components contain the most amount of information to recontruct the original matrix. Suppose one recontructs E using the first p = 2 principal components:

E1 E2 E3 E4
G1 q11 q21 q31 q41
G2 q12 q22 q32 q42
G3 q13 q23 q33 q43
G4 q14 q24 q34 q44
G5 q15 q25 q35 q45

The matrix above can be rearranged as a vector, and be included into the model of meta-analysis replacing the current G × E term. Thus, the model for the ith marker can be also expressed as γi = μi + Zαi + Wβi + Qτi + ei.

Hypothesis testing

The log-likelihood of the model is, therefore, L(μ, σα2, σβ2, στ2) = −0.5(log|V| + (y − μ)TV−1(y − μ)), where the variance is expressed as σγ2 = V = ZZTσα2 + WWTσβ2 + QQTστ2 + R and the log-likelihood of the model above is tested against L(μ = σα2 = σβ2 = στ2 = 0), providing the evidence that at least one of the coefficients (intercept and variance components) is not null. Thus, LRT = −2(Lμ, σα2, σβ2, στ2 − L0, 0, 0, 0).

Woodbury’s matrix identities

The computational burden associated to the analysis above is originated from the determinant and inversion of the covariance matrix V, a square matrix with ef rows and columns. Let X = [Zσα||Wσβ||Qστ], such that V = XXT + R. Using the Woodbury’s matrix identities, we have V−1 = R−1 − R−1X(XTR−1X + I)−1XTR−1 and |V| = |XTR−1X + I||R|, where the square matrix to be inverted has dimension e + f + p. For the example, in the analysis of a dataset with e = 18 environments, f = 41 subpopulations and using p = 2 principal components for the G × E term, we invert a square matrix with dimension 18 + 41 + 2 = 61 rows and columns intead of a matrix with 18 × 41 = 738 rows and columns.