MaCH-Admix: Genotype Imputation for Admixed Populations
Imputation in admixed populations is an important problem but challenging due to the complex linkage disequilibrium (LD) pattern. The emergence of large reference panels such as that from the 1,000 Genomes Project enables more accurate imputation in general, and in particular for admixed populations and for uncommon variants. To efficiently benefit from these large reference panels, one key issue to consider in modern genotype imputation framework is the selection of effective reference panels. In this work, we consider a number of methods for effective reference panel construction inside a hidden Markov model and specific to each target individual. These methods fall into two categories: identity-by-state (IBS) based and ancestry-weighted approach. We evaluated the performance on individuals from recently admixed populations. Our target samples include 8,421 African Americans and 3,587 Hispanic Americans from the Women’s Health Initiative, which allow assessment of imputation quality for uncommon variants. Our experiments include both large and small reference panels; large, medium, and small target samples; and in genome regions of varying levels of LD. We also include BEAGLE and IMPUTE2 for comparison. Experiment results with large reference panel suggest that our novel piecewise IBS method yields consistently higher imputation quality than other methods/software. The advantage is particularly noteworthy among uncommon variants where we observe up to 5.1% information gain with the difference being highly significant (Wilcoxon signed rank test P-value < 0.0001). Our work is the first that considers various sensible approaches for imputation in admixed populations and presents a comprehensive comparison.
The rest of the paper is organized as follows. We first present the general framework of our imputation algorithm, followed by the intuition and formulation of our piecewise IBS and various other effective reference selection methods. Then we evaluate all these methods implemented in MaCH-Admix, the whole-haplotype IBS method implemented in IMPUTE2 [ Howie et al., 2009 ], and BEAGLE[ Browning and Browning, 2009 ] using the following datasets:
We have implemented all methods evaluated, including our piecewise IBS selection method, in our software package MaCH-Admix. Besides the new reference selection functionality, our software also retains high flexibility in two major aspects. First, both regional and whole-chromosome imputation can be accommodated. Second, both data-independent and data-dependent model parameter estimation are supported. Thus, besides standard reference panel with precalibrated parameters, we can elegantly handle study-specific reference panels and target samples with unknown ethnic origin.
In this work, we evaluated two classes of reference selection methods: IBS-based and ancestry-weighted approaches. Among the IBS-based approaches, we propose a novel method based on IBS matching in a piecewise manner. The method breaks genomic region under investigation into small pieces and finds reference haplotypes that best represent every small piece, for each target individual separately. The method can be incorporated directly into existing imputation algorithms and has identical computational complexity to that of the existing whole-haplotype IBS-based method. Results from all real datasets evaluated suggest that our piecewise IBS method is highly robust and stable even when a small number of reference haplotypes are selected. Importantly, for uncommon variants, our piecewise IBS selection method manifests more pronounced advantage with large reference panels.
An alternative approach, based on identity-by-state (IBS) sharing between the target individual and haplotypes in the reference populations, can be embedded within existing imputation models. This approach constructs individual-specific effective reference panels, by selecting the most closely related haplotypes (according to IBS score) from the entire reference pool. The IBS-based selection is intuitive and useful for reducing the size of the effective reference panel and is tailored separately for each target individual. The selection is usually conducted by finding pairwise Hamming distances, which is computationally very appealing. A simple IBS-based method, which selects a subset of haplotypes into the effective reference panel according to their Hamming distance with the haplotypes to be inferred across the entire genomic region to be imputed (hereafter referred to as whole haplotype), has been adopted by IMPUTE2 [ Howie et al., 2009 ]. Although some promising results have been shown when compared with random selection, no work has examined alternatives to this simple whole-haplotype based matching, partly due to the heavy computational burden posed.
Admixed populations offer a unique opportunity for gene mapping because one could utilize admixture LD to search for genes underlying diseases that differ strikingly in prevalence across populations [ Reich and Patterson, 2005 ; Rosenberg et al., 2010 ; Tang et al., 2006 ; Winkler et al., 2010 ; Zhu et al., 2004 ]. Although useful for admixture mapping, admixture LD also imposes challenges for imputation. Because an admixed individual’s genome is a mosaic of ancestral chromosomal segments, to appropriately impute the genotypes, it is imperative to incorporate the underlying ancestry information. Practically, this is equivalent to selecting an appropriate reference panel that matches the corresponding ancestral population(s).
Mục Lục
MATERIAL ANDMETHODS
Assume that we have n individuals in the target population that are genotyped at a set of markers denoted by Mg. In addition, we have an independent set of H reference haplotypes, for example, those from the International HapMap or the 1000 Genomes Projects, encompassing a set of markers denoted by Mr. Without loss of generality, we assume that the set of markers assayed in the target population, Mg, is a subset of Mr, the markers in the reference population. The goal of genotype imputation is to fill in missing genotypes including those missing by design (e.g., genotypes at markers in Mr but not Mg, commonly referred to as untyped markers). As described earlier [Li et al., 2010], our HMM as implemented in MaCH fulfills the goal by inferring the haplotypes encompassing Mr markers for each target individual, from unphased genotypes at the directly assayed markers in Mg. Haplotype reconstruction is accomplished by building imperfect mosaics using some of the H reference haplotypes.
GENERAL FRAMEWORK
Because admixed individuals have inherited genetic information from more than one ancestral population, we start with a pooled panel: a panel with haplotypes from all relevant populations, for example, CEU+YRI for African Americans and CEU+YRI+JPT+CHB for Hispanic Americans, where CEU is an abbreviation for Utah residents (CEPH) with Northern and Western European ancestry; YRI for Yoruba in Ibadan, Nigeria; JPT for Japanese in Toyko, Japan; and CHB for Han Chinese in Beijing, China. Let 𝒢 = (g1, g2, g3, …, gMr) denote the unphased genotypes at Mr markers for a target individual. Furthermore, we define a series of variables Sm, m = 1, 2, …, Mr to denote the hidden state underlying each unphased genotype gm. The hidden state Sm consists of an ordered pair of indices (xm, ym) indicating that, at marker m, the first chromosome of this particular target individual uses reference haplotype xm as the template and the second chromosome uses reference haplotype yM as the template, where xm and ym both take values from {1, 2, …, H}.
We seek to infer the posterior probabilities of the sequence of hidden states 𝒮 = (S1, S2, …, SMr) for each individual as the knowledge of 𝒮 will determine genotype at each of the Mr markers. Define P(Sm |ℋ, 𝒢) as the posterior probability for Sm, the hidden state at marker m with ℋ denoting the pool of reference haplotypes and 𝒢 denoting the genotype vector of the target individual. To infer these posterior probabilities, we run multiple Markov iterations. Within each iteration, we calculate the conditional joint probabilities P(Sm, 𝒢|ℋ) at each marker m via an adapted Baum’s forward and backward algorithm as previously described [Li et al., 2010].
For admixed populations, as we tend to include more reference haplotypes in the pool under the philosophy of erring on the safe side, and as we attempt not to duplicate haplotypes, one key aspect of the modeling is on how to traverse the sample space harboring the most probability mass with minimum computational efforts.
PIECEWISE IBS-BASED REFERENCE SELECTION
In piecewise IBS selection, we seek to construct a set of t effective reference haplotypes from the pool of H haplotypes within each HMM iteration for each target individual separately. Selected reference panels are therefore tailored for each target individual. For presentation clarity, we consider a single target individual. Specifically, we calculate the genetic similarity (measured by IBS, the Hamming distance between two haplotypes) in a piecewise manner between the individual and each haplotype in the reference pool, ignoring the subpopulations (e.g., CEU or YRI) within the reference.
Denote
(h1′,h2′)
as the current haplotype guess for the target individual. We break haplotype
h1′
into a maximum of
t2
pieces so that the typed markers are evenly placed across pieces. Each piece has a minimum length of ν typed markers to ensure that the calculated Hamming distance is informative. Denote the number of pieces by p. For each haplotype piece, we calculated the piece-specific IBS score between
h1′
and each reference haplotype and selects the top
t2p
reference haplotypes, resulting in a total of
t2
selected for
h1′
across all p regions. We repeat the same procedure for
h2′
and select a second set of
t2
reference haplotypes. In our implementation, we set ν = 32, which corresponds to an average length of <200 kb for commonly used genome-wide genotyping platforms. To avoid creating spurious recombinations at piece boundary, we apply a random offset to the first piece in each sampling so that the boundaries differ across iterations. In the case where
t2p
is not an integer, we selects
(t2p)¯
(the ceiling integer) reference haplotypes in each piece for each target haplotype. Then we sample randomly from the selected reference haplotypes. Note that the piecewise selection is repeated for each individual in each sampling iteration. Thus, the selection will change along with the intermediate sampling results.
We have also implemented two whole-haplotype IBS-based methods, IBS Single Queue (IBS-SQ) and IBS Double Queue (IBS-DQ). The former defines IBS score with any reference haplotype as the minimum Hamming distance to
h1′ and h2′
, thus ordering the H reference haplotypes in a single queue. The top t reference haplotypes will be selected accordingly. The latter defines two separate IBS scores for
h1′ and h2′
, thus ordering the H reference haplotypes in two queues. The top t/2 reference haplotypes will be selected for
h1′
according to IBS scores for
h1′
. Similarly, another t/2 reference haplotypes will be selected for
h2′
.
explains the three IBS strategies under two simple scenarios. In both scenarios, there are eight markers measured in both target and reference with color indicating the allelic status where the same color at the same locus implies the same allele. In both , the first chromosome of the target individual shares all eight alleles with the dark-colored reference haplotypes and zero alleles with the light-shaded reference haplotypes. In , the second chromosome of the target individual shares two alleles with the dark-colored reference haplotypes and the remaining six alleles with the light-shaded reference haplotypes; whereas in , the second chromosome shares six alleles with the dark-colored reference haplotypes and the remaining two alleles with the light-shaded reference haplotypes.
Open in a separate window
Suppose
t=H2
. illustrates a scenario where the whole-haplotype Single Queue strategy is not optimal because only dark-colored haplotypes will be selected into the effective reference panel. By combining two sets selected from two separate queues, the whole-haplotype Double Queue strategy is advantageous in the scenario. On the other hand, neither the whole-haplotype Single Queue nor the whole-haplotype Double Queue strategy can handle the scenario in well because both strategies would only select the dark-colored reference haplotypes. Ideally, the selected reference haplotypes should, when possible, contain information to represent every part of both chromosomes carried by the target individual. In the scenario presented in , because the target individual carries segment of the light-shaded haplotype, it is desirable to have some representation of the light-shaded haplotypes in the effective reference panel. Our piecewise IBS method achieves this by breaking the whole region into pieces and selecting some reference haplotypes according to genetic matching in each piece (illustrated in the bottom part of ). By conducting local IBS matching and choosing a few reference haplotypes within each piece, it is able to have some representation of the light-shaded reference haplotypes. As a result, all parts of the target chromosomes are well represented by the selected reference haplotypes. In general, we believe that selecting a small number of reference haplotypes for each piece locally performs better than selecting globally at the whole-haplotype level. Note that the piecewise IBS method has the same computational complexity as the two whole-haplotype IBS methods.
ANCESTRY-WEIGHTED APPROACH
Besides IBS-based methods, we also evaluate an ancestry-weighted selection method, which is motivated by the idea of weighted cosmopolitan panel discussed in the Introduction section. This method concerns the scenario where the reference panel consists of haplotypes from several populations, for instance CEU and YRI, such that the H reference haplotypes are naturally decomposed into several groups. Let Q denote the number of populations included and Hq denote the number of haplotypes from reference population q, q = 1, 2, …, Q. We first consider the issue of weight determination for each contributing reference population, that is, the fraction of reference haplotypes to be selected from that population. Intuitively, the weights should depend on the proportions of ancestry from these reference populations for the target admixed individual(s). The weights can be, on one extreme, the same for all individuals in the target population (e.g., when the admixture makeup is similar across all individuals), or different for subpopulations within the target population, or on the other extreme, specific for each target individual. For presentation clarity, we suppress the individual index i and denote w = (w1, w2, …, wQ) as the vector of weights, under the constraint that w1 + w2 + ⋯ wQ = 1. In this work, we consider the same set of weights for all target individuals. The weights are to represent the average contributions over the imputation region and for all target individuals. We choose to use such average weights over weights specific to each single individual because the average weights can be more stably estimated.
There are several natural ways to estimate the weights. One could prespecify the weights according to estimates of ancestry proportion. For example, it is reasonable to use a ~2:8 CEU:YRI weighting scheme for African Americans who are estimated to have about 20% Caucasian and 80% African ancestries [Lind et al., 2007; Parra et al., 1998; Reiner et al., 2007; Stefflova et al., 2011]. Alternatively, one can estimate the ancestry proportions for the target individuals under investigation. We have implemented an imputation-based approach within MaCH-Admix to infer ancestry proportions, according to the contributions of reference haplotypes from each population to the constructed mosaics of the target individuals so that the weights can be estimated by MaCH-Admix internally. We use the software package structure [Pritchard et al., 2000], specifically its Admix+LocPrior model, on LD-pruned set of single nucleotide polymorphisms (SNPs) to confirm our internal ancestry inference.
Having determined the weights, we are interested in constructing a set of t effective reference haplotypes within each Markov iteration from the pool of H reference haplotypes according to the ancestry proportions. We achieve this by sampling without replacement t × wq haplotypes from the Hq haplotypes in reference population q. For each target individual, we sample a different reference panel under the same set of weights.
MACH-ADMIX
We have implemented the aforementioned methods (three IBS-based and one ancestry-weighted) in our software package MaCH-Admix. MaCH-Admix breaks the one-step imputation in MaCH into three steps: phasing, model parameter (including error rate and recombination rate parameters) estimation, and haplotype-based imputation. The splitting into phasing and haplotype-based imputation is similar to IMPUTE2. Our software can accommodate both regional and whole-chromosome imputation and allows both data-dependent and data-independent model parameter estimation. The flexibility regarding model parameter estimation allows one to perform imputation with standard reference panels such as those from the HapMap or the 1000 Genomes Projects with precalibrated parameters in a data-independent fashion, similar to IMPUTE2, which uses recombination rates estimated from the HapMap data and a constant mutation rate. Alternatively, if one works with study-specific reference panels, or suspects the model parameters differ from those precalibrated (e.g., when target individuals are of unknown ethnicity or from an isolated population), one has the option to simultaneously estimate these model parameters while performing imputation.
DATASETS
We assessed the reference selection methods in the following six target sets:
-
3,587 WHI Hispanic Americans (WHI-HA),
-
8,421 WHI African Americans (WHI-AA),
-
200 randomly sampled WHI-HA individuals,
-
200 randomly sampled WHI-AA individuals,
-
49 HapMap III African Americans (ASW),
-
50 HapMap III Mexican individuals (MEX).
The WHI SHARe consortium offers one of the largest genetic studies in admixed populations. WHI [Anderson et al., 2003; The WHI Study Group, 1998] recruited a total of 161, 808 women with 17% from minority groups (mostly African Americans and Hispanics) from 1993 to 1998 at 40 clinical centers across the United States. The WHI SHARe consortium genotyped all the WHI-AA and WHI-HA individuals using the Affymetrix 6.0 platform. Detailed demographic and recruitment information of these genotyped samples are previously described [Qayyum et al., 2012]. Besides standard quality control (details described previously in [Liu et al., 2012]), we removed SNPs with minor allele frequency (MAF) below 0.5%. To evaluate the imputation performance on target sets of smaller size, we randomly sampled 200 individuals from WHI-HA and WHI-AA separately.
For the two HapMapIII datasets, our target individuals are ASW (individuals of African ancestry in Southwest USA) and MEX (individuals of Mexican ancestry in Los Angeles, California) respectively from the phase III of the International HapMap Project [The International HapMap Consortium, 2010]. These individuals (83 ASW and 77 MEX) were all genotyped using two platforms: the Illumina Human1M and the Affymetrix 6.0. We restricted our analysis to founders only: 49 ASW and 50 MEX.
The main focus of our work is imputation with large reference panel. Thus, we first evaluated the imputation performance of all six target sets with reference from the 1000 Genomes Project (release 20101123, H = 2, 188 haplotypes). For the WHI datasets, the number of markers overlapping between the target and reference, bounded by the number of markers typed in target samples, is smaller than that in the HapMap individuals. Therefore, we performed imputation 10 times, each time masking a different 5% of the Affymetrix 6.0 markers. This masking strategy allowed us to evaluate imputation quality at 50% of Affymetrix 6.0 SNPs. For HapMap III ASW and MEX individuals, we randomly masked 50% of the overlapping markers and evaluated the performance at these markers. We used two different masking schemes for the HapMap and WHI samples because we have ~1.5 million typed markers in the HapMap samples and thus can still achieve reasonable imputation accuracy by masking 50% of the markers in a single trial. In the WHI samples, masking 50% of the ~0.8 million markers in a single trial would substantially reduce imputation accuracy and using one trial with a small percentage of markers masked would lead to insufficient number of markers for evaluation. Therefore, we used multiple trials with 5% masking for the WHI datasets.
To provide a comprehensive evaluation, we also conducted imputation on all six target sets using HapMapII or HapMapIII haplotypes as reference. We used HapMap II CEU+YRI (H = 240) for WHI-AA individuals and HapMapII CEU+YRI+JPT+CHB (H = 420) for WHI-HA individuals. The evaluation is based on masking 50% of the overlapping markers. For HapMap III ASW target set, we considered three different reference panels: HapMapII CEU+YRI (H = 240), HapMapIII CEU+YRI (H = 464), and HapMapIII CEU+YRI+LWK+MKK (H = 930), where LWK (Luhya in Webuye, Kenya) and MKK (Maasai in Kinyawa, Kenya) are two African populations from Kenya. For HapMap III MEX target set, we considered HapMapII CEU+YRI+JPT+CHB (H = 420), and HapMapIII CEU+YRI+JPT+CHB (H = 804). For the HapMap target sets with HapMap references, we used genotypes at SNPs on the Illumina HumanHap650 BeadChip for imputation input and reserved other genotypes for evaluation. We have posted the HapMap data and our command lines used in this work on MaCH-Admix website (see Web Resources).
We picked five 5 Mb regions across the genome to represent a wide spectrum of LD levels. We first calculated median half life of r2, defined as the physical distance at which the median r2 between pairs of SNPs is 0.5, for every 5 Mb region using a sliding window of 1 Mb, in CEU, YRI, and JPT+CHB, respectively. We used HapMapII phased haplotypes for the calculation. The five regions we picked are: chromosome3: 80–85 Mb, chromosome1: 75–80 Mb, chromosome4: 57–62 Mb, chromosome14: 50–55 Mb, and chromosome8: 18–23 Mb in a decreasing order of LD level. The median half life of r2 is around 90th, 70th, 50th, 30th, and 10th percentile within each of the three HapMap populations, for the five regions, respectively (Supplementary Table S1). Supplementary Figure S1 shows the LD levels for the five residing chromosomes. For each region, we treat the middle 4 Mb as the core region and the 500 kb on each end as flanking regions. Only SNPs imputed in the core region were evaluated to gauge imputation accuracy.
METHODS COMPARED
We evaluated the following reference selection approaches implemented in MaCH-Admix:
-
random selection (MaCH-Admix Random or original MaCH),
-
IBS Piecewise selection (MaCH-Admix IBS-PW),
-
IBS Single-Queue selection (MaCH-Admix IBS-SQ),
-
IBS Double-Queue selection (MaCH-Admix IBS-DQ),
-
Ancestry-weighted selection (MaCH-Admix AW) (for HapMapIII datasets).
We also included IMPUTE2 [Howie et al., 2009] and BEAGLE [Browning and Browning, 2009] for comparison. We used IMPUTE 2.1.2 and BEAGLE 3.3.1 with default settings (-k_hap 500 -iter 30 for IMPUTE2; niterations = 10 nsamples = 4 for BEAGLE). As aforementioned, MaCH-Admix can conduct imputation with precalibrated parameters (similar to IMPUTE2); alternatively, MaCH-Admix can perform imputation together with data-dependent parameter estimation in an integrated mode. The integrated mode generates slightly better results at the cost of increased computing time. Here, we report results from the precalibrated mode.
MEASURE OF IMPUTATION QUALITY
We and others have proposed multiple statistics to measure imputation quality [Browning and Browning, 2009; Li et al., 2009; Lin et al., 2010; Marchini and Howie, 2010], measuring either the concordance rate, correlation, or agreement between the imputed genotypes or estimated allele dosages (the fractional counts of an arbitrary allele at each SNP for each individual, ranging continuously from 0 to 2) and their experimental counterpart. We opt to report the dosage r2 values, which are the squared Pearson correlation between the estimated allele dosages and the true experimental genotypes (recoded as 0, 1, and 2 corresponding to the number of minor alleles), because it is a better measure for uncommon variants by taking allele frequency into account and directly related to the effective sample size for downstream association analysis [Pritchard and Przeworski, 2001]. For the remainder of the work, with no special note, average dosage r2 values will be plotted as a function of approximation level (measured by the effective reference panel size, i.e., t described in Methods section, corresponding to MaCH-Admix’s –states option and IMPUTE2’ s -k option). Hereafter, we use approximation level, effective reference size, t, and #states/-k interchangeably. We note that for standard haplotypes-to-genotype imputation (i.e., using reference haplotypes to imputed target individuals with genotypes), computational costs increase quadratically with the approximation level. MaCH-Admix and IMPUTE2 both also have an approximation parameter at the haplotype-based imputation step, MaCH-Admix’s –imputeStates and IMPUTE2’s -k_hap, which increases the computation time linearly and is by default set at a large value (500). We kept both at the default value because increasing beyond the default has rather negligible effects on imputation quality and that total computing time attributable to the haplotype-based imputation step is typically much smaller compared to –states and -k.