Global and Local Persistence of Influenza A(H5N1) Virus

109 23
Global and Local Persistence of Influenza A(H5N1) Virus

Materials and Methods

Sequence Data and Genetic Diversity

All available sequences of the hemagglutinin gene of HPAI (H5N1) viruses isolated from avian hosts were obtained from Influenza Virus Resources at the National Center for Biotechnology Information ( These sequences were aligned by using MUSCLE.. After short sequences (>60 bp shorter than the full-length hemagglutinin-1) were removed, the final dataset included 3,365 sequences from 9 geographic regions, which, to our knowledge, made it the largest influenza A(H5N1) virus dataset analyzed (Table 1; online Technical Appendix Figure 1, Therefore, we consider that the sequence data available in the database are informative and representative of the geographic distribution and global circulation of HPAI (H5N1) viruses, although they were not obtained through systematic global influenza virus sampling that was random in terms of time and space.

To evaluate whether the classification of 9 regions was appropriate, we used the same method for estimating nucleotide diversity of avian influenza A(H5N1) virus that had been used for influenza A(H3N2) virus. Within-region nucleotide diversity was estimated in terms of πw .

where n is the number of regions and refers to diversity estimates in which both samples in each pair are from region i.

The overall between-region diversity was estimated as πb.

where π refers to diversity estimates in which 1 sample is from region i and the other sample is from region j.

Confidence intervals were estimated by taking 1,000 bootstrap replicates from the total pool of sequences. FST (genetic distance) was calculated as (πb−πw)/πb, with FST >0 indicating genetic isolation among regions and supporting the geographic classification mentioned above.

Estimating Global Parameters and Testing Geographic Association at Tips

The program Migrate requires input of 2 parameters: the transition/transversion ratio (κ) and the rate of nucleotide substitution (μ). We estimated these parameters by using the Bayesian phylogenetic method implemented in BEAST version 1.7.2. For all analyses, we used the uncorrelated lognormal relaxed molecular clock to accommodate rate variation among lineages. We used the HKY85 model of nucleotide substitution to parameterize the mutational process; equilibrium nucleotide frequencies were derived from observed frequencies; equilibrium nucleotide frequency rates were homogeneous across sites. Posterior distributions of parameters were estimated by using Markov chain Monte Carlo (MCMC) sampling. Samples were drawn every 5,000 steps over a total of 4.0 × 10 steps, and 2.5 × 10 steps were removed as burn-in. The transition/transversion ratio (κ) was estimated to be 9.163 (95% CI 8.633–9.676). The rate of nucleotide substitution () was estimated to be 5.595 × 10 substitutions/site/year (95% CI 5.249 × 10– 5.970 × 10 substitutions/site/year).

To estimate the extent of geographic structure (extent to which viruses from the same geographic region are more likely to cluster together in the phylogenetic tree than expected by chance) in the HPAI (H5N1) influenza virus populations, we performed a phylogenetic-trait association analysis on the posterior distribution of trees produced by BEAST. These geographic regions were coded onto the tips of the 3,000 trees sampled from the posterior, which were then analyzed by using the maximum monophyletic clade size statistic implemented in the Bayesian Analysis of Time Series program with 1,000 randomizations. For each of the 9 regions included in the analysis, the Bayesian Analysis of Time Series program was used to calculate a p value, which indicated whether the sequences from this region are more inclined to cluster together in the tree than expected by chance.

Estimating Migration Rates between Regions through Resampling

To estimate coalescent parameters for each geographic region, we used an MCMC technique implemented in Migrate version 3.3.0. The prior distribution of Θ (mutation-scaled population size) and 2 Nm (migration rate) values was assumed to be exponential with a mean of 1, and mutational parameters were fixed in the analyses. To minimize the influence of potential sampling biases on our results, we performed independent analyses of 100 resampled replicates. For each replicate, we randomly sampled 50 sequences without replacement from each region (online Technical Appendix Table 2 For each of the 100 bootstrap replicates, 50 MCMC simulations were run for 6 × 10 steps each. The first 5 × 10 steps of each chain were removed as burn-in. Parameter values were sampled every 10 steps. Convergence was assessed visually and through comparison of chains by using the Gelman-Rubin convergence statistic. We combined the remaining samples from each chain to give a total of 5,000 samples for each of the resampled replicates. Estimates of migration rates varied little across the 100 replicates (online Technical Appendix Table 1; mean values are shown in Table 2.

Genealogical Inference and Trunk Extraction

Most of the viruses we analyzed were isolated from domestic chickens and ducks. To infer the genealogy, we reduced the dataset so that a maximum of 30 sequences per year were sampled from chickens and ducks from each region. Combining this subset of sequences with those from the other hosts yielded a final dataset of 2,392 sequences; this dataset had notably fewer sequences from Africa, China, and Southeast Asia than did the original dataset (online Technical Appendix Table 2 We fixed Θ and 2Nm at the values estimated in the previous analysis.

We ran 4 MCMC chains for 2 × 10 steps each, of which the first 10 steps were removed as burn-in; genealogies were sampled every 10 steps. We combined the remaining samples of 4,000 genealogical trees and performed trunk reconstruction on them.

Trunk extraction and processing was performed by using the program PACT (, which is able to estimate the mean and 95% credible interval for the proportion of the trunk assigned to each geographic region. The bigger the proportion, the more the corresponding region accounts for virus variation and evolution. By calculating the proportion of sampled genealogies for which the trunk is assigned to a particular region at different points in time, we could also assess the temporal dynamics, which illustrate the annual change of trunk proportion for each region.

Testing the Robustness of the Results

For many sequences, the time of isolation is known to the nearest year only, which limits the precision of estimates of relative genetic diversity through time. Therefore, we further analyzed a subsample in which we included only sequences isolated during 2006–2011 and for which detailed isolation times were available. This dataset included 1,173 samples from 6 geographic regions. We repeated the above analyses on this subset of the dataset. Our sampling strategies can be found in online Technical Appendix Table 3 MCMC simulations were run for 4 × 10 steps, and 2 × 10 steps were removed as burn-in.

Estimating Genetic Diversity of Each Region

To estimate the relative genetic diversity of each region through time, we extracted sequences of viruses from each region from the subsampled subdataset with 1,173 sequences from 2006 through 2011, which composed 5 new datasets: for Africa, Southeast Asia, China, southern Asia, and Europe (including Siberia). Each of these datasets was analyzed by using the Bayesian skyride method in BEAST. Because the sizes of these subdatasets differed, we ran MCMC simulations for different steps for each and collected samples every 10 steps.

Subscribe to our newsletter
Sign up here to get the latest news, updates and special offers delivered directly to your inbox.
You can unsubscribe at any time

Leave A Reply

Your email address will not be published.