Analysis of 2.1 million SARS-CoV-2 genomes identifies mutations associated with transmissibility

## Metadata Authors: Fritz Obermeyer, Stephen F. Schaffner, Martin Jankowiak, Nikolaos Barkas, Jesse D. Pyle, Daniel J. Park, Bronwyn L. MacInnis, Jeremy Luban, Pardis C. Sabeti, Jacob E. Lemieux Year: 2021 DOI: [10.1101/2021.09.07.21263228](https://doi.org/10.1101/2021.09.07.21263228) ([Open in Zotero](zotero://select/items/@obermeyerAnalysisMillionSARSCoV22021) ) URL: http://medrxiv.org/lookup/doi/10.1101/2021.09.07.21263228 Code: https://github.com/broadinstitute/pyro-cov [Development board for project](https://github.com/broadinstitute/pyro-cov/projects/1) ## Abstract Repeated emergence of SARS-CoV-2 variants with increased transmissibility necessitates rapid detection and characterization of new lineages. To address this need, we developed PyR0, a hierarchical Bayesian multinomial logistic regression model that infers relative transmissibility of all viral lineages across geographic regions, detects lineages increasing in prevalence, and identifies mutations relevant to transmissibility. Applying PyR0 to all publicly available SARS-CoV-2 genomes, we identify numerous substitutions that increase transmissibility, including previously identified spike mutations and many non-spike mutations within the nucleocapsid and nonstructural proteins. PyR0 forecasts growth of new lineages from their mutational profile, identifies viral lineages of concern as they emerge, and prioritizes mutations of biological and public health concern for functional characterization. One Sentence summary A Bayesian hierarchical model of all viral genomes predicts lineage transmissibility and identifies associated mutations. ## Aim Develop a principled approach to modeling transmissibility of SARS-CoV-2 lineages, estimate transmissibility, and model individual mutations in an efficient framework. ## Summary - Hierarchical multinomial regressions model using sequences from the GISAID database to estimate relative transmissibility of SARs-CoV-2 variants with hierarchical structure across lineages and locations. - Use estimates of relative transmissibility for mutations, a baseline reproduction number estimate, and a fixed generation time of 5.5 days to estimate growth rates and reproduction numbers for all variants relative to the Wuhan strain. - Main contribution of the paper is constructing an efficient computational pipeline for the GISAID data using mutation level effects. - Hierarchical multinomial pooled approach is in line with best practices, similar to other implementations but with some novel extensions. - Model is coded in `Pyro` (python and built on `PyTorch`) and fit using stochastic variational inference. This leads to fast fit times but potentially also biased posterior uncertainty (spuriously tight). ![[2021-10-07-covid-strain-pipeline.png]] ![[2021-10-07-covid-strain-rt.png]] ## Comments A very complex and technical piece of work that largely seems to achieve its aims (which are worthwhile). However, hard to say whether they have been fully achieved as the validation is both not extensive enough or analysed in sufficient detail. Excited to see the progression of this work and the approach more generally. Overall well worth a read. ### Major - Ideally the model would be deployed and evaluated in real-time for both its utility and to validate its real-time performance claims. This may actually be happening but if so not mentioned in the text as far as I can tell. - Variational inference produces approximate posterior estimates with substantially reduced run-times compared to MCMC etc. but at the cost of potentially biased posterior uncertainty estimates (usually underestimating). This trade-off is not evaluated in the manuscript. - Manuscript does not discuss viable alternatives for estimating growth rates across strains such as the [Sanger COVID-19 Genomic tracking tool](https://covid19.sanger.ac.uk/lineages/modelled). Doing so would make it easier to understand the novel advantages of this approach (runtime and mutations). - Sources of bias are not considered in the main text or explored in detail in the SI. The discussion says explicitly that the the method is unbiased. but the SI says the following. > However the model is susceptible to the following sources of bias: biased sampling in any (time, region) cell (e.g. sequencing only in case of S-gene target failure); and changes in sampling bias within a single region over time (e.g. a country has a lab in only one city, then spins up a second lab in another distant city with different lineage proportions). > and, > In our case the most notable disadvantage of variational inference is its propensity to yield biased posterior uncertainty estimates. Typically posterior uncertainty is underestimated, leading to credible intervals (CI) that in some cases can be unrealistically narrow. > - Not clear if validation performed on retrospectively available sequence data or sequences available at the time. From figures appears to be retrospectively available data which casts doubt on the claims made about when real-time detection would have been possible. - Validation is mostly graphical, currently hard to follow in the paper as written, and not as convincing as I would really like (leave one out analysis of strains predicted by mutations looks fairly poor for variants of interest (such as Alpha and Delta) for example). - Code is present and results appear to be fully reproducible which is great. However, not as user friendly as it might be and could really do with some more documentation highlighting where everything is (for example the additional validation work present in the code but not the paper). As it stands seems like using this approach will be difficult without involvement of the authors. ### Minor - Model uses a 14 day time-step so presumably will always be lagged by a few weeks even with optimal sequence collection and reporting. Though there appears to be an internal transform to 5.5 days (assumed mean of the generation time). - Very unclear what if any the assumption of a fixed mean generation time has or what role this actually plays in the model. - Little spatial variation found in transmissibility estimates. This could either indicate that there was few sources of bias and therefore the hierarchical structure was not needed or that the model was not sufficiently flexible to capture them. Prior used for pooling of spatial and strain specific variation is very tight potentially leading to this result. - Table 1 which shows increases in growth rates due to mutations does not include credible intervals. Missing comparable table for growth rate differences between variants. - How are importations between regions handled? Model specification shows a single intercept for each region and lineage (pooled) based on prevalence at the start of the time-series. Does this impact/bias growth rate estimates? - Interested to see a variant of the model compared that has no mutation level structure (i.e to explore benefits and negatives of including this). - Logistic prior is unusual for a regression coefficient. Other uses of this in the literature? Could do with a citation and potential exploration. - Model posteriors appear overly precise especially in the UK and more generally. ![[2021-10-07-strain-model-posterior.png]] - Model performance by mutation seems unstable when subsets of data are used (Figure S10). In particular estimates in just Europe or in the world outside Europe agree consistently but disagree with estimates from the whole data set (i.e Europe and the rest of the world). I find this hard to explain as I would have assumed an averaging effect between data subsets. - Case counts are mentioned in the paper and in the code but how are they used for outputs? Are they just used for Figure 2? ## Detailed notes ### Introduction - Identifying lineages with higher transmissibility may help guide outbreak response during the SARS-CoV pandemic. - SARs-CoV-2 has a lot of genomes (over 2.5 million) with both geographic and temporal variability. - Current phylogenetic approaches are computationally inefficient taking days to run (no reference). - Ad-hoc methods to estimate particular lineages are an alternative but cannot capture the complexity of co-circulating lineages (great point and the approach outlined in this manuscript seems broadly the way to go). - Mutation based analysis may allow identification of specific determinants of a lineages phenotype (i.e spike (S) D614G mutation -> more transmissibility). - Using mutations in lineages this approach can be used to forecast the growth rate of unobserved lineages. - **Aim: Develop a principled approach to modelling transmissibility of SARS-CoV-2 lineages, estimate transmissibility, and model individual mutations in an efficient framework** - Modelling only proportion of different lineages rather than absolute number of samples for each lineage and doing so in 14-day intervals across a large number of areas improves the models robustness as mitigates bias due to sampling rates, and changes in transmission. ![[2021-10-07-covid-strain-pipeline.png]] ### Methods #### Data - Sequences from GISAID (collated submissions from around the world usually with a few weeks delay and partial reporting for a few weeks after that). - Lineages: PANGO lineage aliases from https://cov-lineages.org #### Model - Bayesian, hierarchical multinomial logistic regression model using `Pyro` (based on`PyTorch` ). - Samples aggregated by location (country and administrative level 1 region with at least 50 samples). - Regions without at least two time periods of data are dropped (this means some sequences will be dropped as not re-aggregated to country level). - Uses hierarchical structure vs overdispersion (Dirichlet-Multinomial) to account for variability as improved performance (not shown). ![[2021-10-07-strain-model.png]] - $\alpha_{ps}$: initial prevalence of each lineage in each region. - $\beta_{ps}$ location specific growth rate modeled around global growth rates (constructed from mutation effects. This means that quoted strain transmissibility differences are estimated from mutation level differences directly with no additional variation and that these effects can vary across all locations and strains (with the same standard deviation used for pooling). I would have thought some kind of nested structure here may have been sensible to explore. - Very tight prior for between region growth rate variation. - Logistic prior is very unusual for a regression coefficient. Citation? - Sources of bias: > However the model is susceptible to the following sources of bias: biased sampling in any (time, region) cell (e.g. sequencing only in case of S-gene target failure); and changes in sampling bias within a single region over time (e.g. a country has a lab in only one city, then spins up a second lab in another distant city with different lineage proportions). > #### Validation - Compared results from full data-set to disjoint subsets. Compared using correlation coefficient. Stratified by region and conducted two-fold cross validation for both lineages and mutations. - Explored a model that grouped mutations. Found no pairwise correlations. GISAID may be underpowered. Not clear where this analysis is in the manuscript. - Consistency of results across data subsets appears to imply that this hierarchical approach is not working as intended if variation is present or that no variation/bias is present. Perhaps subsets are just not small enough to see variation due to bias? ![[2021-10-07-strain-regional-variation.png]] - Comparing estimated transmissibility from mutations for variants with estimate when variant data is present. Performance is okay but looks a bit biased and not very good for highly transmissible variants which is what we presumably really care about. Not my area so it may be that this performance is better than previously available but not shown here. ![[2021-10-07-loo-validation.png]] - Retrospective performance appears to not account for when sequences were available to be used. So in England for examples sequences may be lagged by several weeks and take several more weeks to be fully reported. This means backcast behaviour is likely optimistic. Backtested forecast performance (Table S1) is difficult for me to parse. What is good or bad performance? How does it compare to any other approach one might use? ![[2021-10-07-backcast-performance.png]] - Model performance by mutation seems unstable when subsets of data are used. In below figure effect sizes are lower for both the partial estimates fairly consistently and then higher when all the data is aggregated. I find this hard to explain (I would have thought they should effectively average in the overall data set). ![[2021-10-07-mutations-subset-analysis.png]] - I have no idea what good or bad looks like for variational inference diagnostics and there is no hand holding here. These are the settings and resulting convergence, > Variational inference is performed for 10,000 iterations with the Adam optimizer with clipped gradients and an exponentially decreasing learning rate schedule and initial learning rates between 0.05 and 0.0025 for different parameter groups (see Figure S15). Optimization proceeds in batch-mode, i.e. without any data subsampling. We initialize model parameters to median prior values with a small amount of noise added to avoid scale parameters collapsing early in training. After inference we make predictions by drawing 1000 posterior samples. See source code for detailed optimizer and initialization configuration. > ![[2021-10-07-convergence-diagnostics.png]] ### Results - Upwards trend in transmissibility over time. - Growth rates estimates varied little across spatial data subsets (so did the hierarchical approach really mitigate bias). - Estimated 2.6-fold (95% CI, 2.27-2.97) growth rate advantage for Delta vs wild type using sequences from July (available in mid to late August). ![[2021-10-07-covid-strain-rt.png]] - Model can forecast growth rate of unobserved lineages using mutations alone. - Tested this claim using leave-one out analysis. Showed estimated agreement with observed behavior (assessed using Pearson correlation coefficient p = 0.91). - Model could have provided an early warning for the rise of VoCs had it been applied to SARS-CoV-2 data. Dominance of Alpha from November 2020 and Delta from late April 2021. - Variant specific modelling efforts were accurate and useful but efforts were limited to particular lineages. and geographic regions. - Predictive ability was tested and forecasts were found to be reliable for 1-2 months. - Model also allows contribution of amino acids to be understand (though see supplement these estimates are likely spuriously precise). - Detected regions that have been found elsewhere to be linked to transmissibility. - Explored a model that grouped mutations. Found no pairwise correlations ### Discussion - Model provides unbiased (aside from variational inference induced bias and bias from sampling discussed in the SI) approach for detecting viral lineages with increased growth rate. - Assesses contribution of individual mutations and aggregates across all lineages + geographic regions. ### Code #### Positives - Code available + has a README. - From a read through looks reproducible but with quite a bit of work and likely not all parts of the validation. - Makefiles used for updating the project. - [Tutorial.md](https://github.com/broadinstitute/pyro-cov/blob/master/Tutorial.md) gives some guidance on reproducing but not clear its complete. - Evidence in the code of additional validation that didn't make it into the paper. Would have been great to see some of this ([here](https://github.com/broadinstitute/pyro-cov/blob/9496e8d2d78239950f58470afd516b9dd9e5d594/pyrocov/mutrans.py#L862) for example). - Overall code quality and software engineering looks pretty good for research code (though as not good at python hard to go beyond code smell indicators). #### Negatives - Code base is under active development (which is a good thing) but not clear that the analysis as presented in the manuscript can be fully reproduced without significant work to understand the versions of the model used etc. - Work may be in theory usable in other contexts but may hard to do so in practice without developer help. - README is missing a basic overview of what all the many files are (for example where is the main model and its documentation). - Lots of hard coded paths + Google cloud commands. - Additional unrelated code present in the repository (such as the `strains` model which appears to be unused but cool). It may have been better to have a package repository for the model and tooling and then a paper repository for the application specific code and paper write up. - Installation of the tooling seems surprisingly hard for a python package which otherwise looks well developed (perhaps due to the analysis and package sharing a repository?). #### Useful links - [Model as presented in the paper](https://github.com/broadinstitute/pyro-cov/blob/master/pyrocov/mutrans.py) - [Another model that includes strains, cases and deaths](https://github.com/broadinstitute/pyro-cov/blob/master/pyrocov/strains.py) - [Project board showing open development (amazing to see)](https://github.com/broadinstitute/pyro-cov/projects/1) ## Tags [[Papers]] #papers #papers/COVID19 #papers/real-time-analysis #python/pyro #inference/variational-inference #papers/strain-dynamics #papers/hierarchical