BMC genomics

Identifying SARS-CoV-2 variants using genetic sequences of the virus’s spike protein

Updated

Abstract

Spike-only nucleotide sequences can reliably identify many SARS-CoV-2 lineages, including main variants of concern.

  • Spike-only sequences can be shared among multiple Pango lineages, sometimes numbering in the tens or hundreds.
  • The concept of 'lineage sets' is introduced, representing the range of Pango lineages consistent with observed mutations.
  • Identifying lineages using spike-only sequences provides a foundation for developing software tools for lineage classification.
  • The findings suggest that spike sequences hold significant information for understanding SARS-CoV-2 genetic diversity.

Simplified

Key numbers

600
Lineages per spike sequence
One observed in 600 different Pango lineages.
337
Lineage sets with unique identification
Sets containing a single lineage uniquely distinguished by consensus spike haplotypes.
1296
Pango lineages analyzed
Total number of Pango lineages with complete representative spike proteins.

Key figures

Fig. 1
Genetic variation patterns in the SARS-CoV-2 across genome sequences
Highlights the distribution and frequency of different mutation types across the spike protein, spotlighting mutation hotspots
12864_2022_8358_Fig1_HTML
  • Panel A
    Number of sequences with non-synonymous mutations at each amino acid position in the spike protein ()
  • Panel B
    Number of sequences with synonymous mutations at each nucleotide position in the spike protein (log10 scale)
  • Panel C
    Number of sequences with insertions and deletions () at each nucleotide position in the spike protein (log10 scale), with points marking the 5′ nucleotide site at start
Fig. 2
Genetic diversity of mutations in four main SARS-CoV-2 variants of concern
Highlights distinct mutation patterns and frequencies across major SARS-CoV-2 variants of concern in the spike protein
12864_2022_8358_Fig2_HTML
  • Panel A
    Spike protein mutations in lineage B.1.1.7 (alpha) showing numerous non-synonymous mutations spread across spike positions with some insertions/deletions
  • Panel B
    Spike protein mutations in lineage B.1.351 (beta) with fewer mutations overall and scattered non-synonymous and synonymous mutations
  • Panel C
    Spike protein mutations in lineage P.1 (gamma) showing sparse non-synonymous mutations at specific spike positions
  • Panel D
    Spike protein mutations in lineage B.1.617.2 (delta) with limited non-synonymous mutations and some insertions/deletions
Fig. 3
Proportion of in SARS-CoV-2 sequences with designations
Highlights that most spike sequences have low ambiguity, supporting reliable Pango lineage assignment from spike data
12864_2022_8358_Fig3_HTML
  • Panel single
    Histogram showing counts of sequences by proportion of ambiguous nucleotide sites, with most sequences having very low ambiguity near 0.00
Fig. 4
Proportion of in spike gene sequences across eight Pango lineages
Highlights variation in ambiguity proportions across small lineage groups, spotlighting challenges in spike-only sequence lineage assignment
12864_2022_8358_Fig4_HTML
  • Panel single
    Scatter plot of individual sequences showing proportion of spike nucleotide sites with ambiguity codes for each of eight Pango lineages; number of sequences per lineage is indicated above each group
Fig. 5
Distribution of spike nucleotide sequences by the number of Pango lineages they appear in
Highlights that most spike sequences are unique to one lineage but some overlap widely across many lineages
12864_2022_8358_Fig5_HTML
  • Panel single
    Histogram showing most spike nucleotide sequences occur in only one lineage, with counts decreasing as the number of lineages per sequence increases; a few sequences appear in many lineages, including one found in 658 lineages
1 / 5

Full Text

What this is

  • Over 2 million SARS-CoV-2 genome sequences have been generated, aiding public health responses.
  • The Pango nomenclature system classifies SARS-CoV-2 lineages based on complete or near-complete genomes.
  • This research investigates the classification of SARS-CoV-2 lineages using spike-only nucleotide sequences.
  • The authors propose a '' concept to represent multiple Pango lineages associated with a given spike sequence.

Essence

  • Many SARS-CoV-2 lineages, including key variants of concern, can be identified using only spike nucleotide sequences. The study introduces 'lineage sets' to capture the range of Pango lineages corresponding to observed mutations in spike sequences.

Key takeaways

  • Spike-only sequences can reliably identify many Pango lineages, especially variants of concern. However, some sequences are shared among numerous lineages, complicating precise lineage assignment.
  • The concept of 'lineage sets' allows classification of spike sequences into groups of related Pango lineages. This approach acknowledges the uncertainty in lineage designation when using partial genomic data.
  • The findings support the development of software tools for assigning spike sequences to lineage sets, enhancing genomic surveillance and outbreak response capabilities.

Caveats

  • Not all Pango lineages can be distinguished by spike sequences alone, particularly those defined by mutations outside the spike protein. This limitation may affect the accuracy of lineage assignments.
  • The study relies on existing genomic data, which may not capture all relevant mutations or lineages, potentially biasing the findings.

Definitions

  • lineage set: A group of Pango lineages that share similar spike mutations, reflecting classification uncertainty.

Simplified

what lands in your inbox each week:

  • 📚7 fresh studies
  • 📝plain-language summaries
  • direct links to original studies
  • 🏅top journal indicators
  • 📅weekly delivery
  • 🧘‍♂️always free