Phyloepidemiology and adaptive evolution of SARS-CoV2 during the ﬁrst and second wave of COVID-19 in India

,


Introduction
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which was first identified in Wuhan City in the Hubei Province of China in December 2019, has now spread worldwide (1)(2)(3).In India, the first case of COVID-19 was identified on January 27th, 2020 in Kerala (4).During this ongoing pandemic, whole-genome sequencing (WGS) has been employed globally to gain insights into the geographical distribution of variants of SARS-CoV-2.As a result, WGS is a powerful tool to understand the mutations occurring across different segments of the viral genome and the selection pressure affecting various genes.

Introduction
The severe acute respiratory syndrome coronavirus-2 (SARS-CoV-2), which was first identified in Wuhan City in the Hubei Province of China in December 2019, has now spread worldwide (1)(2)(3).In India, the first case of COVID-19 was identified on January 27th, 2020 in Kerala (4).During this ongoing pandemic, whole-genome sequencing (WGS) has been employed globally to gain insights into the geographical distribution of variants of SARS-CoV-2.As a result, WGS is a powerful tool to understand the mutations occurring across different segments of the viral genome and the selection pressure affecting various genes.
The B.1.617.1,B.1.617.2, and B.1.617.3 lineages of SARS-CoV-2 were first detected in India in December 2020 and have since become increasingly prevalent.These lineages are distinct and differ in their characteristic mutations (5).Therefore, it is crucial to understand the genetic variation among the different variants circulating in a country to discern mutation patterns among SARS-CoV-2 variants.

Phyloepidemiology and adaptive evolution of SARS-CoV2 during the first and second wave of COVID-19 in India
transmissibility and generate virus escape mutants from convalescent sera (10)(11)(12).Due to the large number of genetic variants of SARS-CoV-2, the virus exists as a quasispecies, as revealed by next-generation sequencing and single nucleotide polymorphism (SNP) analysis (13,14).
Previous studies in India have assessed the mutational profile of SARS-CoV-2 during the second wave of COVID-19, which identified the presence of mutations such as L452R, T478K, E484Q, D614G, and P681R (15,16).It was also found that mutations such as L452R, T478K, and E484Q may result in increased ACE2 binding (15), and P681R may increase the rate of S1-S2 cleavage (15).As of March 25th, 2022, there have been 1,660 new cases of SARS-CoV-2 reported in India, (https://www.worldometers.info/coronavirus/country/india/)It has been observed that human RNA viruses such as HIV-1, influenza viruses, SARS-CoV, and hepatitis C viruses may undergo adaptive changes to evade the host's immune system and establish an infection in a new host (17).The devastating second wave of COVID-19 in India has resulted in numerous deaths (18).Consequently, questions have arisen regarding the adaptive changes that SARS-CoV-2 has undergone during the pandemic's second wave in India.SARS-CoV-2 infections across different countries have affected individuals with varying immunological strength, age, sex, and environmental conditions, imposing selection pressure on the virus.A previous study found that positive selection of ORF1ab, ORF3a, and ORF8 genes drove the early evolutionary trends of SARS-CoV-2 during the 2020 COVID-19 pandemic (19).Therefore, this study aimed to detect the genetic diversity and crucial mutational events of SARS-CoV-2 strains collected from different Indian states at different time points from June 2020 to June 2021.The study investigated the various lineages circulating during the pandemic's first and second wave, their mutations, and selection pressure across the SARS-CoV-2 genome.

Genome sequences retrieval and alignment
In this study, we retrieved a total of 1451 complete and high-coverage SARS-CoV-2 viral genome sequences from the Global Initiative on Sharing All Influenza Database (GISAID) platform and National Center for Biotechnology Information (NCBI).To ensure the quality of the data, only sequences affecting human hosts were selected, and lowquality sequences with more than 5% NNNs and other ambiguous characters were removed using BioEdit 7.2 software.We included only full-length sequences with more than 29000 nucleotides, including the reference sequence MN908947.3,for further analysis.As a result, 1106 sequences were eligible for further analysis, after removing 344 sequences with >5% NNNs and other ambiguous characters.

Multiple sequence alignment and lineage prediction
To analyze the evolutionary trends of SARS-CoV-2 during the first and second waves of COVID-19 in India, we divided the samples into two groups: the first wave, which spanned from June 2020 to December 2020, and the second wave, which started from 15th January 2021 to June 2021, with a peak in April 2021, as reported by Worldometer (https://www.worldometers.info/coronavirus/country/india/) To align the consensus sequences with the reference sequence of the SARS-CoV-2 Wuhan-Hu-1 isolate (GenBank accession number MN908947.3),we used Multiple Alignment using Fast Fourier Transform (MAFFT) (20).Additionally, we predicted the lineages using the Pangolin COVID-19 lineage assigner (https://pangolin.cog-uk.io/)(5).

Mapping nucleotide substitutions in the genome and expressed proteins of SARS CoV-2
The aligned sequences were mapped with the reference sequence of the SARS-CoV-2 Wuhan-Hu-1 isolate (GenBank accession number MN908947.3) to identify mutations across the genome using an online software tool (21).

Selection pressure amongst the lineages of SARS-CoV-2
To detect positive (diversifying) and negative (purifying) selection, a combination of different analysis methods was employed.Single-Likelihood Ancestor Counting (SLAC) (22), fixed effects likelihood (FEL) (22), and mixed effects model of evolution (MEME) (23) were used to analyze selection pressure using the datamonkey server.The results were expressed as dN/dS at a 0.1 level of significance.These methods utilize maximum likelihood to infer nonsynonymous (dN) and synonymous (dS) substitution rates on a per site basis for a given coding alignment and corresponding phylogeny (24).While SLAC and FEL identified sites experiencing pervasive diversifying or purifying selection, MEME detected sites experiencing both pervasive and episodic diversifying selection (25).The mutated sites exhibiting selective pressure were annotated using Molecular Evolutionary Genetics Analysis (MEGA 11) software.

Prediction of the effect of non-synonymous mutations under positive selective pressure on the functionalities of SARS-CoV-2
To predict the effect of non-synonymous mutations on the functionalities of SARS-CoV-2, the Protein Variation Effect Analyser (PROVEAN) software was used.This analysis assumes that protein sequences that are evolutionarily conserved among living organisms have survived natural selection (26).Mutations were predicted to have a deleterious effect if they had a threshold value below the default value of -2.5.

Phylogenomic analysis
A maximum likelihood tree was generated in the CIPRES Science Gateway using the generalized time reversible (GTR) nucleotide substitution model with 1000 bootstrap transmissibility and generate virus escape mutants from convalescent sera (10)(11)(12).Due to the large number of genetic variants of SARS-CoV-2, the virus exists as a quasispecies, as revealed by next-generation sequencing and single nucleotide polymorphism (SNP) analysis (13,14).
Previous studies in India have assessed the mutational profile of SARS-CoV-2 during the second wave of COVID-19, which identified the presence of mutations such as L452R, T478K, E484Q, D614G, and P681R (15,16).It was also found that mutations such as L452R, T478K, and E484Q may result in increased ACE2 binding (15), and P681R may increase the rate of S1-S2 cleavage (15).As of March 25th, 2022, there have been 1,660 new cases of SARS-CoV-2 reported in India, (https://www.worldometers.info/coronavirus/country/india/)It has been observed that human RNA viruses such as HIV-1, influenza viruses, SARS-CoV, and hepatitis C viruses may undergo adaptive changes to evade the host's immune system and establish an infection in a new host (17).The devastating second wave of COVID-19 in India has resulted in numerous deaths (18).Consequently, questions have arisen regarding the adaptive changes that SARS-CoV-2 has undergone during the pandemic's second wave in India.SARS-CoV-2 infections across different countries have affected individuals with varying immunological strength, age, sex, and environmental conditions, imposing selection pressure on the virus.A previous study found that positive selection of ORF1ab, ORF3a, and ORF8 genes drove the early evolutionary trends of SARS-CoV-2 during the 2020 COVID-19 pandemic (19).Therefore, this study aimed to detect the genetic diversity and crucial mutational events of SARS-CoV-2 strains collected from different Indian states at different time points from June 2020 to June 2021.The study investigated the various lineages circulating during the pandemic's first and second wave, their mutations, and selection pressure across the SARS-CoV-2 genome.

Genome sequences retrieval and alignment
In this study, we retrieved a total of 1451 complete and high-coverage SARS-CoV-2 viral genome sequences from the Global Initiative on Sharing All Influenza Database (GISAID) platform and National Center for Biotechnology Information (NCBI).To ensure the quality of the data, only sequences affecting human hosts were selected, and lowquality sequences with more than 5% NNNs and other ambiguous characters were removed using BioEdit 7.2 software.We included only full-length sequences with more than 29000 nucleotides, including the reference sequence MN908947.3,for further analysis.As a result, 1106 sequences were eligible for further analysis, after removing 344 sequences with >5% NNNs and other ambiguous characters.

Multiple sequence alignment and lineage prediction
To analyze the evolutionary trends of SARS-CoV-2 during the first and second waves of COVID-19 in India, we divided the samples into two groups: the first wave, which spanned from June 2020 to December 2020, and the second wave, which started from 15th January 2021 to June 2021, with a peak in April 2021, as reported by Worldometer (https://www.worldometers.info/coronavirus/country/india/) To align the consensus sequences with the reference sequence of the SARS-CoV-2 Wuhan-Hu-1 isolate (GenBank accession number MN908947.3),we used Multiple Alignment using Fast Fourier Transform (MAFFT) (20).Additionally, we predicted the lineages using the Pangolin COVID-19 lineage assigner (https://pangolin.cog-uk.io/)(5).

Mapping nucleotide substitutions in the genome and expressed proteins of SARS CoV-2
The aligned sequences were mapped with the reference sequence of the SARS-CoV-2 Wuhan-Hu-1 isolate (GenBank accession number MN908947.3) to identify mutations across the genome using an online software tool (21).

Selection pressure amongst the lineages of SARS-CoV-2
To detect positive (diversifying) and negative (purifying) selection, a combination of different analysis methods was employed.Single-Likelihood Ancestor Counting (SLAC) (22), fixed effects likelihood (FEL) (22), and mixed effects model of evolution (MEME) (23) were used to analyze selection pressure using the datamonkey server.The results were expressed as dN/dS at a 0.1 level of significance.These methods utilize maximum likelihood to infer nonsynonymous (dN) and synonymous (dS) substitution rates on a per site basis for a given coding alignment and corresponding phylogeny (24).While SLAC and FEL identified sites experiencing pervasive diversifying or purifying selection, MEME detected sites experiencing both pervasive and episodic diversifying selection (25).The mutated sites exhibiting selective pressure were annotated using Molecular Evolutionary Genetics Analysis (MEGA 11) software.

Prediction of the effect of non-synonymous mutations under positive selective pressure on the functionalities of SARS-CoV-2
To predict the effect of non-synonymous mutations on the functionalities of SARS-CoV-2, the Protein Variation Effect Analyser (PROVEAN) software was used.This analysis assumes that protein sequences that are evolutionarily conserved among living organisms have survived natural selection (26).Mutations were predicted to have a deleterious effect if they had a threshold value below the default value of -2.5.

Phylogenomic analysis
A maximum likelihood tree was generated in the CIPRES Science Gateway using the generalized time reversible (GTR) nucleotide substitution model with 1000 bootstrap values.The resulting tree was visualized using iTOL (https://itol.embl.de/)and MEGA 11.The phylogenetic tree expressing the Nextstrain clades was created using https://clades.nextstrain.org/.

Results
Out of the total of 1451 downloaded sequences, 344 sequences (23.7%) were removed due to the presence of >5% NNNs and other ambiguous characters.Hence, 1106 sequences (76.3%) were used for further analysis.

Table 1 Lineages of SARS-Co-V2 circulating during 2020 and 2021 in India
Lineages circulating during the first and second waves of COVID-19 in India Lineages circulating during the first and second waves of COVID-19 in India have been comparatively evaluated.From June 2020 to December 2020 (first wave), a total of 35 lineages were in circulation, while from January 2021 to June 2021, a total of 20 lineages were in circulation (Table 1).In 2021 (second wave), fourteen lineages were common as in the first wave, and six new lineages were found to be circulating.Out of the six new circulating lineages, one was VUM B.1.617.1 lineage (Kappa variant), and the other two were VOCs B.1.351lineage (Beta variant) and B.1.617.2 lineage (Delta variant).In the first wave, one case was from a virus of an undesignated lineage, and in the second wave, seventeen cases were from viruses of undesignated lineage.
In the first wave, six sequences were of VOC-B.1.1.7,whereas in 2021, sixty-nine sequences were found to be of the B.1.1.7 lineage, which was more than ten times higher than the number of B.1.1.7 cases found in the first wave.Another VOC, B.1.351(Beta variant), did not exist during 2020, but in 2021, eight cases were found to be due to this VOC.The most drastic surge was for the VOC-B.1.617.2 lineage (Delta variant).This VOC did not exhibit any case in the first wave, whereas in the second wave, 181 cases were due to this lineage (Table 1, Fig. 1).At the same time, one VUM-B.1.617.1 lineage (Kappa variant) was also prevalent in the second wave, which was absent in the first wave, but in the second wave, forty-five cases were due to the B.1.617.1 lineage.values.The resulting tree was visualized using iTOL (https://itol.embl.de/)and MEGA 11.The phylogenetic tree expressing the Nextstrain clades was created using https://clades.nextstrain.org/.

Results
Out of the total of 1451 downloaded sequences, 344 sequences (23.7%) were removed due to the presence of >5% NNNs and other ambiguous characters.Hence, 1106 sequences (76.3%) were used for further analysis.

Lineages circulating during the first and second waves of COVID-19 in India
Lineages circulating during the first and second waves of COVID-19 in India have been comparatively evaluated.From June 2020 to December 2020 (first wave), a total of 35 lineages were in circulation, while from January 2021 to June 2021, a total of 20 lineages were in circulation (Table 1).In 2021 (second wave), fourteen lineages were common as in the first wave, and six new lineages were found to be circulating.Out of the six new circulating lineages, one was VUM B.1.617.1 lineage (Kappa variant), and the other two were VOCs B.1.351lineage (Beta variant) and B.1.617.2 lineage (Delta variant).In the first wave, one case was from a virus of an undesignated lineage, and in the second wave, seventeen cases were from viruses of undesignated lineage.
In the first wave, six sequences were of VOC-B.1.1.7,whereas in 2021, sixty-nine sequences were found to be of the B.1.1.7 lineage, which was more than ten times higher than the number of B.1.1.7 cases found in the first wave.Another VOC, B.1.351(Beta variant), did not exist during 2020, but in 2021, eight cases were found to be due to this VOC.The most drastic surge was for the VOC-B.1.617.2 lineage (Delta variant).This VOC did not exhibit any case in the first wave, whereas in the second wave, 181 cases were due to this lineage (Table 1, Fig. 1).At the same time, one VUM-B.1.617.1 lineage (Kappa variant) was also prevalent in the second wave, which was absent in the first wave, but in the second wave, forty-five cases were due to the B.1.617.1 lineage.

Mutational events across the genome segment of SARS-CoV-2 lineages
In this study, we analyzed the mutational events across the genome segment of SARS-CoV-2 lineages circulating in India from June 2020 to June 2021.Out of the 1451 downloaded sequences, 344 were removed due to ambiguous characters and NNNs, leaving 1106 sequences for further analysis.The ten most mutated samples belonged to the Delta variant of B.1.617.2 lineage, all of which were found in the second wave, suggesting the severity of the second wave compared to the first (Fig. 2).

Figure 2 Ten most mutated samples during one year time period from June 2020 to June 2021
We identified a total of 1031 mutational events from June 2020 to June 2021.The most frequent events per class in decreasing order of their occurrence were SNP, SNP_silent, e x t r a g e n i c , S N P _ s t o p , d e l e t i o n _ f r a m e s h i f t , insertion_frameshift.The most common nucleotide mutations were A23403G, C3037T, C241T, C14408T, C25563T, C26735T, C18877T, C22444T, C28854T, G210T (Fig. 3.a), while the most frequent protein mutations were S: D614G, NSP3:F106F, NSP 12b: P314L, ORF3a: Q57H, M: Y71Y, NSP14C279C, S: D294D, N: S194L (Fig. 3.b).

Prevalence of L452R, T478K, E484Q, N501Y, and D614G mutations in the RBD of SARS-CoV-2
We also investigated the prevalence of mutations in the receptor-binding domain (RBD) of SARS-CoV-2.Out of the 1106 samples analyzed, 1050 had the D614G mutation, with a prevalence rate of 94.93%.The L452R mutation was found in 233 samples (21.06% prevalence rate), T478K in 186 samples (16.81% prevalence rate), N501Y in 96 samples (8.77% prevalence rate), and E484Q in 47 samples (4.24% prevalence rate) (Table 2, Fig. 3C).The frequency of all mutations was significantly higher in the second wave, except for D614G, which was present in both waves and was the most frequent mutational event observed.L452R and T478K were not present in the first wave but were found in significantly higher numbers during the second wave (Table 2).E484Q and N501Y were sporadically present in the first wave but were present in significantly higher numbers in the second wave.

Mutational events across the genome segment of SARS-CoV-2 lineages
In this study, we analyzed the mutational events across the genome segment of SARS-CoV-2 lineages circulating in India from June 2020 to June 2021.Out of the 1451 downloaded sequences, 344 were removed due to ambiguous characters and NNNs, leaving 1106 sequences for further analysis.The ten most mutated samples belonged to the Delta variant of B.1.617.2 lineage, all of which were found in the second wave, suggesting the severity of the second wave compared to the first (Fig. 2).

Figure 2 Ten most mutated samples during one year time period from June 2020 to June 2021
We identified a total of 1031 mutational events from June 2020 to June 2021.The most frequent events per class in decreasing order of their occurrence were SNP, SNP_silent, e x t r a g e n i c , S N P _ s t o p , d e l e t i o n _ f r a m e s h i f t , insertion_frameshift.The most common nucleotide mutations were A23403G, C3037T, C241T, C14408T, C25563T, C26735T, C18877T, C22444T, C28854T, G210T (Fig. 3.a), while the most frequent protein mutations were S: D614G, NSP3:F106F, NSP 12b: P314L, ORF3a: Q57H, M: Y71Y, NSP14C279C, S: D294D, N: S194L (Fig. 3.b).

Prevalence of L452R, T478K, E484Q, N501Y, and D614G mutations in the RBD of SARS-CoV-2
We also investigated the prevalence of mutations in the receptor-binding domain (RBD) of SARS-CoV-2.Out of the 1106 samples analyzed, 1050 had the D614G mutation, with a prevalence rate of 94.93%.The L452R mutation was found in 233 samples (21.06% prevalence rate), T478K in 186 samples (16.81% prevalence rate), N501Y in 96 samples (8.77% prevalence rate), and E484Q in 47 samples (4.24% prevalence rate) (Table 2, Fig. 3C).The frequency of all mutations was significantly higher in the second wave, except for D614G, which was present in both waves and was the most frequent mutational event observed.L452R and T478K were not present in the first wave but were found in significantly higher numbers during the second wave (Table 2).E484Q and N501Y were sporadically present in the first wave but were present in significantly higher numbers in the second wave.

Table 2
Percentage prevalence of the mutations in RBD of spike protein responsible for increased virulence and virus escape mutants from convalescent sera (Total samples=1106)

Mutations in VOC, VUM, and their comparison with other lineages
The genetic sequences were classified into VUMs, VOCs, and other lineages.A comparison of the mutational events occurring in these groups was presented in Table 2.The most common mutational event observed at the nucleotide level was A23403G, while the S:D614G substitution was the most common at the protein level in the spike protein.Synonymous mutations were observed in NSP3, M, NSP14, and S, which did not result in changes in the amino acid sequence.However, non-synonymous mutations in S, NSP12b, ORF3a, and N indicated that these proteins were under constant selection pressure.3).
One VUM, the B.1.617.1 lineage, was identified, which had two mutations (L452R & D614G) in the spike protein and five mutations in the NSP (Table 3).Among these five mutations, two were silent.The mutations in the spike protein of VUM were D614G and L452R.

Selection pressure amongst the various lineages of SARS CoV-2
A codon-based selection pressure analysis was performed to identify purifying (negative) selection (dN < dS) and positive selection (dN > dS).All sequences were analyzed through the Datamonkey server, and after the removal of stop codons, sites under positive selection with a significance level < 0.1 were annotated (Table 4, Fig. 4a-f).The annotation of sites under positive selection has been provided in Figure S1-S4.The annotation was performed by comparing the substitution with the reference sequence using MEGA 11.

Table 2
Percentage prevalence of the mutations in RBD of spike protein responsible for increased virulence and virus escape mutants from convalescent sera (Total samples=1106)

Mutations in VOC, VUM, and their comparison with other lineages
The genetic sequences were classified into VUMs, VOCs, and other lineages.A comparison of the mutational events occurring in these groups was presented in Table 2.The most common mutational event observed at the nucleotide level was A23403G, while the S:D614G substitution was the most common at the protein level in the spike protein.Synonymous mutations were observed in NSP3, M, NSP14, and S, which did not result in changes in the amino acid sequence.However, non-synonymous mutations in S, NSP12b, ORF3a, and N indicated that these proteins were under constant selection pressure.3).
One VUM, the B.1.617.1 lineage, was identified, which had two mutations (L452R & D614G) in the spike protein and five mutations in the NSP (Table 3).Among these five mutations, two were silent.The mutations in the spike protein of VUM were D614G and L452R.

Selection pressure amongst the various lineages of SARS CoV-2
A codon-based selection pressure analysis was performed to identify purifying (negative) selection (dN < dS) and positive selection (dN > dS).All sequences were analyzed through the Datamonkey server, and after the removal of stop codons, sites under positive selection with a significance level < 0.1 were annotated (Table 4, Fig. 4a-f).The annotation of sites under positive selection has been provided in Figure S1-S4.The annotation was performed by comparing the substitution with the reference sequence using MEGA 11.
In NSP2, NSP3, and NSP13, the P to L substitution was significantly higher in the second wave.Some mutations, such as NSP3: P822L, NSP13: P77L, S: G142D, S: E484K, and ORF3a: S26L, were unique to the second wave and absent in the first wave (Table 5).Mutations NSP2: P129A, NSP2: V381F, NSP3: P822S, S: L54F, and S: E484D were present in the first wave but absent in the second wave.Other mutations were present in both the first and second waves.Interestingly, the highest number of substitutions were substitutions by L (Leucine), D (Aspartic acid), and R (Arginine) amino acids in NSP2, NSP3, NSP13, S, ORF3a, and ORF9 (Table 5).

Table 5
Frequencies of mutations in first and second wave evolving under positive selection pressure.Mutations frequency higher in second wave are marked with * sign in front.Mutations frequency higher in first wave are marked with + sign The SARS-CoV-2 genomes underwent analysis using the FEL method, which showed that purifying selection was uniformly distributed throughout the genome (refer to Table S1).Diversifying selection was observed in several locations, including NSP1 (1 site), NSP2 (3 sites), NSP3 (3 sites), NSP6 (1 site), NSP12 (1 site), NSP16 (2 sites), ORF3A (6 sites), N (9 sites), and the S protein (11 sites) (refer to Table S1).
In NSP2, NSP3, and NSP13, the P to L substitution was significantly higher in the second wave.Some mutations, such as NSP3: P822L, NSP13: P77L, S: G142D, S: E484K, and ORF3a: S26L, were unique to the second wave and absent in the first wave (Table 5).Mutations NSP2: P129A, NSP2: V381F, NSP3: P822S, S: L54F, and S: E484D were present in the first wave but absent in the second wave.Other mutations were present in both the first and second waves.Interestingly, the highest number of substitutions were substitutions by L (Leucine), D (Aspartic acid), and R (Arginine) amino acids in NSP2, NSP3, NSP13, S, ORF3a, and ORF9 (Table 5).

Table 5
Frequencies of mutations in first and second wave evolving under positive selection pressure.Mutations frequency higher in second wave are marked with * sign in front.Mutations frequency higher in first wave are marked with + sign The SARS-CoV-2 genomes underwent analysis using the FEL method, which showed that purifying selection was uniformly distributed throughout the genome (refer to Table S1).Diversifying selection was observed in several locations, including NSP1 (1 site), NSP2 (3 sites), NSP3 (3 sites), NSP6 (1 site), NSP12 (1 site), NSP16 (2 sites), ORF3A (6 sites), N (9 sites), and the S protein (11 sites) (refer to Table S1).

Phylogenomic analysis
The 1106 genomes of SARS-CoV-2 underwent clade and phylogenetic analysis using Nextclade v1.14.0.The analysis identified twelve Nextstrain clades: 19A, 19B, 20A, 20B, 20C, 20G, 20H, 20I, 21B, 21A, 21I, and 21J.Out of these, 21J, 21I, and 21A were categorized under the Delta lineage, 21B as Kappa, 20I as alpha V1, and 20H as Beta V2 (Fig. 5).Upon constructing a Fast tree, we observed that the tree initially divided into three clades, with MN908947.3(reference strain) in the first clade, and the second and third clades clustering the remaining sequences (Figure S5).Some sequences did not fall into any of the clusters upon subjecting them to clustering.These sequences belonged to lineages B, B. Upon analyzing the effect of mutations on the fitness of the virus, NSP2: P129L, P129S, P129A; NSP13: P77L; ORF9: S194L were found to have a deleterious effect.The remaining mutational events that evolved under positive selection pressure had a neutral effect on the fitness of the virus.

Discussion
In the last two years, SARS-CoV-2 has caused devastating losses to both human and economic resources.Therefore, antiviral drugs have been used as a therapeutic and vaccination as a prophylactic measure to combat the COVID-19 pandemic.Antiviral antibodies elicited by vaccination can be used to combat COVID-19 viral infections, but they can also increase the risk of developing resistant strains due to the rapid mutations of viruses (27).The resistance imparted due to selective pressure is remarkable, as seen in the case of G614 which has become common (28).The viruses may also develop antiviral drug resistance due to the extreme use of antiviral therapeutics, which is one of the factors that imparts selective pressure to the viral genome.Therefore, it is imperative to study the viral genome at regular periods when a novel virus is introduced into a population, followed by extreme use of vaccination and antiviral therapeutics, as seen in the current COVID-19 pandemic.Hence, the present study aimed to determine the prevalent SARS-CoV-2 lineages, their mutations, and selection pressure during the first and second waves of COVID-19 in India.
According to our study, amongst all VOCs, the maximum number of cases in the second wave were predominantly of B.

Discussion
In the last two years, SARS-CoV-2 has caused devastating losses to both human and economic resources.Therefore, antiviral drugs have been used as a therapeutic and vaccination as a prophylactic measure to combat the COVID-19 pandemic.Antiviral antibodies elicited by vaccination can be used to combat COVID-19 viral infections, but they can also increase the risk of developing resistant strains due to the rapid mutations of viruses (27).The resistance imparted due to selective pressure is remarkable, as seen in the case of G614 which has become common (28).The viruses may also develop antiviral drug resistance due to the extreme use of antiviral therapeutics, which is one of the factors that imparts selective pressure to the viral genome.Therefore, it is imperative to study the viral genome at regular periods when a novel virus is introduced into a population, followed by extreme use of vaccination and antiviral therapeutics, as seen in the current COVID-19 pandemic.Hence, the present study aimed to determine the prevalent SARS-CoV-2 lineages, their mutations, and selection pressure during the first and second waves of COVID-19 in India.SARS-CoV-2 exhibits mutation and diversification as a result of the selection pressure imparted on it (29).However, most mutations do not impart a selective advantage to the virus.Still, some of them may provide a selective advantage to the virus, such as increased transmissibility due to increased receptor binding ability or escaping the host's immune system by conferring changes in the viral structure recognized by antibodies (30).Mutations L452R, E484Q, D614G, and N501Y, responsible for high transmission ability due to increased binding with the receptor and host immune escape, were also present.These mutations were present in significantly higher numbers in the second wave compared to the first wave.The L452R mutation is associated with increased transmission of SARS-CoV-2 and a reduction in neutralization by convalescent plasma and therapeutic antibodies (10).The E484Q mutation (found only in the B.1.617.1 and B.1.617.3 lineages) is associated with a reduction in neutralization by convalescent sera (12), and the D614G mutation is associated with increased transmissibility and is carried by the majority of currently circulating viruses (11).
In a study, it has been reported that serum derived from vaccinated individuals was able to neutralize the B.1.617.1 and B.1.1.7 lineages, with a two-fold reduction in neutralization compared to the vaccine strain virus B.1 (31).In a recent study (32), the major vaccine breakthrough variant was the Delta variant (B.1.617.2 lineage), and the breakthrough events were P384L, K417N, E484K, and N501Y.In our study, the major dominant mutations and co-mutations in the receptor-binding domain (RBD) were L452R, T478K, N501Y, and E484K.These mutations were predicted to be dominant in the near future in earlier studies (32).The host's immune system exerts constant selective pressure on the virus.Therefore, to study the selective pressure on the SARS-CoV-2 genome, the sequences were subjected to selective pressure analysis.On subjecting the sequences to selective pressure analysis, a few sites were found to be evolving under positive selection.NSP2, which acts to suppress the immune system of the host, has been described.The function of NSP2 is not entirely known, but it is thought to associate with the endosome of the host and disrupt the host cell environment (33).It is also a highly conserved protein among coronaviruses.NSP2 and NSP3 of SARS-CoV are detected not only as matured processed proteins but also as NSP2 and NSP3 precursors (34)(35)(36), suggesting the role of precursors in replication.NSP2 interacts with PHB1 and PHB2 host protein complexes, which are involved in mitochondrial biogenesis (37).In our study, we found two sites (site numbers 129 and 381) in NSP2 evolving under positive selection.NSP3 has two significant jobs.The first is cutting other viral proteins to free them to do their tasks, and the second is removing tags from old proteins that are set for destruction.The removal of tags from old and damaged proteins by NSP3 changes the balance of proteins, thus possibly compromising the cell's ability to fight the virus (38).NSP3 had amino acid substitutions P822L and P822S.The NSP13 helicase had an amino acid substitution P77L evolving under positive selection pressure.The helicase separates double-stranded RNA or DNA with a 5 3 polarity.Besides the helicase activity on double-stranded DNA and RNA, it is also capable of unwinding RNA/DNA duplex.Moreover, it has NTPase activity as well as 5 mRNA capping activity (39).However, the above substitution was found to have a deleterious effect on virus fitness.
In the spike protein, three mutations (E484D, E484Q, and E484K) were identified at site number 484, which are under positive selection pressure.All these mutations are present in the receptor-binding motif (RBM) of the receptor binding domain (RBD) of the spike protein, which binds to the human ACE-2 receptor.The RBD is a 273 amino acid segment of the whole spike protein, stretching from residue 319 to 591.Ten substitutions were found in the spike protein that is evolving under positive selection pressure, indicating that the SARS-CoV-2 genome is still evolving to evade the host's immune system.Therefore, the emergence of a highly virulent novel variant cannot be ruled out.
ORF3a is involved in the high expression of cytokines and chemokines that cause cell death and is a high pathogenicity factor.Site number 26 of ORF3a had a nonsynonymous mutation (S26L) evolving under positive selection, indicating the high virulence of these strains.
ORF9 modulates the host immune response by compromising type I IFN synthesis, interferon signaling, antigen processing and presentation, complement signaling, and inducing IL-6 signaling.Previous studies have suggested that high induction of IL-6 is related to increased viral virulence.Biological activities affected by the production of IL-6 include regulating the expression of macrophage colonystimulating factor, increasing B-cell IgG production, negatively regulating dendritic cell maturation, and promoting the Th2 response.Site numbers 13 and 194 of ORF9 had nonsynonymous mutations evolving under high pressure.

Conclusion
Based on the above facts, it can be inferred that the second wave was more devastating due to variants of concern (VOCs) with high transmissibility.The mutations from Variants Under Monitoring B.1.617.1 lineage (Kappa variant) and Variants Of Concern (B.1.1.7,B.1.351& B.1.617.2 lineages) were significantly higher in 2021 than in 2020.This could be because unlocking and public gatherings might have flared up the dissemination of the virus strains.The ten most frequent mutational events observed were exhibited by the B.1.617.2 lineage (Delta variants), which shows the high genetic variance during the second wave.The beta, delta, and kappa variants exhibited a high percentage (75%) of nonsynonymous mutations.The higher percentage of the nonsynonymous mutations in the beta, delta, and kappa variants shows that these variants are still evolving under constant positive selection pressure and pose a threat to the future.Moreover, all three variants -Delta, Alpha, and Kappaexhibited mutations in the receptor-binding domain.
SARS-CoV-2 exhibits mutation and diversification as a result of the selection pressure imparted on it (29).However, most mutations do not impart a selective advantage to the virus.Still, some of them may provide a selective advantage to the virus, such as increased transmissibility due to increased receptor binding ability or escaping the host's immune system by conferring changes in the viral structure recognized by antibodies (30).Mutations L452R, E484Q, D614G, and N501Y, responsible for high transmission ability due to increased binding with the receptor and host immune escape, were also present.These mutations were present in significantly higher numbers in the second wave compared to the first wave.The L452R mutation is associated with increased transmission of SARS-CoV-2 and a reduction in neutralization by convalescent plasma and therapeutic antibodies (10).The E484Q mutation (found only in the B.1.617.1 and B.1.617.3 lineages) is associated with a reduction in neutralization by convalescent sera (12), and the D614G mutation is associated with increased transmissibility and is carried by the majority of currently circulating viruses (11).
In a study, it has been reported that serum derived from vaccinated individuals was able to neutralize the B.1.617.1 and B.1.1.7 lineages, with a two-fold reduction in neutralization compared to the vaccine strain virus B.1 (31).In a recent study (32), the major vaccine breakthrough variant was the Delta variant (B.1.617.2 lineage), and the breakthrough events were P384L, K417N, E484K, and N501Y.In our study, the major dominant mutations and co-mutations in the receptor-binding domain (RBD) were L452R, T478K, N501Y, and E484K.These mutations were predicted to be dominant in the near future in earlier studies (32).The host's immune system exerts constant selective pressure on the virus.Therefore, to study the selective pressure on the SARS-CoV-2 genome, the sequences were subjected to selective pressure analysis.On subjecting the sequences to selective pressure analysis, a few sites were found to be evolving under positive selection.NSP2, which acts to suppress the immune system of the host, has been described.The function of NSP2 is not entirely known, but it is thought to associate with the endosome of the host and disrupt the host cell environment (33).It is also a highly conserved protein among coronaviruses.NSP2 and NSP3 of SARS-CoV are detected not only as matured processed proteins but also as NSP2 and NSP3 precursors (34)(35)(36), suggesting the role of precursors in replication.NSP2 interacts with PHB1 and PHB2 host protein complexes, which are involved in mitochondrial biogenesis (37).In our study, we found two sites (site numbers 129 and 381) in NSP2 evolving under positive selection.NSP3 has two significant jobs.The first is cutting other viral proteins to free them to do their tasks, and the second is removing tags from old proteins that are set for destruction.The removal of tags from old and damaged proteins by NSP3 changes the balance of proteins, thus possibly compromising the cell's ability to fight the virus (38).NSP3 had amino acid substitutions P822L and P822S.The NSP13 helicase had an amino acid substitution P77L evolving under positive selection pressure.The helicase separates double-stranded RNA or DNA with a 5 3 polarity.Besides the helicase activity on double-stranded DNA and RNA, it is also capable of unwinding RNA/DNA duplex.Moreover, it has NTPase activity as well as 5 mRNA capping activity (39).However, the above substitution was found to have a deleterious effect on virus fitness.
In the spike protein, three mutations (E484D, E484Q, and E484K) were identified at site number 484, which are under positive selection pressure.All these mutations are present in the receptor-binding motif (RBM) of the receptor binding domain (RBD) of the spike protein, which binds to the human ACE-2 receptor.The RBD is a 273 amino acid segment of the whole spike protein, stretching from residue 319 to 591.Ten substitutions were found in the spike protein that is evolving under positive selection pressure, indicating that the SARS-CoV-2 genome is still evolving to evade the host's immune system.Therefore, the emergence of a highly virulent novel variant cannot be ruled out.
ORF3a is involved in the high expression of cytokines and chemokines that cause cell death and is a high pathogenicity factor.Site number 26 of ORF3a had a nonsynonymous mutation (S26L) evolving under positive selection, indicating the high virulence of these strains.
ORF9 modulates the host immune response by compromising type I IFN synthesis, interferon signaling, antigen processing and presentation, complement signaling, and inducing IL-6 signaling.Previous studies have suggested that high induction of IL-6 is related to increased viral virulence.Biological activities affected by the production of IL-6 include regulating the expression of macrophage colonystimulating factor, increasing B-cell IgG production, negatively regulating dendritic cell maturation, and promoting the Th2 response.Site numbers 13 and 194 of ORF9 had nonsynonymous mutations evolving under high pressure.

Conclusion
Based on the above facts, it can be inferred that the second wave was more devastating due to variants of concern (VOCs) with high transmissibility.The mutations present in the spike protein of SARS-CoV-2 made the virus highly transmissible, and the mutations in the NSPs made the host's body more vulnerable to infection by the virus, followed by secondary infection by bacteria and fungus.This may be the possible reason that the second wave had a sharper peak than the first wave, which had a broader peak due to the slow transmission of the virus.The study suggests that viral components responsible for immunosuppression, entry inside the host cell, viral replication, and cytokine storm are targeted by the host's immune response, based on the evidence of positive selection on NSP2, NSP3, NSP13, S protein, ORF3a, and ORF9.However, genomic surveillance needs to be carried out to closely monitor mutational events and selection pressure, as the virus may mutate further to develop escape mutants due to increasing vaccination rates.

Figure 1
Figure 1 Bar diagram showing the prevalent Variants of concern (VOCs) and Variants under monitoring (VUM) during 2020 and 2021 as derived from the whole genome sequencing data from NCBI & GISAID.Frequency of samples of first wave are depicted in the year 2020 and of second wave are depicted in the year 2021

Figure 1
Figure 1 Bar diagram showing the prevalent Variants of concern (VOCs) and Variants under monitoring (VUM) during 2020 and 2021 as derived from the whole genome sequencing data from NCBI & GISAID.Frequency of samples of first wave are depicted in the year 2020 and of second wave are depicted in the year 2021

Figure 3
Figure 3Frequency of various mutations at nucleotide level and at protein level

Figure 3
Figure 3Frequency of various mutations at nucleotide level and at protein level Three types of VOCs were prevalent among all circulating lineages, including the B.1.1.7 (alpha variant), B.1.351(beta variant), and B.1.617.2 (delta variant) lineages, which accounted for 23.86% of the total sequences analyzed.The most common mutational events shared by all three VOCs were 5'UTR: 241 and NSP3: F106F.Mutations were also observed in S, 5'UTR, ORF3a, ORF7a NSPs, M, E, and N proteins, with the Delta variant having the three most frequent substitutions in the Spike protein (D614G, L452R, and T478K).The Kappa variant had two non-synonymous mutations (L452R & D614G) in the spike protein.The alpha variant exhibited only one mutation, S: D614G, in the spike protein, while the beta variant did not show any mutation in the spike protein.The NSPs of the alpha and beta variants had six mutations, with four and two being silent, respectively.The Delta variant exhibited only one silent mutation in the NSP3:F106F.Mutations in the N protein varied among all three VOCs (Table Three types of VOCs were prevalent among all circulating lineages, including the B.1.1.7 (alpha variant), B.1.351(beta variant), and B.1.617.2 (delta variant) lineages, which accounted for 23.86% of the total sequences analyzed.The most common mutational events shared by all three VOCs were 5'UTR: 241 and NSP3: F106F.Mutations were also observed in S, 5'UTR, ORF3a, ORF7a NSPs, M, E, and N proteins, with the Delta variant having the three most frequent substitutions in the Spike protein (D614G, L452R, and T478K).The Kappa variant had two non-synonymous mutations (L452R & D614G) in the spike protein.The alpha variant exhibited only one mutation, S: D614G, in the spike protein, while the beta variant did not show any mutation in the spike protein.The NSPs of the alpha and beta variants had six mutations, with four and two being silent, respectively.The Delta variant exhibited only one silent mutation in the NSP3:F106F.Mutations in the N protein varied among all three VOCs (Table

Figure 5
Figure 5 Phylogenetic analysis of SARS-CoV-2 genomes using Nextclade v1.14.0.Twelve Nextstrain clades viz.19A, 19B, 20A, 20B, 20C, 20G, 20H, 20I, 21B, 21A, 21I and 21J were identified by the phylogenetic analysis 1.617.2lineage (Delta variant), whereas no cases of B.1.617.2 lineage (Delta variant) were found in the first wave.The movement of people, lifting of lockdowns, public gatherings, and festivals could be reasons why the cases of B.1.617.2 lineage (Delta variant) rose significantly.During the second wave, the B.1.617.2 lineage (Delta variant) was the predominant variant in all outbreaks.Additionally, the cases According to our study, amongst all VOCs, the maximum number of cases in the second wave were predominantly of B.1.617.2 lineage (Delta variant), whereas no cases of B.1.617.2 lineage (Delta variant) were found in the first wave.The movement of people, lifting of lockdowns, public gatherings, and festivals could be reasons why the cases of B.1.617.2 lineage (Delta variant) rose significantly.During the second wave, the B.1.617.2 lineage (Delta variant) was the predominant variant in all outbreaks.Additionally, the cases from Variants Under Monitoring B.1.617.1 lineage (Kappa variant) and Variants Of Concern (B.1.1.7,B.1.351& B.1.617.2 lineages) were significantly higher in 2021 than in 2020.This could be because unlocking and public gatherings might have flared up the dissemination of the virus strains.The ten most frequent mutational events observed were exhibited by the B.1.617.2 lineage (Delta variants), which shows the high genetic variance during the second wave.The beta, delta, and kappa variants exhibited a high percentage (75%) of nonsynonymous mutations.The higher percentage of the nonsynonymous mutations in the beta, delta, and kappa variants shows that these variants are still evolving under constant positive selection pressure and pose a threat to the future.Moreover, all three variants -Delta, Alpha, and Kappaexhibited mutations in the receptor-binding domain.

Table 4
Mutations and mutational sites in the SARS CoV-2 evolving under positive selection pressure.