Research output: Contribution to journal › Article › peer-review
Principal Components of Genetic Sequences: Correlations and Significance. / Efimov, V. M.; Efimov, K. V.; Kovaleva, V. Yu. et al.
In: Mathematical Biology and Bioinformatics, Vol. 16, No. 2, 2021, p. 299-316.Research output: Contribution to journal › Article › peer-review
}
TY - JOUR
T1 - Principal Components of Genetic Sequences: Correlations and Significance
AU - Efimov, V. M.
AU - Efimov, K. V.
AU - Kovaleva, V. Yu.
AU - Matushkin, Yu. G.
N1 - Publisher Copyright: © 2021
PY - 2021
Y1 - 2021
N2 - Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.
AB - Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.
KW - anchor bootstrap
KW - CDS
KW - external factors
KW - PCA-Seq
KW - protein secondary structure
KW - SLC9A1 (NHE1) gene
KW - SSA
UR - http://www.scopus.com/inward/record.url?scp=85116449403&partnerID=8YFLogxK
U2 - 10.17537/2021.16.299
DO - 10.17537/2021.16.299
M3 - Article
AN - SCOPUS:85116449403
VL - 16
SP - 299
EP - 316
JO - Mathematical Biology and Bioinformatics
JF - Mathematical Biology and Bioinformatics
SN - 1994-6538
IS - 2
ER -
ID: 34377456