Standard

Principal Components of Genetic Sequences: Correlations and Significance. / Efimov, V. M.; Efimov, K. V.; Kovaleva, V. Yu. и др.

в: Mathematical Biology and Bioinformatics, Том 16, № 2, 2021, стр. 299-316.

Результаты исследований: Научные публикации в периодических изданияхстатьяРецензирование

Harvard

Efimov, VM, Efimov, KV, Kovaleva, VY & Matushkin, YG 2021, 'Principal Components of Genetic Sequences: Correlations and Significance', Mathematical Biology and Bioinformatics, Том. 16, № 2, стр. 299-316. https://doi.org/10.17537/2021.16.299

APA

Efimov, V. M., Efimov, K. V., Kovaleva, V. Y., & Matushkin, Y. G. (2021). Principal Components of Genetic Sequences: Correlations and Significance. Mathematical Biology and Bioinformatics, 16(2), 299-316. https://doi.org/10.17537/2021.16.299

Vancouver

Efimov VM, Efimov KV, Kovaleva VY, Matushkin YG. Principal Components of Genetic Sequences: Correlations and Significance. Mathematical Biology and Bioinformatics. 2021;16(2):299-316. doi: 10.17537/2021.16.299

Author

Efimov, V. M. ; Efimov, K. V. ; Kovaleva, V. Yu. и др. / Principal Components of Genetic Sequences: Correlations and Significance. в: Mathematical Biology and Bioinformatics. 2021 ; Том 16, № 2. стр. 299-316.

BibTeX

@article{40d6e7104f8d4e62b4b11e83b545607a,
title = "Principal Components of Genetic Sequences: Correlations and Significance",
abstract = "Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.",
keywords = "anchor bootstrap, CDS, external factors, PCA-Seq, protein secondary structure, SLC9A1 (NHE1) gene, SSA",
author = "Efimov, {V. M.} and Efimov, {K. V.} and Kovaleva, {V. Yu.} and Matushkin, {Yu. G.}",
note = "Publisher Copyright: {\textcopyright} 2021",
year = "2021",
doi = "10.17537/2021.16.299",
language = "English",
volume = "16",
pages = "299--316",
journal = "Mathematical Biology and Bioinformatics",
issn = "1994-6538",
publisher = "Institute of Mathematical Problems of Biology",
number = "2",

}

RIS

TY - JOUR

T1 - Principal Components of Genetic Sequences: Correlations and Significance

AU - Efimov, V. M.

AU - Efimov, K. V.

AU - Kovaleva, V. Yu.

AU - Matushkin, Yu. G.

N1 - Publisher Copyright: © 2021

PY - 2021

Y1 - 2021

N2 - Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.

AB - Any numerical series can be decomposed into principal components using singular spectral analysis. We have recently proposed a new analysis method ‒ PCA- Seq, which allows calculating numerical principal components for a sequence of elements of any type. In particular, the sequence may be composed of nucleotide base pairs or amino acid residues. Two questions inevitably arise about interpretation of the obtained principal components and about the assessment of their reliability. For interpretation of the symbolic sequence principal components, it is reasonable to evaluate their correlations with numerical characteristics of the sequence elements. To assess the significance of correlations between sequences, one should bear in mind that standard significance criteria are based on the assumption of independence of observations, which, as a rule, is not fulfilled for real sequences. The article discusses the use of an anchor bootstrap technique for these purposes also previously developed by the authors of the article. In this approach it is assumed, that points of a metric space can represent the objects. When taken together they make up some fixed structure in it, in particular, a sequence. The objects are assigned the same random integer weights as in the classical bootstrap. This is sufficient to obtain the bootstrap distribution of the correlation coefficients and assess their significance. The coding sequence of the SLC9A1 gene (synonyms APNH, NHE1, PPP1R143) were taken as an example of use the anchor bootstrap technique in the genetic sequence analysis. Significant correlations of the first principal component were revealed with the hydrophobicity / “transmembraneity” of the corresponding fragments of the amino acid sequence, the phenylalanine content in them, as well as the difference in the T- and A-content in the corresponding nucleotide fragments. Earlier a similar pattern was found by other authors for other genes. Very likely, that it is of a more general nature.

KW - anchor bootstrap

KW - CDS

KW - external factors

KW - PCA-Seq

KW - protein secondary structure

KW - SLC9A1 (NHE1) gene

KW - SSA

UR - http://www.scopus.com/inward/record.url?scp=85116449403&partnerID=8YFLogxK

U2 - 10.17537/2021.16.299

DO - 10.17537/2021.16.299

M3 - Article

AN - SCOPUS:85116449403

VL - 16

SP - 299

EP - 316

JO - Mathematical Biology and Bioinformatics

JF - Mathematical Biology and Bioinformatics

SN - 1994-6538

IS - 2

ER -

ID: 34377456