Standard

Principal component analysis and its generalizations for any type of sequence (PCA-Seq). / Efimov, V. M.; Efimov, K. V.; Kovaleva, V. Y.

In: Вавиловский журнал генетики и селекции, Vol. 23, No. 8, 01.08.2019, p. 1032-1036.

Research output: Contribution to journalReview articlepeer-review

Harvard

Efimov, VM, Efimov, KV & Kovaleva, VY 2019, 'Principal component analysis and its generalizations for any type of sequence (PCA-Seq)', Вавиловский журнал генетики и селекции, vol. 23, no. 8, pp. 1032-1036. https://doi.org/10.18699/VJ19.584

APA

Efimov, V. M., Efimov, K. V., & Kovaleva, V. Y. (2019). Principal component analysis and its generalizations for any type of sequence (PCA-Seq). Вавиловский журнал генетики и селекции, 23(8), 1032-1036. https://doi.org/10.18699/VJ19.584

Vancouver

Efimov VM, Efimov KV, Kovaleva VY. Principal component analysis and its generalizations for any type of sequence (PCA-Seq). Вавиловский журнал генетики и селекции. 2019 Aug 1;23(8):1032-1036. doi: 10.18699/VJ19.584

Author

Efimov, V. M. ; Efimov, K. V. ; Kovaleva, V. Y. / Principal component analysis and its generalizations for any type of sequence (PCA-Seq). In: Вавиловский журнал генетики и селекции. 2019 ; Vol. 23, No. 8. pp. 1032-1036.

BibTeX

@article{2fc1cbfee5c4445e98f1566b81acaa9c,
title = "Principal component analysis and its generalizations for any type of sequence (PCA-Seq)",
abstract = "In the 1940s, Karhunen and Lo{\`e}ve proposed a method for processing a one-dimensional numeric time series by converting it into multidimensional by shifts. In fact, a one-dimensional number series was decomposed into several orthogonal time series. This method has many times been independently developed and applied in practice under various names (EOF, SSA, Caterpillar, etc.). Nowadays, the name 'SSA' (Singular Spectral Analysis) is the most often used. It turned out that it is universal, applicable to any time series without requiring stationary assumptions, automatically decomposes time series into a trend, cyclic components and noise. By the beginning of the 1980s, Takens had shown that for a dynamical system such a method makes it possible to obtain an attractor from observing only one of these variables, thereby bringing the method to a powerful theoretical basis. In the same years, the practical benefits of phase portraits became clear. In particular, it was used in the analysis and forecast of animal abundance dynamics. In this paper we propose to extend SSA to a one-dimensional sequence of any type of elements, including numbers, symbols, figures, etc., and, as a special case, to a molecular sequence. Technically, the problem is solved using an algorithm like SSA. The sequence is cut by a sliding window into fragments of a given length. Between all fragments, the matrix of Euclidean distances is calculated. This is always possible. For example, the square root of the Hamming distance between fragments is a Euclidean distance. For the resulting matrix, the principal components are calculated by the principal-coordinate method (PCo). Instead of a distance matrix, one can use a matrix of any similarity/dissimilarity indexes and apply methods of multidimensional scaling (MDS). The result will always be PCs in some Euclidean space. We called this method 'PCA-Seq'. It is certainly an exploratory method, as is its particular case SSA. For any sequence, including molecular, PCA-Seq without any additional assumptions allows presenting its principal components in a numerical form and visualizing them in the form of phase portraits. A long history of SSA application for numerical data gives all reason to believe that PCA-Seq will be not less useful in the analysis of non-numerical data, especially in hypothesizing. PCA-Seq is implemented in the freely distributed Jacobi 4 package (http://jacobi4.ru/).",
keywords = "MDS, Molecular sequences, P-distance, PCA, PCo, SSA, SVD, Time series, molecular sequences, p-distance, time series",
author = "Efimov, {V. M.} and Efimov, {K. V.} and Kovaleva, {V. Y.}",
year = "2019",
month = aug,
day = "1",
doi = "10.18699/VJ19.584",
language = "English",
volume = "23",
pages = "1032--1036",
journal = "Вавиловский журнал генетики и селекции",
issn = "2500-0462",
publisher = "Institute of Cytology and Genetics of Siberian Branch of the Russian Academy of Sciences",
number = "8",

}

RIS

TY - JOUR

T1 - Principal component analysis and its generalizations for any type of sequence (PCA-Seq)

AU - Efimov, V. M.

AU - Efimov, K. V.

AU - Kovaleva, V. Y.

PY - 2019/8/1

Y1 - 2019/8/1

N2 - In the 1940s, Karhunen and Loève proposed a method for processing a one-dimensional numeric time series by converting it into multidimensional by shifts. In fact, a one-dimensional number series was decomposed into several orthogonal time series. This method has many times been independently developed and applied in practice under various names (EOF, SSA, Caterpillar, etc.). Nowadays, the name 'SSA' (Singular Spectral Analysis) is the most often used. It turned out that it is universal, applicable to any time series without requiring stationary assumptions, automatically decomposes time series into a trend, cyclic components and noise. By the beginning of the 1980s, Takens had shown that for a dynamical system such a method makes it possible to obtain an attractor from observing only one of these variables, thereby bringing the method to a powerful theoretical basis. In the same years, the practical benefits of phase portraits became clear. In particular, it was used in the analysis and forecast of animal abundance dynamics. In this paper we propose to extend SSA to a one-dimensional sequence of any type of elements, including numbers, symbols, figures, etc., and, as a special case, to a molecular sequence. Technically, the problem is solved using an algorithm like SSA. The sequence is cut by a sliding window into fragments of a given length. Between all fragments, the matrix of Euclidean distances is calculated. This is always possible. For example, the square root of the Hamming distance between fragments is a Euclidean distance. For the resulting matrix, the principal components are calculated by the principal-coordinate method (PCo). Instead of a distance matrix, one can use a matrix of any similarity/dissimilarity indexes and apply methods of multidimensional scaling (MDS). The result will always be PCs in some Euclidean space. We called this method 'PCA-Seq'. It is certainly an exploratory method, as is its particular case SSA. For any sequence, including molecular, PCA-Seq without any additional assumptions allows presenting its principal components in a numerical form and visualizing them in the form of phase portraits. A long history of SSA application for numerical data gives all reason to believe that PCA-Seq will be not less useful in the analysis of non-numerical data, especially in hypothesizing. PCA-Seq is implemented in the freely distributed Jacobi 4 package (http://jacobi4.ru/).

AB - In the 1940s, Karhunen and Loève proposed a method for processing a one-dimensional numeric time series by converting it into multidimensional by shifts. In fact, a one-dimensional number series was decomposed into several orthogonal time series. This method has many times been independently developed and applied in practice under various names (EOF, SSA, Caterpillar, etc.). Nowadays, the name 'SSA' (Singular Spectral Analysis) is the most often used. It turned out that it is universal, applicable to any time series without requiring stationary assumptions, automatically decomposes time series into a trend, cyclic components and noise. By the beginning of the 1980s, Takens had shown that for a dynamical system such a method makes it possible to obtain an attractor from observing only one of these variables, thereby bringing the method to a powerful theoretical basis. In the same years, the practical benefits of phase portraits became clear. In particular, it was used in the analysis and forecast of animal abundance dynamics. In this paper we propose to extend SSA to a one-dimensional sequence of any type of elements, including numbers, symbols, figures, etc., and, as a special case, to a molecular sequence. Technically, the problem is solved using an algorithm like SSA. The sequence is cut by a sliding window into fragments of a given length. Between all fragments, the matrix of Euclidean distances is calculated. This is always possible. For example, the square root of the Hamming distance between fragments is a Euclidean distance. For the resulting matrix, the principal components are calculated by the principal-coordinate method (PCo). Instead of a distance matrix, one can use a matrix of any similarity/dissimilarity indexes and apply methods of multidimensional scaling (MDS). The result will always be PCs in some Euclidean space. We called this method 'PCA-Seq'. It is certainly an exploratory method, as is its particular case SSA. For any sequence, including molecular, PCA-Seq without any additional assumptions allows presenting its principal components in a numerical form and visualizing them in the form of phase portraits. A long history of SSA application for numerical data gives all reason to believe that PCA-Seq will be not less useful in the analysis of non-numerical data, especially in hypothesizing. PCA-Seq is implemented in the freely distributed Jacobi 4 package (http://jacobi4.ru/).

KW - MDS

KW - Molecular sequences

KW - P-distance

KW - PCA

KW - PCo

KW - SSA

KW - SVD

KW - Time series

KW - molecular sequences

KW - p-distance

KW - time series

UR - http://www.scopus.com/inward/record.url?scp=85081976114&partnerID=8YFLogxK

U2 - 10.18699/VJ19.584

DO - 10.18699/VJ19.584

M3 - Review article

AN - SCOPUS:85081976114

VL - 23

SP - 1032

EP - 1036

JO - Вавиловский журнал генетики и селекции

JF - Вавиловский журнал генетики и селекции

SN - 2500-0462

IS - 8

ER -

ID: 23878986