Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices

Standard

Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices. / Kazakovtsev, Vladimir; Плеханов, Михаил Станиславович; Naumchev, Alexandr et al.

In: Big Data and Cognitive Computing, Vol. 9, No. 10, 254, 09.10.2025.

Research output: Contribution to journal › Article › peer-review

Harvard

Kazakovtsev, V, Плеханов, МС, Naumchev, A, Shkaberina, G, Masich, I, Egorova, L, Stupina, A, Popov, A & Kazakovtsev, L 2025, 'Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices', Big Data and Cognitive Computing, vol. 9, no. 10, 254. https://doi.org/10.3390/bdcc9100254

APA

Kazakovtsev, V., Плеханов, М. С., Naumchev, A., Shkaberina, G., Masich, I., Egorova, L., Stupina, A., Popov, A., & Kazakovtsev, L. (2025). Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices. Big Data and Cognitive Computing, 9(10), [254]. https://doi.org/10.3390/bdcc9100254

Vancouver

Kazakovtsev V, Плеханов МС, Naumchev A, Shkaberina G, Masich I, Egorova L et al. Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices. Big Data and Cognitive Computing. 2025 Oct 9;9(10):254. doi: 10.3390/bdcc9100254

Author

Kazakovtsev, Vladimir ; Плеханов, Михаил Станиславович ; Naumchev, Alexandr et al. / Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices. In: Big Data and Cognitive Computing. 2025 ; Vol. 9, No. 10.

BibTeX

@article{231cfd2b68f648c4be1dee9b06b2c47a,

title = "Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices",

abstract = "In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: all data vectors are divided into clusters, and each cluster is assigned to its central point (centroid). For an ANN search query, the closest centroids are determined, and the further search continues in the corresponding clusters only. In our study, the complexity of each query is assessed and classified with the use of results of an initial trial search in a limited number of clusters. Based on this classification, the algorithm dynamically determines the presumably sufficient number of clusters which is sufficient to achieve the desired Recall value, thereby improving vector search efficiency. Our experiments show that such a complexity classifier can be built with the use of a single feature, and we propose an algorithm for its training. We studied the impact of various features on the query processing and discovered a strong dependence on the number of clusters that contains at least one nearest neighbor (productive clusters). The new algorithm is designed to be implemented on top of the IVF search which is a well-known algorithm for approximate nearest neighbor search and uses existing IVF indexes that are widely used in the most popular vector database management systems, such as pgvector. The results obtained demonstrate a significant increase in the speed of nearest neighbor search (up to 35%) while maintaining a high Recall rate of 0.99. Additionally, the search algorithm is deterministic, which might be extremely important for tasks where the reproducibility of results plays a crucial role. The developed algorithm has been tested on datasets of varying sizes up to one billion data vectors.",

author = "Vladimir Kazakovtsev and Плеханов, {Михаил Станиславович} and Alexandr Naumchev and Guzel Shkaberina and Igor Masich and Lyudmila Egorova and Alena Stupina and Aleksey Popov and Lev Kazakovtsev",

note = "Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices / V. Kazakovtsev, M. Plekhanov, A. Naumchev, G. Shkaberina, I. Masich, L. Egorova, A. Stupina, A. Popov, L. Kazakovtsev // Big Data and Cognitive Computing. - 2025. - Т. 9. № 10. - С. 254",

year = "2025",

month = oct,

day = "9",

doi = "10.3390/bdcc9100254",

language = "English",

volume = "9",

journal = "Big Data and Cognitive Computing",

issn = "2504-2289",

publisher = "Multidisciplinary Digital Publishing Institute (MDPI)",

number = "10",

}

RIS

TY - JOUR

T1 - Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices

AU - Kazakovtsev, Vladimir

AU - Плеханов, Михаил Станиславович

AU - Naumchev, Alexandr

AU - Shkaberina, Guzel

AU - Masich, Igor

AU - Egorova, Lyudmila

AU - Stupina, Alena

AU - Popov, Aleksey

AU - Kazakovtsev, Lev

N1 - Fast Adaptive Approximate Nearest Neighbor Search with Cluster-Shaped Indices / V. Kazakovtsev, M. Plekhanov, A. Naumchev, G. Shkaberina, I. Masich, L. Egorova, A. Stupina, A. Popov, L. Kazakovtsev // Big Data and Cognitive Computing. - 2025. - Т. 9. № 10. - С. 254

PY - 2025/10/9

Y1 - 2025/10/9

N2 - In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: all data vectors are divided into clusters, and each cluster is assigned to its central point (centroid). For an ANN search query, the closest centroids are determined, and the further search continues in the corresponding clusters only. In our study, the complexity of each query is assessed and classified with the use of results of an initial trial search in a limited number of clusters. Based on this classification, the algorithm dynamically determines the presumably sufficient number of clusters which is sufficient to achieve the desired Recall value, thereby improving vector search efficiency. Our experiments show that such a complexity classifier can be built with the use of a single feature, and we propose an algorithm for its training. We studied the impact of various features on the query processing and discovered a strong dependence on the number of clusters that contains at least one nearest neighbor (productive clusters). The new algorithm is designed to be implemented on top of the IVF search which is a well-known algorithm for approximate nearest neighbor search and uses existing IVF indexes that are widely used in the most popular vector database management systems, such as pgvector. The results obtained demonstrate a significant increase in the speed of nearest neighbor search (up to 35%) while maintaining a high Recall rate of 0.99. Additionally, the search algorithm is deterministic, which might be extremely important for tasks where the reproducibility of results plays a crucial role. The developed algorithm has been tested on datasets of varying sizes up to one billion data vectors.

AB - In this study, we propose a novel adaptive algorithm for approximate nearest neighbor (ANN) search, based on the inverted file (IVF) index (cluster-based index) and online query complexity classification. The concept of the classical IVF search implemented in vector databases is as follows: all data vectors are divided into clusters, and each cluster is assigned to its central point (centroid). For an ANN search query, the closest centroids are determined, and the further search continues in the corresponding clusters only. In our study, the complexity of each query is assessed and classified with the use of results of an initial trial search in a limited number of clusters. Based on this classification, the algorithm dynamically determines the presumably sufficient number of clusters which is sufficient to achieve the desired Recall value, thereby improving vector search efficiency. Our experiments show that such a complexity classifier can be built with the use of a single feature, and we propose an algorithm for its training. We studied the impact of various features on the query processing and discovered a strong dependence on the number of clusters that contains at least one nearest neighbor (productive clusters). The new algorithm is designed to be implemented on top of the IVF search which is a well-known algorithm for approximate nearest neighbor search and uses existing IVF indexes that are widely used in the most popular vector database management systems, such as pgvector. The results obtained demonstrate a significant increase in the speed of nearest neighbor search (up to 35%) while maintaining a high Recall rate of 0.99. Additionally, the search algorithm is deterministic, which might be extremely important for tasks where the reproducibility of results plays a crucial role. The developed algorithm has been tested on datasets of varying sizes up to one billion data vectors.

U2 - 10.3390/bdcc9100254

DO - 10.3390/bdcc9100254

M3 - Article

VL - 9

JO - Big Data and Cognitive Computing

JF - Big Data and Cognitive Computing

SN - 2504-2289

IS - 10

M1 - 254

ER -

ID: 70653084