Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor

Standard

Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. / Heinecke, Alexander; Vaidyanathan, Karthikeyan; Smelyanskiy, Mikhail и др.

2013. 126-137 Работа представлена на 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Массачусетс, Соединенные Штаты Америки.

Результаты исследований: Материалы конференций › материалы › Рецензирование

Harvard

Heinecke, A, Vaidyanathan, K, Smelyanskiy, M, Kobotov, A, Dubtsov, R, Henry, G, Shet, AG, Chrysos, G & Dubey, P 2013, 'Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor', Работа представлена на 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Соединенные Штаты Америки, 20.05.2013 - 24.05.2013 стр. 126-137. https://doi.org/10.1109/IPDPS.2013.113

APA

Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A. G., Chrysos, G., & Dubey, P. (2013). Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. 126-137. Работа представлена на 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Массачусетс, Соединенные Штаты Америки. https://doi.org/10.1109/IPDPS.2013.113

Vancouver

Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G и др. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. 2013. Работа представлена на 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Массачусетс, Соединенные Штаты Америки. doi: 10.1109/IPDPS.2013.113

Author

Heinecke, Alexander ; Vaidyanathan, Karthikeyan ; Smelyanskiy, Mikhail и др. / Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. Работа представлена на 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Массачусетс, Соединенные Штаты Америки.12 стр.

BibTeX

@conference{21bfed4cb442454ebb7bc1fd48ee60e1,

title = "Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel{\textregistered} Xeon Phi{\texttrademark} coprocessor",

abstract = "Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel{\textregistered} Xeon Phi{\texttrademark} co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.",

keywords = "HPL, hybrid parallelization, LU factorization, panel factorization, SIMD, TLP, Xeon Phi",

author = "Alexander Heinecke and Karthikeyan Vaidyanathan and Mikhail Smelyanskiy and Alexander Kobotov and Roman Dubtsov and Greg Henry and Shet, {Aniruddha G.} and George Chrysos and Pradeep Dubey",

note = "Copyright: Copyright 2013 Elsevier B.V., All rights reserved.; 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013 ; Conference date: 20-05-2013 Through 24-05-2013",

year = "2013",

doi = "10.1109/IPDPS.2013.113",

language = "English",

pages = "126--137",

}

RIS

TY - CONF

T1 - Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor

AU - Heinecke, Alexander

AU - Vaidyanathan, Karthikeyan

AU - Smelyanskiy, Mikhail

AU - Kobotov, Alexander

AU - Dubtsov, Roman

AU - Henry, Greg

AU - Shet, Aniruddha G.

AU - Chrysos, George

AU - Dubey, Pradeep

N1 - Conference code: 27

PY - 2013

Y1 - 2013

N2 - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

AB - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

KW - HPL

KW - hybrid parallelization

KW - LU factorization

KW - panel factorization

KW - SIMD

KW - TLP

KW - Xeon Phi

UR - http://www.scopus.com/inward/record.url?scp=84884866137&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2013.113

DO - 10.1109/IPDPS.2013.113

M3 - Paper

AN - SCOPUS:84884866137

SP - 126

EP - 137

T2 - 27th IEEE International Parallel and Distributed Processing Symposium

Y2 - 20 May 2013 through 24 May 2013

ER -

ID: 27580126