Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor

Standard

Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. / Heinecke, Alexander; Vaidyanathan, Karthikeyan; Smelyanskiy, Mikhail et al.

2013. 126-137 Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Massachusetts, United States.

Research output: Contribution to conference › Paper › peer-review

Harvard

Heinecke, A, Vaidyanathan, K, Smelyanskiy, M, Kobotov, A, Dubtsov, R, Henry, G, Shet, AG, Chrysos, G & Dubey, P 2013, 'Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor', Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, Boston, United States, 20.05.2013 - 24.05.2013 pp. 126-137. https://doi.org/10.1109/IPDPS.2013.113

APA

Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A. G., Chrysos, G., & Dubey, P. (2013). Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. 126-137. Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Massachusetts, United States. https://doi.org/10.1109/IPDPS.2013.113

Vancouver

Heinecke A, Vaidyanathan K, Smelyanskiy M, Kobotov A, Dubtsov R, Henry G et al.. Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. 2013. Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Massachusetts, United States. doi: 10.1109/IPDPS.2013.113

Author

Heinecke, Alexander ; Vaidyanathan, Karthikeyan ; Smelyanskiy, Mikhail et al. / Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, Boston, Massachusetts, United States.12 p.

BibTeX

@conference{21bfed4cb442454ebb7bc1fd48ee60e1,

title = "Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel{\textregistered} Xeon Phi{\texttrademark} coprocessor",

abstract = "Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel{\textregistered} Xeon Phi{\texttrademark} co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.",

keywords = "HPL, hybrid parallelization, LU factorization, panel factorization, SIMD, TLP, Xeon Phi",

author = "Alexander Heinecke and Karthikeyan Vaidyanathan and Mikhail Smelyanskiy and Alexander Kobotov and Roman Dubtsov and Greg Henry and Shet, {Aniruddha G.} and George Chrysos and Pradeep Dubey",

note = "Copyright: Copyright 2013 Elsevier B.V., All rights reserved.; 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013 ; Conference date: 20-05-2013 Through 24-05-2013",

year = "2013",

doi = "10.1109/IPDPS.2013.113",

language = "English",

pages = "126--137",

}

RIS

TY - CONF

T1 - Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor

AU - Heinecke, Alexander

AU - Vaidyanathan, Karthikeyan

AU - Smelyanskiy, Mikhail

AU - Kobotov, Alexander

AU - Dubtsov, Roman

AU - Henry, Greg

AU - Shet, Aniruddha G.

AU - Chrysos, George

AU - Dubey, Pradeep

N1 - Conference code: 27

PY - 2013

Y1 - 2013

N2 - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

AB - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.

KW - HPL

KW - hybrid parallelization

KW - LU factorization

KW - panel factorization

KW - SIMD

KW - TLP

KW - Xeon Phi

UR - http://www.scopus.com/inward/record.url?scp=84884866137&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2013.113

DO - 10.1109/IPDPS.2013.113

M3 - Paper

AN - SCOPUS:84884866137

SP - 126

EP - 137

T2 - 27th IEEE International Parallel and Distributed Processing Symposium

Y2 - 20 May 2013 through 24 May 2013

ER -

ID: 27580126