Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor. / Heinecke, Alexander; Vaidyanathan, Karthikeyan; Smelyanskiy, Mikhail et al.
2013. 126-137 Paper presented at 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013, Boston, MA, United States.Research output: Contribution to conference › Paper › peer-review
}
TY - CONF
T1 - Design and implementation of the Linpack benchmark for single and multi-node systems based on Intel® Xeon Phi™ coprocessor
AU - Heinecke, Alexander
AU - Vaidyanathan, Karthikeyan
AU - Smelyanskiy, Mikhail
AU - Kobotov, Alexander
AU - Dubtsov, Roman
AU - Henry, Greg
AU - Shet, Aniruddha G.
AU - Chrysos, George
AU - Dubey, Pradeep
N1 - Copyright: Copyright 2013 Elsevier B.V., All rights reserved.
PY - 2013
Y1 - 2013
N2 - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.
AB - Dense linear algebra has been traditionally used to evaluate the performance and efficiency of new architectures. This trend has continued for the past half decade with the advent of multi-core processors and hardware accelerators. In this paper we describe how several flavors of the Linpack benchmark are accelerated on Intel's recently released Intel® Xeon Phi™ co-processor (code-named Knights Corner) in both native and hybrid configurations. Our native DGEMM implementation takes full advantage of Knights Corner's salient architectural features and successfully utilizes close to 90% of its peak compute capability. Our native Linpack implementation running entirely on Knights Corner employs novel dynamic scheduling and achieves close to 80% efficiency - the highest published co-processor efficiency. Similarly to native, our single-node hybrid implementation of Linpack also achieves nearly 80% efficiency. Using dynamic scheduling and an enhanced look-ahead scheme, this implementation scales well to a 100-node cluster, on which it achieves over 76% efficiency while delivering the total performance of 107 TFLOPS.
KW - HPL
KW - hybrid parallelization
KW - LU factorization
KW - panel factorization
KW - SIMD
KW - TLP
KW - Xeon Phi
UR - http://www.scopus.com/inward/record.url?scp=84884866137&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2013.113
DO - 10.1109/IPDPS.2013.113
M3 - Paper
AN - SCOPUS:84884866137
SP - 126
EP - 137
T2 - 27th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2013
Y2 - 20 May 2013 through 24 May 2013
ER -
ID: 27580126