并行示例
现象
多物理内核的情况下(128个),单并行线程只有1个,程序运行时长较久。
调优思路
通过最基本的omp编程方式实现并行计算;因为在多核环境场景下,提升并行度是最直接的优化手段。
操作步骤
图1 并行示例代码
- 运行矩阵行列大小为2048的parallel_matmult示例。
./matmul 2048 1
返回信息如下:
Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 0.175117s Matrix multiplication time = 2.971563s
矩阵行列大小为2048情况下,并行计算耗时3秒左右。
- 创建矩阵行列大小为2048的parallel_matmult示例的roofline任务。
devkit tuner roofline -o parallel_matmult_2048 -m region ./matmul 2048 1
返回信息如下:
Note: 1. Roofline task is currently only supported on the 920 platform. 2. The application must be a binary file in ELF format. 3. Roofline task collection needs to ensure the application has finished running. 4. The estimated time of roofline collection is about 3 * application estimated time. 5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD RFCOLLECT: Start collection for ./matmul RFCOLLECT: Launch application to collect performance metrics of ./matmul Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 0.174164s ROOFLINE_EVENTS are initialized. Matrix multiplication time = 2.996051s RFCOLLECT: Launch application to do binary instrumentation of ./matmul Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0 Initialization time = 0.523171s Matrix multiplication time = 3.427321s RFCOLLECT: Launch benchmarks for measuring roofs RFCOLLECT: Processing all collected data RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-154009.json RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-154009.json" to get report. Get roofline report ... The roofline json report: /matrix_multiplication/parallel_matmult_2048.json The roofline html report: /matrix_multiplication/parallel_matmult_2048.html
- 查看parallel_matmult_2048_html报告。图2 parallel_matmult_2048_html报告
此时获取的roofs的并行度为128,获取到Elapsed Time 2.953s, GFLOP Count 17.18,Performance 5.818 GFLOPS。
优化效果
在多核环境场景下,提升并行度是最直接的优化手段。
case |
Elapsed Time(s) |
GFLOP Count |
Performance |
单位时间性能倍率(相比于前一case) |
端到端性能倍率(相比于前一case) |
---|---|---|---|---|---|
base_matmult_2048 |
62.699 |
17.18 |
0.274 |
-- |
-- |
parallel_matmult_2048 |
2.953 |
17.18 |
5.818 |
21.232 |
21.232 |
图3 对比分析
父主题: 使用Roofline进行性能分析