中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

并行示例

现象

多物理内核的情况下(128个),单并行线程只有1个,程序运行时长较久。

调优思路

通过最基本的omp编程方式实现并行计算;因为在多核环境场景下,提升并行度是最直接的优化手段。

操作步骤

图1 并行示例代码
  1. 运行矩阵行列大小为2048的parallel_matmult示例。
    ./matmul 2048 1

    返回信息如下:

    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.175117s
    Matrix multiplication time = 2.971563s

    矩阵行列大小为2048情况下,并行计算耗时3秒左右。

  2. 创建矩阵行列大小为2048的parallel_matmult示例的roofline任务。
    devkit tuner roofline -o parallel_matmult_2048 -m region ./matmul 2048 1

    返回信息如下:

    Note:
      1. Roofline task is currently only supported on the 920 platform.
      2. The application must be a binary file in ELF format.
      3. Roofline task collection needs to ensure the application has finished running.
      4. The estimated time of roofline collection is about 3 * application estimated time.
      5. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.174164s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 2.996051s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.523171s
    Matrix multiplication time = 3.427321s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-154009.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-154009.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_2048.json
    The roofline html report: /matrix_multiplication/parallel_matmult_2048.html
  3. 查看parallel_matmult_2048_html报告。
    图2 parallel_matmult_2048_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 2.953s, GFLOP Count 17.18,Performance 5.818 GFLOPS。

优化效果

在多核环境场景下,提升并行度是最直接的优化手段。

表1 性能对比分析

case

Elapsed Time(s)

GFLOP Count

Performance

单位时间性能倍率(相比于前一case)

端到端性能倍率(相比于前一case)

base_matmult_2048

62.699

17.18

0.274

--

--

parallel_matmult_2048

2.953

17.18

5.818

21.232

21.232

图3 对比分析