鲲鹏社区首页
EN
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

并行示例

现象

多物理内核的情况下(128个),但并行线程只有1个,程序运行时长较久。

调优思路

通过最基本的OpenMP编程方式实现并行计算;因为在多核环境场景下,提升并行度是最直接的优化手段。

操作步骤

图1 并行示例代码
  1. 运行矩阵行列大小为2048的parallel_matmult示例。
    1
    ./matmul 2048 1
    

    返回信息如下:

    1
    2
    3
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.175117s
    Matrix multiplication time = 2.971563s
    

    矩阵行列大小为2048情况下,并行计算耗时3秒左右。

  2. 创建矩阵行列大小为2048的parallel_matmult示例的roofline任务。
    1
    devkit tuner roofline -o parallel_matmult_2048 -m region ./matmul 2048 1
    

    返回信息如下:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    Note:
        1. Roofline task is currently only supported on the 920 platform.
        2. The application must be a binary file in ELF format, and read permissions are required to detect the format of the application.
        3. Roofline task collection needs to ensure the application has finished running.
        4. The estimated time of roofline collection is about 3 * application estimated time.
        5. Roofline analysis is available only on physical machines.
        6. You can learn about the roofline profiling method by looking at document /usr/local/devkit/tuner/docs/ROOFLINE_KNOW_HOW.MD
    RFCOLLECT: Start collection for ./matmul
    RFCOLLECT: Launch application to collect performance metrics of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.174164s
    ROOFLINE_EVENTS are initialized.
    Matrix multiplication time = 2.996051s
    RFCOLLECT: Launch application to do binary instrumentation of ./matmul
    Size is 2048, Matrix multiplication method is: 1, Check correctness is: 0
    Initialization time = 0.523171s
    Matrix multiplication time = 3.427321s
    RFCOLLECT: Launch benchmarks for measuring roofs
    RFCOLLECT: Processing all collected data
    RFCOLLECT: Result is captured at /matrix_multiplication/rfcollect-20240506-154009.json
    RFCOLLECT: Run "rfreport /matrix_multiplication/rfcollect-20240506-154009.json" to get report.
    
    Get roofline report ...
    The roofline json report: /matrix_multiplication/parallel_matmult_2048.json
    The roofline html report: /matrix_multiplication/parallel_matmult_2048.html
    
  3. 查看parallel_matmult_2048_html报告。
    图2 parallel_matmult_2048_html报告

    此时获取的roofs的并行度为128,获取到Elapsed Time 2.953s, GFLOP Count 17.18,Performance 5.818 GFLOPS。

优化效果

在多核环境场景下,提升并行度是最直接的优化手段。

表1 性能对比分析

case

Elapsed Time(s)

GFLOP Count

Performance

单位时间性能倍率(相比于前一case)

端到端性能倍率(相比于前一case)

base_matmult_2048

62.699

17.18

0.274

--

--

parallel_matmult_2048

2.953

17.18

5.818

21.232

21.232

图3 对比分析

Web模式的Roofline分析任务支持对比任务,可以使用Web模式查看对比分析结果。