中文
注册
我要评分
文档获取效率
文档正确性
内容完整性
文档易理解
在线提单
论坛求助

构建机器学习Spark算法库适配代码

构建机器学习算法加速库适配代码Spark-ml-algo-lib过程:

  1. 下载Spark 2.3.2源码zip包到“/opt/”目录并解压,得到Spark源码目录“/opt/ spark-2.3.2”

    获取地址:https://github.com/apache/spark/archive/v2.3.2.zip

    wget https://github.com/apache/spark/archive/v2.3.2.zip unzip v2.3.2.zip
  2. 获取Breeze 0.13.1源码zip包到“/opt/”目录并解压,得到Breeze源码目录“/opt/breeze-releases-v0.13.1”

    获取地址:https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip

    wget https://github.com/scalanlp/breeze/archive/releases/v0.13.1.zip unzip v0.13.1.zip
  3. “/opt/”目录下建立一个层级为如下所示的目录的工程Spark-ml-algo-lib。

    cd /opt/
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/breeze/optimize
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/aggregator
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/optim/loss
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed
    mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm
    mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity
  4. 按照表1表2的对应关系将Spark 2.3.2和Breeze 0.13.1中的对应原文件复制到Spark-ml-algo-lib目录,表格左边两列是目标目录和文件名,右边两列的是需要移动的原文件目录及文件名。由于需要复制的文件很多,操作的代码只给出两个示例。

    有些文件在复制到目标文件夹后需要改名。

    操作命令示例:
    1
    2
    cp /opt/spark-2.3.2/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala
    cp /opt/breeze-releases-v0.13.1/math/src/main/scala/breeze/optimize/FirstOrderMinimizer.scala /opt/Bigdata_ML_ALGO_ACC_LIB/ml-accelerator/src/main/scala/breeze/optimize/FirstOrderMinimizerX.scala
    
    表1 Spark中需要放入Spark-ml-algo-lib工程的文件

    Spark-ml-algo-lib工程目录

    Spark-ml-algo-lib工程文件名

    Spark原文件所在目录

    Spark原文件名

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/

    GBTClassifier.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/classification/

    GBTClassifier.scala

    LinearSVC.scala

    LinearSVC.scala

    RandomForestClassifier.scala

    RandomForestClassifier.scala

    DecisionTreeClassifier.scala

    DecisionTreeClassifier.scala

    LogisticRegression.scala

    LogisticRegression.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregatorX.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/aggregator/

    DifferentiableLossAggregator.scala

    HingeAggregatorX.scala

    HingeAggregator.scala

    HuberAggregatorX.scala

    HuberAggregator.scala

    LeastSquaresAggregatorX.scala

    LeastSquaresAggregator.scala

    LogisticAggregatorX.scala

    LogisticAggregator.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/optim/loss/

    RDDLossFunctionX.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/

    RDDLossFunction.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/

    DecisionTreeRegressor.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/optim/loss/

    DecisionTreeRegressor.scala

    GBTRegressor.scala

    GBTRegressor.scala

    LinearRegression.scala

    LinearRegression.scala

    RandomForestRegressor.scala

    RandomForestRegressor.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/

    GradientBoostedTrees.scala

    NodeIdCache.scala

    NodeIdCache.scala

    RandomForest.scala

    RandomForest.scala

    RandomForest4GBDTX.scala

    RandomForest.scala

    RandomForestRaw.scala

    RandomForest.scala

    DecisionForest.scala

    RandomForest.scala

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/

    treeParams.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/

    treeParams.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/

    KMACCm.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/clustering

    KMeans.scala

    KMeans.scala

    KMeans.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/

    RowMatrix.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg/distributed

    RowMatrix.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/

    EigenValueDecomposition.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/linalg

    EigenValueDecomposition.scala

    Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/

    DecisionTree.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/mllib/tree

    DecisionTree.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/

    Node.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/

    Node.scala

    Split.scala

    Split.scala

    Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl

    BaggedPoint.scala

    spark-2.3.2/mllib/src/main/scala/org/ apache/spark/ml/tree/impl/

    BaggedPoint.scala

    DTFeatureStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    DTStatsAggregator.scala

    GradientBoostedTreesCore.scala

    RandomForest.scala

    TreePointX.scala

    TreePoint.scala

    TreePointY.scala

    TreePoint.scala

    Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/tree/impurity/

    Entropy.scala

    spark-2.3.2/mllib/src/main/scala/org/apache/spark/mllib/tree/impurity

    Entropy.scala

    Gini.scala

    Gini.scala

    Impurities.scala

    Impurities.scala

    Impurity.scala

    Impurity.scala

    Variance.scala

    Variance.scala

    表2 Breeze中需要放入Spark-ml-algo-lib工程的文件

    Spark-ml-algo-lib工程目录

    Spark-ml-algo-lib工程文件名

    Breeze原文件所在目录

    Breeze原文件名

    Spark-ml-algo-lib/ml-accelerator/ src/main/scala/breeze/optimize

    FirstOrderMinimizerX.scala

    breeze-releases-v0.13.1/math/src/ main/scala/breeze/optimize

    FirstOrderMinimizer.scala

    LBFGSX.scala

    LBFGS.scala

    OWLQNX.scala

    OWLQN.scala

    完成4后,Spark-ml-algo-lib工程的目录结构及目录下的文件如下:

    Spark-ml-algo-lib
    ├── ml-accelerator
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   └── optimize
    │               │       ├── FirstOrderMinimizerX.scala
    │               │       ├── LBFGSX.scala
    │               │       └── OWLQNX.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   ├── classification
    │                           │   │   ├── DecisionTreeClassifier.scala
    │                           │   │   ├── GBTClassifier.scala
    │                           │   │   ├── LinearSVC.scala
    │                           │   │   ├── LogisticRegression.scala
    │                           │   │   └── RandomForestClassifier.scala
    │                           │   ├── optim
    │                           │   │   ├── aggregator
    │                           │   │   │   ├── DifferentiableLossAggregatorX.scala
    │                           │   │   │   ├── HingeAggregatorX.scala
    │                           │   │   │   ├── HuberAggregatorX.scala
    │                           │   │   │   ├── LeastSquaresAggregatorX.scala
    │                           │   │   │   └── LogisticAggregatorX.scala
    │                           │   │   └── loss
    │                           │   │       └── RDDLossFunctionX.scala
    │                           │   ├── regression
    │                           │   │   ├── DecisionTreeRegressor.scala
    │                           │   │   ├── GBTRegressor.scala
    │                           │   │   ├── LinearRegression.scala
    │                           │   │   └── RandomForestRegressor.scala
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── DecisionForest.scala
    │                           │       │   ├── GradientBoostedTrees.scala
    │                           │       │   ├── NodeIdCache.scala
    │                           │       │   ├── RandomForest4GBDTX.scala
    │                           │       │   ├── RandomForestRaw.scala
    │                           │       │   └── RandomForest.scala
    │                           │       └── treeParams.scala
    │                           └── mllib
    │                               ├── clustering
    │                               │   ├── KMACCm.scala
    │                               │   └── KMeans.scala
    │                               ├── linalg
    │                               │   ├── distributed
    │                               │   │   └── RowMatrix.scala
    │                               │   └── EigenValueDecomposition.scala
    │                               └── tree
    │                                   └── DecisionTree.scala
    └── ml-core
        └── src
            └── main
                └── scala
                    └── org
                        └── apache
                            └── spark
                                ├── ml
                                │   └── tree
                                │       ├── impl
                                │       │   ├── BaggedPoint.scala
                                │       │   ├── DTFeatureStatsAggregator.scala
                                │       │   ├── DTStatsAggregator.scala
                                │       │   ├── GradientBoostedTreesCore.scala
                                │       │   ├── TreePointX.scala
                                │       │   └── TreePointY.scala
                                │       ├── Node.scala
                                │       └── Split.scala
                                └── mllib
                                    └── tree
                                        └── impurity
                                            ├── Entropy.scala
                                            ├── Gini.scala
                                            ├── Impurities.scala
                                            ├── Impurity.scala
                                            └── Variance.scala
  5. 下载Spark-ml-algo-lib.patch到“/opt/Spark-ml-algo-lib/”目录下,将patch解压后并入Spark-ml-algo-lib,得到完整的机器学习算法加速库适配代码Spark-ml-algo-lib。
    1
    2
    3
    cd /opt/Spark-ml-algo-lib
    wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v1.1.0/Spark-ml-algo-lib.patch
    patch -p1 < Spark-ml-algo-lib.patch
    

    完整的机器学习算法加速库适配代码Spark-ml-algo-lib的目录及目录下的文件如下:

    Spark-ml-algo-lib
    ├── LICENSE
    ├── ml-accelerator
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   └── optimize
    │               │       ├── FirstOrderMinimizerX.scala
    │               │       ├── LBFGSX.scala
    │               │       └── OWLQNX.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   ├── classification
    │                           │   │   ├── DecisionTreeClassifier.scala
    │                           │   │   ├── GBTClassifier.scala
    │                           │   │   ├── LinearSVC.scala
    │                           │   │   ├── LogisticRegression.scala
    │                           │   │   └── RandomForestClassifier.scala
    │                           │   ├── optim
    │                           │   │   ├── aggregator
    │                           │   │   │   ├── DifferentiableLossAggregatorX.scala
    │                           │   │   │   ├── HingeAggregatorX.scala
    │                           │   │   │   ├── HuberAggregatorX.scala
    │                           │   │   │   ├── LeastSquaresAggregatorX.scala
    │                           │   │   │   └── LogisticAggregatorX.scala
    │                           │   │   └── loss
    │                           │   │       └── RDDLossFunctionX.scala
    │                           │   ├── regression
    │                           │   │   ├── DecisionTreeRegressor.scala
    │                           │   │   ├── GBTRegressor.scala
    │                           │   │   ├── LinearRegression.scala
    │                           │   │   └── RandomForestRegressor.scala
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── DecisionForest.scala
    │                           │       │   ├── GradientBoostedTrees.scala
    │                           │       │   ├── NodeIdCache.scala
    │                           │       │   ├── RandomForest4GBDTX.scala
    │                           │       │   ├── RandomForestRaw.scala
    │                           │       │   └── RandomForest.scala
    │                           │       └── treeParams.scala
    │                           └── mllib
    │                               ├── clustering
    │                               │   ├── KMACCm.scala
    │                               │   └── KMeans.scala
    │                               ├── linalg
    │                               │   ├── distributed
    │                               │   │   └── RowMatrix.scala
    │                               │   └── EigenValueDecomposition.scala
    │                               └── tree
    │                                   └── DecisionTree.scala
    ├── ml-core
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   └── tree
    │                           │       ├── impl
    │                           │       │   ├── BaggedPoint.scala
    │                           │       │   ├── DTFeatureStatsAggregator.scala
    │                           │       │   ├── DTStatsAggregator.scala
    │                           │       │   ├── GradientBoostedTreesCore.scala
    │                           │       │   ├── TreePointX.scala
    │                           │       │   └── TreePointY.scala
    │                           │       ├── Node.scala
    │                           │       └── Split.scala
    │                           └── mllib
    │                               └── tree
    │                                   └── impurity
    │                                       ├── Entropy.scala
    │                                       ├── Gini.scala
    │                                       ├── Impurities.scala
    │                                       ├── Impurity.scala
    │                                       └── Variance.scala
    ├── ml-kernel-client
    │   ├── pom.xml
    │   └── src
    │       └── main
    │           └── scala
    │               ├── breeze
    │               │   ├── linalg
    │               │   │   ├── blas
    │               │   │   │   ├── Dgemv.scala
    │               │   │   │   └── Gramian.scala
    │               │   │   ├── DenseMatrixUtil.scala
    │               │   │   ├── DenseVectorUtil.scala
    │               │   │   └── lapack
    │               │   │       └── EigenDecomposition.scala
    │               │   └── optimize
    │               │       ├── ACC.scala
    │               │       ├── LBFGSL.scala
    │               │       └── OWLQNL.scala
    │               └── org
    │                   └── apache
    │                       └── spark
    │                           ├── ml
    │                           │   └── tree
    │                           │       └── impl
    │                           │           ├── DTUtils.scala
    │                           │           ├── GradientBoostedTreesUtil.scala
    │                           │           └── RFUtils.scala
    │                           ├── mllib.clustering
    │                           │   └── KmeansUtil.scala
    │                           └── mllib.linalg.distributed
    │                               └── RowMatrixUtil.scala
    ├── pom.xml
    ├── README.md
    └── scalastyle-config.xml