构建机器学习算法加速库适配代码
- 构建机器学习算法加速库适配代码Spark-ml-algo-lib过程如下。此过程以适配Spark 3.3.1代码的构建为例。
- 以下操作请在Linux环境下操作,该章节仅供参考。
- 下载Spark 3.1.1源码zip包到“/opt/”目录并解压,得到Spark源码目录。
获取地址:https://github.com/apache/spark/archive/v3.3.1.zip
1
wget https://github.com/apache/spark/archive/v3.3.1.zip
- 获取Breeze 0.13.1源码zip包到“/opt/”目录并解压,得到Breeze源码目录。
获取地址:https://github.com/scalanlp/breeze/archive/releases/v1.0.zip
1
wget https://github.com/scalanlp/breeze/archive/releases/v1.0.zip
- 在“/opt/”目录下建立一个层级为如下所示的目录的工程Spark-ml-algo-lib。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
cd /opt/ mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/tuning mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/optimization mkdir -p Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/classification mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/recommendation mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/regression mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tree/impl mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/ml/tuning mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/feature mkdir -p Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm
- 按照表1、表2的对应关系将Spark 3.3.1、Breeze 1.0中的对应原文件复制到“Spark-ml-algo-lib”目录,表格左边两列是目标目录和文件名,右边两列的是需要移动的原文件目录及文件名。由于需要复制的文件很多,操作的代码只给出两个示例。
有些文件在复制到目标文件夹后需要改名。
操作命令示例:1 2
cp /opt/spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala /opt/Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/ml/classification/GBTClassifier.scala cp /opt/breeze-releases-v1.0/math/src/main/scala/breeze/numerics/package.scala /opt/Spark-ml-algo-lib/ml-core/src/main/scala/breeze/numerics/DigammaX.scala
表1 Spark中需要放入Spark-ml-algo-lib工程的文件 Spark-ml-algo-lib工程目录
Spark-ml-algo-lib工程文件名
Spark原文件所在目录
Spark原文件名
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/classification/
GBTClassifier.scala
LinearSVC.scala
LinearSVC.scala
RandomForestClassifier.scala
RandomForestClassifier.scala
DecisionTreeClassifier.scala
DecisionTreeClassifier.scala
FMClassifier.scala
FMClassifier.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/feature
IDF.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/feature
IDF.scala
Word2Vec.scala
Word2Vec.scala
DecisionTreeBucketizer.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/classification
RandomForestClassifier.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/fpm
PrefixSpan.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/fpm
PrefixSpan.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/recommendation/
ALS.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/recommendation
ALS.scala
NMF.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/recommendation
ALS.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/regression/
DecisionTreeRegressor.scala
GBTRegressor.scala
GBTRegressor.scala
FMRegressor.scala
FMRegressor.scala
RandomForestRegressor.scala
RandomForestRegressor.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
GradientBoostedTrees.scala
RandomForest.scala
RandomForest.scala
RandomForest4GBDTX.scala
RandomForest.scala
RandomForestRaw.scala
RandomForest.scala
DecisionForest.scala
RandomForest.scala
DecisionTreeBucket.scala
RandomForest.scala
DecisionTreeMetadata.scala
DecisionTreeMetadata.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/tree/
treeParams.scala
treeModels.scala
treeModels.scala
Spark-ml-algo-lib/ml-accelerator/ src/main/scala/org/apache/spark/ml/tuning/
BayesianCrossValidator.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/tuning/
CrossValidator.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/clustering/
LDA.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/clustering
LDA.scala
LDAOptimizer.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/feature
IDF.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/feature/
IDF.scala
Word2Vec.scala
Word2Vec.scala
PCA.scala
PCA.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/fpm/
PrefixSpan.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/fpm
PrefixSpan.scala
FPGrowth.scala
FPGrowth.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/distributed/
RowMatrix.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed
RowMatrix.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/linalg/
EigenValueDecomposition.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/linalg
EigenValueDecomposition.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/optimization/
LBFGSN.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/optimization
LBFGS.scala
Spark-ml-algo-lib/ml-accelerator/src/main/scala/org/apache/spark/mllib/tree/
DecisionTree.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/tree
DecisionTree.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/
Node.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/tree/
Node.scala
Spark-ml-algo-lib/ml-core/ src/main/scala/org/apache/spark/ml/tree/impl
BaggedPoint.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/ml/tree/impl/
BaggedPoint.scala
DTFeatureStatsAggregator.scala
DTStatsAggregator.scala
GradientBoostedTreesCore.scala
GradientBoostedTrees.scala
TreePointX.scala
TreePoint.scala
TreePointY.scala
TreePoint.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/clustering/
LDAUtilsX.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/clustering
LDAUtils.scala
OnlineLDAOptimizerXObj.scala
LDAOptimizer.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/feature/
VocabWord.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/feature
Word2Vec.scala
Spark-ml-algo-lib/ml-core/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
spark-3.3.1/mllib/src/main/scala/org/apache/spark/mllib/fpm/
LocalPrefixSpan.scala
PrefixSpanBase.scala
PrefixSpan.scala
FPGrowthCore.scala
FPGrowth.scala
- 下载patch到“/opt/Spark-ml-algo-lib/”目录下,以Spark 3.3.1为例,将Spark 3.3.1的patch并入Spark-ml-algo-lib,得到完整的机器学习算法加速库适配代码Spark-ml-algo-lib。
1 2 3
cd /opt/Spark-ml-algo-lib wget https://github.com/kunpengcompute/Spark-ml-algo-lib/releases/download/v3.0.0-spark3.3.1/Spark-ml-algo-lib-Spark3.3.1.patch patch -p1 < Spark-ml-algo-lib-Spark3.3.1.patch
完整的机器学习算法加速库适配代码Spark-ml-algo-lib的目录与仓库代码一致。