公开数据集

本文的机器学习测试用例使用官网数据集，请从官网下载house、HIGGS、nytimes、Kosarak、DEEP1B、Mnist8m、Epsilon、MESH_DEFORM。下文所有的数据集下载解压上传均在server1节点进行。

下载官网house数据集

新建“/test/dataset/ml”目录，并进入该目录。
```
mkdir -p /test/dataset/ml
cd /test/dataset/ml
```
下载官网house数据集（需要机器网络能连接到google）。浏览器中输入如下地址：https://sites.google.com/view/approxdbscan/datasets
将2中下载的数据集放入“/test/dataset/ml”目录中。

HDFS新建文件夹。

hadoop fs -mkdir -p /tmp/dataset/ml
hadoop fs -mkdir -p /tmp/ml/dataset

上传数据集到“/tmp/dataset/ml”。

hadoop fs -put /test/dataset/ml/house.ds /tmp/dataset/ml

打开spark-shell。
```
spark-shell
```
输入以下命令（注意冒号也是命令）。
```
:paste
```

执行下面代码，对数据集进行处理。

val file = sc.textFile("/tmp/dataset/ml/house.ds")
file.take(10).foreach(println(_))
file.count
val data = file.map(x => x.split(" ")).filter(_.length == 8).map(x => x.slice(1, 8).mkString(" "))
data.count
data.take(10).foreach(println(_))
data.repartition(1).saveAsTextFile("/tmp/ml/dataset/house")

回车并执行Ctrl+d。
检查HDFS对应目录下有无训练集和测试集数据，得到结果如下图。
```
hadoop fs -ls /tmp/ml/dataset/house
```

移除HDFS上无用的数据集目录。

hadoop fs -rm -r /tmp/dataset/mlhadoop fs -rm -r /tmp/dataset/ml

下载官网HIGGS数据集

新建/“test/dataset/ml/higgs”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/higgs
cd /test/dataset/ml/higgs
```

下载官网higgs数据集。

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/HIGGS.bz2

解压数据集到当前目录。
```
bzip2 -d HIGGS.bz2
```
HDFS新建“/tmp/dataset/ml/higgs”文件夹。
```
hadoop fs -mkdir -p /tmp/ml/dataset/higgs
```

上传数据集到HDFS上。

hadoop fs -put /test/dataset/ml/higgs/HIGGS /tmp/ml/dataset/higgs

打开Spark-shell。
```
spark-shell
```
输入以下命令（注意冒号也是命令）。
```
:paste
```

执行下面代码，将数据集分割为训练集和测试集。

val reader = spark.read.format("libsvm")
reader.option("numFeatures", 28)
val dataPath = "/tmp/ml/dataset/higgs"
val data = reader.load(dataPath)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
val trainOutputPath = s"${dataPath}_train"
val testOutputPath = s"${dataPath}_test"
trainingData.write.format("libsvm").save(trainOutputPath)
testData.write.format("libsvm").save(testOutputPath)

回车并执行Ctrl+d。
移除HDFS上无用的数据集目录。
```
hadoop fs -rm -r /tmp/ml/dataset/higgs
```
检查HDFS对应目录下有无训练集和测试集数据，得到结果如下图。
```
hadoop fs -ls /tmp/ml/dataset
```

下载官网nytimes数据集

新建“/test/dataset/ml/nytimes”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/nytimes
cd /test/dataset/ml/nytimes
```

下载官网nytimes数据集。

wget https://archive.ics.uci.edu/ml/machine-learning-databases/bag-of-words/docword.nytimes.txt.gz

解压数据集到当前目录。
```
gzip -d docword.nytimes.txt.gz
```

新建dataset_process.py（因为是Python文件，所以请注意格式缩进）。

vim dataset_process.py

文件内容如下。

import sys
if __name__ == "__main__":
if len(sys.argv) <= 1:
print("Please input dataset")
exit()
filename = sys.argv[1]
print("Reading data")
processed_data = {}
with open(filename, 'r') as fp:
data = fp.readlines()
print("Pre-processing data")
for line in data[3:]:
line_split = line.strip().split()
if len(line_split) < 3:
continue
doc_id = int(line_split[0])
vocab_id = line_split[1]
term_num = line_split[2]
if doc_id not in processed_data:
processed_data[doc_id] = str(doc_id)
processed_data[doc_id] += (" %s:%s" % (vocab_id, term_num))
print("Post-processing data")
doc_ids = list(processed_data.keys())
doc_ids.sort()
data = []
for doc_id in doc_ids:
data.append(processed_data[doc_id] + "\n")
print("Writing data")
with open(filename + ".libsvm", 'w') as fp:
fp.writelines(data)

使用dataset_process.py将数据集转换成libsvm格式。
```
python3 dataset_process.py docword.nytimes.txt
```
修改数据集docword.nytimes.txt.libsvm的名称为docword.nytimes.txt.libsvm.raw。

新建reorder.py（因为是Python文件，所以请注意格式缩进）。

vim reorder.py

文件内容如下。

filename = "docword.nytimes.txt.libsvm.raw"
new_filename = "docword.nytimes.txt.libsvm"
with open(filename, 'r') as fp:
filedata = fp.readlines()
print("Data length: %d" % len(filedata))
count = 0
data = []
for line in filedata:
line_split = line.strip().split()
doc_index = int(line_split[0])
doc_terms = {}
for term in line_split[1:]:
term_split = term.strip().split(":")
assert int(term_split[0]) not in doc_terms
doc_terms[int(term_split[0])] = int(term_split[1])
data.append([doc_index, doc_terms])
count += 1
if count % 100000 == 0:
print("Processed %d00K" % int(count / 100000))
count = 0
new_filedata = []
for doc in data:
doc_string = str(doc[0])
term_indices = list(doc[1].keys())
term_indices.sort()
for term_index in term_indices:
doc_string += (" " + str(term_index) + ":" + str(doc[1][term_index]))
doc_string += "\n"
new_filedata.append(doc_string)
count += 1
if count % 100000 == 0:
print("Generated %d00K" % int(count / 100000))
with open(new_filename, 'w') as fp:
fp.writelines(new_filedata)

使用reorder.py将6修改名称之后的数据集进行重新排序。
```
python3 reorder.py
```
HDFS新建“/tmp/dataset/ml/nytimes”文件夹。
```
hadoop fs -mkdir -p /tmp/ml/dataset/nytimes/
```

上传数据集到HDFS上。

hadoop fs -put /test/dataset/ml/nytimes/docword.nytimes.txt.libsvm /tmp/ml/dataset/nytimes/

下载官网Kosarak数据集

新建“/test/dataset/ml/Kosarak”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/Kosarak
cd /test/dataset/ml/Kosarak
```

下载官网Kosarak数据集。

wget http://www.philippe-fournier-viger.com/spmf/datasets/kosarak_sequences.txt

HDFS新建“/tmp/ml/dataset/Kosarak”文件夹。
```
hadoop fs -mkdir -p /tmp/ml/dataset/Kosarak/
```

上传数据集到HDFS上。

hadoop fs -put /test/dataset/ml/Kosarak/kosarak_sequences.txt /tmp/ml/dataset/Kosarak/

下载官网DEEP1B数据集

新建“/test/dataset/ml/DEEP1B”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/DEEP1B
cd /test/dataset/ml/DEEP1B
```

下载官网DEEP1B数据集。

wget http://ann-benchmarks.com/deep-image-96-angular.hdf5

新建文件processHDF5.py（因为是Python文件，所以请注意格式缩进）。

vim processHDF5.py

内容如下。

import os
import h5py
# downloaded hdf5 file
inputFile = h5py.File('deep-image-96-angular.hdf5', 'r')
# directory name to store output files
outputDir = "deep1b"
# the number of samples in each output file
samplesPerFile = 5000
sampleCnt = 0
fileCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
data = inputFile['train']
for feature in data:
writer.write(','.join([str(d) for d in feature]) + "\n")
sampleCnt += 1
if sampleCnt == samplesPerFile:
writer.close()
fileCnt += 1
sampleCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
data = inputFile['test']
for feature in data:
writer.write(','.join([str(d) for d in feature]) + "\n")
sampleCnt += 1
if sampleCnt == samplesPerFile:
writer.close()
fileCnt += 1
sampleCnt = 0
writer = open(os.path.join(outputDir, 'part-{}'.format(fileCnt)), 'w')
writer.close()

数据处理：将hdf5文件转换为文本文件，每个样本一行，特征之间用逗号分隔。
```
mkdir deep1b
python3 processHDF5.py
```
若出现如下报错，则需运行python3 -m pip install h5py。
HDFS新建“/tmp/ml/dataset/DEEP1”文件夹。
```
hadoop fs -mkdir -p /tmp/ml/dataset/DEEP1B
```

上传数据集到HDFS上。

hadoop fs -put /test/dataset/ml/deep1b/* /tmp/ml/dataset/DEEP1B/

下载官网Mnist8m数据集

新建“/test/dataset/ml/mnist8m”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/mnist8m
cd /test/dataset/ml/mnist8m
```

下载官网mnist8m数据集。

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/mnist8m.bz2

解压训练集和测试集与当前目录。
```
bzip2 -d mnist8m.bz2
```
HDFS新建“/tmp/dataset/ml/mnist8m”文件夹。
```
hadoop fs -mkdir -p /tmp/ml/dataset/mnist8m
```

上传数据集到HDFS上。

hadoop fs -put /test/dataset/ml/mnist8m/mnist8m /tmp/ml/dataset/mnist8m

打开Spark-shell。
```
spark-shell
```
输入以下命令（注意冒号也是命令）。
```
:paste
```

执行下面代码，将数据集分割为训练集和测试集。

val reader = spark.read.format("libsvm")
reader.option("numFeatures", 784)
val dataPath = "/tmp/ml/dataset/mnist8m"
val data = reader.load(dataPath)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),2020)
val trainOutputPath = s"${dataPath}_train"
val testOutputPath = s"${dataPath}_test"
trainingData.write.format("libsvm").save(trainOutputPath)
testData.write.format("libsvm").save(testOutputPath)

回车并执行Ctrl+d。

移除HDFS上无用的数据集目录。

hadoop fs -rm -r /tmp/ml/dataset/mnist8m

检查HDFS对应目录下有无训练集和测试集数据，得到结果如下图。
```
hadoop fs -ls /tmp/ml/dataset
```

下载官网Epsilon数据集

新建“/test/dataset/ml/epsilon”目录，并进入该目录。
```
mkdir -p /test/dataset/ml/epsilon
cd /test/dataset/ml/epsilon
```

下载官网epsilon训练集和测试集。

wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.bz2
wget https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary/epsilon_normalized.t.bz2

解压训练集和测试集到当前目录。

bzip2 -d epsilon_normalized.bz2
bzip2 -d epsilon_normalized.t.bz2

HDFS新建“/tmp/dataset/ml/epsilon_train ”和“/tmp/dataset/ml/epsilon_test” 文件夹。

hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_train
hadoop fs -mkdir -p /tmp/ml/dataset/epsilon_test

上传训练集和测试集到HDFS上。

hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized /tmp/ml/dataset/epsilon_train
hadoop fs -put /test/dataset/ml/epsilon/epsilon_normalized.t /tmp/ml/dataset/epsilon_test

下载官网MESH_DEFORM数据集

新建“/test/dataset/ml/mesh_deform”目录，并进入该目录。

mkdir -p /test/dataset/ml/mesh_deform
cd /test/dataset/ml/mesh_deform

下载官网MESH_DEFORM数据集。

wget https://suitesparse-collection-website.herokuapp.com/MM/Yoshiyasu/mesh_deform.tar.gz

解压于当前目录。
```
tar zxvf mesh_deform.tar.gz
```
解压得到mesh_deform.mtx文件，打开mtx文件，删除第1-25行，第1至24行是信息行，第25行表示的是矩阵行数、列数、非零个数。真正的数据矩阵信息是从第26行开始的。
```
vim mesh_deform.mtx
```
HDFS新建“/tmp/ml/dataset/MESH_DEFORM”文件夹。
```
hadoop fs –mkdir -p /tmp/ml/dataset/MESH_DEFORM
```

上传数据集到HDFS上。

hadoop fs -put mesh_deform.mtx /tmp/ml/dataset/MESH_DEFORM/

检查HDFS对应目录下有无数据，得到结果如下图。
```
hadoop fs -ls /tmp/ml/dataset/MESH_DEFORM
```

父主题： 测试数据集