Spark入门：基本的统计工具（2） – spark.mllib

【版权声明】博客内容由厦门大学数据库实验室拥有版权，未经允许，请勿转载！
[返回Spark教程首页]

五、假设检验 Hypothesis testing

Spark目前支持皮尔森卡方检测（Pearson’s chi-squared tests），包括“适配度检定”（Goodness of fit）以及“独立性检定”（independence）。

首先，我们导入必要的包

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._

接下来，我们从数据集中选择要分析的数据，比如说我们取出iris数据集中的前两条数据v1和v2。不同的输入类型决定了是做拟合度检验还是独立性检验。拟合度检验要求输入为Vector, 独立性检验要求输入是Matrix。

scala> val v1: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).first
v1: org.apache.spark.mllib.linalg.Vector = [5.1,3.5,1.4,0.2]
scala> val v2: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).take(2).last
v2: org.apache.spark.mllib.linalg.Vector = [4.9,3.0,1.4,0.2]

(一) 适合度检验 Goodness fo fit

Goodness fo fit（适合度检验）：验证一组观察值的次数分配是否异于理论上的分配。其 H0假设（虚无假设，null hypothesis）为一个样本中已发生事件的次数分配会服从某个特定的理论分配。实际执行多项式试验而得到的观察次数，与虚无假设的期望次数相比较，检验二者接近的程度，利用样本数据以检验总体分布是否为某一特定分布的统计方法。

通常情况下这个特定的理论分配指的是均匀分配，目前Spark默认的是均匀分配。以下是代码：

scala> val goodnessOfFitTestResult = Statistics.chiSqTest(v1)
goodnessOfFitTestResult: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 5.588235294117647
pValue = 0.1334553914430291
No presumption against null hypothesis: observed follows the same distribution as expected..

可以看到P值，自由度，检验统计量，所使用的方法，以及零假设等信息。我们先简单介绍下每个输出的意义：

method: 方法。这里采用pearson方法。

statistic：检验统计量。简单来说就是用来决定是否可以拒绝原假设的证据。检验统计量的值是利用样本数据计算得到的，它代表了样本中的信息。检验统计量的绝对值越大，拒绝原假设的理由越充分，反之，不拒绝原假设的理由越充分。

degrees of freedom：自由度。表示可自由变动的样本观测值的数目，

pValue：统计学根据显著性检验方法所得到的P 值。一般以P < 0.05 为显著， P<0.01 为非常显著，其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。

一般来说，假设检验主要看P值就够了。在本例中pValue =0.133，说明两组的差别无显著意义。通过V1的观测值[5.1, 3.5, 1.4, 0.2]，无法拒绝其服从于期望分配（这里默认是均匀分配）的假设。

（二）独立性检验 Indenpendence

卡方独立性检验是用来检验两个属性间是否独立。其中一个属性做为行，另外一个做为列，通过貌似相关的关系考察其是否真实存在相关性。比如天气温变化和肺炎发病率。

首先，我们通过v1、v2构造一个举证Matrix，然后进行独立性检验：

scala> val mat: Matrix =Matrices.dense(2,2,Array(v1(0),v1(1),v2(0),v2(1)))
mat: org.apache.spark.mllib.linalg.Matrix =
5.1  4.9
3.5  3.0
scala> val a =Statistics.chiSqTest(mat)
a: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.012787584067389817
pValue = 0.90996538641943
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

这里所要检验是否独立的两个属性，一个是样本的序号，另一个是样本的数据值。在本例中pValue =0.91，说明无法拒绝“样本序号与数据值无关”的假设。这也符合数据集的实际情况，因为v1和v2是从同一个样本抽取的两条数据，样本的序号与数据的取值应该是没有关系的。

我们也可以把v1作为样本，把v2作为期望值，进行卡方检验：

scala> val c1 = Statistics.chiSqTest(v1, v2)
c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 0.03717820461517941
pValue = 0.9981145601231336
No presumption against null hypothesis: observed follows the same distribution as expected..

本例中pValue =0.998，说明样本v1与期望值等于V2的数据分布并无显著差异。事实上，v1=[5.1,3.5,1.4,0.2]与v2= [4.9,3.0,1.4,0.2]很像，v1可以看做是从期望值为v2的数据分布中抽样出来的的。

同样的，键值对也可以进行独立性检验，这里我们取iris的数据组成键值对：

scala> val data=sc.textFile("G:/spark/iris.data")
data: org.apache.spark.rdd.RDD[String] = G:/spark/iris.data MapPartitionsRDD[13] at textFile at <console>:44
scala>     val obs = data.map{ line =>
     |       val parts = line.split(',')
     |       LabeledPoint(if(parts(4)=="Iris-setosa") 0.toDouble else if (parts(4)=="Iris-versicolor") 1.toDouble else
     |       2.toDouble, Vectors.dense(parts(0).toDouble,parts(1).toDouble,parts
(2).toDouble,parts(3).toDouble))}
obs: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[14] at map at <console>:46

进行独立性检验，返回一个包含每个特征对于标签的卡方检验的数组：

scala> val featureTestResults= Statistics.chiSqTest(obs)
featureTestResults: Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi...

这里实际上是把特征数据中的每一列都与标签进行独立性检验。可以看出，P值都非常小，说明可以拒绝“某列与标签列无关”的假设。也就是说，可以认为每一列的数据都与最后的标签有相关性。我们用foreach把完整结果打印出来：

scala> var i = 1
i: Int = 1
scala> featureTestResults.foreach { result =>
     |   println(s"Column $i:\n$result")
     |   i += 1
     | }
Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 4:
Chi squared test summary:
method: pearson
degrees of freedom = 42
statistic = 271.75
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..

spark也支持Kolmogorov-Smirnov 检验，下面将展示具体的步骤：

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at <console>:44
// run a KS test for the sample versus a standard normal distribution
scala> val testResult = Statistics.kolmogorovSmirnovTest(test, "norm", 0, 1)
testResult: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult =
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.999991460094529
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
// perform a KS test using a cumulative distribution function of our making
scala>     val myCDF: Double => Double = (p=>p*2)
myCDF: Double => Double = <function1>
scala>     val testResult2 = Statistics.kolmogorovSmirnovTest(test, myCDF)
testResult2: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 14.806666666666668
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.

六、随机数生成 Random data generation

RandomRDDs 是一个工具集，用来生成含有随机数的RDD，可以按各种给定的分布模式生成数据集，Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。RandomRDDs提供随机double RDDS或vector RDDS。

下面的例子中生成一个随机double RDD，其值是标准正态分布N（0，1），然后将其映射到N（1，4）。

首先，导入必要的包：

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

生成1000000个服从正态分配N(0,1)的RDD[Double]，并且分布在 10 个分区中：

scala> val u = normalRDD(sc, 10000000L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[35] at RDD at RandomRDD.scala:38

把生成的随机数转化成N(1,4) 正态分布：

scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[36] at map at <console>:50

七、核密度估计 Kernel density estimation

Spark ML 提供了一个工具类 KernelDensity 用于核密度估算，核密度估算的意思是根据已知的样本估计未知的密度，属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布，如果某一个数在观察中出现了，可认为这个数的概率密度很大，和这个数比较近的数的概率密度也会比较大，而那些离这个数远的数的概率密度会比较小。Spark1.6.2版本支持高斯核(Gaussian kernel)。

首先，导入必要的包：

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

同时留意到已经导入的数据：

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at <console>:44

用样本数据构建核函数，这里用假设检验中得到的iris的第一个属性的数据作为样本数据进行估计：


scala> val kd = new KernelDensity().setSample(test).setBandwidth(3.0)
kd: org.apache.spark.mllib.stat.KernelDensity = org.apache.spark.mllib.stat.KernelDensity@26216fa3

其中setBandwidth表示高斯核的宽度，为一个平滑参数，可以看做是高斯核的标准差。

构造了核密度估计kd，就可以对给定数据数据进行核估计：

scala> val densities = kd.estimate(Array(-1.0, 2.0, 5.0, 5.8))
densities: Array[Double] = Array(0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114)

这里表示的是，在样本-1.0, 2.0, 5.0, 5.8等样本点上，其估算的概率密度函数值分别是：0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114。

子雨大数据之Spark入门
扫一扫访问本博客

厦大数据库实验室博客

五、假设检验 Hypothesis testing

(一) 适合度检验 Goodness fo fit

（二）独立性检验 Indenpendence

六、随机数生成 Random data generation

七、核密度估计 Kernel density estimation