Spark入门:基本的统计工具(2) – spark.mllib

大数据学习路线图

【版权声明】博客内容由厦门大学数据库实验室拥有版权,未经允许,请勿转载!
[返回Spark教程首页]

五、假设检验 Hypothesis testing

​ Spark目前支持皮尔森卡方检测(Pearson’s chi-squared tests),包括“适配度检定”(Goodness of fit)以及“独立性检定”(independence)。

​ 首先,我们导入必要的包

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.regression.LabeledPoint
import org.apache.spark.mllib.stat.Statistics._

​ 接下来,我们从数据集中选择要分析的数据,比如说我们取出iris数据集中的前两条数据v1和v2。不同的输入类型决定了是做拟合度检验还是独立性检验。拟合度检验要求输入为Vector, 独立性检验要求输入是Matrix。

scala> val v1: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).first
v1: org.apache.spark.mllib.linalg.Vector = [5.1,3.5,1.4,0.2]
scala> val v2: Vector = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => Vectors.dense(p(0).toDouble, p(1).toDouble, p(2).toDouble, p(3).toDouble)).take(2).last
v2: org.apache.spark.mllib.linalg.Vector = [4.9,3.0,1.4,0.2]

(一) 适合度检验 Goodness fo fit

Goodness fo fit(适合度检验):验证一组观察值的次数分配是否异于理论上的分配。其 H0假设(虚无假设,null hypothesis)为一个样本中已发生事件的次数分配会服从某个特定的理论分配。实际执行多项式试验而得到的观察次数,与虚无假设的期望次数相比较,检验二者接近的程度,利用样本数据以检验总体分布是否为某一特定分布的统计方法。

通常情况下这个特定的理论分配指的是均匀分配,目前Spark默认的是均匀分配。以下是代码:

scala> val goodnessOfFitTestResult = Statistics.chiSqTest(v1)
goodnessOfFitTestResult: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 5.588235294117647
pValue = 0.1334553914430291
No presumption against null hypothesis: observed follows the same distribution as expected..

可以看到P值,自由度,检验统计量,所使用的方法,以及零假设等信息。我们先简单介绍下每个输出的意义:

method: 方法。这里采用pearson方法。

statistic: 检验统计量。简单来说就是用来决定是否可以拒绝原假设的证据。检验统计量的值是利用样本数据计算得到的,它代表了样本中的信息。检验统计量的绝对值越大,拒绝原假设的理由越充分,反之,不拒绝原假设的理由越充分。

degrees of freedom:自由度。表示可自由变动的样本观测值的数目,

pValue:统计学根据显著性检验方法所得到的P 值。一般以P < 0.05 为显著, P<0.01 为非常显著,其含义是样本间的差异由抽样误差所致的概率小于0.05 或0.01。

一般来说,假设检验主要看P值就够了。在本例中pValue =0.133,说明两组的差别无显著意义。通过V1的观测值[5.1, 3.5, 1.4, 0.2],无法拒绝其服从于期望分配(这里默认是均匀分配)的假设。

(二)独立性检验 Indenpendence

卡方独立性检验是用来检验两个属性间是否独立。其中一个属性做为行,另外一个做为列,通过貌似相关的关系考察其是否真实存在相关性。比如天气温变化和肺炎发病率。

首先,我们通过v1、v2构造一个举证Matrix,然后进行独立性检验:

scala> val mat: Matrix =Matrices.dense(2,2,Array(v1(0),v1(1),v2(0),v2(1)))
mat: org.apache.spark.mllib.linalg.Matrix =
5.1  4.9
3.5  3.0
scala> val a =Statistics.chiSqTest(mat)
a: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 1
statistic = 0.012787584067389817
pValue = 0.90996538641943
No presumption against null hypothesis: the occurrence of the outcomes is statistically independent..

​ 这里所要检验是否独立的两个属性,一个是样本的序号,另一个是样本的数据值。在本例中pValue =0.91,说明无法拒绝“样本序号与数据值无关”的假设。这也符合数据集的实际情况,因为v1和v2是从同一个样本抽取的两条数据,样本的序号与数据的取值应该是没有关系的。

我们也可以把v1作为样本,把v2作为期望值,进行卡方检验:

scala> val c1 = Statistics.chiSqTest(v1, v2)
c1: org.apache.spark.mllib.stat.test.ChiSqTestResult =
Chi squared test summary:
method: pearson
degrees of freedom = 3
statistic = 0.03717820461517941
pValue = 0.9981145601231336
No presumption against null hypothesis: observed follows the same distribution as expected..

本例中pValue =0.998,说明样本v1与期望值等于V2的数据分布并无显著差异。事实上,v1=[5.1,3.5,1.4,0.2]与v2= [4.9,3.0,1.4,0.2]很像,v1可以看做是从期望值为v2的数据分布中抽样出来的的。

同样的,键值对也可以进行独立性检验,这里我们取iris的数据组成键值对:

scala> val data=sc.textFile("G:/spark/iris.data")
data: org.apache.spark.rdd.RDD[String] = G:/spark/iris.data MapPartitionsRDD[13] at textFile at <console>:44
scala>     val obs = data.map{ line =>
     |       val parts = line.split(',')
     |       LabeledPoint(if(parts(4)=="Iris-setosa") 0.toDouble else if (parts(4)=="Iris-versicolor") 1.toDouble else
     |       2.toDouble, Vectors.dense(parts(0).toDouble,parts(1).toDouble,parts
(2).toDouble,parts(3).toDouble))}
obs: org.apache.spark.rdd.RDD[org.apache.spark.mllib.regression.LabeledPoint] = MapPartitionsRDD[14] at map at <console>:46

​ 进行独立性检验,返回一个包含每个特征对于标签的卡方检验的数组:

scala> val featureTestResults= Statistics.chiSqTest(obs)
featureTestResults: Array[org.apache.spark.mllib.stat.test.ChiSqTestResult] =
Array(Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent.., Chi...

​ 这里实际上是把特征数据中的每一列都与标签进行独立性检验。可以看出,P值都非常小,说明可以拒绝“某列与标签列无关”的假设。也就是说,可以认为每一列的数据都与最后的标签有相关性。我们用foreach把完整结果打印出来:

scala> var i = 1
i: Int = 1
scala> featureTestResults.foreach { result =>
     |   println(s"Column $i:\n$result")
     |   i += 1
     | }
Column 1:
Chi squared test summary:
method: pearson
degrees of freedom = 68
statistic = 156.26666666666665
pValue = 6.6659873176888595E-9
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 2:
Chi squared test summary:
method: pearson
degrees of freedom = 44
statistic = 88.36446886446883
pValue = 8.303947787857702E-5
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 3:
Chi squared test summary:
method: pearson
degrees of freedom = 84
statistic = 271.79999999999995
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..
Column 4:
Chi squared test summary:
method: pearson
degrees of freedom = 42
statistic = 271.75
pValue = 0.0
Very strong presumption against null hypothesis: the occurrence of the outcomes
is statistically independent..

spark也支持Kolmogorov-Smirnov 检验,下面将展示具体的步骤:

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at <console>:44
// run a KS test for the sample versus a standard normal distribution
scala> val testResult = Statistics.kolmogorovSmirnovTest(test, "norm", 0, 1)
testResult: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult =
Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 0.999991460094529
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
// perform a KS test using a cumulative distribution function of our making
scala>     val myCDF: Double => Double = (p=>p*2)
myCDF: Double => Double = <function1>
scala>     val testResult2 = Statistics.kolmogorovSmirnovTest(test, myCDF)
testResult2: org.apache.spark.mllib.stat.test.KolmogorovSmirnovTestResult = Kolmogorov-Smirnov test summary:
degrees of freedom = 0
statistic = 14.806666666666668
pValue = 0.0
Very strong presumption against null hypothesis: Sample follows theoretical distribution.

六、随机数生成 Random data generation

​ RandomRDDs 是一个工具集,用来生成含有随机数的RDD,可以按各种给定的分布模式生成数据集,Random RDDs包下现支持正态分布、泊松分布和均匀分布三种分布方式。RandomRDDs提供随机double RDDS或vector RDDS。

​ 下面的例子中生成一个随机double RDD,其值是标准正态分布N(0,1),然后将其映射到N(1,4)。

​ 首先,导入必要的包:

import org.apache.spark.SparkContext
import org.apache.spark.mllib.random.RandomRDDs._

​ 生成1000000个服从正态分配N(0,1)的RDD[Double],并且分布在 10 个分区中:

scala> val u = normalRDD(sc, 10000000L, 10)
u: org.apache.spark.rdd.RDD[Double] = RandomRDD[35] at RDD at RandomRDD.scala:38

​ 把生成的随机数转化成N(1,4) 正态分布:

scala> val v = u.map(x => 1.0 + 2.0 * x)
v: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[36] at map at <console>:50

七、核密度估计 Kernel density estimation

​ Spark ML 提供了一个工具类 KernelDensity 用于核密度估算,核密度估算的意思是根据已知的样本估计未知的密度,属於非参数检验方法之一。核密度估计的原理是。观察某一事物的已知分布,如果某一个数在观察中出现了,可认为这个数的概率密度很大,和这个数比较近的数的概率密度也会比较大,而那些离这个数远的数的概率密度会比较小。Spark1.6.2版本支持高斯核(Gaussian kernel)。

​ 首先,导入必要的包:

import org.apache.spark.mllib.stat.KernelDensity
import org.apache.spark.rdd.RDD

​ 同时留意到已经导入的数据:

scala> val test = sc.textFile("G:/spark/iris.data").map(_.split(",")).map(p => p(0).toDouble)
test: org.apache.spark.rdd.RDD[Double] = MapPartitionsRDD[22] at map at <console>:44

用样本数据构建核函数,这里用假设检验中得到的iris的第一个属性的数据作为样本数据进行估计:


scala> val kd = new KernelDensity().setSample(test).setBandwidth(3.0) kd: org.apache.spark.mllib.stat.KernelDensity = org.apache.spark.mllib.stat.KernelDensity@26216fa3

其中setBandwidth表示高斯核的宽度,为一个平滑参数,可以看做是高斯核的标准差。

构造了核密度估计kd,就可以对给定数据数据进行核估计:

scala> val densities = kd.estimate(Array(-1.0, 2.0, 5.0, 5.8))
densities: Array[Double] = Array(0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114)

这里表示的是,在样本-1.0, 2.0, 5.0, 5.8等样本点上,其估算的概率密度函数值分别是:0.011372003554433524, 0.059925911357198915, 0.12365409462424519, 0.12816280708978114。


子雨大数据之Spark入门
扫一扫访问本博客