Spark2.1.0入门:逻辑斯蒂回归分类器

大数据技术原理与应用

【版权声明】博客内容由厦门大学数据库实验室拥有版权,未经允许,请勿转载!

[返回Spark教程首页]

逻辑斯蒂回归

方法简介

​ 逻辑斯蒂回归(logistic regression)是统计学习中的经典分类方法,属于对数线性模型。logistic回归的因变量可以是二分类的,也可以是多分类的。

基本原理

logistic分布

​ 设X是连续随机变量,X服从logistic分布是指X具有下列分布函数和密度函数:

\(\)

\(\)

​ 其中,$\mu$为位置参数,$\gamma$为形状参数。

​ $f(x)$与 \(\) 图像如下,其中分布函数是以 \(\) 为中心对阵, \(\) $越小曲线变化越快。

二项logistic回归模型:

​ 二项logistic回归模型如下:

\(\)

\(\)

​ 其中, \(\) 是输入, \(\) 是输出,w称为权值向量,b称为偏置, \(\) 为w和x的内积。

参数估计

​ 假设:

\(\)

​ 则似然函数为:

\(\)

​ 求对数似然函数:

\(\)

\(\)

​ 从而对 \(\) 求极大值,得到w的估计值。求极值的方法可以是梯度下降法,梯度上升法等。

示例代码

​ 我们以iris数据集(iris)为例进行分析。iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集。为了便于理解,我们这里主要用后两个属性(花瓣的长度和宽度)来进行分类。目前 spark.ml 中支持二分类和多分类,我们将分别从“用二项逻辑斯蒂回归来解决二分类问题”、“用多项逻辑斯蒂回归来解决二分类问题”、“用多项逻辑斯蒂回归来解决多分类问题”三个方面进行分析。

用二项逻辑斯蒂回归解决 二分类 问题

首先我们先取其中的后两类数据,用二项逻辑斯蒂回归进行二分类分析。

1. 导入需要的包:
import org.apache.spark.sql.Row
import org.apache.spark.sql.SparkSession
import org.apache.spark.ml.linalg.{Vector,Vectors}
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
import org.apache.spark.ml.{Pipeline,PipelineModel}
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer,HashingTF, Tokenizer}
import org.apache.spark.ml.classification.LogisticRegression
import org.apache.spark.ml.classification.LogisticRegressionModel
import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression}
import org.apache.spark.sql.functions;
2. 读取数据,简要分析:

​ 导入spark.implicits._,使其支持把一个RDD隐式转换为一个DataFrame。我们用case class定义一个schema:Iris,Iris就是我们需要的数据的结构;然后读取文本文件,第一个map把每行的数据用“,”隔开,比如在我们的数据集中,每行被分成了5部分,前4部分是鸢尾花的4个特征,最后一部分是鸢尾花的分类;我们这里把特征存储在Vector中,创建一个Iris模式的RDD,然后转化成dataframe;最后调用show()方法来查看一下部分数据。

scala> import spark.implicits._
import spark.implicits._

scala> case class Iris(features: org.apache.spark.ml.linalg.Vector, label: String)
defined class Iris

scala> val data = spark.sparkContext.textFile("file:///usr/local/spark/iris.txt").map(_.split(",")).map(p => I
ris(Vectors.dense(p(0).toDouble,p(1).toDouble,p(2).toDouble, p(3).toDouble), p(4
).toString())).toDF()
data: org.apache.spark.sql.DataFrame = [features: vector, label: string]

scala> data.show()
+-----------------+-----------+
|         features|      label|
+-----------------+-----------+
|[5.1,3.5,1.4,0.2]|Iris-setosa|
|[4.9,3.0,1.4,0.2]|Iris-setosa|
|[4.7,3.2,1.3,0.2]|Iris-setosa|
|[4.6,3.1,1.5,0.2]|Iris-setosa|
|[5.0,3.6,1.4,0.2]|Iris-setosa|
|[5.4,3.9,1.7,0.4]|Iris-setosa|
|[4.6,3.4,1.4,0.3]|Iris-setosa|
|[5.0,3.4,1.5,0.2]|Iris-setosa|
|[4.4,2.9,1.4,0.2]|Iris-setosa|
|[4.9,3.1,1.5,0.1]|Iris-setosa|
|[5.4,3.7,1.5,0.2]|Iris-setosa|
|[4.8,3.4,1.6,0.2]|Iris-setosa|
|[4.8,3.0,1.4,0.1]|Iris-setosa|
|[4.3,3.0,1.1,0.1]|Iris-setosa|
|[5.8,4.0,1.2,0.2]|Iris-setosa|
|[5.7,4.4,1.5,0.4]|Iris-setosa|
|[5.4,3.9,1.3,0.4]|Iris-setosa|
|[5.1,3.5,1.4,0.3]|Iris-setosa|
|[5.7,3.8,1.7,0.3]|Iris-setosa|
|[5.1,3.8,1.5,0.3]|Iris-setosa|
+-----------------+-----------+
only showing top 20 rows

​ 因为我们现在处理的是2分类问题,所以我们不需要全部的3类数据,我们要从中选出两类的数据。这里首先把刚刚得到的数据注册成一个表iris,注册成这个表之后,我们就可以通过sql语句进行数据查询,比如我们这里选出了所有不属于“Iris-setosa”类别的数据;选出我们需要的数据后,我们可以把结果打印出来看一下,这时就已经没有“Iris-setosa”类别的数据。

scala> data.createOrReplaceTempView("iris")

scala> val df = spark.sql("select * from iris where label != 'Iris-setosa'")
df: org.apache.spark.sql.DataFrame = [features: vector, label: string]

scala> df.map(t => t(1)+":"+t(0)).collect().foreach(println)
Iris-versicolor:[7.0,3.2,4.7,1.4]
Iris-versicolor:[6.4,3.2,4.5,1.5]
Iris-versicolor:[6.9,3.1,4.9,1.5]
Iris-versicolor:[5.5,2.3,4.0,1.3]
Iris-versicolor:[6.5,2.8,4.6,1.5]
Iris-versicolor:[5.7,2.8,4.5,1.3]
Iris-versicolor:[6.3,3.3,4.7,1.6]
Iris-versicolor:[4.9,2.4,3.3,1.0]
Iris-versicolor:[6.6,2.9,4.6,1.3]
Iris-versicolor:[5.2,2.7,3.9,1.4]
Iris-versicolor:[5.0,2.0,3.5,1.0]
Iris-versicolor:[5.9,3.0,4.2,1.5]
Iris-versicolor:[6.0,2.2,4.0,1.0]
Iris-versicolor:[6.1,2.9,4.7,1.4]
Iris-versicolor:[5.6,2.9,3.6,1.3]
Iris-versicolor:[6.7,3.1,4.4,1.4]
Iris-versicolor:[5.6,3.0,4.5,1.5]
Iris-versicolor:[5.8,2.7,4.1,1.0]
Iris-versicolor:[6.2,2.2,4.5,1.5]
Iris-versicolor:[5.6,2.5,3.9,1.1]
Iris-versicolor:[5.9,3.2,4.8,1.8]
Iris-versicolor:[6.1,2.8,4.0,1.3]
Iris-versicolor:[6.3,2.5,4.9,1.5]
Iris-versicolor:[6.1,2.8,4.7,1.2]
Iris-versicolor:[6.4,2.9,4.3,1.3]
Iris-versicolor:[6.6,3.0,4.4,1.4]
Iris-versicolor:[6.8,2.8,4.8,1.4]
Iris-versicolor:[6.7,3.0,5.0,1.7]
Iris-versicolor:[6.0,2.9,4.5,1.5]
Iris-versicolor:[5.7,2.6,3.5,1.0]
Iris-versicolor:[5.5,2.4,3.8,1.1]
Iris-versicolor:[5.5,2.4,3.7,1.0]
Iris-versicolor:[5.8,2.7,3.9,1.2]
Iris-versicolor:[6.0,2.7,5.1,1.6]
Iris-versicolor:[5.4,3.0,4.5,1.5]
Iris-versicolor:[6.0,3.4,4.5,1.6]
Iris-versicolor:[6.7,3.1,4.7,1.5]
Iris-versicolor:[6.3,2.3,4.4,1.3]
Iris-versicolor:[5.6,3.0,4.1,1.3]
Iris-versicolor:[5.5,2.5,4.0,1.3]
Iris-versicolor:[5.5,2.6,4.4,1.2]
Iris-versicolor:[6.1,3.0,4.6,1.4]
Iris-versicolor:[5.8,2.6,4.0,1.2]
Iris-versicolor:[5.0,2.3,3.3,1.0]
Iris-versicolor:[5.6,2.7,4.2,1.3]
Iris-versicolor:[5.7,3.0,4.2,1.2]
Iris-versicolor:[5.7,2.9,4.2,1.3]
Iris-versicolor:[6.2,2.9,4.3,1.3]
Iris-versicolor:[5.1,2.5,3.0,1.1]
Iris-versicolor:[5.7,2.8,4.1,1.3]
Iris-virginica:[6.3,3.3,6.0,2.5]
Iris-virginica:[5.8,2.7,5.1,1.9]
Iris-virginica:[7.1,3.0,5.9,2.1]
Iris-virginica:[6.3,2.9,5.6,1.8]
Iris-virginica:[6.5,3.0,5.8,2.2]
Iris-virginica:[7.6,3.0,6.6,2.1]
Iris-virginica:[4.9,2.5,4.5,1.7]
Iris-virginica:[7.3,2.9,6.3,1.8]
Iris-virginica:[6.7,2.5,5.8,1.8]
Iris-virginica:[7.2,3.6,6.1,2.5]
Iris-virginica:[6.5,3.2,5.1,2.0]
Iris-virginica:[6.4,2.7,5.3,1.9]
Iris-virginica:[6.8,3.0,5.5,2.1]
Iris-virginica:[5.7,2.5,5.0,2.0]
Iris-virginica:[5.8,2.8,5.1,2.4]
Iris-virginica:[6.4,3.2,5.3,2.3]
Iris-virginica:[6.5,3.0,5.5,1.8]
Iris-virginica:[7.7,3.8,6.7,2.2]
Iris-virginica:[7.7,2.6,6.9,2.3]
Iris-virginica:[6.0,2.2,5.0,1.5]
Iris-virginica:[6.9,3.2,5.7,2.3]
Iris-virginica:[5.6,2.8,4.9,2.0]
Iris-virginica:[7.7,2.8,6.7,2.0]
Iris-virginica:[6.3,2.7,4.9,1.8]
Iris-virginica:[6.7,3.3,5.7,2.1]
Iris-virginica:[7.2,3.2,6.0,1.8]
Iris-virginica:[6.2,2.8,4.8,1.8]
Iris-virginica:[6.1,3.0,4.9,1.8]
Iris-virginica:[6.4,2.8,5.6,2.1]
Iris-virginica:[7.2,3.0,5.8,1.6]
Iris-virginica:[7.4,2.8,6.1,1.9]
Iris-virginica:[7.9,3.8,6.4,2.0]
Iris-virginica:[6.4,2.8,5.6,2.2]
Iris-virginica:[6.3,2.8,5.1,1.5]
Iris-virginica:[6.1,2.6,5.6,1.4]
Iris-virginica:[7.7,3.0,6.1,2.3]
Iris-virginica:[6.3,3.4,5.6,2.4]
Iris-virginica:[6.4,3.1,5.5,1.8]
Iris-virginica:[6.0,3.0,4.8,1.8]
Iris-virginica:[6.9,3.1,5.4,2.1]
Iris-virginica:[6.7,3.1,5.6,2.4]
Iris-virginica:[6.9,3.1,5.1,2.3]
Iris-virginica:[5.8,2.7,5.1,1.9]
Iris-virginica:[6.8,3.2,5.9,2.3]
Iris-virginica:[6.7,3.3,5.7,2.5]
Iris-virginica:[6.7,3.0,5.2,2.3]
Iris-virginica:[6.3,2.5,5.0,1.9]
Iris-virginica:[6.5,3.0,5.2,2.0]
Iris-virginica:[6.2,3.4,5.4,2.3]
Iris-virginica:[5.9,3.0,5.1,1.8]
3. 构建ML的pipeline

​ 分别获取标签列和特征列,进行索引,并进行了重命名。

scala> val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df)
labelIndexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e53e67411169
scala> val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(df)
featureIndexer: org.apache.spark.ml.feature.VectorIndexerModel = vecIdx_53b988077b38

​ 接下来,我们把数据集随机分成训练集和测试集,其中训练集占70%。

scala> val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3))
trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string]
testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string]

​ 然后,我们设置logistic的参数,这里我们统一用setter的方法来设置,也可以用ParamMap来设置(具体的可以查看spark mllib的官网)。这里我们设置了循环次数为10次,正则化项为0.3等,具体的可以设置的参数可以通过explainParams()来获取,还能看到我们已经设置的参数的结果。

scala> val lr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8)
lr: org.apache.spark.ml.classification.LogisticRegression = logreg_692899496c23

scala> println("LogisticRegression parameters:\n" + lr.explainParams() + "\n")
LogisticRegression parameters:
aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2)
elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8)
family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial. (default: auto)
featuresCol: features column name (default: features, current: indexedFeatures)
fitIntercept: whether to fit an intercept term (default: true)
labelCol: label column name (default: label, current: indexedLabel)
maxIter: maximum number of iterations (>= 0) (default: 100, current: 10)
predictionCol: prediction column name (default: prediction)
probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability)
rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction)
regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3)
standardization: whether to standardize the training features before fitting the model (default: true)
threshold: threshold in binary classification prediction, in range [0, 1] (default: 0.5)
thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold (undefined)
tol: the convergence tolerance for iterative algorithms (>= 0) (default: 1.0E-6)
weightCol: weight column name. If this is not set or empty, we treat all instanc
e weights as 1.0 (undefined)

​ 这里我们设置一个labelConverter,目的是把预测的类别重新转化成字符型的。

scala> val labelConverter = new IndexToString().setInputCol("prediction").setOut
putCol("predictedLabel").setLabels(labelIndexer.labels)
labelConverter: org.apache.spark.ml.feature.IndexToString = idxToStr_c204eafabf57

​ 构建pipeline,设置stage,然后调用fit()来训练模型。

scala> val lrPipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, lr, labelConverter))
lrPipeline: org.apache.spark.ml.Pipeline = pipeline_eb1b201af1e0

scala> val lrPipelineModel = lrPipeline.fit(trainingData)
lrPipelineModel: org.apache.spark.ml.PipelineModel = pipeline_eb1b201af1e0

​ pipeline本质上是一个Estimator,当pipeline调用fit()的时候就产生了一个PipelineModel,本质上是一个Transformer。然后这个PipelineModel就可以调用transform()来进行预测,生成一个新的DataFrame,即利用训练得到的模型对测试集进行验证。

scala> val lrPredictions = lrPipelineModel.transform(testData)
lrPredictions: org.apache.spark.sql.DataFrame = [features: vector, label: string ... 6 more fields]

​ 最后我们可以输出预测的结果,其中select选择要输出的列,collect获取所有行的数据,用foreach把每行打印出来。其中打印出来的值依次分别代表该行数据的真实分类和特征值、预测属于不同分类的概率、预测的分类。

scala> lrPredictions.select("predictedLabel", "label", "features", "probability").collect().foreach { case Row(predictedLabel: String, label: String, features: Vector, prob: Vector) => println(s"($label, $features) --> prob=$prob, predicted Label=$predictedLabel")}
(Iris-virginica, [4.9,2.5,4.5,1.7]) --> prob=[0.4796551461409372,0.5203448538590628], predictedLabel=Iris-virginica
(Iris-versicolor, [5.1,2.5,3.0,1.1]) --> prob=[0.5892626391059901,0.41073736089401], predictedLabel=Iris-versicolor
(Iris-versicolor, [5.5,2.3,4.0,1.3]) --> prob=[0.5577310241453046,0.4422689758546954], predictedLabel=Iris-versicolor
(Iris-versicolor, [5.5,2.4,3.8,1.1]) --> prob=[0.5930925958747064,0.40690740412529364], predictedLabel=Iris-versicolor
(Iris-virginica, [5.8,2.7,5.1,1.9]) --> prob=[0.4524998736099255,0.5475001263900744], predictedLabel=Iris-virginica
(Iris-versicolor, [5.9,3.0,4.2,1.5]) --> prob=[0.5257270436631982,0.47427295633680183], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.1,3.0,4.6,1.4]) --> prob=[0.5457035031286359,0.45429649687136403], predictedLabel=Iris-versicolor
(Iris-virginica, [6.2,3.4,5.4,2.3]) --> prob=[0.3859565350120368,0.6140434649879633], predictedLabel=Iris-virginica
(Iris-versicolor, [6.3,2.3,4.4,1.3]) --> prob=[0.5655339019128637,0.43446609808713627], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.3,3.3,4.7,1.6]) --> prob=[0.5116086280721536,0.4883913719278465], predictedLabel=Iris-versicolor
(Iris-virginica, [6.3,3.3,6.0,2.5]) --> prob=[0.35315821989064367,0.6468417801093564], predictedLabel=Iris-virginica
(Iris-virginica, [6.3,3.4,5.6,2.4]) --> prob=[0.3698681732152967,0.6301318267847033], predictedLabel=Iris-virginica
(Iris-virginica, [6.4,2.8,5.6,2.2]) --> prob=[0.4051590481630649,0.5948409518369352], predictedLabel=Iris-virginica
(Iris-versicolor, [6.4,3.2,4.5,1.5]) --> prob=[0.530663405141585,0.4693365948584149], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.5,2.8,4.6,1.5]) --> prob=[0.5316499889105902,0.46835001108940977], predictedLabel=Iris-versicolor
(Iris-virginica, [6.5,3.0,5.5,1.8]) --> prob=[0.4774053949325432,0.5225946050674568], predictedLabel=Iris-virginica
(Iris-virginica, [6.5,3.2,5.1,2.0]) --> prob=[0.44145814886994916,0.5585418511300508], predictedLabel=Iris-virginica
(Iris-versicolor, [6.6,2.9,4.6,1.3]) --> prob=[0.5684518414201765,0.43154815857982354], predictedLabel=Iris-versicolor
(Iris-virginica, [6.7,3.0,5.2,2.3]) --> prob=[0.3906615305372632,0.6093384694627368], predictedLabel=Iris-virginica
(Iris-virginica, [6.7,3.3,5.7,2.1]) --> prob=[0.4256244550522105,0.5743755449477895], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.0,5.5,2.1]) --> prob=[0.42659325401060527,0.5734067459893948], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.2,5.9,2.3]) --> prob=[0.3916050059160978,0.6083949940839022], predictedLabel=Iris-virginica
(Iris-versicolor, [6.9,3.1,4.9,1.5]) --> prob=[0.5355937737733117,0.46440622622668826], predictedLabel=Iris-versicolor
(Iris-virginica, [6.9,3.2,5.7,2.3]) --> prob=[0.3925492919563345,0.6074507080436655], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.2,6.0,1.8]) --> prob=[0.4843281406266611,0.515671859373339], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.6,6.1,2.5]) --> prob=[0.36134526511761866,0.6386547348823813], predictedLabel=Iris-virginica
(Iris-virginica, [7.3,2.9,6.3,1.8]) --> prob=[0.4853176576364371,0.5146823423635629], predictedLabel=Iris-virginica
(Iris-virginica, [7.4,2.8,6.1,1.9]) --> prob=[0.4682458710345799,0.5317541289654201], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,2.8,6.7,2.0]) --> prob=[0.4532108745431632,0.5467891254568368], predictedLabel=Iris-virginica
4. 模型评估

​ 创建一个MulticlassClassificationEvaluator实例,用setter方法把预测分类的列名和真实分类的列名进行设置;然后计算预测准确率和错误率。

scala> val evaluator = new MulticlassClassificationEvaluator().setLabelCol("indexedLabel").setPredictionCol("prediction")
evaluator: org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator = mcEval_a80353e4211d

scala> val lrAccuracy = evaluator.evaluate(lrPredictions)
lrAccuracy: Double = 1.0

scala> println("Test Error = " + (1.0 - lrAccuracy))
Test Error = 0.0

​ 从上面可以看到预测的准确性达到100%,接下来我们可以通过model来获取我们训练得到的逻辑斯蒂模型。前面已经说过model是一个PipelineModel,因此我们可以通过调用它的stages来获取模型,具体如下:

scala> val lrModel = lrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
lrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_692899496c23

scala> println("Coefficients: " + lrModel.coefficients+"Intercept: "+lrModel.intercept+"numClasses: "+lrModel.numClasses+"numFeatures: "+lrModel.numFeatures)
Coefficients: [-0.0396171957643483,0.0,0.0,0.07240315639651046]Intercept: -0.23127346342015379numClasses: 2numFeatures: 4
5. 模型评估

​ spark的ml库还提供了一个对模型的摘要总结(summary),不过目前只支持二项逻辑斯蒂回归,而且要显示转化成BinaryLogisticRegressionSummary 。在下面的代码中,首先获得二项逻辑斯模型的摘要;然后获得10次循环中损失函数的变化,并将结果打印出来,可以看到损失函数随着循环是逐渐变小的,损失函数越小,模型就越好;接下来,我们把摘要强制转化为BinaryLogisticRegressionSummary ,来获取用来评估模型性能的矩阵;通过获取ROC,我们可以判断模型的好坏,areaUnderROC达到了 0.969551282051282,说明我们的分类器还是不错的;最后,我们通过最大化fMeasure来选取最合适的阈值,其中fMeasure是一个综合了召回率和准确率的指标,通过最大化fMeasure,我们可以选取到用来分类的最合适的阈值。

scala> val trainingSummary = lrModel.summary
trainingSummary: org.apache.spark.ml.classification.LogisticRegressionTrainingSummary = org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary@158f5154

scala> val objectiveHistory = trainingSummary.objectiveHistory
objectiveHistory: Array[Double] = Array(0.6882791293138858, 0.68577138350929, 0. 6828927674985116, 0.6770239468050248, 0.6733358509192351, 0.671857682317743, 0.6 704799223654864, 0.6698239590515892, 0.6692485998664358, 0.6689383488804157, 0.6
686619485854871)

scala> objectiveHistory.foreach(loss => println(loss))
0.6882791293138858
0.68577138350929
0.6828927674985116
0.6770239468050248
0.6733358509192351
0.671857682317743
0.6704799223654864
0.6698239590515892
0.6692485998664358
0.6689383488804157
0.6686619485854871

scala> val binarySummary = trainingSummary.asInstanceOf[BinaryLogisticRegressionSummary]
binarySummary: org.apache.spark.ml.classification.BinaryLogisticRegressionSummary = org.apache.spark.ml.classification.BinaryLogisticRegressionTrainingSummary@158f5154

scala> println(s"areaUnderROC: ${binarySummary.areaUnderROC}")
areaUnderROC: 0.969551282051282

scala> val fMeasure = binarySummary.fMeasureByThreshold
fMeasure: org.apache.spark.sql.DataFrame = [threshold: double, F-Measure: double]

scala> val maxFMeasure = fMeasure.select(functions.max("F-Measure")).head().getDouble(0)
maxFMeasure: Double = 0.9180327868852458

scala> val bestThreshold = fMeasure.where($"F-Measure" === maxFMeasure).select("
threshold").head().getDouble(0)
bestThreshold: Double = 0.5206174467023945

scala> lrModel.setThreshold(bestThreshold)

用多项逻辑斯蒂回归解决 二分类 问题

​ 对于二分类问题,我们还可以用多项逻辑斯蒂回归进行多分类分析。多项逻辑斯蒂回归与二项逻辑斯蒂回归类似,只是在模型设置上把family参数设置成multinomial,这里我们仅列出结果:

scala> val mlr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8).setFamily("multinomial")
mlr: org.apache.spark.ml.classification.LogisticRegression = logreg_82bf612d153e

scala> val mlrPipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, mlr, labelConverter))
mlrPipeline: org.apache.spark.ml.Pipeline = pipeline_016e81dbca1f

scala> val mlrPipelineModel = mlrPipeline.fit(trainingData)
mlrPipelineModel: org.apache.spark.ml.PipelineModel = pipeline_016e81dbca1f

scala> val mlrPredictions = mlrPipelineModel.transform(testData)
mlrPredictions: org.apache.spark.sql.DataFrame = [features: vector, label: string ... 6 more fields]

scala> mlrPredictions.select("predictedLabel", "label", "features", "probability").collect().foreach { case Row(predictedLabel: String, label: String, features:
 Vector, prob: Vector) => println(s"($label, $features) --> prob=$prob, predictedLabel=$predictedLabel")}
(Iris-virginica, [4.9,2.5,4.5,1.7]) --> prob=[0.4706991400166566,0.5293008599833434], predictedLabel=Iris-virginica
(Iris-versicolor, [5.1,2.5,3.0,1.1]) --> prob=[0.6123754219240134,0.38762457807598644], predictedLabel=Iris-versicolor
(Iris-versicolor, [5.5,2.3,4.0,1.3]) --> prob=[0.5724859784244956,0.42751402157550444], predictedLabel=Iris-versicolor
(Iris-versicolor, [5.5,2.4,3.8,1.1]) --> prob=[0.617700896993959,0.3822991030060409], predictedLabel=Iris-versicolor
(Iris-virginica, [5.8,2.7,5.1,1.9]) --> prob=[0.43670908827255583,0.563290911727444], predictedLabel=Iris-virginica
(Iris-versicolor, [5.9,3.0,4.2,1.5]) --> prob=[0.5316312191190347,0.4683687808809653], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.1,3.0,4.6,1.4]) --> prob=[0.5577018837203559,0.44229811627964416], predictedLabel=Iris-versicolor
(Iris-virginica, [6.2,3.4,5.4,2.3]) --> prob=[0.3525986631597158,0.6474013368402842], predictedLabel=Iris-virginica
(Iris-versicolor, [6.3,2.3,4.4,1.3]) --> prob=[0.5834583948072782,0.4165416051927219], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.3,3.3,4.7,1.6]) --> prob=[0.5138182157242249,0.4861817842757751], predictedLabel=Iris-versicolor
(Iris-virginica, [6.3,3.3,6.0,2.5]) --> prob=[0.31220890350252506,0.6877910964974749], predictedLabel=Iris-virginica
(Iris-virginica, [6.3,3.4,5.6,2.4]) --> prob=[0.33271906258823314,0.6672809374117669], predictedLabel=Iris-virginica
(Iris-virginica, [6.4,2.8,5.6,2.2]) --> prob=[0.3769557902496345,0.6230442097503656], predictedLabel=Iris-virginica
(Iris-versicolor, [6.4,3.2,4.5,1.5]) --> prob=[0.5386254134782479,0.461374586521752], predictedLabel=Iris-versicolor
(Iris-versicolor, [6.5,2.8,4.6,1.5]) --> prob=[0.5400225182386277,0.4599774817613724], predictedLabel=Iris-versicolor
(Iris-virginica, [6.5,3.0,5.5,1.8]) --> prob=[0.4697204592971098,0.5302795407028901], predictedLabel=Iris-virginica
(Iris-virginica, [6.5,3.2,5.1,2.0]) --> prob=[0.42334262258749805,0.576657377412502], predictedLabel=Iris-virginica
(Iris-versicolor, [6.6,2.9,4.6,1.3]) --> prob=[0.587552435510627,0.412447564489373], predictedLabel=Iris-versicolor
(Iris-virginica, [6.7,3.0,5.2,2.3]) --> prob=[0.35904307162043886,0.6409569283795613], predictedLabel=Iris-virginica
(Iris-virginica, [6.7,3.3,5.7,2.1]) --> prob=[0.4033033147932801,0.5966966852067198], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.0,5.5,2.1]) --> prob=[0.40465727034257987,0.5953427296574202], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.2,5.9,2.3]) --> prob=[0.36033816936772717,0.6396618306322728], predictedLabel=Iris-virginica
(Iris-versicolor, [6.9,3.1,4.9,1.5]) --> prob=[0.5456044340218952,0.4543955659781048], predictedLabel=Iris-versicolor
(Iris-virginica, [6.9,3.2,5.7,2.3]) --> prob=[0.361635302911029,0.638364697088971], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.2,6.0,1.8]) --> prob=[0.47953540973471454,0.5204645902652856], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.6,6.1,2.5]) --> prob=[0.3231782795184636,0.6768217204815363], predictedLabel=Iris-virginica
(Iris-virginica, [7.3,2.9,6.3,1.8]) --> prob=[0.48093901384053533,0.5190609861594646], predictedLabel=Iris-virginica
(Iris-virginica, [7.4,2.8,6.1,1.9]) --> prob=[0.4589531699542302,0.5410468300457698], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,2.8,6.7,2.0]) --> prob=[0.43989505147330155,0.5601049485266986], predictedLabel=Iris-virginica

scala> val mlrAccuracy = evaluator.evaluate(mlrPredictions)
mlrAccuracy: Double = 1.0

scala> println("Test Error = " + (1.0 - mlrAccuracy))
Test Error = 0.0

scala> val mlrModel = mlrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
mlrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_82bf612d153e

scala> println("Multinomial coefficients: " + mlrModel.coefficientMatrix+"Multin
omial intercepts: "+mlrModel.interceptVector+"numClasses: "+mlrModel.numClasses+
"numFeatures: "+mlrModel.numFeatures)
Multinomial coefficients: 0.028116025552572706   0.0  0.0  -0.046949976003541706
-0.028116025552572695  0.0  0.0  0.046949976003541706   Multinomial intercepts:
[0.13221236569525302,-0.13221236569525302]numClasses: 2numFeatures: 4

用多项逻辑斯蒂回归解决 多分类 问题

​ 对于多分类问题,我们需要用多项逻辑斯蒂回归进行多分类分析。这里我们用全部的iris数据集,即有三个类别,过程与上述基本一致,这里我们同样仅列出结果:

scala> mlrPredictions.select("predictedLabel", "label", "features", "probability").collect().foreach { case Row(predictedLabel: String, label: String, features:
 Vector, prob: Vector) => println(s"($label, $features) --> prob=$prob, predictedLabel=$predictedLabel")}
(Iris-setosa, [4.3,3.0,1.1,0.1]) --> prob=[0.49856067730476944,0.2623440805400292,0.23909524215520148], predictedLabel=Iris-setosa
(Iris-setosa, [4.4,2.9,1.4,0.2]) --> prob=[0.46571089790971687,0.277891570222724,0.25639753186755915], predictedLabel=Iris-setosa
(Iris-setosa, [4.6,3.4,1.4,0.3]) --> prob=[0.5001101367665973,0.25928904940719977,0.24060081382620296], predictedLabel=Iris-setosa
(Iris-setosa, [4.6,3.6,1.0,0.2]) --> prob=[0.5463459284110406,0.236823238870237,0.2168308327187224], predictedLabel=Iris-setosa
(Iris-setosa, [4.7,3.2,1.6,0.2]) --> prob=[0.48370179709200706,0.2689591735381297,0.24733902936986318], predictedLabel=Iris-setosa
(Iris-setosa, [4.8,3.0,1.4,0.1]) --> prob=[0.4851576852808171,0.2693562861247639,0.24548602859441893], predictedLabel=Iris-setosa
(Iris-setosa, [4.9,3.0,1.4,0.2]) --> prob=[0.47467118791268154,0.2733753207454508,0.2519534913418676], predictedLabel=Iris-setosa
(Iris-setosa, [4.9,3.1,1.5,0.1]) --> prob=[0.48967732688779486,0.267131626783035,0.2431910463291701], predictedLabel=Iris-setosa
(Iris-versicolor, [5.0,2.3,3.3,1.0]) --> prob=[0.26303122888674907,0.36560215832179155,0.3713666127914594], predictedLabel=Iris-virginica
(Iris-setosa, [5.0,3.5,1.3,0.3]) --> prob=[0.5135688079539743,0.2524416257183621,0.2339895663276636], predictedLabel=Iris-setosa
(Iris-setosa, [5.0,3.5,1.6,0.6]) --> prob=[0.4686356517088239,0.2713034457686629,0.26006090252251307], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.5,1.4,0.3]) --> prob=[0.5091020180722664,0.25475974124614675,0.23613824068158687], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.8,1.5,0.3]) --> prob=[0.531570061574297,0.24348517949467904,0.22494475893102403], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.8,1.9,0.4]) --> prob=[0.503222274154322,0.25683175058110785,0.23994597526457007], predictedLabel=Iris-setosa
(Iris-setosa, [5.2,3.5,1.5,0.2]) --> prob=[0.5151370941776632,0.2529823490495923,0.2318805567727446], predictedLabel=Iris-setosa
(Iris-setosa, [5.3,3.7,1.5,0.2]) --> prob=[0.5330773525753305,0.24387796024925384,0.22304468717541576], predictedLabel=Iris-setosa
(Iris-versicolor, [5.4,3.0,4.5,1.5]) --> prob=[0.2306542447600023,0.372383222489962,0.39696253275003573], predictedLabel=Iris-virginica
(Iris-setosa, [5.4,3.9,1.3,0.4]) --> prob=[0.5389512877303541,0.23848657002728416,0.2225621422423618], predictedLabel=Iris-setosa
(Iris-versicolor, [5.5,2.4,3.7,1.0]) --> prob=[0.25620601559263473,0.36919246180632764,0.37460152260103763], predictedLabel=Iris-virginica
(Iris-setosa, [5.5,3.5,1.3,0.2]) --> prob=[0.5240613549472979,0.24832602160956213,0.22761262344314004], predictedLabel=Iris-setosa
(Iris-setosa, [5.5,4.2,1.4,0.2]) --> prob=[0.5818115053858839,0.21899706180633755,0.19919143280777854], predictedLabel=Iris-setosa
(Iris-versicolor, [5.6,2.5,3.9,1.1]) --> prob=[0.24827164138938784,0.3712338899987297,0.38049446861188246], predictedLabel=Iris-virginica
(Iris-versicolor, [5.6,2.7,4.2,1.3]) --> prob=[0.23609842674482123,0.3733910806218104,0.39051049263336834], predictedLabel=Iris-virginica
(Iris-virginica, [5.6,2.8,4.9,2.0]) --> prob=[0.17353784667372726,0.38803750951559646,0.43842464381067625], predictedLabel=Iris-virginica
(Iris-versicolor, [5.6,2.9,3.6,1.3]) --> prob=[0.26994082035183004,0.35725015822484213,0.37280902142332784], predictedLabel=Iris-virginica
(Iris-setosa, [5.7,4.4,1.5,0.4]) --> prob=[0.5744990088621882,0.22068271118182198,0.20481827995598978], predictedLabel=Iris-setosa
(Iris-virginica, [5.8,2.8,5.1,2.4]) --> prob=[0.14589555459093273,0.39150544114527663,0.4625990042637906], predictedLabel=Iris-virginica
(Iris-virginica, [5.9,3.0,5.1,1.8]) --> prob=[0.19164845952411863,0.38448782728830505,0.42386371318757643], predictedLabel=Iris-virginica
(Iris-versicolor, [6.0,2.2,4.0,1.0]) --> prob=[0.23300779791940326,0.3802856918956981,0.3867065101848985], predictedLabel=Iris-virginica
(Iris-versicolor, [6.0,2.7,5.1,1.6]) --> prob=[0.18810463050749873,0.3900406691963187,0.42185470029618244], predictedLabel=Iris-virginica
(Iris-versicolor, [6.1,2.8,4.0,1.3]) --> prob=[0.24928433400912278,0.3671520807495573,0.3835635852413199], predictedLabel=Iris-virginica
(Iris-versicolor, [6.2,2.2,4.5,1.5]) --> prob=[0.18351550396686533,0.3934066675024647,0.42307782853066994], predictedLabel=Iris-virginica
(Iris-virginica, [6.2,2.8,4.8,1.8]) --> prob=[0.1888126898204262,0.38539188363903437,0.4257954265405395], predictedLabel=Iris-virginica
(Iris-versicolor, [6.2,2.9,4.3,1.3]) --> prob=[0.24600050420847877,0.3689652108789115,0.38503428491260977], predictedLabel=Iris-virginica
(Iris-virginica, [6.2,3.4,5.4,2.3]) --> prob=[0.17337730890542696,0.3825617039174212,0.44406098717715176], predictedLabel=Iris-virginica
(Iris-virginica, [6.3,2.9,5.6,1.8]) --> prob=[0.1729681423511942,0.3931462837297906,0.4338855739190153], predictedLabel=Iris-virginica
(Iris-virginica, [6.4,3.1,5.5,1.8]) --> prob=[0.18621090846131505,0.3872972795834499,0.42649181195523495], predictedLabel=Iris-virginica
(Iris-versicolor, [6.6,2.9,4.6,1.3]) --> prob=[0.23618909578565045,0.373766365784125,0.3900445384302246], predictedLabel=Iris-virginica
(Iris-virginica, [6.7,3.3,5.7,2.5]) --> prob=[0.1496994275680708,0.38855932284425526,0.4617412495876739], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.0,5.5,2.1]) --> prob=[0.16265889090899283,0.39126984184915486,0.4460712672418523], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.2,6.0,1.8]) --> prob=[0.1782593898810351,0.3913068582491216,0.43043375186984334], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,2.6,6.9,2.3]) --> prob=[0.10733085394350968,0.41117706558989164,0.4814920804665987], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,3.8,6.7,2.2]) --> prob=[0.16693678799079806,0.38877323991855633,0.44428997209064564], predictedLabel=Iris-virginica
(Iris-virginica, [7.9,3.8,6.4,2.0]) --> prob=[0.18714592916724979,0.3838745095632083,0.42897956126954184], predictedLabel=Iris-virginica

scala> val mlrAccuracy = evaluator.evaluate(mlrPredictions)
mlrAccuracy: Double = 0.6339712918660287

scala> println("Test Error = " + (1.0 - mlrAccuracy))
Test Error = 0.36602870813397126

scala> val mlrModel = mlrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
mlrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_9661a4f56149

scala> println("Multinomial coefficients: " + mlrModel.coefficientMatrix+"Multinomial intercepts: "+mlrModel.interceptVector+"numClasses: "+mlrModel.numClasses+"numFeatures: "+mlrModel.numFeatures)
Multinomial coefficients: 0.0  0.35442627664118775    -0.1787646656602406  -0.36
662299325180614
0.0  0.0                    0.0                  0.0
0.0  -0.010992364266212548  0.0                  0.11193811404312962   Multinomi
al intercepts: [-0.10160079218819881,0.0863062310816332,0.01529456110656562]numC
lasses: 3numFeatures: 4



子雨大数据之Spark入门
扫一扫访问本博客