【版权声明】博客内容由厦门大学数据库实验室拥有版权,未经允许,请勿转载!
逻辑斯蒂回归
方法简介
逻辑斯蒂回归(logistic regression)是统计学习中的经典分类方法,属于对数线性模型。logistic回归的因变量可以是二分类的,也可以是多分类的。
基本原理
logistic分布
设X是连续随机变量,X服从logistic分布是指X具有下列分布函数和密度函数:
其中,为位置参数,为形状参数。
与
图像如下,其中分布函数是以
为中心对阵,
*** QuickLaTeX cannot compile formula: 越小曲线变化越快。 <h5><img src="http://dblab.xmu.edu.cn/blog/wp-content/uploads/2016/12/6d96cc41gw1etfkt9bbhwj20c603xmx4.jpg" alt="" title="" /></h5> <h5>二项logistic回归模型:</h5> 二项logistic回归模型如下: <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-0fbac46d9cb25d3da0c5e0ce2491aa6e_l3.png" height="43" width="253" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[P(Y=1|x)=\frac {exp(w \cdot x + b)} {1 + exp(w \cdot x + b)}\]" title="Rendered by QuickLaTeX.com"/> <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-44fa7f0601e3607fabfda93ae964be51_l3.png" height="41" width="253" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[P(Y=0|x)=\frac {1} {1 + exp(w \cdot x + b)}\]" title="Rendered by QuickLaTeX.com"/> 其中,<span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-477e1cd490f7330c49621f5e744e5038_l3.png" height="15" width="53" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[x \in R^n\]" title="Rendered by QuickLaTeX.com"/>是输入,<span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-fcb9be84a0978a9b5f986ea8e23c1de4_l3.png" height="16" width="61" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[Y \in {0,1}\]" title="Rendered by QuickLaTeX.com"/>是输出,w称为权值向量,b称为偏置,<span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-f645be27f6922dcf5e963e9d3d41d4fb_l3.png" height="8" width="36" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[w \cdot x\]" title="Rendered by QuickLaTeX.com"/>为w和x的内积。 <h5>参数估计</h5> 假设: <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-a8f70fd9541936adb9a5dbefb48231e4_l3.png" height="19" width="351" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[P(Y=1|x)=\pi (x), \quad P(Y=0|x)=1-\pi (x)\]" title="Rendered by QuickLaTeX.com"/> 则似然函数为: <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-f5d63ea80a9fd78a9f6a3841b41b955c_l3.png" height="52" width="195" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[\prod_{i=1}^N [\pi (x_i)]^{y_i} [1 - \pi(x_i)]^{1-y_i}\]" title="Rendered by QuickLaTeX.com"/> 求对数似然函数: <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-7b5b46100544dfadefe100d2f5ce9736_l3.png" height="52" width="373" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[L(w) = \sum_{i=1}^N [y_i \log{\pi(x_i)} + (1-y_i) \log{(1 - \pi(x_i)})]\]" title="Rendered by QuickLaTeX.com"/> <span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-d355be19cc15a6842b9f19487cd5287a_l3.png" height="52" width="581" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[\sum_{i=1}^N [y_i \log{\frac {\pi (x_i)} {1 - \pi(x_i)}} + \log{(1 - \pi(x_i)})]=\sum_{i=1}^N [y_i \log{\frac {\pi (x_i)} {1 - \pi(x_i)}} + \log{(1 - \pi(x_i)})]\]" title="Rendered by QuickLaTeX.com"/> 从而对<span class="ql-right-eqno"> </span><span class="ql-left-eqno"> </span><img src="https://dblab.xmu.edu.cn/blog/wp-content/ql-cache/quicklatex.com-71661cb137bd34c6ab95ab635582a958_l3.png" height="19" width="38" class="ql-img-displayed-equation quicklatex-auto-format" alt="\[L(w)\]" title="Rendered by QuickLaTeX.com"/>求极大值,得到w的估计值。求极值的方法可以是梯度下降法,梯度上升法等。 <h3>示例代码</h3> 我们以iris数据集(<a href="http://dblab.xmu.edu.cn/blog/wp-content/uploads/2017/03/iris.txt">iris</a>)为例进行分析。iris以鸢尾花的特征作为数据来源,数据集包含150个数据集,分为3类,每类50个数据,每个数据包含4个属性,是在数据挖掘、数据分类中非常常用的测试集、训练集。为了便于理解,我们这里主要用后两个属性(花瓣的长度和宽度)来进行分类。目前 <code>spark.ml</code> 中支持二分类和多分类,我们将分别从``用二项逻辑斯蒂回归来解决二分类问题''、``用多项逻辑斯蒂回归来解决二分类问题''、``用多项逻辑斯蒂回归来解决多分类问题''三个方面进行分析。 <h4>用二项逻辑斯蒂回归解决 二分类 问题</h4> 首先我们先取其中的后两类数据,用二项逻辑斯蒂回归进行二分类分析。 <h5>1. 导入需要的包:</h5> <pre><code class="scala">import org.apache.spark.sql.Row import org.apache.spark.sql.SparkSession import org.apache.spark.ml.linalg.{Vector,Vectors} import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator import org.apache.spark.ml.{Pipeline,PipelineModel} import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer,HashingTF, Tokenizer} import org.apache.spark.ml.classification.LogisticRegression import org.apache.spark.ml.classification.LogisticRegressionModel import org.apache.spark.ml.classification.{BinaryLogisticRegressionSummary, LogisticRegression} import org.apache.spark.sql.functions; </code></pre> <h5>2. 读取数据,简要分析:</h5> 导入spark.implicits._,使其支持把一个RDD隐式转换为一个DataFrame。我们用case class定义一个schema:Iris,Iris就是我们需要的数据的结构;然后读取文本文件,第一个map把每行的数据用``,''隔开,比如在我们的数据集中,每行被分成了5部分,前4部分是鸢尾花的4个特征,最后一部分是鸢尾花的分类;我们这里把特征存储在Vector中,创建一个Iris模式的RDD,然后转化成dataframe;最后调用show()方法来查看一下部分数据。 <pre><code class="scala">scala> import spark.implicits._ import spark.implicits._ scala> case class Iris(features: org.apache.spark.ml.linalg.Vector, label: String) defined class Iris scala> val data = spark.sparkContext.textFile("file:///usr/local/spark/iris.txt").map(_.split(",")).map(p => I ris(Vectors.dense(p(0).toDouble,p(1).toDouble,p(2).toDouble, p(3).toDouble), p(4 ).toString())).toDF() data: org.apache.spark.sql.DataFrame = [features: vector, label: string] scala> data.show() +-----------------+-----------+ | features| label| +-----------------+-----------+ |[5.1,3.5,1.4,0.2]|Iris-setosa| |[4.9,3.0,1.4,0.2]|Iris-setosa| |[4.7,3.2,1.3,0.2]|Iris-setosa| |[4.6,3.1,1.5,0.2]|Iris-setosa| |[5.0,3.6,1.4,0.2]|Iris-setosa| |[5.4,3.9,1.7,0.4]|Iris-setosa| |[4.6,3.4,1.4,0.3]|Iris-setosa| |[5.0,3.4,1.5,0.2]|Iris-setosa| |[4.4,2.9,1.4,0.2]|Iris-setosa| |[4.9,3.1,1.5,0.1]|Iris-setosa| |[5.4,3.7,1.5,0.2]|Iris-setosa| |[4.8,3.4,1.6,0.2]|Iris-setosa| |[4.8,3.0,1.4,0.1]|Iris-setosa| |[4.3,3.0,1.1,0.1]|Iris-setosa| |[5.8,4.0,1.2,0.2]|Iris-setosa| |[5.7,4.4,1.5,0.4]|Iris-setosa| |[5.4,3.9,1.3,0.4]|Iris-setosa| |[5.1,3.5,1.4,0.3]|Iris-setosa| |[5.7,3.8,1.7,0.3]|Iris-setosa| |[5.1,3.8,1.5,0.3]|Iris-setosa| +-----------------+-----------+ only showing top 20 rows </code></pre> 因为我们现在处理的是2分类问题,所以我们不需要全部的3类数据,我们要从中选出两类的数据。这里首先把刚刚得到的数据注册成一个表iris,注册成这个表之后,我们就可以通过sql语句进行数据查询,比如我们这里选出了所有不属于``Iris-setosa''类别的数据;选出我们需要的数据后,我们可以把结果打印出来看一下,这时就已经没有``Iris-setosa''类别的数据。 <pre><code class="scala">scala> data.createOrReplaceTempView("iris") scala> val df = spark.sql("select * from iris where label != 'Iris-setosa'") df: org.apache.spark.sql.DataFrame = [features: vector, label: string] scala> df.map(t => t(1)+":"+t(0)).collect().foreach(println) Iris-versicolor:[7.0,3.2,4.7,1.4] Iris-versicolor:[6.4,3.2,4.5,1.5] Iris-versicolor:[6.9,3.1,4.9,1.5] Iris-versicolor:[5.5,2.3,4.0,1.3] Iris-versicolor:[6.5,2.8,4.6,1.5] Iris-versicolor:[5.7,2.8,4.5,1.3] Iris-versicolor:[6.3,3.3,4.7,1.6] Iris-versicolor:[4.9,2.4,3.3,1.0] Iris-versicolor:[6.6,2.9,4.6,1.3] Iris-versicolor:[5.2,2.7,3.9,1.4] Iris-versicolor:[5.0,2.0,3.5,1.0] Iris-versicolor:[5.9,3.0,4.2,1.5] Iris-versicolor:[6.0,2.2,4.0,1.0] Iris-versicolor:[6.1,2.9,4.7,1.4] Iris-versicolor:[5.6,2.9,3.6,1.3] Iris-versicolor:[6.7,3.1,4.4,1.4] Iris-versicolor:[5.6,3.0,4.5,1.5] Iris-versicolor:[5.8,2.7,4.1,1.0] Iris-versicolor:[6.2,2.2,4.5,1.5] Iris-versicolor:[5.6,2.5,3.9,1.1] Iris-versicolor:[5.9,3.2,4.8,1.8] Iris-versicolor:[6.1,2.8,4.0,1.3] Iris-versicolor:[6.3,2.5,4.9,1.5] Iris-versicolor:[6.1,2.8,4.7,1.2] Iris-versicolor:[6.4,2.9,4.3,1.3] Iris-versicolor:[6.6,3.0,4.4,1.4] Iris-versicolor:[6.8,2.8,4.8,1.4] Iris-versicolor:[6.7,3.0,5.0,1.7] Iris-versicolor:[6.0,2.9,4.5,1.5] Iris-versicolor:[5.7,2.6,3.5,1.0] Iris-versicolor:[5.5,2.4,3.8,1.1] Iris-versicolor:[5.5,2.4,3.7,1.0] Iris-versicolor:[5.8,2.7,3.9,1.2] Iris-versicolor:[6.0,2.7,5.1,1.6] Iris-versicolor:[5.4,3.0,4.5,1.5] Iris-versicolor:[6.0,3.4,4.5,1.6] Iris-versicolor:[6.7,3.1,4.7,1.5] Iris-versicolor:[6.3,2.3,4.4,1.3] Iris-versicolor:[5.6,3.0,4.1,1.3] Iris-versicolor:[5.5,2.5,4.0,1.3] Iris-versicolor:[5.5,2.6,4.4,1.2] Iris-versicolor:[6.1,3.0,4.6,1.4] Iris-versicolor:[5.8,2.6,4.0,1.2] Iris-versicolor:[5.0,2.3,3.3,1.0] Iris-versicolor:[5.6,2.7,4.2,1.3] Iris-versicolor:[5.7,3.0,4.2,1.2] Iris-versicolor:[5.7,2.9,4.2,1.3] Iris-versicolor:[6.2,2.9,4.3,1.3] Iris-versicolor:[5.1,2.5,3.0,1.1] Iris-versicolor:[5.7,2.8,4.1,1.3] Iris-virginica:[6.3,3.3,6.0,2.5] Iris-virginica:[5.8,2.7,5.1,1.9] Iris-virginica:[7.1,3.0,5.9,2.1] Iris-virginica:[6.3,2.9,5.6,1.8] Iris-virginica:[6.5,3.0,5.8,2.2] Iris-virginica:[7.6,3.0,6.6,2.1] Iris-virginica:[4.9,2.5,4.5,1.7] Iris-virginica:[7.3,2.9,6.3,1.8] Iris-virginica:[6.7,2.5,5.8,1.8] Iris-virginica:[7.2,3.6,6.1,2.5] Iris-virginica:[6.5,3.2,5.1,2.0] Iris-virginica:[6.4,2.7,5.3,1.9] Iris-virginica:[6.8,3.0,5.5,2.1] Iris-virginica:[5.7,2.5,5.0,2.0] Iris-virginica:[5.8,2.8,5.1,2.4] Iris-virginica:[6.4,3.2,5.3,2.3] Iris-virginica:[6.5,3.0,5.5,1.8] Iris-virginica:[7.7,3.8,6.7,2.2] Iris-virginica:[7.7,2.6,6.9,2.3] Iris-virginica:[6.0,2.2,5.0,1.5] Iris-virginica:[6.9,3.2,5.7,2.3] Iris-virginica:[5.6,2.8,4.9,2.0] Iris-virginica:[7.7,2.8,6.7,2.0] Iris-virginica:[6.3,2.7,4.9,1.8] Iris-virginica:[6.7,3.3,5.7,2.1] Iris-virginica:[7.2,3.2,6.0,1.8] Iris-virginica:[6.2,2.8,4.8,1.8] Iris-virginica:[6.1,3.0,4.9,1.8] Iris-virginica:[6.4,2.8,5.6,2.1] Iris-virginica:[7.2,3.0,5.8,1.6] Iris-virginica:[7.4,2.8,6.1,1.9] Iris-virginica:[7.9,3.8,6.4,2.0] Iris-virginica:[6.4,2.8,5.6,2.2] Iris-virginica:[6.3,2.8,5.1,1.5] Iris-virginica:[6.1,2.6,5.6,1.4] Iris-virginica:[7.7,3.0,6.1,2.3] Iris-virginica:[6.3,3.4,5.6,2.4] Iris-virginica:[6.4,3.1,5.5,1.8] Iris-virginica:[6.0,3.0,4.8,1.8] Iris-virginica:[6.9,3.1,5.4,2.1] Iris-virginica:[6.7,3.1,5.6,2.4] Iris-virginica:[6.9,3.1,5.1,2.3] Iris-virginica:[5.8,2.7,5.1,1.9] Iris-virginica:[6.8,3.2,5.9,2.3] Iris-virginica:[6.7,3.3,5.7,2.5] Iris-virginica:[6.7,3.0,5.2,2.3] Iris-virginica:[6.3,2.5,5.0,1.9] Iris-virginica:[6.5,3.0,5.2,2.0] Iris-virginica:[6.2,3.4,5.4,2.3] Iris-virginica:[5.9,3.0,5.1,1.8] </code></pre> <h5>3. 构建ML的pipeline</h5> 分别获取标签列和特征列,进行索引,并进行了重命名。 <pre><code class="scala">scala> val labelIndexer = new StringIndexer().setInputCol("label").setOutputCol("indexedLabel").fit(df) labelIndexer: org.apache.spark.ml.feature.StringIndexerModel = strIdx_e53e67411169 scala> val featureIndexer = new VectorIndexer().setInputCol("features").setOutputCol("indexedFeatures").fit(df) featureIndexer: org.apache.spark.ml.feature.VectorIndexerModel = vecIdx_53b988077b38 </code></pre> 接下来,我们把数据集随机分成训练集和测试集,其中训练集占70%。 <pre><code class="scala">scala> val Array(trainingData, testData) = df.randomSplit(Array(0.7, 0.3)) trainingData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string] testData: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [features: vector, label: string] </code></pre> 然后,我们设置logistic的参数,这里我们统一用setter的方法来设置,也可以用ParamMap来设置(具体的可以查看spark mllib的官网)。这里我们设置了循环次数为10次,正则化项为0.3等,具体的可以设置的参数可以通过explainParams()来获取,还能看到我们已经设置的参数的结果。 <pre><code class="scala">scala> val lr = new LogisticRegression().setLabelCol("indexedLabel").setFeaturesCol("indexedFeatures").setMaxIter(10).setRegParam(0.3).setElasticNetParam(0.8) lr: org.apache.spark.ml.classification.LogisticRegression = logreg_692899496c23 scala> println("LogisticRegression parameters:\n" + lr.explainParams() + "\n") LogisticRegression parameters: aggregationDepth: suggested depth for treeAggregate (>= 2) (default: 2) elasticNetParam: the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty (default: 0.0, current: 0.8) family: The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial. (default: auto) featuresCol: features column name (default: features, current: indexedFeatures) fitIntercept: whether to fit an intercept term (default: true) labelCol: label column name (default: label, current: indexedLabel) maxIter: maximum number of iterations (>= 0) (default: 100, current: 10) predictionCol: prediction column name (default: prediction) probabilityCol: Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities (default: probability) rawPredictionCol: raw prediction (a.k.a. confidence) column name (default: rawPrediction) regParam: regularization parameter (>= 0) (default: 0.0, current: 0.3) standardization: whether to standardize the training features before fitting the model (default: true) threshold: threshold in binary classification prediction, in range [0, 1] (default: 0.5) thresholds: Thresholds in multi-class classification to adjust the probability of predicting each class. Array must have length equal to the number of classes, with values > 0 excepting that at most one value may be 0. The class with largest value p/t is predicted, where p is the original probability of that class and t is the class's threshold (undefined) tol: the convergence tolerance for iterative algorithms (>= 0) (default: 1.0E-6) weightCol: weight column name. If this is not set or empty, we treat all instanc e weights as 1.0 (undefined) </code></pre> 这里我们设置一个labelConverter,目的是把预测的类别重新转化成字符型的。 <pre><code class="scala">scala> val labelConverter = new IndexToString().setInputCol("prediction").setOut putCol("predictedLabel").setLabels(labelIndexer.labels) labelConverter: org.apache.spark.ml.feature.IndexToString = idxToStr_c204eafabf57 </code></pre> 构建pipeline,设置stage,然后调用fit()来训练模型。 <pre><code class="scala">scala> val lrPipeline = new Pipeline().setStages(Array(labelIndexer, featureIndexer, lr, labelConverter)) lrPipeline: org.apache.spark.ml.Pipeline = pipeline_eb1b201af1e0 scala> val lrPipelineModel = lrPipeline.fit(trainingData) lrPipelineModel: org.apache.spark.ml.PipelineModel = pipeline_eb1b201af1e0 </code></pre> pipeline本质上是一个Estimator,当pipeline调用fit()的时候就产生了一个PipelineModel,本质上是一个Transformer。然后这个PipelineModel就可以调用transform()来进行预测,生成一个新的DataFrame,即利用训练得到的模型对测试集进行验证。 <pre><code class="scala">scala> val lrPredictions = lrPipelineModel.transform(testData) lrPredictions: org.apache.spark.sql.DataFrame = [features: vector, label: string ... 6 more fields] </code></pre> 最后我们可以输出预测的结果,其中select选择要输出的列,collect获取所有行的数据,用foreach把每行打印出来。其中打印出来的值依次分别代表该行数据的真实分类和特征值、预测属于不同分类的概率、预测的分类。 <pre><code class="scala">scala> lrPredictions.select("predictedLabel", "label", "features", "probability").collect().foreach { case Row(predictedLabel: String, label: String, features: Vector, prob: Vector) => println(s"( *** Error message: Unicode character 越 (U+8D8A) leading text: 越 Unicode character 小 (U+5C0F) leading text: 越小 Unicode character 曲 (U+66F2) leading text: 越小曲 Unicode character 线 (U+7EBF) leading text: 越小曲线 Unicode character 变 (U+53D8) leading text: 越小曲线变 Unicode character 化 (U+5316) leading text: 越小曲线变化 Unicode character 越 (U+8D8A) leading text: 越小曲线变化越 Unicode character 快 (U+5FEB) leading text: 越小曲线变化越快 Unicode character 。 (U+3002) leading text: 越小曲线变化越快。 Unicode character 二 (U+4E8C) leading text: <h5>二 Unicode character 项 (U+9879) leading text: <h5>二项 Unicode character 回 (U+56DE) leading text: <h5>二项logistic回 Unicode character 归 (U+5F52)
label, prob, predicted Label={binarySummary.areaUnderROC}")
areaUnderROC: 0.969551282051282
scala> val fMeasure = binarySummary.fMeasureByThreshold
fMeasure: org.apache.spark.sql.DataFrame = [threshold: double, F-Measure: double]
scala> val maxFMeasure = fMeasure.select(functions.max("F-Measure")).head().getDouble(0)
maxFMeasure: Double = 0.9180327868852458
scala> val bestThreshold = fMeasure.where(label, prob, predictedLabel=label, prob, predictedLabel=$predictedLabel")}
(Iris-setosa, [4.3,3.0,1.1,0.1]) --> prob=[0.49856067730476944,0.2623440805400292,0.23909524215520148], predictedLabel=Iris-setosa
(Iris-setosa, [4.4,2.9,1.4,0.2]) --> prob=[0.46571089790971687,0.277891570222724,0.25639753186755915], predictedLabel=Iris-setosa
(Iris-setosa, [4.6,3.4,1.4,0.3]) --> prob=[0.5001101367665973,0.25928904940719977,0.24060081382620296], predictedLabel=Iris-setosa
(Iris-setosa, [4.6,3.6,1.0,0.2]) --> prob=[0.5463459284110406,0.236823238870237,0.2168308327187224], predictedLabel=Iris-setosa
(Iris-setosa, [4.7,3.2,1.6,0.2]) --> prob=[0.48370179709200706,0.2689591735381297,0.24733902936986318], predictedLabel=Iris-setosa
(Iris-setosa, [4.8,3.0,1.4,0.1]) --> prob=[0.4851576852808171,0.2693562861247639,0.24548602859441893], predictedLabel=Iris-setosa
(Iris-setosa, [4.9,3.0,1.4,0.2]) --> prob=[0.47467118791268154,0.2733753207454508,0.2519534913418676], predictedLabel=Iris-setosa
(Iris-setosa, [4.9,3.1,1.5,0.1]) --> prob=[0.48967732688779486,0.267131626783035,0.2431910463291701], predictedLabel=Iris-setosa
(Iris-versicolor, [5.0,2.3,3.3,1.0]) --> prob=[0.26303122888674907,0.36560215832179155,0.3713666127914594], predictedLabel=Iris-virginica
(Iris-setosa, [5.0,3.5,1.3,0.3]) --> prob=[0.5135688079539743,0.2524416257183621,0.2339895663276636], predictedLabel=Iris-setosa
(Iris-setosa, [5.0,3.5,1.6,0.6]) --> prob=[0.4686356517088239,0.2713034457686629,0.26006090252251307], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.5,1.4,0.3]) --> prob=[0.5091020180722664,0.25475974124614675,0.23613824068158687], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.8,1.5,0.3]) --> prob=[0.531570061574297,0.24348517949467904,0.22494475893102403], predictedLabel=Iris-setosa
(Iris-setosa, [5.1,3.8,1.9,0.4]) --> prob=[0.503222274154322,0.25683175058110785,0.23994597526457007], predictedLabel=Iris-setosa
(Iris-setosa, [5.2,3.5,1.5,0.2]) --> prob=[0.5151370941776632,0.2529823490495923,0.2318805567727446], predictedLabel=Iris-setosa
(Iris-setosa, [5.3,3.7,1.5,0.2]) --> prob=[0.5330773525753305,0.24387796024925384,0.22304468717541576], predictedLabel=Iris-setosa
(Iris-versicolor, [5.4,3.0,4.5,1.5]) --> prob=[0.2306542447600023,0.372383222489962,0.39696253275003573], predictedLabel=Iris-virginica
(Iris-setosa, [5.4,3.9,1.3,0.4]) --> prob=[0.5389512877303541,0.23848657002728416,0.2225621422423618], predictedLabel=Iris-setosa
(Iris-versicolor, [5.5,2.4,3.7,1.0]) --> prob=[0.25620601559263473,0.36919246180632764,0.37460152260103763], predictedLabel=Iris-virginica
(Iris-setosa, [5.5,3.5,1.3,0.2]) --> prob=[0.5240613549472979,0.24832602160956213,0.22761262344314004], predictedLabel=Iris-setosa
(Iris-setosa, [5.5,4.2,1.4,0.2]) --> prob=[0.5818115053858839,0.21899706180633755,0.19919143280777854], predictedLabel=Iris-setosa
(Iris-versicolor, [5.6,2.5,3.9,1.1]) --> prob=[0.24827164138938784,0.3712338899987297,0.38049446861188246], predictedLabel=Iris-virginica
(Iris-versicolor, [5.6,2.7,4.2,1.3]) --> prob=[0.23609842674482123,0.3733910806218104,0.39051049263336834], predictedLabel=Iris-virginica
(Iris-virginica, [5.6,2.8,4.9,2.0]) --> prob=[0.17353784667372726,0.38803750951559646,0.43842464381067625], predictedLabel=Iris-virginica
(Iris-versicolor, [5.6,2.9,3.6,1.3]) --> prob=[0.26994082035183004,0.35725015822484213,0.37280902142332784], predictedLabel=Iris-virginica
(Iris-setosa, [5.7,4.4,1.5,0.4]) --> prob=[0.5744990088621882,0.22068271118182198,0.20481827995598978], predictedLabel=Iris-setosa
(Iris-virginica, [5.8,2.8,5.1,2.4]) --> prob=[0.14589555459093273,0.39150544114527663,0.4625990042637906], predictedLabel=Iris-virginica
(Iris-virginica, [5.9,3.0,5.1,1.8]) --> prob=[0.19164845952411863,0.38448782728830505,0.42386371318757643], predictedLabel=Iris-virginica
(Iris-versicolor, [6.0,2.2,4.0,1.0]) --> prob=[0.23300779791940326,0.3802856918956981,0.3867065101848985], predictedLabel=Iris-virginica
(Iris-versicolor, [6.0,2.7,5.1,1.6]) --> prob=[0.18810463050749873,0.3900406691963187,0.42185470029618244], predictedLabel=Iris-virginica
(Iris-versicolor, [6.1,2.8,4.0,1.3]) --> prob=[0.24928433400912278,0.3671520807495573,0.3835635852413199], predictedLabel=Iris-virginica
(Iris-versicolor, [6.2,2.2,4.5,1.5]) --> prob=[0.18351550396686533,0.3934066675024647,0.42307782853066994], predictedLabel=Iris-virginica
(Iris-virginica, [6.2,2.8,4.8,1.8]) --> prob=[0.1888126898204262,0.38539188363903437,0.4257954265405395], predictedLabel=Iris-virginica
(Iris-versicolor, [6.2,2.9,4.3,1.3]) --> prob=[0.24600050420847877,0.3689652108789115,0.38503428491260977], predictedLabel=Iris-virginica
(Iris-virginica, [6.2,3.4,5.4,2.3]) --> prob=[0.17337730890542696,0.3825617039174212,0.44406098717715176], predictedLabel=Iris-virginica
(Iris-virginica, [6.3,2.9,5.6,1.8]) --> prob=[0.1729681423511942,0.3931462837297906,0.4338855739190153], predictedLabel=Iris-virginica
(Iris-virginica, [6.4,3.1,5.5,1.8]) --> prob=[0.18621090846131505,0.3872972795834499,0.42649181195523495], predictedLabel=Iris-virginica
(Iris-versicolor, [6.6,2.9,4.6,1.3]) --> prob=[0.23618909578565045,0.373766365784125,0.3900445384302246], predictedLabel=Iris-virginica
(Iris-virginica, [6.7,3.3,5.7,2.5]) --> prob=[0.1496994275680708,0.38855932284425526,0.4617412495876739], predictedLabel=Iris-virginica
(Iris-virginica, [6.8,3.0,5.5,2.1]) --> prob=[0.16265889090899283,0.39126984184915486,0.4460712672418523], predictedLabel=Iris-virginica
(Iris-virginica, [7.2,3.2,6.0,1.8]) --> prob=[0.1782593898810351,0.3913068582491216,0.43043375186984334], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,2.6,6.9,2.3]) --> prob=[0.10733085394350968,0.41117706558989164,0.4814920804665987], predictedLabel=Iris-virginica
(Iris-virginica, [7.7,3.8,6.7,2.2]) --> prob=[0.16693678799079806,0.38877323991855633,0.44428997209064564], predictedLabel=Iris-virginica
(Iris-virginica, [7.9,3.8,6.4,2.0]) --> prob=[0.18714592916724979,0.3838745095632083,0.42897956126954184], predictedLabel=Iris-virginica
scala> val mlrAccuracy = evaluator.evaluate(mlrPredictions)
mlrAccuracy: Double = 0.6339712918660287
scala> println("Test Error = " + (1.0 - mlrAccuracy))
Test Error = 0.36602870813397126
scala> val mlrModel = mlrPipelineModel.stages(2).asInstanceOf[LogisticRegressionModel]
mlrModel: org.apache.spark.ml.classification.LogisticRegressionModel = logreg_9661a4f56149
scala> println("Multinomial coefficients: " + mlrModel.coefficientMatrix+"Multinomial intercepts: "+mlrModel.interceptVector+"numClasses: "+mlrModel.numClasses+"numFeatures: "+mlrModel.numFeatures)
Multinomial coefficients: 0.0 0.35442627664118775 -0.1787646656602406 -0.36
662299325180614
0.0 0.0 0.0 0.0
0.0 -0.010992364266212548 0.0 0.11193811404312962 Multinomi
al intercepts: [-0.10160079218819881,0.0863062310816332,0.01529456110656562]numC
lasses: 3numFeatures: 4