You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pun <pu...@gmail.com> on 2017/10/05 21:12:09 UTC
Spark ML - LogisticRegression extract words with highest weights
I am using Spark ML's pipeline to classify text documents with the following
steps:
Tokenizer -> CountVectorizer -> LogisticRegression
I want to be able to print the words with the highest weights. Can this be
done?
So far I have been able to extract the LR coefficients, but can those be
tied up to the actual words?
import org.apache.spark.ml.classification.{LogisticRegression,
LogisticRegressionModel}import org.apache.spark.ml.feature.{CountVectorizer,
Tokenizer}val tokenizer = new Tokenizer() .setInputCol("text")
.setOutputCol("words")val countVectorizer = new CountVectorizer()
.setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new
LogisticRegression() .setMaxIter(10) .setRegParam(0.01)val pipeline = new
Pipeline() .setStages(Array(tokenizer, countVectorizer, lr))// Fit the
pipeline to training documents.val model = pipeline.fit(training)val results
= model.transform(test)val lrm: LogisticRegressionModel =
model.stages.last.asInstanceOf[LogisticRegressionModel]// PRINT
COEFFICIENTSprintln(s"LR Model
coefficients:\n${lrm.coefficients.toArray.mkString("\n")}")(lrm.intercept,
lrm.coefficients)
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/