You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by pun <pu...@gmail.com> on 2017/10/05 21:12:09 UTC

Spark ML - LogisticRegression extract words with highest weights

I am using Spark ML's pipeline to classify text documents with the following
steps:
Tokenizer -> CountVectorizer -> LogisticRegression 
I want to be able to print the words with the highest weights. Can this be
done?
So far I have been able to extract the LR coefficients, but can those be
tied up to the actual words?
import org.apache.spark.ml.classification.{LogisticRegression,
LogisticRegressionModel}import org.apache.spark.ml.feature.{CountVectorizer,
Tokenizer}val tokenizer = new Tokenizer()  .setInputCol("text") 
.setOutputCol("words")val countVectorizer = new CountVectorizer() 
.setInputCol(tokenizer.getOutputCol)  .setOutputCol("features")val lr = new
LogisticRegression()  .setMaxIter(10)  .setRegParam(0.01)val pipeline = new
Pipeline()  .setStages(Array(tokenizer, countVectorizer, lr))// Fit the
pipeline to training documents.val model = pipeline.fit(training)val results
= model.transform(test)val lrm: LogisticRegressionModel =
model.stages.last.asInstanceOf[LogisticRegressionModel]// PRINT
COEFFICIENTSprintln(s"LR Model
coefficients:\n${lrm.coefficients.toArray.mkString("\n")}")(lrm.intercept,
lrm.coefficients)




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/