You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by pun <pu...@gmail.com> on 2017/10/04 18:29:14 UTC
mllib - CountVectorizer + LogisticRegression unexpected behavior
Hello,I have a model, which uses CountVectorizer and LogisticRegression.
*Everything seems to work fine, except that when I am running the last step
to get results and predictions, the document ids (doc_id) are being changed
completely. Do you know why that is? Am I doing something wrong?*
import org.apache.spark.ml.classification.LogisticRegressionimport
org.apache.spark.ml.feature.{CountVectorizer, Tokenizer}val tokenizer = new
Tokenizer() .setInputCol("text") .setOutputCol("words")val countVectorizer
= new CountVectorizer() //.setVocabSize(50000)
.setInputCol(tokenizer.getOutputCol) .setOutputCol("features")val lr = new
LogisticRegression() .setMaxIter(10) .setRegParam(0.01)val pipeline = new
Pipeline() .setStages(Array(tokenizer, countVectorizer, lr))// Fit the
pipeline to training documents.val model = pipeline.fit(training)val results
= model.transform(test)
Training and test are two DFs with the following structure:
root |-- doc_id: string (nullable = true) |-- text: string (nullable = true)
|-- label: integer (nullable = false)
Thanks in advance!
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
Re: Spark ML - CountVectorizer + LogisticRegression unexpected
behavior
Posted by pun <pu...@gmail.com>.
Nm. Rookie error. I wasn't caching the DF.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org