You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by Yuhao Yang <hh...@gmail.com> on 2017/03/15 01:05:08 UTC
Re: how to construct parameter for model.transform() from datafile

Hi Jinhong,


Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.

From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu <lu...@gmail.com>:

> Anyone help?
>
> > 在 2017年3月13日，19:38，jinhong lu <lu...@gmail.com> 写道：
> >
> > After train the mode, I got the result look like this:
> >
> >
> >       scala>  predictionResult.show()
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |label|            features|       rawPrediction|
>  probability|prediction|
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>      1.0|
> >
> > And then, I transform() the data by these code:
> >
> >       import org.apache.spark.ml.linalg.Vectors
> >       import org.apache.spark.ml.linalg.Vector
> >       import scala.collection.mutable
> >
> >          def lineToVector(line:String ):Vector={
> >           val seq = new mutable.Queue[(Int,Double)]
> >           val content = line.split(" ");
> >           for( s <- content){
> >             val index = s.split(":")(0).toInt
> >             val value = s.split(":")(1).toDouble
> >              seq += ((index,value))
> >           }
> >           return Vectors.sparse(144109, seq)
> >         }
> >
> >        val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> >        val predictionResult = model.transform(df)
> >        predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> >  at scala.Predef$.require(Predef.scala:224)
> >  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> >  at lineToVector(<console>:55)
> >  at $anonfun$4.apply(<console>:50)
> >  at $anonfun$4.apply(<console>:50)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> >  at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> >  at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> >       return Vectors.sparse(144109, seq)
> >
> > to
> >
> >       return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> >       Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> >         at scala.Predef$.require(Predef.scala:224)
> >         at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> >         at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> >         at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日，16:31，jinhong lu <lu...@gmail.com> 写道：
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >>      0 31607:17
> >>      0 111905:36
> >>      0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >>      0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >>      0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >>      0 19109:7 29705:4 123305:32
> >>      0 15309:1 43005:1 108509:1
> >>      1 604:1 6401:1 6503:1 15207:4 31607:40
> >>      0 1807:19
> >>      0 301:14 501:1 1502:14 2507:12 123305:4
> >>      0 607:14 19109:460 123305:448
> >>      0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >>      1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >>      import org.apache.spark.ml.classification.NaiveBayes
> >>      import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >>      import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >>      import org.apache.spark.sql.SparkSession
> >>
> >>      val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >>      val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >>      val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >>      //val model = new NaiveBayes().fit(trainingData)
> >>      val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >>      val predictions = model.transform(testData)
> >>      predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >>      ID1     509:2 5102:4 25909:1 31709:4 121905:19
> >>      ID2     800201:1
> >>      ID3     116005:4
> >>      ID4     800201:1
> >>      ID5     19109:1  21708:1 23208:1 49809:1 88609:1
> >>      ID6     800201:1
> >>      ID7     43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>