You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by jinhong lu <lu...@gmail.com> on 2017/03/13 11:38:35 UTC

Re: how to construct parameter for model.transform() from datafile

After train the mode, I got the result look like this:


	scala>  predictionResult.show()
	+-----+--------------------+--------------------+--------------------+----------+
	|label|            features|       rawPrediction|         probability|prediction|
	+-----+--------------------+--------------------+--------------------+----------+
	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|

And then, I transform() the data by these code:

	import org.apache.spark.ml.linalg.Vectors
	import org.apache.spark.ml.linalg.Vector
	import scala.collection.mutable

	   def lineToVector(line:String ):Vector={
	    val seq = new mutable.Queue[(Int,Double)]
	    val content = line.split(" ");
	    for( s <- content){
	      val index = s.split(":")(0).toInt
	      val value = s.split(":")(1).toDouble
	       seq += ((index,value))
	    }
	    return Vectors.sparse(144109, seq)
	  }

	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
	 val predictionResult = model.transform(df)
	 predictionResult.show()


But I got the error look like this:

 Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
  at scala.Predef$.require(Predef.scala:224)
  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
  at lineToVector(<console>:55)
  at $anonfun$4.apply(<console>:50)
  at $anonfun$4.apply(<console>:50)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)

So I change    

 	return Vectors.sparse(144109, seq)

to 

	return Vectors.sparse(804202, seq)

Another error occurs:

	Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
	  at scala.Predef$.require(Predef.scala:224)
	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
	  at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)

what should I do?
> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
> 
> Hi, all:
> 
> I got these training data:
> 
> 	0 31607:17
> 	0 111905:36
> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
> 	0 19109:7 29705:4 123305:32
> 	0 15309:1 43005:1 108509:1
> 	1 604:1 6401:1 6503:1 15207:4 31607:40
> 	0 1807:19
> 	0 301:14 501:1 1502:14 2507:12 123305:4
> 	0 607:14 19109:460 123305:448
> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> 
> And then I train the model by spark:
> 
> 	import org.apache.spark.ml.classification.NaiveBayes
> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> 	import org.apache.spark.sql.SparkSession
> 
> 	val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
> 	val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> 	//val model = new NaiveBayes().fit(trainingData)
> 	val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
> 	val predictions = model.transform(testData)
> 	predictions.show()
> 
> 
> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
> 
> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
> 	ID2	800201:1
> 	ID3	116005:4
> 	ID4	800201:1
> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
> 	ID6	800201:1
> 	ID7	43505:7 106405:7
> 
> I know I can use the transform() method, but how to contrust the parameter for transform() method?
> 
> 
> 
> 
> 
> Thanks,
> lujinhong
> 

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: how to construct parameter for model.transform() from datafile

Posted by Liang-Chi Hsieh <vi...@gmail.com>.
Just found that you can specify number of features when loading libsvm
source:

val df = spark.read.option("numFeatures", "100").format("libsvm")



Liang-Chi Hsieh wrote
> As the libsvm format can't specify number of features, and looks like
> NaiveBayes doesn't have such parameter, if your training/testing data is
> sparse, the number of features inferred from the data files can be
> inconsistent.
> 
> We may need to fix this.
> 
> Before a fixing going into NaiveBayes, currently a workaround is to align
> the number of features between training and testing data before fitting
> the model.
> 
> jinhong lu wrote
>> After train the mode, I got the result look like this:
>> 
>> 
>> 	scala>  predictionResult.show()
>> 
>> +-----+--------------------+--------------------+--------------------+----------+
>> 	|label|            features|       rawPrediction|        
>> probability|prediction|
>> 
>> +-----+--------------------+--------------------+--------------------+----------+
>> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
>> 0.0|
>> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
>> 0.0|
>> 	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|      
>> 1.0|
>> 
>> And then, I transform() the data by these code:
>> 
>> 	import org.apache.spark.ml.linalg.Vectors
>> 	import org.apache.spark.ml.linalg.Vector
>> 	import scala.collection.mutable
>> 
>> 	   def lineToVector(line:String ):Vector={
>> 	    val seq = new mutable.Queue[(Int,Double)]
>> 	    val content = line.split(" ");
>> 	    for( s <- content){
>> 	      val index = s.split(":")(0).toInt
>> 	      val value = s.split(":")(1).toDouble
>> 	       seq += ((index,value))
>> 	    }
>> 	    return Vectors.sparse(144109, seq)
>> 	  }
>> 
>> 	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
>> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>> =>
>> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
>> "features")
>> 	 val predictionResult = model.transform(df)
>> 	 predictionResult.show()
>> 
>> 
>> But I got the error look like this:
>> 
>>  Caused by: java.lang.IllegalArgumentException: requirement failed: You
>> may not write an element to index 804201 because the declared size of
>> your vector is 144109
>>   at scala.Predef$.require(Predef.scala:224)
>>   at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>>   at lineToVector(
>> <console>
>> :55)
>>   at $anonfun$4.apply(
>> <console>
>> :50)
>>   at $anonfun$4.apply(
>> <console>
>> :50)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>>   at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>>   at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>>   at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>>   at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>>   at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>> 
>> So I change    
>> 
>>  	return Vectors.sparse(144109, seq)
>> 
>> to 
>> 
>> 	return Vectors.sparse(804202, seq)
>> 
>> Another error occurs:
>> 
>> 	Caused by: java.lang.IllegalArgumentException: requirement failed: The
>> columns of A don't match the number of elements of x. A: 144109, x:
>> 804202
>> 	  at scala.Predef$.require(Predef.scala:224)
>> 	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>> 	  at
>> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>> 	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>> 
>> what should I do?
>>> 在 2017年3月13日,16:31,jinhong lu &lt;

>> lujinhong2@

>> &gt; 写道:
>>> 
>>> Hi, all:
>>> 
>>> I got these training data:
>>> 
>>> 	0 31607:17
>>> 	0 111905:36
>>> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
>>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
>>> 112109:4 123305:48 142509:1
>>> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19
>>> 31607:19
>>> 	0 19109:7 29705:4 123305:32
>>> 	0 15309:1 43005:1 108509:1
>>> 	1 604:1 6401:1 6503:1 15207:4 31607:40
>>> 	0 1807:19
>>> 	0 301:14 501:1 1502:14 2507:12 123305:4
>>> 	0 607:14 19109:460 123305:448
>>> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>>> 128209:1
>>> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>>> 27709:2 56509:8 122705:62 123305:31 124005:2
>>> 
>>> And then I train the model by spark:
>>> 
>>> 	import org.apache.spark.ml.classification.NaiveBayes
>>> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>> 	import org.apache.spark.sql.SparkSession
>>> 
>>> 	val spark =
>>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>>> 	val data =
>>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>>> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>>> seed = 1234L)
>>> 	//val model = new NaiveBayes().fit(trainingData)
>>> 	val model = new
>>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>>> 	val predictions = model.transform(testData)
>>> 	predictions.show()
>>> 
>>> 
>>> OK, I have got my model by the cole above, but how can I use this model
>>> to predict the classfication of other data like these:
>>> 
>>> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
>>> 	ID2	800201:1
>>> 	ID3	116005:4
>>> 	ID4	800201:1
>>> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
>>> 	ID6	800201:1
>>> 	ID7	43505:7 106405:7
>>> 
>>> I know I can use the transform() method, but how to contrust the
>>> parameter for transform() method?
>>> 
>>> 
>>> 
>>> 
>>> 
>>> Thanks,
>>> lujinhong
>>> 
>> 
>> Thanks,
>> lujinhong
>> 
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: 

>> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21180.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: how to construct parameter for model.transform() from datafile

Posted by Liang-Chi Hsieh <vi...@gmail.com>.
As the libsvm format can't specify number of features, and looks like
NaiveBayes doesn't have such parameter, if your training/testing data is
sparse, the number of features inferred from the data files can be
inconsistent.

We may need to fix this.

Before a fixing going into NaiveBayes, currently a workaround is to align
the number of features between training and testing data before fitting the
model.



jinhong lu wrote
> After train the mode, I got the result look like this:
> 
> 
> 	scala>  predictionResult.show()
> 
> +-----+--------------------+--------------------+--------------------+----------+
> 	|label|            features|       rawPrediction|        
> probability|prediction|
> 
> +-----+--------------------+--------------------+--------------------+----------+
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
> 0.0|
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|      
> 0.0|
> 	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|      
> 1.0|
> 
> And then, I transform() the data by these code:
> 
> 	import org.apache.spark.ml.linalg.Vectors
> 	import org.apache.spark.ml.linalg.Vector
> 	import scala.collection.mutable
> 
> 	   def lineToVector(line:String ):Vector={
> 	    val seq = new mutable.Queue[(Int,Double)]
> 	    val content = line.split(" ");
> 	    for( s <- content){
> 	      val index = s.split(":")(0).toInt
> 	      val value = s.split(":")(1).toDouble
> 	       seq += ((index,value))
> 	    }
> 	    return Vectors.sparse(144109, seq)
> 	  }
> 
> 	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> =>
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> 	 val predictionResult = model.transform(df)
> 	 predictionResult.show()
> 
> 
> But I got the error look like this:
> 
>  Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
>   at scala.Predef$.require(Predef.scala:224)
>   at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>   at lineToVector(
> <console>
> :55)
>   at $anonfun$4.apply(
> <console>
> :50)
>   at $anonfun$4.apply(
> <console>
> :50)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>   at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>   at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>   at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>   at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> 
> So I change    
> 
>  	return Vectors.sparse(144109, seq)
> 
> to 
> 
> 	return Vectors.sparse(804202, seq)
> 
> Another error occurs:
> 
> 	Caused by: java.lang.IllegalArgumentException: requirement failed: The
> columns of A don't match the number of elements of x. A: 144109, x: 804202
> 	  at scala.Predef$.require(Predef.scala:224)
> 	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> 	  at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> 	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
> 
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu &lt;

> lujinhong2@

> &gt; 写道:
>> 
>> Hi, all:
>> 
>> I got these training data:
>> 
>> 	0 31607:17
>> 	0 111905:36
>> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
>> 112109:4 123305:48 142509:1
>> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19
>> 31607:19
>> 	0 19109:7 29705:4 123305:32
>> 	0 15309:1 43005:1 108509:1
>> 	1 604:1 6401:1 6503:1 15207:4 31607:40
>> 	0 1807:19
>> 	0 301:14 501:1 1502:14 2507:12 123305:4
>> 	0 607:14 19109:460 123305:448
>> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>> 128209:1
>> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>> 27709:2 56509:8 122705:62 123305:31 124005:2
>> 
>> And then I train the model by spark:
>> 
>> 	import org.apache.spark.ml.classification.NaiveBayes
>> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> 	import org.apache.spark.sql.SparkSession
>> 
>> 	val spark =
>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> 	val data =
>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>> seed = 1234L)
>> 	//val model = new NaiveBayes().fit(trainingData)
>> 	val model = new
>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> 	val predictions = model.transform(testData)
>> 	predictions.show()
>> 
>> 
>> OK, I have got my model by the cole above, but how can I use this model
>> to predict the classfication of other data like these:
>> 
>> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
>> 	ID2	800201:1
>> 	ID3	116005:4
>> 	ID4	800201:1
>> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
>> 	ID6	800201:1
>> 	ID7	43505:7 106405:7
>> 
>> I know I can use the transform() method, but how to contrust the
>> parameter for transform() method?
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> lujinhong
>> 
> 
> Thanks,
> lujinhong
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe e-mail: 

> dev-unsubscribe@.apache





-----
Liang-Chi Hsieh | @viirya 
Spark Technology Center 
http://www.spark.tc/ 
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21179.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: how to construct parameter for model.transform() from datafile

Posted by Yuhao Yang <hh...@gmail.com>.
Hi Jinhong,


Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.

From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu <lu...@gmail.com>:

> Anyone help?
>
> > 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> >
> > After train the mode, I got the result look like this:
> >
> >
> >       scala>  predictionResult.show()
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |label|            features|       rawPrediction|
>  probability|prediction|
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>      1.0|
> >
> > And then, I transform() the data by these code:
> >
> >       import org.apache.spark.ml.linalg.Vectors
> >       import org.apache.spark.ml.linalg.Vector
> >       import scala.collection.mutable
> >
> >          def lineToVector(line:String ):Vector={
> >           val seq = new mutable.Queue[(Int,Double)]
> >           val content = line.split(" ");
> >           for( s <- content){
> >             val index = s.split(":")(0).toInt
> >             val value = s.split(":")(1).toDouble
> >              seq += ((index,value))
> >           }
> >           return Vectors.sparse(144109, seq)
> >         }
> >
> >        val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> >        val predictionResult = model.transform(df)
> >        predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> >  at scala.Predef$.require(Predef.scala:224)
> >  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> >  at lineToVector(<console>:55)
> >  at $anonfun$4.apply(<console>:50)
> >  at $anonfun$4.apply(<console>:50)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> >  at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> >  at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> >       return Vectors.sparse(144109, seq)
> >
> > to
> >
> >       return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> >       Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> >         at scala.Predef$.require(Predef.scala:224)
> >         at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> >         at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> >         at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >>      0 31607:17
> >>      0 111905:36
> >>      0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >>      0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >>      0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >>      0 19109:7 29705:4 123305:32
> >>      0 15309:1 43005:1 108509:1
> >>      1 604:1 6401:1 6503:1 15207:4 31607:40
> >>      0 1807:19
> >>      0 301:14 501:1 1502:14 2507:12 123305:4
> >>      0 607:14 19109:460 123305:448
> >>      0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >>      1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >>      import org.apache.spark.ml.classification.NaiveBayes
> >>      import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >>      import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >>      import org.apache.spark.sql.SparkSession
> >>
> >>      val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >>      val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >>      val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >>      //val model = new NaiveBayes().fit(trainingData)
> >>      val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >>      val predictions = model.transform(testData)
> >>      predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >>      ID1     509:2 5102:4 25909:1 31709:4 121905:19
> >>      ID2     800201:1
> >>      ID3     116005:4
> >>      ID4     800201:1
> >>      ID5     19109:1  21708:1 23208:1 49809:1 88609:1
> >>      ID6     800201:1
> >>      ID7     43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: how to construct parameter for model.transform() from datafile

Posted by Yuhao Yang <hh...@gmail.com>.
Hi Jinhong,


Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.

From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).

Regards,
Yuhao


2017-03-13 4:59 GMT-07:00 jinhong lu <lu...@gmail.com>:

> Anyone help?
>
> > 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> >
> > After train the mode, I got the result look like this:
> >
> >
> >       scala>  predictionResult.show()
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |label|            features|       rawPrediction|
>  probability|prediction|
> >       +-----+--------------------+--------------------+-----------
> ---------+----------+
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>      0.0|
> >       |  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>      1.0|
> >
> > And then, I transform() the data by these code:
> >
> >       import org.apache.spark.ml.linalg.Vectors
> >       import org.apache.spark.ml.linalg.Vector
> >       import scala.collection.mutable
> >
> >          def lineToVector(line:String ):Vector={
> >           val seq = new mutable.Queue[(Int,Double)]
> >           val content = line.split(" ");
> >           for( s <- content){
> >             val index = s.split(":")(0).toInt
> >             val value = s.split(":")(1).toDouble
> >              seq += ((index,value))
> >           }
> >           return Vectors.sparse(144109, seq)
> >         }
> >
> >        val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> >        val predictionResult = model.transform(df)
> >        predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> >  at scala.Predef$.require(Predef.scala:224)
> >  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> >  at lineToVector(<console>:55)
> >  at $anonfun$4.apply(<console>:50)
> >  at $anonfun$4.apply(<console>:50)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> >  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> >  at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> >  at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> >  at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> >       return Vectors.sparse(144109, seq)
> >
> > to
> >
> >       return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> >       Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> >         at scala.Predef$.require(Predef.scala:224)
> >         at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> >         at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> >         at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >>      0 31607:17
> >>      0 111905:36
> >>      0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >>      0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >>      0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >>      0 19109:7 29705:4 123305:32
> >>      0 15309:1 43005:1 108509:1
> >>      1 604:1 6401:1 6503:1 15207:4 31607:40
> >>      0 1807:19
> >>      0 301:14 501:1 1502:14 2507:12 123305:4
> >>      0 607:14 19109:460 123305:448
> >>      0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >>      1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >>      import org.apache.spark.ml.classification.NaiveBayes
> >>      import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >>      import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >>      import org.apache.spark.sql.SparkSession
> >>
> >>      val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >>      val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >>      val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >>      //val model = new NaiveBayes().fit(trainingData)
> >>      val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >>      val predictions = model.transform(testData)
> >>      predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >>      ID1     509:2 5102:4 25909:1 31709:4 121905:19
> >>      ID2     800201:1
> >>      ID3     116005:4
> >>      ID4     800201:1
> >>      ID5     19109:1  21708:1 23208:1 49809:1 88609:1
> >>      ID6     800201:1
> >>      ID7     43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>

Re: how to construct parameter for model.transform() from datafile

Posted by jinhong lu <lu...@gmail.com>.
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> 
> After train the mode, I got the result look like this:
> 
> 
> 	scala>  predictionResult.show()
> 	+-----+--------------------+--------------------+--------------------+----------+
> 	|label|            features|       rawPrediction|         probability|prediction|
> 	+-----+--------------------+--------------------+--------------------+----------+
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> 	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|
> 
> And then, I transform() the data by these code:
> 
> 	import org.apache.spark.ml.linalg.Vectors
> 	import org.apache.spark.ml.linalg.Vector
> 	import scala.collection.mutable
> 
> 	   def lineToVector(line:String ):Vector={
> 	    val seq = new mutable.Queue[(Int,Double)]
> 	    val content = line.split(" ");
> 	    for( s <- content){
> 	      val index = s.split(":")(0).toInt
> 	      val value = s.split(":")(1).toDouble
> 	       seq += ((index,value))
> 	    }
> 	    return Vectors.sparse(144109, seq)
> 	  }
> 
> 	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
> 	 val predictionResult = model.transform(df)
> 	 predictionResult.show()
> 
> 
> But I got the error look like this:
> 
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(<console>:55)
>  at $anonfun$4.apply(<console>:50)
>  at $anonfun$4.apply(<console>:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> 
> So I change    
> 
> 	return Vectors.sparse(144109, seq)
> 
> to 
> 
> 	return Vectors.sparse(804202, seq)
> 
> Another error occurs:
> 
> 	Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
> 	  at scala.Predef$.require(Predef.scala:224)
> 	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> 	  at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> 	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
> 
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
>> 
>> Hi, all:
>> 
>> I got these training data:
>> 
>> 	0 31607:17
>> 	0 111905:36
>> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>> 	0 19109:7 29705:4 123305:32
>> 	0 15309:1 43005:1 108509:1
>> 	1 604:1 6401:1 6503:1 15207:4 31607:40
>> 	0 1807:19
>> 	0 301:14 501:1 1502:14 2507:12 123305:4
>> 	0 607:14 19109:460 123305:448
>> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>> 
>> And then I train the model by spark:
>> 
>> 	import org.apache.spark.ml.classification.NaiveBayes
>> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> 	import org.apache.spark.sql.SparkSession
>> 
>> 	val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> 	val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>> 	//val model = new NaiveBayes().fit(trainingData)
>> 	val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> 	val predictions = model.transform(testData)
>> 	predictions.show()
>> 
>> 
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>> 
>> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
>> 	ID2	800201:1
>> 	ID3	116005:4
>> 	ID4	800201:1
>> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
>> 	ID6	800201:1
>> 	ID7	43505:7 106405:7
>> 
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> lujinhong
>> 
> 
> Thanks,
> lujinhong
> 

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org


Re: how to construct parameter for model.transform() from datafile

Posted by jinhong lu <lu...@gmail.com>.
Anyone help?

> 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> 
> After train the mode, I got the result look like this:
> 
> 
> 	scala>  predictionResult.show()
> 	+-----+--------------------+--------------------+--------------------+----------+
> 	|label|            features|       rawPrediction|         probability|prediction|
> 	+-----+--------------------+--------------------+--------------------+----------+
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> 	|  0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|       0.0|
> 	|  0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|       1.0|
> 
> And then, I transform() the data by these code:
> 
> 	import org.apache.spark.ml.linalg.Vectors
> 	import org.apache.spark.ml.linalg.Vector
> 	import scala.collection.mutable
> 
> 	   def lineToVector(line:String ):Vector={
> 	    val seq = new mutable.Queue[(Int,Double)]
> 	    val content = line.split(" ");
> 	    for( s <- content){
> 	      val index = s.split(":")(0).toInt
> 	      val value = s.split(":")(1).toDouble
> 	       seq += ((index,value))
> 	    }
> 	    return Vectors.sparse(144109, seq)
> 	  }
> 
> 	 val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
> 	 val predictionResult = model.transform(df)
> 	 predictionResult.show()
> 
> 
> But I got the error look like this:
> 
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
>  at scala.Predef$.require(Predef.scala:224)
>  at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>  at lineToVector(<console>:55)
>  at $anonfun$4.apply(<console>:50)
>  at $anonfun$4.apply(<console>:50)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>  at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>  at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>  at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>  at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>  at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> 
> So I change    
> 
> 	return Vectors.sparse(144109, seq)
> 
> to 
> 
> 	return Vectors.sparse(804202, seq)
> 
> Another error occurs:
> 
> 	Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
> 	  at scala.Predef$.require(Predef.scala:224)
> 	  at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> 	  at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> 	  at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
> 
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
>> 
>> Hi, all:
>> 
>> I got these training data:
>> 
>> 	0 31607:17
>> 	0 111905:36
>> 	0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>> 	0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 	0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>> 	0 19109:7 29705:4 123305:32
>> 	0 15309:1 43005:1 108509:1
>> 	1 604:1 6401:1 6503:1 15207:4 31607:40
>> 	0 1807:19
>> 	0 301:14 501:1 1502:14 2507:12 123305:4
>> 	0 607:14 19109:460 123305:448
>> 	0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>> 	1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>> 
>> And then I train the model by spark:
>> 
>> 	import org.apache.spark.ml.classification.NaiveBayes
>> 	import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> 	import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> 	import org.apache.spark.sql.SparkSession
>> 
>> 	val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> 	val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> 	val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>> 	//val model = new NaiveBayes().fit(trainingData)
>> 	val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> 	val predictions = model.transform(testData)
>> 	predictions.show()
>> 
>> 
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>> 
>> 	ID1	509:2 5102:4 25909:1 31709:4 121905:19
>> 	ID2	800201:1
>> 	ID3	116005:4
>> 	ID4	800201:1
>> 	ID5	19109:1  21708:1 23208:1 49809:1 88609:1
>> 	ID6	800201:1
>> 	ID7	43505:7 106405:7
>> 
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>> 
>> 
>> 
>> 
>> 
>> Thanks,
>> lujinhong
>> 
> 
> Thanks,
> lujinhong
> 

Thanks,
lujinhong


---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org