You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@spark.apache.org by jinhong lu <lu...@gmail.com> on 2017/03/13 11:38:35 UTC
Re: how to construct parameter for model.transform() from datafile
After train the mode, I got the result look like this:
scala> predictionResult.show()
+-----+--------------------+--------------------+--------------------+----------+
|label| features| rawPrediction| probability|prediction|
+-----+--------------------+--------------------+--------------------+----------+
| 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
| 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
| 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| 1.0|
And then, I transform() the data by these code:
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.linalg.Vector
import scala.collection.mutable
def lineToVector(line:String ):Vector={
val seq = new mutable.Queue[(Int,Double)]
val content = line.split(" ");
for( s <- content){
val index = s.split(":")(0).toInt
val value = s.split(":")(1).toDouble
seq += ((index,value))
}
return Vectors.sparse(144109, seq)
}
val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
val predictionResult = model.transform(df)
predictionResult.show()
But I got the error look like this:
Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
at lineToVector(<console>:55)
at $anonfun$4.apply(<console>:50)
at $anonfun$4.apply(<console>:50)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
So I change
return Vectors.sparse(144109, seq)
to
return Vectors.sparse(804202, seq)
Another error occurs:
Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
what should I do?
> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
>
> Hi, all:
>
> I got these training data:
>
> 0 31607:17
> 0 111905:36
> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
> 0 19109:7 29705:4 123305:32
> 0 15309:1 43005:1 108509:1
> 1 604:1 6401:1 6503:1 15207:4 31607:40
> 0 1807:19
> 0 301:14 501:1 1502:14 2507:12 123305:4
> 0 607:14 19109:460 123305:448
> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>
> And then I train the model by spark:
>
> import org.apache.spark.ml.classification.NaiveBayes
> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
> import org.apache.spark.sql.SparkSession
>
> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
> //val model = new NaiveBayes().fit(trainingData)
> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
> val predictions = model.transform(testData)
> predictions.show()
>
>
> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>
> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> ID2 800201:1
> ID3 116005:4
> ID4 800201:1
> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
> ID6 800201:1
> ID7 43505:7 106405:7
>
> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>
>
>
>
>
> Thanks,
> lujinhong
>
Thanks,
lujinhong
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
Re: how to construct parameter for model.transform() from datafile
Posted by Liang-Chi Hsieh <vi...@gmail.com>.
Just found that you can specify number of features when loading libsvm
source:
val df = spark.read.option("numFeatures", "100").format("libsvm")
Liang-Chi Hsieh wrote
> As the libsvm format can't specify number of features, and looks like
> NaiveBayes doesn't have such parameter, if your training/testing data is
> sparse, the number of features inferred from the data files can be
> inconsistent.
>
> We may need to fix this.
>
> Before a fixing going into NaiveBayes, currently a workaround is to align
> the number of features between training and testing data before fitting
> the model.
>
> jinhong lu wrote
>> After train the mode, I got the result look like this:
>>
>>
>> scala> predictionResult.show()
>>
>> +-----+--------------------+--------------------+--------------------+----------+
>> |label| features| rawPrediction|
>> probability|prediction|
>>
>> +-----+--------------------+--------------------+--------------------+----------+
>> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>> 0.0|
>> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
>> 0.0|
>> | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
>> 1.0|
>>
>> And then, I transform() the data by these code:
>>
>> import org.apache.spark.ml.linalg.Vectors
>> import org.apache.spark.ml.linalg.Vector
>> import scala.collection.mutable
>>
>> def lineToVector(line:String ):Vector={
>> val seq = new mutable.Queue[(Int,Double)]
>> val content = line.split(" ");
>> for( s <- content){
>> val index = s.split(":")(0).toInt
>> val value = s.split(":")(1).toDouble
>> seq += ((index,value))
>> }
>> return Vectors.sparse(144109, seq)
>> }
>>
>> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
>> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
>> =>
>> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
>> "features")
>> val predictionResult = model.transform(df)
>> predictionResult.show()
>>
>>
>> But I got the error look like this:
>>
>> Caused by: java.lang.IllegalArgumentException: requirement failed: You
>> may not write an element to index 804201 because the declared size of
>> your vector is 144109
>> at scala.Predef$.require(Predef.scala:224)
>> at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
>> at lineToVector(
>> <console>
>> :55)
>> at $anonfun$4.apply(
>> <console>
>> :50)
>> at $anonfun$4.apply(
>> <console>
>> :50)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
>> at
>> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
>> at
>> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>> at
>> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
>> at
>> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>> at
>> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>>
>> So I change
>>
>> return Vectors.sparse(144109, seq)
>>
>> to
>>
>> return Vectors.sparse(804202, seq)
>>
>> Another error occurs:
>>
>> Caused by: java.lang.IllegalArgumentException: requirement failed: The
>> columns of A don't match the number of elements of x. A: 144109, x:
>> 804202
>> at scala.Predef$.require(Predef.scala:224)
>> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
>> at
>> org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
>> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>>
>> what should I do?
>>> 在 2017年3月13日,16:31,jinhong lu <
>> lujinhong2@
>> > 写道:
>>>
>>> Hi, all:
>>>
>>> I got these training data:
>>>
>>> 0 31607:17
>>> 0 111905:36
>>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
>>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
>>> 112109:4 123305:48 142509:1
>>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19
>>> 31607:19
>>> 0 19109:7 29705:4 123305:32
>>> 0 15309:1 43005:1 108509:1
>>> 1 604:1 6401:1 6503:1 15207:4 31607:40
>>> 0 1807:19
>>> 0 301:14 501:1 1502:14 2507:12 123305:4
>>> 0 607:14 19109:460 123305:448
>>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>>> 128209:1
>>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>>> 27709:2 56509:8 122705:62 123305:31 124005:2
>>>
>>> And then I train the model by spark:
>>>
>>> import org.apache.spark.ml.classification.NaiveBayes
>>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>>> import org.apache.spark.sql.SparkSession
>>>
>>> val spark =
>>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>>> val data =
>>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>>> seed = 1234L)
>>> //val model = new NaiveBayes().fit(trainingData)
>>> val model = new
>>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>>> val predictions = model.transform(testData)
>>> predictions.show()
>>>
>>>
>>> OK, I have got my model by the cole above, but how can I use this model
>>> to predict the classfication of other data like these:
>>>
>>> ID1 509:2 5102:4 25909:1 31709:4 121905:19
>>> ID2 800201:1
>>> ID3 116005:4
>>> ID4 800201:1
>>> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
>>> ID6 800201:1
>>> ID7 43505:7 106405:7
>>>
>>> I know I can use the transform() method, but how to contrust the
>>> parameter for transform() method?
>>>
>>>
>>>
>>>
>>>
>>> Thanks,
>>> lujinhong
>>>
>>
>> Thanks,
>> lujinhong
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail:
>> dev-unsubscribe@.apache
-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21180.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
Re: how to construct parameter for model.transform() from datafile
Posted by Liang-Chi Hsieh <vi...@gmail.com>.
As the libsvm format can't specify number of features, and looks like
NaiveBayes doesn't have such parameter, if your training/testing data is
sparse, the number of features inferred from the data files can be
inconsistent.
We may need to fix this.
Before a fixing going into NaiveBayes, currently a workaround is to align
the number of features between training and testing data before fitting the
model.
jinhong lu wrote
> After train the mode, I got the result look like this:
>
>
> scala> predictionResult.show()
>
> +-----+--------------------+--------------------+--------------------+----------+
> |label| features| rawPrediction|
> probability|prediction|
>
> +-----+--------------------+--------------------+--------------------+----------+
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
> 1.0|
>
> And then, I transform() the data by these code:
>
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.linalg.Vector
> import scala.collection.mutable
>
> def lineToVector(line:String ):Vector={
> val seq = new mutable.Queue[(Int,Double)]
> val content = line.split(" ");
> for( s <- content){
> val index = s.split(":")(0).toInt
> val value = s.split(":")(1).toDouble
> seq += ((index,value))
> }
> return Vectors.sparse(144109, seq)
> }
>
> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> =>
> (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> val predictionResult = model.transform(df)
> predictionResult.show()
>
>
> But I got the error look like this:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> at lineToVector(
> <console>
> :55)
> at $anonfun$4.apply(
> <console>
> :50)
> at $anonfun$4.apply(
> <console>
> :50)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
> at
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at
> org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>
> So I change
>
> return Vectors.sparse(144109, seq)
>
> to
>
> return Vectors.sparse(804202, seq)
>
> Another error occurs:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: The
> columns of A don't match the number of elements of x. A: 144109, x: 804202
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <
> lujinhong2@
> > 写道:
>>
>> Hi, all:
>>
>> I got these training data:
>>
>> 0 31607:17
>> 0 111905:36
>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
>> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
>> 112109:4 123305:48 142509:1
>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19
>> 31607:19
>> 0 19109:7 29705:4 123305:32
>> 0 15309:1 43005:1 108509:1
>> 1 604:1 6401:1 6503:1 15207:4 31607:40
>> 0 1807:19
>> 0 301:14 501:1 1502:14 2507:12 123305:4
>> 0 607:14 19109:460 123305:448
>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48
>> 128209:1
>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2
>> 27709:2 56509:8 122705:62 123305:31 124005:2
>>
>> And then I train the model by spark:
>>
>> import org.apache.spark.ml.classification.NaiveBayes
>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> import org.apache.spark.sql.SparkSession
>>
>> val spark =
>> SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> val data =
>> spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3),
>> seed = 1234L)
>> //val model = new NaiveBayes().fit(trainingData)
>> val model = new
>> NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> val predictions = model.transform(testData)
>> predictions.show()
>>
>>
>> OK, I have got my model by the cole above, but how can I use this model
>> to predict the classfication of other data like these:
>>
>> ID1 509:2 5102:4 25909:1 31709:4 121905:19
>> ID2 800201:1
>> ID3 116005:4
>> ID4 800201:1
>> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
>> ID6 800201:1
>> ID7 43505:7 106405:7
>>
>> I know I can use the transform() method, but how to contrust the
>> parameter for transform() method?
>>
>>
>>
>>
>>
>> Thanks,
>> lujinhong
>>
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail:
> dev-unsubscribe@.apache
-----
Liang-Chi Hsieh | @viirya
Spark Technology Center
http://www.spark.tc/
--
View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Re-how-to-construct-parameter-for-model-transform-from-datafile-tp21155p21179.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
Re: how to construct parameter for model.transform() from datafile
Posted by Yuhao Yang <hh...@gmail.com>.
Hi Jinhong,
Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.
From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).
Regards,
Yuhao
2017-03-13 4:59 GMT-07:00 jinhong lu <lu...@gmail.com>:
> Anyone help?
>
> > 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> >
> > After train the mode, I got the result look like this:
> >
> >
> > scala> predictionResult.show()
> > +-----+--------------------+--------------------+-----------
> ---------+----------+
> > |label| features| rawPrediction|
> probability|prediction|
> > +-----+--------------------+--------------------+-----------
> ---------+----------+
> > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> > | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
> 1.0|
> >
> > And then, I transform() the data by these code:
> >
> > import org.apache.spark.ml.linalg.Vectors
> > import org.apache.spark.ml.linalg.Vector
> > import scala.collection.mutable
> >
> > def lineToVector(line:String ):Vector={
> > val seq = new mutable.Queue[(Int,Double)]
> > val content = line.split(" ");
> > for( s <- content){
> > val index = s.split(":")(0).toInt
> > val value = s.split(":")(1).toDouble
> > seq += ((index,value))
> > }
> > return Vectors.sparse(144109, seq)
> > }
> >
> > val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> > val predictionResult = model.transform(df)
> > predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> > at scala.Predef$.require(Predef.scala:224)
> > at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> > at lineToVector(<console>:55)
> > at $anonfun$4.apply(<console>:50)
> > at $anonfun$4.apply(<console>:50)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> > at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> > at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> > at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> > at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> > return Vectors.sparse(144109, seq)
> >
> > to
> >
> > return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> > at scala.Predef$.require(Predef.scala:224)
> > at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> > at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> > at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >> 0 31607:17
> >> 0 111905:36
> >> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >> 0 19109:7 29705:4 123305:32
> >> 0 15309:1 43005:1 108509:1
> >> 1 604:1 6401:1 6503:1 15207:4 31607:40
> >> 0 1807:19
> >> 0 301:14 501:1 1502:14 2507:12 123305:4
> >> 0 607:14 19109:460 123305:448
> >> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >> import org.apache.spark.ml.classification.NaiveBayes
> >> import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >> import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >> import org.apache.spark.sql.SparkSession
> >>
> >> val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >> val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >> //val model = new NaiveBayes().fit(trainingData)
> >> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >> val predictions = model.transform(testData)
> >> predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> >> ID2 800201:1
> >> ID3 116005:4
> >> ID4 800201:1
> >> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
> >> ID6 800201:1
> >> ID7 43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
Re: how to construct parameter for model.transform() from datafile
Posted by Yuhao Yang <hh...@gmail.com>.
Hi Jinhong,
Based on the error message, your second collection of vectors has a
dimension of 804202, while the dimension of your training vectors
was 144109. So please make sure your test dataset are of the same dimension
as the training data.
From the test dataset you posted, the vector dimension is much larger
than 144109
(804202?).
Regards,
Yuhao
2017-03-13 4:59 GMT-07:00 jinhong lu <lu...@gmail.com>:
> Anyone help?
>
> > 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
> >
> > After train the mode, I got the result look like this:
> >
> >
> > scala> predictionResult.show()
> > +-----+--------------------+--------------------+-----------
> ---------+----------+
> > |label| features| rawPrediction|
> probability|prediction|
> > +-----+--------------------+--------------------+-----------
> ---------+----------+
> > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> > | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...|
> 0.0|
> > | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...|
> 1.0|
> >
> > And then, I transform() the data by these code:
> >
> > import org.apache.spark.ml.linalg.Vectors
> > import org.apache.spark.ml.linalg.Vector
> > import scala.collection.mutable
> >
> > def lineToVector(line:String ):Vector={
> > val seq = new mutable.Queue[(Int,Double)]
> > val content = line.split(" ");
> > for( s <- content){
> > val index = s.split(":")(0).toInt
> > val value = s.split(":")(1).toDouble
> > seq += ((index,value))
> > }
> > return Vectors.sparse(144109, seq)
> > }
> >
> > val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable,
> org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/
> gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line
> => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid",
> "features")
> > val predictionResult = model.transform(df)
> > predictionResult.show()
> >
> >
> > But I got the error look like this:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement failed: You
> may not write an element to index 804201 because the declared size of your
> vector is 144109
> > at scala.Predef$.require(Predef.scala:224)
> > at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> > at lineToVector(<console>:55)
> > at $anonfun$4.apply(<console>:50)
> > at $anonfun$4.apply(<console>:50)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> > at org.apache.spark.sql.catalyst.expressions.GeneratedClass$
> GeneratedIterator.processNext(generated.java:84)
> > at org.apache.spark.sql.execution.BufferedRowIterator.
> hasNext(BufferedRowIterator.java:43)
> > at org.apache.spark.sql.execution.WholeStageCodegenExec$$
> anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> > at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:246)
> > at org.apache.spark.sql.execution.SparkPlan$$anonfun$
> 4.apply(SparkPlan.scala:240)
> > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> > at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$
> 1$$anonfun$apply$24.apply(RDD.scala:803)
> >
> > So I change
> >
> > return Vectors.sparse(144109, seq)
> >
> > to
> >
> > return Vectors.sparse(804202, seq)
> >
> > Another error occurs:
> >
> > Caused by: java.lang.IllegalArgumentException: requirement
> failed: The columns of A don't match the number of elements of x. A:
> 144109, x: 804202
> > at scala.Predef$.require(Predef.scala:224)
> > at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> > at org.apache.spark.ml.linalg.Matrix$class.multiply(
> Matrices.scala:110)
> > at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.
> scala:176)
> >
> > what should I do?
> >> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
> >>
> >> Hi, all:
> >>
> >> I got these training data:
> >>
> >> 0 31607:17
> >> 0 111905:36
> >> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1
> 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2
> 112109:4 123305:48 142509:1
> >> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
> >> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3
> 15207:19 31607:19
> >> 0 19109:7 29705:4 123305:32
> >> 0 15309:1 43005:1 108509:1
> >> 1 604:1 6401:1 6503:1 15207:4 31607:40
> >> 0 1807:19
> >> 0 301:14 501:1 1502:14 2507:12 123305:4
> >> 0 607:14 19109:460 123305:448
> >> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1
> 123305:48 128209:1
> >> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1
> 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
> >>
> >> And then I train the model by spark:
> >>
> >> import org.apache.spark.ml.classification.NaiveBayes
> >> import org.apache.spark.ml.evaluation.
> BinaryClassificationEvaluator
> >> import org.apache.spark.ml.evaluation.
> MulticlassClassificationEvaluator
> >> import org.apache.spark.sql.SparkSession
> >>
> >> val spark = SparkSession.builder.appName("NaiveBayesExample").
> getOrCreate()
> >> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/
> training_data3")
> >> val Array(trainingData, testData) = data.randomSplit(Array(0.7,
> 0.3), seed = 1234L)
> >> //val model = new NaiveBayes().fit(trainingData)
> >> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(
> trainingData)
> >> val predictions = model.transform(testData)
> >> predictions.show()
> >>
> >>
> >> OK, I have got my model by the cole above, but how can I use this model
> to predict the classfication of other data like these:
> >>
> >> ID1 509:2 5102:4 25909:1 31709:4 121905:19
> >> ID2 800201:1
> >> ID3 116005:4
> >> ID4 800201:1
> >> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
> >> ID6 800201:1
> >> ID7 43505:7 106405:7
> >>
> >> I know I can use the transform() method, but how to contrust the
> parameter for transform() method?
> >>
> >>
> >>
> >>
> >>
> >> Thanks,
> >> lujinhong
> >>
> >
> > Thanks,
> > lujinhong
> >
>
> Thanks,
> lujinhong
>
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>
>
Re: how to construct parameter for model.transform() from datafile
Posted by jinhong lu <lu...@gmail.com>.
Anyone help?
> 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
>
> After train the mode, I got the result look like this:
>
>
> scala> predictionResult.show()
> +-----+--------------------+--------------------+--------------------+----------+
> |label| features| rawPrediction| probability|prediction|
> +-----+--------------------+--------------------+--------------------+----------+
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
> | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| 1.0|
>
> And then, I transform() the data by these code:
>
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.linalg.Vector
> import scala.collection.mutable
>
> def lineToVector(line:String ):Vector={
> val seq = new mutable.Queue[(Int,Double)]
> val content = line.split(" ");
> for( s <- content){
> val index = s.split(":")(0).toInt
> val value = s.split(":")(1).toDouble
> seq += ((index,value))
> }
> return Vectors.sparse(144109, seq)
> }
>
> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
> val predictionResult = model.transform(df)
> predictionResult.show()
>
>
> But I got the error look like this:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> at lineToVector(<console>:55)
> at $anonfun$4.apply(<console>:50)
> at $anonfun$4.apply(<console>:50)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>
> So I change
>
> return Vectors.sparse(144109, seq)
>
> to
>
> return Vectors.sparse(804202, seq)
>
> Another error occurs:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
>>
>> Hi, all:
>>
>> I got these training data:
>>
>> 0 31607:17
>> 0 111905:36
>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>> 0 19109:7 29705:4 123305:32
>> 0 15309:1 43005:1 108509:1
>> 1 604:1 6401:1 6503:1 15207:4 31607:40
>> 0 1807:19
>> 0 301:14 501:1 1502:14 2507:12 123305:4
>> 0 607:14 19109:460 123305:448
>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>>
>> And then I train the model by spark:
>>
>> import org.apache.spark.ml.classification.NaiveBayes
>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> import org.apache.spark.sql.SparkSession
>>
>> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>> //val model = new NaiveBayes().fit(trainingData)
>> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> val predictions = model.transform(testData)
>> predictions.show()
>>
>>
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>>
>> ID1 509:2 5102:4 25909:1 31709:4 121905:19
>> ID2 800201:1
>> ID3 116005:4
>> ID4 800201:1
>> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
>> ID6 800201:1
>> ID7 43505:7 106405:7
>>
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>>
>>
>>
>>
>>
>> Thanks,
>> lujinhong
>>
>
> Thanks,
> lujinhong
>
Thanks,
lujinhong
---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscribe@spark.apache.org
Re: how to construct parameter for model.transform() from datafile
Posted by jinhong lu <lu...@gmail.com>.
Anyone help?
> 在 2017年3月13日,19:38,jinhong lu <lu...@gmail.com> 写道:
>
> After train the mode, I got the result look like this:
>
>
> scala> predictionResult.show()
> +-----+--------------------+--------------------+--------------------+----------+
> |label| features| rawPrediction| probability|prediction|
> +-----+--------------------+--------------------+--------------------+----------+
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
> | 0.0|(144109,[100],[2.0])|[-12.246737725034...|[0.96061209556737...| 0.0|
> | 0.0|(144109,[100],[24...|[-146.81612388602...|[9.73704654529197...| 1.0|
>
> And then, I transform() the data by these code:
>
> import org.apache.spark.ml.linalg.Vectors
> import org.apache.spark.ml.linalg.Vector
> import scala.collection.mutable
>
> def lineToVector(line:String ):Vector={
> val seq = new mutable.Queue[(Int,Double)]
> val content = line.split(" ");
> for( s <- content){
> val index = s.split(":")(0).toInt
> val value = s.split(":")(1).toDouble
> seq += ((index,value))
> }
> return Vectors.sparse(144109, seq)
> }
>
> val df = sc.sequenceFile[org.apache.hadoop.io.LongWritable, org.apache.hadoop.io.Text]("/data/gamein/gameall_sdc/wh/gameall.db/edt_udid_label_format/ds=20170312/001006_0").map(line=>line._2).map(line => (line.toString.split("\t")(0),lineToVector(line.toString.split("\t")(1)))).toDF("udid", "features")
> val predictionResult = model.transform(df)
> predictionResult.show()
>
>
> But I got the error look like this:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: You may not write an element to index 804201 because the declared size of your vector is 144109
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.Vectors$.sparse(Vectors.scala:219)
> at lineToVector(<console>:55)
> at $anonfun$4.apply(<console>:50)
> at $anonfun$4.apply(<console>:50)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
> at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:84)
> at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
> at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:246)
> at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
> at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
>
> So I change
>
> return Vectors.sparse(144109, seq)
>
> to
>
> return Vectors.sparse(804202, seq)
>
> Another error occurs:
>
> Caused by: java.lang.IllegalArgumentException: requirement failed: The columns of A don't match the number of elements of x. A: 144109, x: 804202
> at scala.Predef$.require(Predef.scala:224)
> at org.apache.spark.ml.linalg.BLAS$.gemv(BLAS.scala:521)
> at org.apache.spark.ml.linalg.Matrix$class.multiply(Matrices.scala:110)
> at org.apache.spark.ml.linalg.DenseMatrix.multiply(Matrices.scala:176)
>
> what should I do?
>> 在 2017年3月13日,16:31,jinhong lu <lu...@gmail.com> 写道:
>>
>> Hi, all:
>>
>> I got these training data:
>>
>> 0 31607:17
>> 0 111905:36
>> 0 109:3 506:41 1509:1 2106:4 5309:1 7209:5 8406:1 27108:1 27709:1 30209:8 36109:20 41408:1 42309:1 46509:1 47709:5 57809:1 58009:1 58709:2 112109:4 123305:48 142509:1
>> 0 407:14 2905:2 5209:2 6509:2 6909:2 14509:2 18507:10
>> 0 604:3 3505:9 6401:3 6503:2 6505:3 7809:8 10509:3 12109:3 15207:19 31607:19
>> 0 19109:7 29705:4 123305:32
>> 0 15309:1 43005:1 108509:1
>> 1 604:1 6401:1 6503:1 15207:4 31607:40
>> 0 1807:19
>> 0 301:14 501:1 1502:14 2507:12 123305:4
>> 0 607:14 19109:460 123305:448
>> 0 5406:14 7209:4 10509:3 19109:6 24706:10 26106:4 31409:1 123305:48 128209:1
>> 1 1606:1 2306:3 3905:19 4408:3 4506:8 8707:3 19109:50 24809:1 26509:2 27709:2 56509:8 122705:62 123305:31 124005:2
>>
>> And then I train the model by spark:
>>
>> import org.apache.spark.ml.classification.NaiveBayes
>> import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
>> import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator
>> import org.apache.spark.sql.SparkSession
>>
>> val spark = SparkSession.builder.appName("NaiveBayesExample").getOrCreate()
>> val data = spark.read.format("libsvm").load("/tmp/ljhn1829/aplus/training_data3")
>> val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3), seed = 1234L)
>> //val model = new NaiveBayes().fit(trainingData)
>> val model = new NaiveBayes().setThresholds(Array(10.0,1.0)).fit(trainingData)
>> val predictions = model.transform(testData)
>> predictions.show()
>>
>>
>> OK, I have got my model by the cole above, but how can I use this model to predict the classfication of other data like these:
>>
>> ID1 509:2 5102:4 25909:1 31709:4 121905:19
>> ID2 800201:1
>> ID3 116005:4
>> ID4 800201:1
>> ID5 19109:1 21708:1 23208:1 49809:1 88609:1
>> ID6 800201:1
>> ID7 43505:7 106405:7
>>
>> I know I can use the transform() method, but how to contrust the parameter for transform() method?
>>
>>
>>
>>
>>
>> Thanks,
>> lujinhong
>>
>
> Thanks,
> lujinhong
>
Thanks,
lujinhong
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org