You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by "Igor L." <ta...@gmail.com> on 2016/02/18 10:28:23 UTC

How to train and predict in parallel via Spark MLlib?

Good day, Spark team!
I have to solve regression problem for different restricitons. There is a
bunch of criteria and rules for them, I have to build model and make
predictions for each, combine all and save.
So, now my solution looks like:
    
    criteria2Rules: List[(String, Set[String])]
    var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
    criteria2Rules.foreach {
      case (criterion, rules) =>
        val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
data)
        val model: GradientBoostedTreesModel = buildModel(trainDataSet)
        val predictionDataSet = preparePredictionDataSet(criterion, data)
        val predictedScores = predictScores(predictionDataSet, model,
criterion, rules)
        result = result.union(predictedScores)
    }

It works almost nice, but too slow for the reason GradientBoostedTreesModel
training not so fast, especially in case of big amount of features, samples
and also quite big list of using criteria. 
I suppose it could work better, if Spark will train models and make
predictions in parallel.

I've tried to use a relational way of data operation:

    val criteria2RulesRdd: RDD[(String, Set[String])]
    
    val cartesianCriteriaRules2DataRdd =
criteria2RulesRdd.cartesian(dataRdd)
    cartesianCriteriaRules2DataRdd
      .aggregateByKey(List[Data]())(
        { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
::: lstR}
      )
      .map {
        case (criteria, rulesSet, scorePredictionDataList) =>
          val trainSet = ???
          val model = ???
          val predictionSet = ???
       	  val predictedScores = ???
      }
      ...

but it inevitably brings to situation when one RDD is produced inside
another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and
as far as I know it's a bad scenario, e.g.
toy example below doesn't work:
scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_
* 2).collect)).collect.

Is there any way to use Spark MLlib in parallel way?

Thank u for attention!

--
BR,
Junior Scala/Python Developer
Igor L.



--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
For additional commands, e-mail: user-help@spark.apache.org

Re: How to train and predict in parallel via Spark MLlib?

Posted by Xiangrui Meng <me...@gmail.com>.

I put a simple example here:
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/1233855/3877825096667927/588180/d9d264e39a.html

On Thu, Feb 18, 2016 at 6:47 AM Игорь Ляхов <ta...@gmail.com> wrote:

> Xiangrui, thnx for your answer!
> Could you clarify some details?
> What do you mean "I can trigger training jobs in different threads on the
> driver"? I have 4-machine cluster (It will grow in future), and I wish
> use them in parallel for training and predicting.
> Do you have any example? It will be great if you show me anyone.
>
> Thanks a lot for your participation!
> --Igor
>
> 2016-02-18 17:24 GMT+03:00 Xiangrui Meng <me...@gmail.com>:
>
>> If you have a big cluster, you can trigger training jobs in different
>> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui
>>
>> On Thu, Feb 18, 2016, 4:28 AM Igor L. <ta...@gmail.com> wrote:
>>
>>> Good day, Spark team!
>>> I have to solve regression problem for different restricitons. There is a
>>> bunch of criteria and rules for them, I have to build model and make
>>> predictions for each, combine all and save.
>>> So, now my solution looks like:
>>>
>>>     criteria2Rules: List[(String, Set[String])]
>>>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>>>     criteria2Rules.foreach {
>>>       case (criterion, rules) =>
>>>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
>>> data)
>>>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>>>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>>>         val predictedScores = predictScores(predictionDataSet, model,
>>> criterion, rules)
>>>         result = result.union(predictedScores)
>>>     }
>>>
>>> It works almost nice, but too slow for the reason
>>> GradientBoostedTreesModel
>>> training not so fast, especially in case of big amount of features,
>>> samples
>>> and also quite big list of using criteria.
>>> I suppose it could work better, if Spark will train models and make
>>> predictions in parallel.
>>>
>>> I've tried to use a relational way of data operation:
>>>
>>>     val criteria2RulesRdd: RDD[(String, Set[String])]
>>>
>>>     val cartesianCriteriaRules2DataRdd =
>>> criteria2RulesRdd.cartesian(dataRdd)
>>>     cartesianCriteriaRules2DataRdd
>>>       .aggregateByKey(List[Data]())(
>>>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) =>
>>> lstL
>>> ::: lstR}
>>>       )
>>>       .map {
>>>         case (criteria, rulesSet, scorePredictionDataList) =>
>>>           val trainSet = ???
>>>           val model = ???
>>>           val predictionSet = ???
>>>           val predictedScores = ???
>>>       }
>>>       ...
>>>
>>> but it inevitably brings to situation when one RDD is produced inside
>>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint])
>>> and
>>> as far as I know it's a bad scenario, e.g.
>>> toy example below doesn't work:
>>> scala> sc.parallelize(1 to 100).map(x => (x,
>>> sc.parallelize(Array(2)).map(_
>>> * 2).collect)).collect.
>>>
>>> Is there any way to use Spark MLlib in parallel way?
>>>
>>> Thank u for attention!
>>>
>>> --
>>> BR,
>>> Junior Scala/Python Developer
>>> Igor L.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>>> For additional commands, e-mail: user-help@spark.apache.org
>>>
>>>
>
>
> --
> --
> С уважением,
> Игорь Ляхов
>

Re: How to train and predict in parallel via Spark MLlib?

Posted by Игорь Ляхов <ta...@gmail.com>.

Xiangrui, thnx for your answer!
Could you clarify some details?
What do you mean "I can trigger training jobs in different threads on the
driver"? I have 4-machine cluster (It will grow in future), and I wish use
them in parallel for training and predicting.
Do you have any example? It will be great if you show me anyone.

Thanks a lot for your participation!
--Igor

2016-02-18 17:24 GMT+03:00 Xiangrui Meng <me...@gmail.com>:

> If you have a big cluster, you can trigger training jobs in different
> threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui
>
> On Thu, Feb 18, 2016, 4:28 AM Igor L. <ta...@gmail.com> wrote:
>
>> Good day, Spark team!
>> I have to solve regression problem for different restricitons. There is a
>> bunch of criteria and rules for them, I have to build model and make
>> predictions for each, combine all and save.
>> So, now my solution looks like:
>>
>>     criteria2Rules: List[(String, Set[String])]
>>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>>     criteria2Rules.foreach {
>>       case (criterion, rules) =>
>>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
>> data)
>>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>>         val predictedScores = predictScores(predictionDataSet, model,
>> criterion, rules)
>>         result = result.union(predictedScores)
>>     }
>>
>> It works almost nice, but too slow for the reason
>> GradientBoostedTreesModel
>> training not so fast, especially in case of big amount of features,
>> samples
>> and also quite big list of using criteria.
>> I suppose it could work better, if Spark will train models and make
>> predictions in parallel.
>>
>> I've tried to use a relational way of data operation:
>>
>>     val criteria2RulesRdd: RDD[(String, Set[String])]
>>
>>     val cartesianCriteriaRules2DataRdd =
>> criteria2RulesRdd.cartesian(dataRdd)
>>     cartesianCriteriaRules2DataRdd
>>       .aggregateByKey(List[Data]())(
>>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
>> ::: lstR}
>>       )
>>       .map {
>>         case (criteria, rulesSet, scorePredictionDataList) =>
>>           val trainSet = ???
>>           val model = ???
>>           val predictionSet = ???
>>           val predictedScores = ???
>>       }
>>       ...
>>
>> but it inevitably brings to situation when one RDD is produced inside
>> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint])
>> and
>> as far as I know it's a bad scenario, e.g.
>> toy example below doesn't work:
>> scala> sc.parallelize(1 to 100).map(x => (x,
>> sc.parallelize(Array(2)).map(_
>> * 2).collect)).collect.
>>
>> Is there any way to use Spark MLlib in parallel way?
>>
>> Thank u for attention!
>>
>> --
>> BR,
>> Junior Scala/Python Developer
>> Igor L.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
>> For additional commands, e-mail: user-help@spark.apache.org
>>
>>


-- 
--
С уважением,
Игорь Ляхов

Re: How to train and predict in parallel via Spark MLlib?

Posted by Xiangrui Meng <me...@gmail.com>.

If you have a big cluster, you can trigger training jobs in different
threads on the driver. Putting RDDs inside an RDD won't work. -Xiangrui

On Thu, Feb 18, 2016, 4:28 AM Igor L. <ta...@gmail.com> wrote:

> Good day, Spark team!
> I have to solve regression problem for different restricitons. There is a
> bunch of criteria and rules for them, I have to build model and make
> predictions for each, combine all and save.
> So, now my solution looks like:
>
>     criteria2Rules: List[(String, Set[String])]
>     var result: RDD[(Id, Double)] = sc.parallelize(Array[(Id, Double)]())
>     criteria2Rules.foreach {
>       case (criterion, rules) =>
>         val trainDataSet: RDD[LabeledPoint] = prepareTrainSet(criterion,
> data)
>         val model: GradientBoostedTreesModel = buildModel(trainDataSet)
>         val predictionDataSet = preparePredictionDataSet(criterion, data)
>         val predictedScores = predictScores(predictionDataSet, model,
> criterion, rules)
>         result = result.union(predictedScores)
>     }
>
> It works almost nice, but too slow for the reason GradientBoostedTreesModel
> training not so fast, especially in case of big amount of features, samples
> and also quite big list of using criteria.
> I suppose it could work better, if Spark will train models and make
> predictions in parallel.
>
> I've tried to use a relational way of data operation:
>
>     val criteria2RulesRdd: RDD[(String, Set[String])]
>
>     val cartesianCriteriaRules2DataRdd =
> criteria2RulesRdd.cartesian(dataRdd)
>     cartesianCriteriaRules2DataRdd
>       .aggregateByKey(List[Data]())(
>         { case (lst, tuple) => lst :+ tuple }, { case (lstL, lstR) => lstL
> ::: lstR}
>       )
>       .map {
>         case (criteria, rulesSet, scorePredictionDataList) =>
>           val trainSet = ???
>           val model = ???
>           val predictionSet = ???
>           val predictedScores = ???
>       }
>       ...
>
> but it inevitably brings to situation when one RDD is produced inside
> another RDD (GradientBoostedTreesModel is trained on RDD[LabeledPoint]) and
> as far as I know it's a bad scenario, e.g.
> toy example below doesn't work:
> scala> sc.parallelize(1 to 100).map(x => (x, sc.parallelize(Array(2)).map(_
> * 2).collect)).collect.
>
> Is there any way to use Spark MLlib in parallel way?
>
> Thank u for attention!
>
> --
> BR,
> Junior Scala/Python Developer
> Igor L.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-train-and-predict-in-parallel-via-Spark-MLlib-tp26261.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscribe@spark.apache.org
> For additional commands, e-mail: user-help@spark.apache.org
>
>