You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by Jaonary Rabarisoa <ja...@gmail.com> on 2014/03/21 14:32:34 UTC

N-Fold validation and RDD partitions

Hi

I need to partition my data represented as RDD into n folds and run metrics
computation in each fold and finally compute the means of my metrics
overall the folds.
Does spark can do the data partition out of the box or do I need to
implement it myself. I know that RDD has a partitions method and
mapPartitions but I really don't understand the purpose and the meaning of
partition here.



Cheers,

Jaonary

Re: N-Fold validation and RDD partitions

Posted by Walrus theCat <wa...@gmail.com>.

If someone wanted / needed to implement this themselves, are partitions the
correct way to go?  Any tips on how to get started (say, dividing an RDD
into 5 parts)?



On Fri, Mar 21, 2014 at 9:51 AM, Jaonary Rabarisoa <ja...@gmail.com>wrote:

> Thank you Hai-Anh. Are the files   CrossValidation.scala and RandomSplitRDD.scala
>  enough to use it ? I'm currently using spark 0.9.0 and I to avoid to
> rebuild every thing.
>
>
>
>
> On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh <ah...@adatao.com> wrote:
>
>> Hi Jaonary,
>>
>> You can find the code for k-fold CV in
>> https://github.com/apache/incubator-spark/pull/448. I have not find the
>> time to resubmit the pull to latest master.
>>
>>
>> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sanjay_awat@yahoo.com
>> > wrote:
>>
>>> Hi Jaonary,
>>>
>>> I believe the n folds should be mapped into n Keys in spark using a map
>>> function. You can reduce the returned PairRDD and you should get your
>>> metric.
>>> I don't understand partitions fully, but from whatever I understand of
>>> it, they aren't required in your scenario.
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com>
>>> wrote:
>>>   Hi
>>>
>>> I need to partition my data represented as RDD into n folds and run
>>> metrics computation in each fold and finally compute the means of my
>>> metrics overall the folds.
>>> Does spark can do the data partition out of the box or do I need to
>>> implement it myself. I know that RDD has a partitions method and
>>> mapPartitions but I really don't understand the purpose and the meaning of
>>> partition here.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>
>>
>>  --
>> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
>> http://www.linkedin.com/in/haianh
>>
>>
>

Re: N-Fold validation and RDD partitions

Posted by Jaonary Rabarisoa <ja...@gmail.com>.

Thank you Hai-Anh. Are the files   CrossValidation.scala and
RandomSplitRDD.scala
 enough to use it ? I'm currently using spark 0.9.0 and I to avoid to
rebuild every thing.




On Fri, Mar 21, 2014 at 4:58 PM, Hai-Anh Trinh <ah...@adatao.com> wrote:

> Hi Jaonary,
>
> You can find the code for k-fold CV in
> https://github.com/apache/incubator-spark/pull/448. I have not find the
> time to resubmit the pull to latest master.
>
>
> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sa...@yahoo.com>wrote:
>
>> Hi Jaonary,
>>
>> I believe the n folds should be mapped into n Keys in spark using a map
>> function. You can reduce the returned PairRDD and you should get your
>> metric.
>> I don't understand partitions fully, but from whatever I understand of
>> it, they aren't required in your scenario.
>>
>> Regards,
>> Sanjay
>>
>>
>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com>
>> wrote:
>>   Hi
>>
>> I need to partition my data represented as RDD into n folds and run
>> metrics computation in each fold and finally compute the means of my
>> metrics overall the folds.
>> Does spark can do the data partition out of the box or do I need to
>> implement it myself. I know that RDD has a partitions method and
>> mapPartitions but I really don't understand the purpose and the meaning of
>> partition here.
>>
>>
>>
>> Cheers,
>>
>> Jaonary
>>
>>
>>
>
>
> --
> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
> http://www.linkedin.com/in/haianh
>
>

Re: N-Fold validation and RDD partitions

Posted by Jaonary Rabarisoa <ja...@gmail.com>.

There is also a "randomSplit" method in the latest version of spark
https://github.com/apache/incubator-spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala


On Tue, Mar 25, 2014 at 1:21 AM, Holden Karau <ho...@pigscanfly.ca> wrote:

> There is also https://github.com/apache/spark/pull/18 against the current
> repo which may be easier to apply.
>
>
> On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh <ah...@adatao.com> wrote:
>
>> Hi Jaonary,
>>
>> You can find the code for k-fold CV in
>> https://github.com/apache/incubator-spark/pull/448. I have not find the
>> time to resubmit the pull to latest master.
>>
>>
>> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sanjay_awat@yahoo.com
>> > wrote:
>>
>>> Hi Jaonary,
>>>
>>> I believe the n folds should be mapped into n Keys in spark using a map
>>> function. You can reduce the returned PairRDD and you should get your
>>> metric.
>>> I don't understand partitions fully, but from whatever I understand of
>>> it, they aren't required in your scenario.
>>>
>>> Regards,
>>> Sanjay
>>>
>>>
>>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com>
>>> wrote:
>>>   Hi
>>>
>>> I need to partition my data represented as RDD into n folds and run
>>> metrics computation in each fold and finally compute the means of my
>>> metrics overall the folds.
>>> Does spark can do the data partition out of the box or do I need to
>>> implement it myself. I know that RDD has a partitions method and
>>> mapPartitions but I really don't understand the purpose and the meaning of
>>> partition here.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Jaonary
>>>
>>>
>>>
>>
>>
>>  --
>> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
>> http://www.linkedin.com/in/haianh
>>
>>
>
>
> --
> Cell : 425-233-8271
>

Re: N-Fold validation and RDD partitions

Posted by Holden Karau <ho...@pigscanfly.ca>.

There is also https://github.com/apache/spark/pull/18 against the current
repo which may be easier to apply.


On Fri, Mar 21, 2014 at 8:58 AM, Hai-Anh Trinh <ah...@adatao.com> wrote:

> Hi Jaonary,
>
> You can find the code for k-fold CV in
> https://github.com/apache/incubator-spark/pull/448. I have not find the
> time to resubmit the pull to latest master.
>
>
> On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sa...@yahoo.com>wrote:
>
>> Hi Jaonary,
>>
>> I believe the n folds should be mapped into n Keys in spark using a map
>> function. You can reduce the returned PairRDD and you should get your
>> metric.
>> I don't understand partitions fully, but from whatever I understand of
>> it, they aren't required in your scenario.
>>
>> Regards,
>> Sanjay
>>
>>
>>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com>
>> wrote:
>>   Hi
>>
>> I need to partition my data represented as RDD into n folds and run
>> metrics computation in each fold and finally compute the means of my
>> metrics overall the folds.
>> Does spark can do the data partition out of the box or do I need to
>> implement it myself. I know that RDD has a partitions method and
>> mapPartitions but I really don't understand the purpose and the meaning of
>> partition here.
>>
>>
>>
>> Cheers,
>>
>> Jaonary
>>
>>
>>
>
>
> --
> Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
> http://www.linkedin.com/in/haianh
>
>


-- 
Cell : 425-233-8271

Re: N-Fold validation and RDD partitions

Posted by Hai-Anh Trinh <ah...@adatao.com>.

Hi Jaonary,

You can find the code for k-fold CV in
https://github.com/apache/incubator-spark/pull/448. I have not find the
time to resubmit the pull to latest master.


On Fri, Mar 21, 2014 at 8:46 PM, Sanjay Awatramani <sa...@yahoo.com>wrote:

> Hi Jaonary,
>
> I believe the n folds should be mapped into n Keys in spark using a map
> function. You can reduce the returned PairRDD and you should get your
> metric.
> I don't understand partitions fully, but from whatever I understand of it,
> they aren't required in your scenario.
>
> Regards,
> Sanjay
>
>
>   On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com>
> wrote:
>   Hi
>
> I need to partition my data represented as RDD into n folds and run
> metrics computation in each fold and finally compute the means of my
> metrics overall the folds.
> Does spark can do the data partition out of the box or do I need to
> implement it myself. I know that RDD has a partitions method and
> mapPartitions but I really don't understand the purpose and the meaning of
> partition here.
>
>
>
> Cheers,
>
> Jaonary
>
>
>


-- 
Hai-Anh Trinh | Senior Software Engineer | http://adatao.com/
http://www.linkedin.com/in/haianh

Re: N-Fold validation and RDD partitions

Posted by Sanjay Awatramani <sa...@yahoo.com>.

Hi Jaonary,

I believe the n folds should be mapped into n Keys in spark using a map function. You can reduce the returned PairRDD and you should get your metric.
I don't understand partitions fully, but from whatever I understand of it, they aren't required in your scenario.

Regards,
Sanjay



On Friday, 21 March 2014 7:03 PM, Jaonary Rabarisoa <ja...@gmail.com> wrote:
 
Hi

I need to partition my data represented as RDD into n folds and run metrics computation in each fold and finally compute the means of my metrics overall the folds.
Does spark can do the data partition out of the box or do I need to implement it myself. I know that RDD has a partitions method and mapPartitions but I really don't understand the purpose and the meaning of partition here.



Cheers,

Jaonary