You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by RK Aduri <rk...@collectivei.com> on 2016/07/20 18:32:01 UTC

MultiThreading in Spark 1.6.0

Spark version: 1.6.0
So, here is the background:

I have a data frame (Large_Row_DataFrame) which I have created from an
array of row objects and also have another array of unique ids (U_ID) which
I’m going to use to look up into the Large_Row_DataFrame (which is cached)
to do a customized function.
For the each lookup for each unique id, I do a collect on the cached
dataframe Large_Row_DataFrame. This means that they would be a bunch of
‘collect’ actions which Spark has to run. Since I’m executing this in a loop
for each unique id (U_ID), all the such collect actions run in sequential
mode.

Solution that I implemented:

To avoid the sequential wait of each collect, I have created few subsets of
unique ids with a specific size and run each thread for such a subset. For
each such subset, I executed a thread which is a spark job that runs
collects in sequence only for that subset. And, I have created as many
threads as subsets, each thread handling each subset. Surprisingly, The
resultant run time is better than the earlier sequential approach.

Now the question:

Is the multithreading a correct approach towards the solution? Or could
there be a better way of doing this.

--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: MultiThreading in Spark 1.6.0

Posted by RK Aduri <rk...@collectivei.com>.

Thanks for the idea Maciej. The data is roughly 10 gigs.

I’m wondering if there any way to avoid the collect for each unit operation and somehow capture all such resultant arrays and collect them at once.

> On Jul 20, 2016, at 2:52 PM, Maciej Bryński <ma...@brynski.pl> wrote:
> 
> RK Aduri,
> Another idea is to union all results and then run collect.
> The question is how big collected data is.
> 
> 2016-07-20 20:32 GMT+02:00 RK Aduri <rk...@collectivei.com>:
>> Spark version: 1.6.0
>> So, here is the background:
>> 
>>        I have a data frame (Large_Row_DataFrame) which I have created from an
>> array of row objects and also have another array of unique ids (U_ID) which
>> I’m going to use to look up into the Large_Row_DataFrame (which is cached)
>> to do a customized function.
>>       For the each lookup for each unique id, I do a collect on the cached
>> dataframe Large_Row_DataFrame. This means that they would be a bunch of
>> ‘collect’ actions which Spark has to run. Since I’m executing this in a loop
>> for each unique id (U_ID), all the such collect actions run in sequential
>> mode.
>> 
>> Solution that I implemented:
>> 
>> To avoid the sequential wait of each collect, I have created few subsets of
>> unique ids with a specific size and run each thread for such a subset. For
>> each such subset, I executed a thread which is a spark job that runs
>> collects in sequence only for that subset. And, I have created as many
>> threads as subsets, each thread handling each subset. Surprisingly, The
>> resultant run time is better than the earlier sequential approach.
>> 
>> Now the question:
>> 
>>        Is the multithreading a correct approach towards the solution? Or could
>> there be a better way of doing this.
>> 
>> 
>> 
>> --
>> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>> 
>> ---------------------------------------------------------------------
>> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>> 
> 
> 
> 
> -- 
> Maciek Bryński


-- 
Collective[i] dramatically improves sales and marketing performance using 
technology, applications and a revolutionary network designed to provide 
next generation analytics and decision-support directly to business users. 
Our goal is to maximize human potential and minimize mistakes. In most 
cases, the results are astounding. We cannot, however, stop emails from 
sometimes being sent to the wrong person. If you are not the intended 
recipient, please notify us by replying to this email's sender and deleting 
it (and any attachments) permanently from your system. If you are, please 
respect the confidentiality of this communication's contents.

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org

Re: MultiThreading in Spark 1.6.0

Posted by Maciej Bryński <ma...@brynski.pl>.

RK Aduri,
Another idea is to union all results and then run collect.
The question is how big collected data is.

2016-07-20 20:32 GMT+02:00 RK Aduri <rk...@collectivei.com>:
> Spark version: 1.6.0
> So, here is the background:
>
>         I have a data frame (Large_Row_DataFrame) which I have created from an
> array of row objects and also have another array of unique ids (U_ID) which
> I’m going to use to look up into the Large_Row_DataFrame (which is cached)
> to do a customized function.
>        For the each lookup for each unique id, I do a collect on the cached
> dataframe Large_Row_DataFrame. This means that they would be a bunch of
> ‘collect’ actions which Spark has to run. Since I’m executing this in a loop
> for each unique id (U_ID), all the such collect actions run in sequential
> mode.
>
> Solution that I implemented:
>
> To avoid the sequential wait of each collect, I have created few subsets of
> unique ids with a specific size and run each thread for such a subset. For
> each such subset, I executed a thread which is a spark job that runs
> collects in sequence only for that subset. And, I have created as many
> threads as subsets, each thread handling each subset. Surprisingly, The
> resultant run time is better than the earlier sequential approach.
>
> Now the question:
>
>         Is the multithreading a correct approach towards the solution? Or could
> there be a better way of doing this.
>
>
>
> --
> View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/MultiThreading-in-Spark-1-6-0-tp27374.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscribe@spark.apache.org
>



-- 
Maciek Bryński

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org