You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Brandon White <bw...@gmail.com> on 2015/07/08 02:04:41 UTC

Parallelizing multiple RDD / DataFrame creation in Spark

Say I have a spark job that looks like following:

def loadTable1() {
  val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
  table1.cache().registerTempTable("table1")}
def loadTable2() {
  val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
  table2.cache().registerTempTable("table2")}

def loadAllTables() {
  loadTable1()
  loadTable2()}

loadAllTables()

How do I parallelize this Spark job so that both tables are created at the
same time or in parallel?

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by ayan guha <gu...@gmail.com>.
Do you have a benchmark to say running these two statements as it is will
be slower than what you suggest?
On 9 Jul 2015 01:06, "Brandon White" <bw...@gmail.com> wrote:

> The point of running them in parallel would be faster creation of the
> tables. Has anybody been able to efficiently parallelize something like
> this in Spark?
> On Jul 8, 2015 12:29 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:
>
>> Whats the point of creating them in parallel? You can multi-thread it run
>> it in parallel though.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
>> wrote:
>>
>>> Say I have a spark job that looks like following:
>>>
>>> def loadTable1() {
>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>   table1.cache().registerTempTable("table1")}
>>> def loadTable2() {
>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>   table2.cache().registerTempTable("table2")}
>>>
>>> def loadAllTables() {
>>>   loadTable1()
>>>   loadTable2()}
>>>
>>> loadAllTables()
>>>
>>> How do I parallelize this Spark job so that both tables are created at
>>> the same time or in parallel?
>>>
>>
>>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Srikanth <sr...@gmail.com>.
Your tableLoad() APIs are not actions. File will be read fully only when an
action is performed.
If the action is something like table1.join(table2), then I think both
files will be read in parallel.
Can you try that and look at the execution plan or in 1.4 this is shown in
Spark UI.

Srikanth

On Wed, Jul 8, 2015 at 11:06 AM, Brandon White <bw...@gmail.com>
wrote:

> The point of running them in parallel would be faster creation of the
> tables. Has anybody been able to efficiently parallelize something like
> this in Spark?
> On Jul 8, 2015 12:29 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:
>
>> Whats the point of creating them in parallel? You can multi-thread it run
>> it in parallel though.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
>> wrote:
>>
>>> Say I have a spark job that looks like following:
>>>
>>> def loadTable1() {
>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>   table1.cache().registerTempTable("table1")}
>>> def loadTable2() {
>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>   table2.cache().registerTempTable("table2")}
>>>
>>> def loadAllTables() {
>>>   loadTable1()
>>>   loadTable2()}
>>>
>>> loadAllTables()
>>>
>>> How do I parallelize this Spark job so that both tables are created at
>>> the same time or in parallel?
>>>
>>
>>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Brandon White <bw...@gmail.com>.
The point of running them in parallel would be faster creation of the
tables. Has anybody been able to efficiently parallelize something like
this in Spark?
On Jul 8, 2015 12:29 AM, "Akhil Das" <ak...@sigmoidanalytics.com> wrote:

> Whats the point of creating them in parallel? You can multi-thread it run
> it in parallel though.
>
> Thanks
> Best Regards
>
> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
> wrote:
>
>> Say I have a spark job that looks like following:
>>
>> def loadTable1() {
>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>   table1.cache().registerTempTable("table1")}
>> def loadTable2() {
>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>   table2.cache().registerTempTable("table2")}
>>
>> def loadAllTables() {
>>   loadTable1()
>>   loadTable2()}
>>
>> loadAllTables()
>>
>> How do I parallelize this Spark job so that both tables are created at
>> the same time or in parallel?
>>
>
>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Ashish Dutt <as...@gmail.com>.
Thanks you Akhil for the link


Sincerely,
Ashish Dutt
PhD Candidate
Department of Information Systems
University of Malaya, Lembah Pantai,
50603 Kuala Lumpur, Malaysia

On Wed, Jul 8, 2015 at 3:43 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Have a look
> http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala,
> create two threads and call thread1.start(), thread2.start()
>
> Thanks
> Best Regards
>
> On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt <as...@gmail.com>
> wrote:
>
>> Thanks for your reply Akhil.
>> How do you multithread it?
>>
>> Sincerely,
>> Ashish Dutt
>>
>>
>> On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das <ak...@sigmoidanalytics.com>
>> wrote:
>>
>>> Whats the point of creating them in parallel? You can multi-thread it
>>> run it in parallel though.
>>>
>>> Thanks
>>> Best Regards
>>>
>>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
>>> wrote:
>>>
>>>> Say I have a spark job that looks like following:
>>>>
>>>> def loadTable1() {
>>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>>   table1.cache().registerTempTable("table1")}
>>>> def loadTable2() {
>>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>>   table2.cache().registerTempTable("table2")}
>>>>
>>>> def loadAllTables() {
>>>>   loadTable1()
>>>>   loadTable2()}
>>>>
>>>> loadAllTables()
>>>>
>>>> How do I parallelize this Spark job so that both tables are created at
>>>> the same time or in parallel?
>>>>
>>>
>>>
>>
>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Have a look
http://alvinalexander.com/scala/how-to-create-java-thread-runnable-in-scala,
create two threads and call thread1.start(), thread2.start()

Thanks
Best Regards

On Wed, Jul 8, 2015 at 1:06 PM, Ashish Dutt <as...@gmail.com> wrote:

> Thanks for your reply Akhil.
> How do you multithread it?
>
> Sincerely,
> Ashish Dutt
>
>
> On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das <ak...@sigmoidanalytics.com>
> wrote:
>
>> Whats the point of creating them in parallel? You can multi-thread it run
>> it in parallel though.
>>
>> Thanks
>> Best Regards
>>
>> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
>> wrote:
>>
>>> Say I have a spark job that looks like following:
>>>
>>> def loadTable1() {
>>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>>   table1.cache().registerTempTable("table1")}
>>> def loadTable2() {
>>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>>   table2.cache().registerTempTable("table2")}
>>>
>>> def loadAllTables() {
>>>   loadTable1()
>>>   loadTable2()}
>>>
>>> loadAllTables()
>>>
>>> How do I parallelize this Spark job so that both tables are created at
>>> the same time or in parallel?
>>>
>>
>>
>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Ashish Dutt <as...@gmail.com>.
Thanks for your reply Akhil.
How do you multithread it?

Sincerely,
Ashish Dutt


On Wed, Jul 8, 2015 at 3:29 PM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> Whats the point of creating them in parallel? You can multi-thread it run
> it in parallel though.
>
> Thanks
> Best Regards
>
> On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
> wrote:
>
>> Say I have a spark job that looks like following:
>>
>> def loadTable1() {
>>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>>   table1.cache().registerTempTable("table1")}
>> def loadTable2() {
>>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>>   table2.cache().registerTempTable("table2")}
>>
>> def loadAllTables() {
>>   loadTable1()
>>   loadTable2()}
>>
>> loadAllTables()
>>
>> How do I parallelize this Spark job so that both tables are created at
>> the same time or in parallel?
>>
>
>

Re: Parallelizing multiple RDD / DataFrame creation in Spark

Posted by Akhil Das <ak...@sigmoidanalytics.com>.
Whats the point of creating them in parallel? You can multi-thread it run
it in parallel though.

Thanks
Best Regards

On Wed, Jul 8, 2015 at 5:34 AM, Brandon White <bw...@gmail.com>
wrote:

> Say I have a spark job that looks like following:
>
> def loadTable1() {
>   val table1 = sqlContext.jsonFile(s"s3://textfiledirectory/")
>   table1.cache().registerTempTable("table1")}
> def loadTable2() {
>   val table2 = sqlContext.jsonFile(s"s3://testfiledirectory2/")
>   table2.cache().registerTempTable("table2")}
>
> def loadAllTables() {
>   loadTable1()
>   loadTable2()}
>
> loadAllTables()
>
> How do I parallelize this Spark job so that both tables are created at the
> same time or in parallel?
>