You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by huanglr <hu...@CeBiTec.Uni-Bielefeld.DE> on 2015/07/10 15:52:42 UTC

Spark Broadcasting large dataset

Hey, Guys!

I am using spark for NGS data application.

In my case I have to broadcast a very big dataset to each task.  

However there are serveral tasks (say 48 tasks) running on cpus (also 48 cores) in the same node. These tasks, who run on the same node, could share the same dataset. But spark broadcast them 48 times (if I understand correctly). 
Is there a way to broadcast just one copy for each node and share by all tasks running on such nodes? 

Much appreciated!

best!



huanglr

Re: RE: Spark Broadcasting large dataset

Posted by huanglr <hu...@CeBiTec.Uni-Bielefeld.DE>.
Hi, Ashic,

Thank you very much for your reply!

The tasks I mention is a running Function that I implemented with Spark API and passed to each partition of a RDD.  Within the Function I broadcast a big variable to be queried by each partition.

So, When I am running on a 48 cores slave node. I have 48 partitions corresponding 48 tasks (or clousure) where each tasks get a broadcast value (I see this from the memory usage and the API doc). Is there a way to share the value with all 48 partitions of 48 tasks?


best!


huanglr
 
From: Ashic Mahtab
Date: 2015-07-10 17:02
To: huanglr; Apache Spark
Subject: RE: Spark Broadcasting large dataset
When you say tasks, do you mean different applications, or different tasks in the same application? If it's the same program, they should be able to share the broadcasted value. But given you're asking the question, I imagine they're separate.

And in that case, afaik, the answer is no. You might look into putting the data into a fast store like Cassandra - that might help depending on your use case. 

Cheers,
Ashic.



Date: Fri, 10 Jul 2015 15:52:42 +0200
From: huanglr@CeBiTec.Uni-Bielefeld.DE
To: user@spark.apache.org
Subject: Spark Broadcasting large dataset

Hey, Guys!

I am using spark for NGS data application.

In my case I have to broadcast a very big dataset to each task.  

However there are serveral tasks (say 48 tasks) running on cpus (also 48 cores) in the same node. These tasks, who run on the same node, could share the same dataset. But spark broadcast them 48 times (if I understand correctly). 
Is there a way to broadcast just one copy for each node and share by all tasks running on such nodes? 

Much appreciated!

best!



huanglr

RE: Spark Broadcasting large dataset

Posted by Ashic Mahtab <as...@live.com>.
When you say tasks, do you mean different applications, or different tasks in the same application? If it's the same program, they should be able to share the broadcasted value. But given you're asking the question, I imagine they're separate.

And in that case, afaik, the answer is no. You might look into putting the data into a fast store like Cassandra - that might help depending on your use case. 

Cheers,
Ashic.

Date: Fri, 10 Jul 2015 15:52:42 +0200
From: huanglr@CeBiTec.Uni-Bielefeld.DE
To: user@spark.apache.org
Subject: Spark Broadcasting large dataset


Hey, Guys!
I am using spark for NGS data application.
In my case I have to broadcast a very big dataset to each task.  
However there are serveral tasks (say 48 tasks) running on cpus (also 48 cores) in the same node. These tasks, who run on the same node, could share the same dataset. But spark broadcast them 48 times (if I understand correctly). Is there a way to broadcast just one copy for each node and share by all tasks running on such nodes? 
Much appreciated!
best!


huanglr