You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Parthus <pe...@gmail.com> on 2014/07/22 21:54:03 UTC

What if there are large, read-only variables shared by all map functions?

Hi there,

I was wondering if anybody could help me find an efficient way to make a
MapReduce program like this:

1) For each map function, it need access some huge files, which is around
6GB

2) These files are READ-ONLY. Actually they are like some huge look-up
table, which will not change during 2~3 years.

I tried two ways to make the program work, but neither of them is efficient:

1) The first approach I tried is to let each map function load those files
independently, like this:

map (...) { load(files); DoMapTask(...)}

2) The second approach I tried is to load the files before RDD.map(...) and
broadcast the files. However, because the files are too large, the
broadcasting overhead is 30min ~ 1 hour.

Could anybody help me find an efficient way to solve it?

Thanks very much.










--
View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: What if there are large, read-only variables shared by all map functions?

Posted by Aaron Davidson <il...@gmail.com>.
In particular, take a look at the TorrentBroadcast, which should be much
more efficient than HttpBroadcast (which was the default in 1.0) for large
files.

If you find that TorrentBroadcast doesn't work for you, then another way to
solve this problem is to place the data on all nodes' local disks, and
amortize the cost of the data loading by using RDD#mapPartitions instead of
#map, which allows you to do the loading once for a large set of elements.

You could refine this model further by keeping some sort of (perhaps
static) state on your Executors, like

object LookupTable {
  def getOrLoadTable(): LookupTable
}

and then using this method in your map partitions method. This would ensure
the table is only loaded once on each Executor, and could also be used to
ensure the data remains between jobs. You should be careful, though, at
using so much memory outside of Spark's knowledge -- you may need to tune
the Spark memory options if you run into OutOfMemoryErrors.


On Wed, Jul 23, 2014 at 8:39 PM, Mayur Rustagi <ma...@gmail.com>
wrote:

> Have a look at broadcast variables .
>
>
> On Tuesday, July 22, 2014, Parthus <pe...@gmail.com> wrote:
>
>> Hi there,
>>
>> I was wondering if anybody could help me find an efficient way to make a
>> MapReduce program like this:
>>
>> 1) For each map function, it need access some huge files, which is around
>> 6GB
>>
>> 2) These files are READ-ONLY. Actually they are like some huge look-up
>> table, which will not change during 2~3 years.
>>
>> I tried two ways to make the program work, but neither of them is
>> efficient:
>>
>> 1) The first approach I tried is to let each map function load those files
>> independently, like this:
>>
>> map (...) { load(files); DoMapTask(...)}
>>
>> 2) The second approach I tried is to load the files before RDD.map(...)
>> and
>> broadcast the files. However, because the files are too large, the
>> broadcasting overhead is 30min ~ 1 hour.
>>
>> Could anybody help me find an efficient way to solve it?
>>
>> Thanks very much.
>>
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>
>
> --
> Sent from Gmail Mobile
>

Re: What if there are large, read-only variables shared by all map functions?

Posted by Mayur Rustagi <ma...@gmail.com>.
Have a look at broadcast variables .


On Tuesday, July 22, 2014, Parthus <pe...@gmail.com> wrote:

> Hi there,
>
> I was wondering if anybody could help me find an efficient way to make a
> MapReduce program like this:
>
> 1) For each map function, it need access some huge files, which is around
> 6GB
>
> 2) These files are READ-ONLY. Actually they are like some huge look-up
> table, which will not change during 2~3 years.
>
> I tried two ways to make the program work, but neither of them is
> efficient:
>
> 1) The first approach I tried is to let each map function load those files
> independently, like this:
>
> map (...) { load(files); DoMapTask(...)}
>
> 2) The second approach I tried is to load the files before RDD.map(...) and
> broadcast the files. However, because the files are too large, the
> broadcasting overhead is 30min ~ 1 hour.
>
> Could anybody help me find an efficient way to solve it?
>
> Thanks very much.
>
>
>
>
>
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/What-if-there-are-large-read-only-variables-shared-by-all-map-functions-tp10435.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>


-- 
Sent from Gmail Mobile