You are viewing a plain text version of this content. The canonical link for it is here.
Posted to mapreduce-user@hadoop.apache.org by Arko Provo Mukherjee <ar...@gmail.com> on 2011/11/01 00:45:10 UTC

Sharing data in a mapper for all values

Hello,

I have a situation where I am reading a big file from HDFS and then
comparing all the data in that file with each input to the mapper.

Now since my mapper is trying to read the entire HDFS file for each of its
input, the amount of data it is having to read and keep in memory is
becoming large (file size * no of inputs to the mapper)

Can we someone avoid this by loading the file once for each mapper such
that the mapper can reuse the loaded file for each of the inputs that it
receives.

If this can be done, then for each mapper, I can just load the file once
and then the mapper can use it for the entire slice of data that it
receives.

Thanks a lot in advance!

Warm regards
Arko

Re: Sharing data in a mapper for all values

Posted by Bibek Paudel <et...@gmail.com>.
On Tue, Nov 1, 2011 at 12:52 AM, Joey Echeverria <jo...@cloudera.com> wrote:
> Yes, you can read the file in the configure() (old api) and setup()
> (new api) methods. The data can be saved in a variable that will be
> accessible to every call to map().

If I understood the question correctly, the situation is to keep a big
HDFS file in memory for each mapper. I know that for small files, the
DistributedCache is a solution.

Bibek

Re: Sharing data in a mapper for all values

Posted by Joey Echeverria <jo...@cloudera.com>.
Yes, you can read the file in the configure() (old api) and setup()
(new api) methods. The data can be saved in a variable that will be
accessible to every call to map().

-Joey

On Mon, Oct 31, 2011 at 7:45 PM, Arko Provo Mukherjee
<ar...@gmail.com> wrote:
> Hello,
> I have a situation where I am reading a big file from HDFS and then
> comparing all the data in that file with each input to the mapper.
> Now since my mapper is trying to read the entire HDFS file for each of its
> input, the amount of data it is having to read and keep in memory is
> becoming large (file size * no of inputs to the mapper)
> Can we someone avoid this by loading the file once for each mapper such that
> the mapper can reuse the loaded file for each of the inputs that it
> receives.
> If this can be done, then for each mapper, I can just load the file once and
> then the mapper can use it for the entire slice of data that it receives.
> Thanks a lot in advance!
>
> Warm regards
> Arko



-- 
Joseph Echeverria
Cloudera, Inc.
443.305.9434

Re: Sharing data in a mapper for all values

Posted by Harsh J <ha...@cloudera.com>.
Arko,

Have you considered using Hive/Pig for the same kind of functionality instead?

There are also ways to use reducers for this with proper group/sort
comparators in place (need more understanding of what you're trying to
achieve here before we can give out a solution), but you can use the
above tools instead - they may offer a more 'natural' way out.

On Tue, Nov 1, 2011 at 5:15 AM, Arko Provo Mukherjee
<ar...@gmail.com> wrote:
> Hello,
> I have a situation where I am reading a big file from HDFS and then
> comparing all the data in that file with each input to the mapper.
> Now since my mapper is trying to read the entire HDFS file for each of its
> input, the amount of data it is having to read and keep in memory is
> becoming large (file size * no of inputs to the mapper)
> Can we someone avoid this by loading the file once for each mapper such that
> the mapper can reuse the loaded file for each of the inputs that it
> receives.
> If this can be done, then for each mapper, I can just load the file once and
> then the mapper can use it for the entire slice of data that it receives.
> Thanks a lot in advance!
>
> Warm regards
> Arko



-- 
Harsh J

Re: Sharing data in a mapper for all values

Posted by Anthony Urso <an...@cs.ucla.edu>.
Arko:

If you have keyed both the big blob and the input files similarly, and
you can output both streams to HDFS sorted by key, then you can
reformulate this whole process as a map-side join.  It will be a lot
simpler and more efficient than scanning the whole blob for each
input.

Also, do whatever loading you have to do in the constructor or the
configure method so save a lot of repetition.

Hope this helps,
Anthony

On Mon, Oct 31, 2011 at 4:45 PM, Arko Provo Mukherjee
<ar...@gmail.com> wrote:
> Hello,
> I have a situation where I am reading a big file from HDFS and then
> comparing all the data in that file with each input to the mapper.
> Now since my mapper is trying to read the entire HDFS file for each of its
> input, the amount of data it is having to read and keep in memory is
> becoming large (file size * no of inputs to the mapper)
> Can we someone avoid this by loading the file once for each mapper such that
> the mapper can reuse the loaded file for each of the inputs that it
> receives.
> If this can be done, then for each mapper, I can just load the file once and
> then the mapper can use it for the entire slice of data that it receives.
> Thanks a lot in advance!
>
> Warm regards
> Arko