You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Prasan Ary <vo...@yahoo.com> on 2008/03/12 17:24:57 UTC

reading input file only once for multiple map functions

I have a very large xml file as input and a  couple of Map/Reduce functions. Input key/value pair to all of my map functions is the same. 
  I was wondering if there is a way that I read the input xml file only once, then create key/value pair (also once) and give these k/v pairs as input to my map functions as opposed to having to read the xml and generate key/value once for each map functions?
   
  thanks.
   

       
---------------------------------
Looking for last minute shopping deals?  Find them fast with Yahoo! Search.

RE: reading input file only once for multiple map functions

Posted by Joydeep Sen Sarma <js...@facebook.com>.

the short answer is no - can't do this.

there are some special cases:

if the map output key for the same xml record is the same for both the jobs (ie. sort/partition/grouping is based on same value) - then you can do this in the application layer.

if the map output keys differs - then there's no way to do this. u can combine both the jobs and send tagged outputs from each job's map function - but that doesn't achieve much (saves file scan - but unrelated data has now got to be sorted together - which on the whole may be a loss rather than a win).

if the map itself can achieve a dramatic reduction in data size - then u can consider running a first job that has no reduces - just applies the map functions to produce two sets of (much smaller) files that are written out to hdfs. then u can launch two jobs (with identity mapper and the original reduce functions) that work against these data sets. so u will have three jobs - but only a single scan of the initial file.

---

it would help if the nature of the problem was described (size/schema of input data, outputs desired) - rather than the solution that u are trying to implement.

-----Original Message-----
From: Prasan Ary [mailto:voicesnthedark@yahoo.com]
Sent: Wed 3/12/2008 10:38 AM
To: core-user@hadoop.apache.org
Subject: Re: reading input file only once for multiple map functions

Ted,
  Say I have two Mapper classes . Map function for both of these classes get their input split from a very large XML file.

  Right now I am creating two different jobs, Job_1 and Job_2 , and both of these jobs have the same input path ( to the XML file) . However, since I am using a custum InputFormat to split the XML at record boundary, all splits for Job_1 and Job_2 should be the same ( equal to the number of records in XML).

  So basically I am splitting the XML twice, and getting same split each time. It would be nice if I could split the XML once, and send those splits to Map of Job_1 and Job_2. 

  ===========================================================

Ted Dunning <td...@veoh.com> wrote:

Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines. The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same. What is the point of parallelism in
that case? Are your maps random in some sense? Are they really operating
on different parts of the single input? If so, shouldn't they just be
getting the part of the input that they will be working on?

Perhaps you should describe what you are trying to do at a higher level. It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.

On 3/12/08 9:24 AM, "Prasan Ary" wrote:

> I have a very large xml file as input and a couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
> I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
> 
> thanks.
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo! Search.

---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: reading input file only once for multiple map functions

Posted by Ted Dunning <td...@veoh.com>.

Ahhh...

There is an old saying for this.  I think you are pulling fly specks out of
pepper.

Unless your input format is very, very strange, doing the split again for
two jobs does, indeed, lead to some small inefficiency, but this cost should
be so low compared to other inefficiencies that you are wasting your time to
try to optimize that away.  Remember, you don't know where the maps will
execute so getting the split to the correct nodes will be a nightmare.

If your splitting is actually so expensive that you can measure it, then you
should consider changing formats.  This is analogous to having a single
gzipped input file.  Splitting such a file involves reading the file from
the beginning because gzip is a stream compression algorithm.  There are a
few proposals going around to optimize that by concatenating gzip files with
special marker files in between, but the real answer is to either not use
gzipped input files or to split the files before gzipping.

On 3/12/08 10:38 AM, "Prasan Ary" <vo...@yahoo.com> wrote:

>   So basically I am splitting the XML twice, and getting same split each time.
> It would be nice if I could split the XML once, and send those splits to Map
> of Job_1 and Job_2.

Re: reading input file only once for multiple map functions

Posted by Prasan Ary <vo...@yahoo.com>.

Ted,
  Say I have two Mapper classes . Map function for both of these classes get their input split from a very large XML file.

  Right now I am creating two different jobs, Job_1 and Job_2 , and both of these jobs have the same input path ( to the XML file) . However, since I am using a custum InputFormat to split the XML at record boundary, all splits for Job_1 and Job_2 should be the same ( equal to the number of records in XML).

  So basically I am splitting the XML twice, and getting same split each time. It would be nice if I could split the XML once, and send those splits to Map of Job_1 and Job_2. 

  ===========================================================

Ted Dunning <td...@veoh.com> wrote:

Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines. The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same. What is the point of parallelism in
that case? Are your maps random in some sense? Are they really operating
on different parts of the single input? If so, shouldn't they just be
getting the part of the input that they will be working on?

Perhaps you should describe what you are trying to do at a higher level. It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.

On 3/12/08 9:24 AM, "Prasan Ary" wrote:

> I have a very large xml file as input and a couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
> I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
> 
> thanks.
> 
> 
> 
> ---------------------------------
> Looking for last minute shopping deals? Find them fast with Yahoo! Search.

---------------------------------
Be a better friend, newshound, and know-it-all with Yahoo! Mobile.  Try it now.

Re: reading input file only once for multiple map functions

Posted by Ted Dunning <td...@veoh.com>.

Your request sounds very strange.

First off, different map objects are created on different machines (that IS
the point, after all) and thus any reading of data has to be done on at
least all of those machines.  The map object is only created once per split,
though, so that might be a bit more what you are getting at.

Your basic requirement is a little odd, however, since you say that the
input to all of the maps is the same.  What is the point of parallelism in
that case?  Are your maps random in some sense?  Are they really operating
on different parts of the single input?  If so, shouldn't they just be
getting the part of the input that they will be working on?

Perhaps you should describe what you are trying to do at a higher level.  It
really sounds like you have taken a bit of an odd turn somewhere in your
porting your algorithm to a parallel form.

On 3/12/08 9:24 AM, "Prasan Ary" <vo...@yahoo.com> wrote:

> I have a very large xml file as input and a  couple of Map/Reduce functions.
> Input key/value pair to all of my map functions is the same.
>   I was wondering if there is a way that I read the input xml file only once,
> then create key/value pair (also once) and give these k/v pairs as input to my
> map functions as opposed to having to read the xml and generate key/value once
> for each map functions?
>    
>   thanks.
>    
> 
>        
> ---------------------------------
> Looking for last minute shopping deals?  Find them fast with Yahoo! Search.