You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by stolikp <st...@o2.pl> on 2010/01/23 15:49:22 UTC

Passing whole text file to a single map

I've got some text files in my input directory and I want to pass each single
text file (whole file not just a line) to a map (one file per one map). How
can I do this ? TextInputFormat splits text into lines and I do not want
this to happen.
I tried:
http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
but it doesn't work for me, compiler doesn't know what
NonSplitableTextInputFormat.class is.
I'm using hadoop 0.20.1 
-- 
View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.

Re: Passing whole text file to a single map

Posted by Jason Venner <ja...@gmail.com>.

http://prohadoop.ning.com/forum/topics/passing-whole-file-to-map

On Sat, Jan 23, 2010 at 8:41 AM, Edward Capriolo <ed...@gmail.com>wrote:

> My bible code problem is someone similar. I have many small files and
> one mapper needs to process an entire file. So I generate an input
> file
>
> /user/bc/ecapriolo/bible1/grid/10/0,dictionary.txt
> /user/bc/ecapriolo/bible1/grid/10/1,dictionary.txt
> /user/bc/ecapriolo/bible1/grid/10/2,dictionary.txt
>
> use nline input format:
>
>    JobConf conf = new JobConf(getConf(), GridSearcher.class);
>    conf.setJobName("GridSearcher");
>    conf.setMapperClass(MapClass.class);
>    conf.setInputFormat(NLineInputFormat.class);
>    conf.setMapOutputKeyClass(Text.class);
>    conf.setMapOutputValueClass(Text.class);
>    FileInputFormat.setInputPaths(conf, new
> Path("/user/bc/gridsearchcmd.txt"));
>    FileOutputFormat.setOutputPath(conf, new
> Path("/user/bc/gridsearchres"));
>
> Now each mapper opens and processes the entire file using
> FSDataInputStream. It is an anti-pattern, but my map is NOT feeding me
> line per line of data. It is only feeding me the names of files to
> open. One map one file.
>
> On Sat, Jan 23, 2010 at 9:54 AM, Raymond Jennings III
> <ra...@yahoo.com> wrote:
> > Not sure if this solves your problem but I had a similar case where there
> was unique data at the beginning of the file and if that file was split
> between maps I would lose that for the 2nd and subsequent maps.  I was able
> to pull the file name from the conf and read the first two lines for every
> map.
> >
> > --- On Sat, 1/23/10, stolikp <st...@o2.pl> wrote:
> >
> >> From: stolikp <st...@o2.pl>
> >> Subject: Passing whole text file to a single map
> >> To: core-user@hadoop.apache.org
> >> Date: Saturday, January 23, 2010, 9:49 AM
> >>
> >> I've got some text files in my input directory and I want
> >> to pass each single
> >> text file (whole file not just a line) to a map (one file
> >> per one map). How
> >> can I do this ? TextInputFormat splits text into lines and
> >> I do not want
> >> this to happen.
> >> I tried:
> >>
> http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
> >> but it doesn't work for me, compiler doesn't know what
> >> NonSplitableTextInputFormat.class is.
> >> I'm using hadoop 0.20.1
> >> --
> >> View this message in context:
> http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html
> >> Sent from the Hadoop core-user mailing list archive at
> >> Nabble.com.
> >>
> >>
> >
> >
> >
> >
>



-- 
Pro Hadoop, a book to guide you from beginner to hadoop mastery,
http://www.amazon.com/dp/1430219424?tag=jewlerymall
www.prohadoopbook.com a community for Hadoop Professionals

Re: Passing whole text file to a single map

Posted by Edward Capriolo <ed...@gmail.com>.

My bible code problem is someone similar. I have many small files and
one mapper needs to process an entire file. So I generate an input
file

/user/bc/ecapriolo/bible1/grid/10/0,dictionary.txt
/user/bc/ecapriolo/bible1/grid/10/1,dictionary.txt
/user/bc/ecapriolo/bible1/grid/10/2,dictionary.txt

use nline input format:

    JobConf conf = new JobConf(getConf(), GridSearcher.class);
    conf.setJobName("GridSearcher");
    conf.setMapperClass(MapClass.class);
    conf.setInputFormat(NLineInputFormat.class);
    conf.setMapOutputKeyClass(Text.class);
    conf.setMapOutputValueClass(Text.class);
    FileInputFormat.setInputPaths(conf, new Path("/user/bc/gridsearchcmd.txt"));
    FileOutputFormat.setOutputPath(conf, new Path("/user/bc/gridsearchres"));

Now each mapper opens and processes the entire file using
FSDataInputStream. It is an anti-pattern, but my map is NOT feeding me
line per line of data. It is only feeding me the names of files to
open. One map one file.

On Sat, Jan 23, 2010 at 9:54 AM, Raymond Jennings III
<ra...@yahoo.com> wrote:
> Not sure if this solves your problem but I had a similar case where there was unique data at the beginning of the file and if that file was split between maps I would lose that for the 2nd and subsequent maps.  I was able to pull the file name from the conf and read the first two lines for every map.
>
> --- On Sat, 1/23/10, stolikp <st...@o2.pl> wrote:
>
>> From: stolikp <st...@o2.pl>
>> Subject: Passing whole text file to a single map
>> To: core-user@hadoop.apache.org
>> Date: Saturday, January 23, 2010, 9:49 AM
>>
>> I've got some text files in my input directory and I want
>> to pass each single
>> text file (whole file not just a line) to a map (one file
>> per one map). How
>> can I do this ? TextInputFormat splits text into lines and
>> I do not want
>> this to happen.
>> I tried:
>> http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
>> but it doesn't work for me, compiler doesn't know what
>> NonSplitableTextInputFormat.class is.
>> I'm using hadoop 0.20.1
>> --
>> View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html
>> Sent from the Hadoop core-user mailing list archive at
>> Nabble.com.
>>
>>
>
>
>
>

Re: Passing whole text file to a single map

Posted by Raymond Jennings III <ra...@yahoo.com>.

Not sure if this solves your problem but I had a similar case where there was unique data at the beginning of the file and if that file was split between maps I would lose that for the 2nd and subsequent maps.  I was able to pull the file name from the conf and read the first two lines for every map.

--- On Sat, 1/23/10, stolikp <st...@o2.pl> wrote:

> From: stolikp <st...@o2.pl>
> Subject: Passing whole text file to a single map
> To: core-user@hadoop.apache.org
> Date: Saturday, January 23, 2010, 9:49 AM
> 
> I've got some text files in my input directory and I want
> to pass each single
> text file (whole file not just a line) to a map (one file
> per one map). How
> can I do this ? TextInputFormat splits text into lines and
> I do not want
> this to happen.
> I tried:
> http://hadoop.apache.org/common/docs/r0.20./streaming.html#How+do+I+process+files%2C+one+per+map%3F
> but it doesn't work for me, compiler doesn't know what
> NonSplitableTextInputFormat.class is.
> I'm using hadoop 0.20.1 
> -- 
> View this message in context: http://old.nabble.com/Passing-whole-text-file-to-a-single-map-tp27286204p27286204.html
> Sent from the Hadoop core-user mailing list archive at
> Nabble.com.
> 
>