You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Per Stolpe <pe...@gmail.com> on 2009/06/04 18:18:11 UTC

Letting the Mapper handle multiple lines.

Hi.
I'm quite new to Hadoop programming, so to get a good start I started
writing my own program that summarizes a column in a large tab separated
file (~100 000 000 lines). My first naive implementation was quite simple, a
small rework of the WordCounter example that comes with Hadoop. This program
did calculate the correct answer, but it performed quite badly, since every
line in the file invokes a call to map(). To solve this, I wrote my own
RecordReader, one that would return a List<Text> instead of just a Text. It
does type check in Eclipse and all seems to be fine until I actually run the
program. When I do, I get the following error:

java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
java.util.List
        at Summarizer$TokenizerMapper.map(Summarizer.java:1)
        at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
        at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
        at org.apache.hadoop.mapred.Child.main(Child.java:170)

(repeated several times)

What might be the problem?
And are there maybe InputFormat (that are not marked as Deprecated) that
already solves my problem?

Source code:
Summarizer: http://pastebin.com/m52876939
RecordReader: http://pastebin.com/m2c541a00
InputFormat: http://pastebin.com/m7714b0c

Hadoop version: 0.20.0
Java JDK version: 1.6 u14

Regards,
Per and Felix

Re: Letting the Mapper handle multiple lines.

Posted by Per Stolpe <pe...@gmail.com>.

I did indeed think that addInputPath() set the InputFormat class, so 
this is probably what has been my problem. I'll try this when I gain 
access to my cluster again on Monday, but I'm fairly confident that this 
will fix my program.

Thank you very much for a good answer.
Take care, I will post an update on Monday.

HRoger wrote:
> I has read your code ,I think you should add
> job.setInputFormatClass(MultiLineInputFormat.class);
> when you not set the that ,it would use TextInputFormat and the value is
> Text default.You may thought
> that "MultiLineInputFormat.addInputPath()" would set the InputFormatClass
> auto, but it doesn't do that.
> You also can set configuration.set("mapred.job.tracker","local") and add
> some log info to debug you program.
>
> Good Luck!
>
> Per Stolpe wrote:
>   
>> Hi.
>> I'm quite new to Hadoop programming, so to get a good start I started
>> writing my own program that summarizes a column in a large tab separated
>> file (~100 000 000 lines). My first naive implementation was quite simple,
>> a
>> small rework of the WordCounter example that comes with Hadoop. This
>> program
>> did calculate the correct answer, but it performed quite badly, since
>> every
>> line in the file invokes a call to map(). To solve this, I wrote my own
>> RecordReader, one that would return a List<Text> instead of just a Text.
>> It
>> does type check in Eclipse and all seems to be fine until I actually run
>> the
>> program. When I do, I get the following error:
>>
>> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
>> java.util.List
>>         at Summarizer$TokenizerMapper.map(Summarizer.java:1)
>>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
>>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
>>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
>>
>> (repeated several times)
>>
>> What might be the problem?
>> And are there maybe InputFormat (that are not marked as Deprecated) that
>> already solves my problem?
>>
>> Source code:
>> Summarizer: http://pastebin.com/m52876939
>> RecordReader: http://pastebin.com/m2c541a00
>> InputFormat: http://pastebin.com/m7714b0c
>>
>> Hadoop version: 0.20.0
>> Java JDK version: 1.6 u14
>>
>> Regards,
>> Per and Felix

Re: Letting the Mapper handle multiple lines.

Posted by HRoger <ha...@163.com>.

I has read your code ,I think you should add
job.setInputFormatClass(MultiLineInputFormat.class);
when you not set the that ,it would use TextInputFormat and the value is
Text default.You may thought
that "MultiLineInputFormat.addInputPath()" would set the InputFormatClass
auto, but it doesn't do that.
You also can set configuration.set("mapred.job.tracker","local") and add
some log info to debug you program.

Good Luck!

Per Stolpe wrote:
> 
> Hi.
> I'm quite new to Hadoop programming, so to get a good start I started
> writing my own program that summarizes a column in a large tab separated
> file (~100 000 000 lines). My first naive implementation was quite simple,
> a
> small rework of the WordCounter example that comes with Hadoop. This
> program
> did calculate the correct answer, but it performed quite badly, since
> every
> line in the file invokes a call to map(). To solve this, I wrote my own
> RecordReader, one that would return a List<Text> instead of just a Text.
> It
> does type check in Eclipse and all seems to be fine until I actually run
> the
> program. When I do, I get the following error:
> 
> java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to
> java.util.List
>         at Summarizer$TokenizerMapper.map(Summarizer.java:1)
>         at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>         at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:518)
>         at org.apache.hadoop.mapred.MapTask.run(MapTask.java:303)
>         at org.apache.hadoop.mapred.Child.main(Child.java:170)
> 
> (repeated several times)
> 
> What might be the problem?
> And are there maybe InputFormat (that are not marked as Deprecated) that
> already solves my problem?
> 
> Source code:
> Summarizer: http://pastebin.com/m52876939
> RecordReader: http://pastebin.com/m2c541a00
> InputFormat: http://pastebin.com/m7714b0c
> 
> Hadoop version: 0.20.0
> Java JDK version: 1.6 u14
> 
> Regards,
> Per and Felix
> 
> 

-- 
View this message in context: http://www.nabble.com/Letting-the-Mapper-handle-multiple-lines.-tp23873214p23875177.html
Sent from the Hadoop core-user mailing list archive at Nabble.com.