You are viewing a plain text version of this content. The canonical link for it is here.

Posted to mapreduce-user@hadoop.apache.org by Kunal Gupta <ku...@techlead-india.com> on 2009/12/01 07:17:19 UTC

How to write a custom input format and record reader to read multiple lines of text from files

Can someone explain how to override the "FileInputFormat" and
"RecordReader" in order to be able to read multiple lines of text from
input files in a single map task?

Here the key will be the offset of the first line of text and value will
be the N lines of text. 

I have overridden the class FileInputFormat:

public class MultiLineFileInputFormat
	extends FileInputFormat<LongWritable, Text>{
...
}

and implemented the abstract method:

public RecordReader createRecordReader(InputSplit split,
                TaskAttemptContext context)
         throws IOException, InterruptedException {...}

I have also overridden the recordreader class:

public class MultiLineFileRecordReader extends
RecordReader<LongWritable, Text>
{...}

and in the job configuration, specified this new InputFormat class:

job.setInputFormatClass(MultiLineFileInputFormat.class);

--------------------------------------------------------------------------
When I  run this new map/reduce program, i get the following java error:
--------------------------------------------------------------------------
Exception in thread "main" java.lang.RuntimeException:
java.lang.NoSuchMethodException: CustomRecordReader
$MultiLineFileInputFormat.<init>()
	at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
	at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
	at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
	at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
	at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
	at CustomRecordReader.main(CustomRecordReader.java:257)
Caused by: java.lang.NoSuchMethodException: CustomRecordReader
$MultiLineFileInputFormat.<init>()
	at java.lang.Class.getConstructor0(Class.java:2706)
	at java.lang.Class.getDeclaredConstructor(Class.java:1985)
	at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
	... 5 more

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Kunal Gupta <ku...@techlead-india.com>.

I have not implemented a constructor for the class extending the
FileInputFormat class.
I had actually implemented a argument constructor for the class
extending RecordReader. This class was not having a no-arg constructor.
After reading your comment i wrote a no-arg constructor for this class,
but still getting the same error.

Following is the class extending the FileInputFormat class:

	public class MultiLineFileInputFormat
	extends FileInputFormat<LongWritable, Text> {

		/*public MultiLineFileInputFormat()
		{
			super();
		}*/
		
		@Override
		public RecordReader createRecordReader(InputSplit split,
                TaskAttemptContext context)
         throws IOException, InterruptedException 
         {
			
				context.setStatus(split.toString());
				return new MultiLineFileRecordReader((FileSplit) split, context);
         }
		
	}


On Tue, 2009-12-01 at 07:20 +0000, Sean Owen wrote:
> It sounds like you have declared a constructor in
> MultiLineFileInputFormat that needs an argument. By doing so, no
> no-arg constructor is automatically generated. Unless you write one,
> it won't exist. The Hadoop framework instantiates your class by
> calling the no-arg constructor. The error you get says this directly:
> there is no no-arg constructor. Write one to fix it.
> 
> The example you reference has a no-arg constructor, by default, since
> it declares no constructors at all.
> 
> On Tue, Dec 1, 2009 at 6:57 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
> > Can you kindly guide me on what initialisation i need to do in the
> > implemented class constructor - MultiLineFileInputFormat?
> >
> > i was following the sample provided on this yahoo page:
> >
> > http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
> >
> >
> >
> >
> > On Tue, 2009-12-01 at 06:45 +0000, Sean Owen wrote:
> >> It sounds like you have no provided a no-arg constructor in
> >> MultiLineFileInputFormat.
> >>
> >> On Tue, Dec 1, 2009 at 6:17 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
> >> > Can someone explain how to override the "FileInputFormat" and
> >> > "RecordReader" in order to be able to read multiple lines of text from
> >> > input files in a single map task?
> >> >
> >> > Here the key will be the offset of the first line of text and value will
> >> > be the N lines of text.
> >> >
> >> > I have overridden the class FileInputFormat:
> >> >
> >> > public class MultiLineFileInputFormat
> >> >        extends FileInputFormat<LongWritable, Text>{
> >> > ...
> >> > }
> >> >
> >> > and implemented the abstract method:
> >> >
> >> > public RecordReader createRecordReader(InputSplit split,
> >> >                TaskAttemptContext context)
> >> >         throws IOException, InterruptedException {...}
> >> >
> >> > I have also overridden the recordreader class:
> >> >
> >> > public class MultiLineFileRecordReader extends
> >> > RecordReader<LongWritable, Text>
> >> > {...}
> >> >
> >> > and in the job configuration, specified this new InputFormat class:
> >> >
> >> > job.setInputFormatClass(MultiLineFileInputFormat.class);
> >> >
> >> > --------------------------------------------------------------------------
> >> > When I  run this new map/reduce program, i get the following java error:
> >> > --------------------------------------------------------------------------
> >> > Exception in thread "main" java.lang.RuntimeException:
> >> > java.lang.NoSuchMethodException: CustomRecordReader
> >> > $MultiLineFileInputFormat.<init>()
> >> >        at
> >> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
> >> >        at
> >> > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
> >> >        at
> >> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
> >> >        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> >> >        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> >> >        at CustomRecordReader.main(CustomRecordReader.java:257)
> >> > Caused by: java.lang.NoSuchMethodException: CustomRecordReader
> >> > $MultiLineFileInputFormat.<init>()
> >> >        at java.lang.Class.getConstructor0(Class.java:2706)
> >> >        at java.lang.Class.getDeclaredConstructor(Class.java:1985)
> >> >        at
> >> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
> >> >        ... 5 more
> >> >
> >> >
> >>
> >
> >
>

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Sean Owen <sr...@gmail.com>.

It sounds like you have declared a constructor in
MultiLineFileInputFormat that needs an argument. By doing so, no
no-arg constructor is automatically generated. Unless you write one,
it won't exist. The Hadoop framework instantiates your class by
calling the no-arg constructor. The error you get says this directly:
there is no no-arg constructor. Write one to fix it.

The example you reference has a no-arg constructor, by default, since
it declares no constructors at all.

On Tue, Dec 1, 2009 at 6:57 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
> Can you kindly guide me on what initialisation i need to do in the
> implemented class constructor - MultiLineFileInputFormat?
>
> i was following the sample provided on this yahoo page:
>
> http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat
>
>
>
>
> On Tue, 2009-12-01 at 06:45 +0000, Sean Owen wrote:
>> It sounds like you have no provided a no-arg constructor in
>> MultiLineFileInputFormat.
>>
>> On Tue, Dec 1, 2009 at 6:17 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
>> > Can someone explain how to override the "FileInputFormat" and
>> > "RecordReader" in order to be able to read multiple lines of text from
>> > input files in a single map task?
>> >
>> > Here the key will be the offset of the first line of text and value will
>> > be the N lines of text.
>> >
>> > I have overridden the class FileInputFormat:
>> >
>> > public class MultiLineFileInputFormat
>> >        extends FileInputFormat<LongWritable, Text>{
>> > ...
>> > }
>> >
>> > and implemented the abstract method:
>> >
>> > public RecordReader createRecordReader(InputSplit split,
>> >                TaskAttemptContext context)
>> >         throws IOException, InterruptedException {...}
>> >
>> > I have also overridden the recordreader class:
>> >
>> > public class MultiLineFileRecordReader extends
>> > RecordReader<LongWritable, Text>
>> > {...}
>> >
>> > and in the job configuration, specified this new InputFormat class:
>> >
>> > job.setInputFormatClass(MultiLineFileInputFormat.class);
>> >
>> > --------------------------------------------------------------------------
>> > When I  run this new map/reduce program, i get the following java error:
>> > --------------------------------------------------------------------------
>> > Exception in thread "main" java.lang.RuntimeException:
>> > java.lang.NoSuchMethodException: CustomRecordReader
>> > $MultiLineFileInputFormat.<init>()
>> >        at
>> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>> >        at
>> > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>> >        at
>> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>> >        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>> >        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>> >        at CustomRecordReader.main(CustomRecordReader.java:257)
>> > Caused by: java.lang.NoSuchMethodException: CustomRecordReader
>> > $MultiLineFileInputFormat.<init>()
>> >        at java.lang.Class.getConstructor0(Class.java:2706)
>> >        at java.lang.Class.getDeclaredConstructor(Class.java:1985)
>> >        at
>> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>> >        ... 5 more
>> >
>> >
>>
>
>

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Kunal Gupta <ku...@techlead-india.com>.

Can you kindly guide me on what initialisation i need to do in the
implemented class constructor - MultiLineFileInputFormat?

i was following the sample provided on this yahoo page:

http://developer.yahoo.com/hadoop/tutorial/module5.html#fileformat




On Tue, 2009-12-01 at 06:45 +0000, Sean Owen wrote:
> It sounds like you have no provided a no-arg constructor in
> MultiLineFileInputFormat.
> 
> On Tue, Dec 1, 2009 at 6:17 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
> > Can someone explain how to override the "FileInputFormat" and
> > "RecordReader" in order to be able to read multiple lines of text from
> > input files in a single map task?
> >
> > Here the key will be the offset of the first line of text and value will
> > be the N lines of text.
> >
> > I have overridden the class FileInputFormat:
> >
> > public class MultiLineFileInputFormat
> >        extends FileInputFormat<LongWritable, Text>{
> > ...
> > }
> >
> > and implemented the abstract method:
> >
> > public RecordReader createRecordReader(InputSplit split,
> >                TaskAttemptContext context)
> >         throws IOException, InterruptedException {...}
> >
> > I have also overridden the recordreader class:
> >
> > public class MultiLineFileRecordReader extends
> > RecordReader<LongWritable, Text>
> > {...}
> >
> > and in the job configuration, specified this new InputFormat class:
> >
> > job.setInputFormatClass(MultiLineFileInputFormat.class);
> >
> > --------------------------------------------------------------------------
> > When I  run this new map/reduce program, i get the following java error:
> > --------------------------------------------------------------------------
> > Exception in thread "main" java.lang.RuntimeException:
> > java.lang.NoSuchMethodException: CustomRecordReader
> > $MultiLineFileInputFormat.<init>()
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
> >        at
> > org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
> >        at
> > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
> >        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> >        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> >        at CustomRecordReader.main(CustomRecordReader.java:257)
> > Caused by: java.lang.NoSuchMethodException: CustomRecordReader
> > $MultiLineFileInputFormat.<init>()
> >        at java.lang.Class.getConstructor0(Class.java:2706)
> >        at java.lang.Class.getDeclaredConstructor(Class.java:1985)
> >        at
> > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
> >        ... 5 more
> >
> >
>

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Sean Owen <sr...@gmail.com>.

It sounds like you have no provided a no-arg constructor in
MultiLineFileInputFormat.

On Tue, Dec 1, 2009 at 6:17 AM, Kunal Gupta <ku...@techlead-india.com> wrote:
> Can someone explain how to override the "FileInputFormat" and
> "RecordReader" in order to be able to read multiple lines of text from
> input files in a single map task?
>
> Here the key will be the offset of the first line of text and value will
> be the N lines of text.
>
> I have overridden the class FileInputFormat:
>
> public class MultiLineFileInputFormat
>        extends FileInputFormat<LongWritable, Text>{
> ...
> }
>
> and implemented the abstract method:
>
> public RecordReader createRecordReader(InputSplit split,
>                TaskAttemptContext context)
>         throws IOException, InterruptedException {...}
>
> I have also overridden the recordreader class:
>
> public class MultiLineFileRecordReader extends
> RecordReader<LongWritable, Text>
> {...}
>
> and in the job configuration, specified this new InputFormat class:
>
> job.setInputFormatClass(MultiLineFileInputFormat.class);
>
> --------------------------------------------------------------------------
> When I  run this new map/reduce program, i get the following java error:
> --------------------------------------------------------------------------
> Exception in thread "main" java.lang.RuntimeException:
> java.lang.NoSuchMethodException: CustomRecordReader
> $MultiLineFileInputFormat.<init>()
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>        at
> org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>        at
> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>        at CustomRecordReader.main(CustomRecordReader.java:257)
> Caused by: java.lang.NoSuchMethodException: CustomRecordReader
> $MultiLineFileInputFormat.<init>()
>        at java.lang.Class.getConstructor0(Class.java:2706)
>        at java.lang.Class.getDeclaredConstructor(Class.java:1985)
>        at
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>        ... 5 more
>
>

RE: How to write a custom input format and record reader to read multiple lines of text from files

Posted by gu...@orange-ftgroup.com.

Sorry, in my previous post, i forgot the attachments.
MultipleLineTextInputFormat allows to read N lines (customizable with the mapred.textrecordreader.linecount property). The simple case when files are non splittable (compressed or forced to non splittable as in MultipleLineTextInputFormat) is implemented.
The full implementation when files are splittable is much more difficult to implement because of the need to manage the case where the record composed of N lines overlap a split...
I'm interested in the full implementation.

-----Message d'origine-----
De : guillaume.viland@orange-ftgroup.com [mailto:guillaume.viland@orange-ftgroup.com] 
Envoyé : mardi 1 décembre 2009 10:22
À : mapreduce-user@hadoop.apache.org
Objet : RE: How to write a custom input format and record reader to read multiple lines of text from files

I've developed a version of a MultipleLineTextInputFormat for hadoop 0.19. I think it is not perfect but it works for my needs.
I've attached the code, feel free to improve or use it. Do not hesitate to contact me if you improve the code.

-----Message d'origine-----
De : Kunal Gupta [mailto:kunal@techlead-india.com] 
Envoyé : mardi 1 décembre 2009 09:50
À : mapreduce-user@hadoop.apache.org
Objet : Re: How to write a custom input format and record reader to read multiple lines of text from files

NLineInputFormat will help in splitting N lines of text for each Mapper,
but it will still pass single line of text to each call to the Map
function.

I want N lines of text to be passed as 'value' to the Map function.

By extending FileInputFormat and RecordReader classes i am concatinating
N lines of text and setting that as the 'value'.

But this program is not running. Probably some initialization error.

I am intimating the framework to use my extended classes as InputFormat:

job.setInputFormatClass(MultiLineFileInputFormat.class);

On Tue, 2009-12-01 at 13:53 +0530, Amogh Vasekar wrote:
> Hi,
> The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less
> the same, and should help you guide writing custom input format :)
> 
> Amogh
> 
> 
> On 12/1/09 11:47 AM, "Kunal Gupta" <ku...@techlead-india.com> wrote:
> 
>         Can someone explain how to override the "FileInputFormat" and
>         "RecordReader" in order to be able to read multiple lines of
>         text from
>         input files in a single map task?
>         
>         Here the key will be the offset of the first line of text and
>         value will
>         be the N lines of text.
>         
>         I have overridden the class FileInputFormat:
>         
>         public class MultiLineFileInputFormat
>                 extends FileInputFormat<LongWritable, Text>{
>         ...
>         }
>         
>         and implemented the abstract method:
>         
>         public RecordReader createRecordReader(InputSplit split,
>                         TaskAttemptContext context)
>                  throws IOException, InterruptedException {...}
>         
>         I have also overridden the recordreader class:
>         
>         public class MultiLineFileRecordReader extends
>         RecordReader<LongWritable, Text>
>         {...}
>         
>         and in the job configuration, specified this new InputFormat
>         class:
>         
>         job.setInputFormatClass(MultiLineFileInputFormat.class);
>         
>         --------------------------------------------------------------------------
>         When I  run this new map/reduce program, i get the following
>         java error:
>         --------------------------------------------------------------------------
>         Exception in thread "main" java.lang.RuntimeException:
>         java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>                 at
>         org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>                 at
>         org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>                 at
>         org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>                 at
>         org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>                 at
>         CustomRecordReader.main(CustomRecordReader.java:257)
>         Caused by: java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at java.lang.Class.getConstructor0(Class.java:2706)
>                 at
>         java.lang.Class.getDeclaredConstructor(Class.java:1985)
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>                 ... 5 more
>         
>         

*********************************
This message and any attachments (the "message") are confidential and intended solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or falsified.
If you are not the intended addressee of this message, please cancel it immediately and inform the sender.
********************************

*********************************
This message and any attachments (the "message") are confidential and intended solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or falsified.
If you are not the intended addressee of this message, please cancel it immediately and inform the sender.
********************************

RE: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Kunal Gupta <ku...@techlead-india.com>.

I am extending the class FileInputFormat. This class is having an
abstract method createRecordReader. I have implemented the method,
but still running the program is giving me constructor errors.

I tried passing FileInputFormat as my InputFormat class in the job
configuration, and surely it gave me the initialization error:

(InstantiationExceptionConstructorAccessorImpl.java:30)

But i was hoping that after implementing the abstract method of the
class FileInputFormat this issue should not have arise.

What can i do to correctly extend the FileInputFormat class and use it
for my custom InputFormat?

On Tue, 2009-12-01 at 10:21 +0100, guillaume.viland@orange-ftgroup.com
wrote:
> I've developed a version of a MultipleLineTextInputFormat for hadoop 0.19. I think it is not perfect but it works for my needs.
> I've attached the code, feel free to improve or use it. Do not hesitate to contact me if you improve the code.
> 
> 
> 
> 
> -----Message d'origine-----
> De : Kunal Gupta [mailto:kunal@techlead-india.com] 
> Envoyé : mardi 1 décembre 2009 09:50
> À : mapreduce-user@hadoop.apache.org
> Objet : Re: How to write a custom input format and record reader to read multiple lines of text from files
> 
> NLineInputFormat will help in splitting N lines of text for each Mapper,
> but it will still pass single line of text to each call to the Map
> function.
> 
> I want N lines of text to be passed as 'value' to the Map function.
> 
> By extending FileInputFormat and RecordReader classes i am concatinating
> N lines of text and setting that as the 'value'.
> 
> But this program is not running. Probably some initialization error.
> 
> I am intimating the framework to use my extended classes as InputFormat:
> 
> job.setInputFormatClass(MultiLineFileInputFormat.class);
> 
> On Tue, 2009-12-01 at 13:53 +0530, Amogh Vasekar wrote:
> > Hi,
> > The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less
> > the same, and should help you guide writing custom input format :)
> > 
> > Amogh
> > 
> > 
> > On 12/1/09 11:47 AM, "Kunal Gupta" <ku...@techlead-india.com> wrote:
> > 
> >         Can someone explain how to override the "FileInputFormat" and
> >         "RecordReader" in order to be able to read multiple lines of
> >         text from
> >         input files in a single map task?
> >         
> >         Here the key will be the offset of the first line of text and
> >         value will
> >         be the N lines of text.
> >         
> >         I have overridden the class FileInputFormat:
> >         
> >         public class MultiLineFileInputFormat
> >                 extends FileInputFormat<LongWritable, Text>{
> >         ...
> >         }
> >         
> >         and implemented the abstract method:
> >         
> >         public RecordReader createRecordReader(InputSplit split,
> >                         TaskAttemptContext context)
> >                  throws IOException, InterruptedException {...}
> >         
> >         I have also overridden the recordreader class:
> >         
> >         public class MultiLineFileRecordReader extends
> >         RecordReader<LongWritable, Text>
> >         {...}
> >         
> >         and in the job configuration, specified this new InputFormat
> >         class:
> >         
> >         job.setInputFormatClass(MultiLineFileInputFormat.class);
> >         
> >         --------------------------------------------------------------------------
> >         When I  run this new map/reduce program, i get the following
> >         java error:
> >         --------------------------------------------------------------------------
> >         Exception in thread "main" java.lang.RuntimeException:
> >         java.lang.NoSuchMethodException: CustomRecordReader
> >         $MultiLineFileInputFormat.<init>()
> >                 at
> >         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
> >                 at
> >         org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
> >                 at
> >         org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
> >                 at
> >         org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
> >                 at
> >         org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
> >                 at
> >         CustomRecordReader.main(CustomRecordReader.java:257)
> >         Caused by: java.lang.NoSuchMethodException: CustomRecordReader
> >         $MultiLineFileInputFormat.<init>()
> >                 at java.lang.Class.getConstructor0(Class.java:2706)
> >                 at
> >         java.lang.Class.getDeclaredConstructor(Class.java:1985)
> >                 at
> >         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
> >                 ... 5 more
> >         
> >         
> 
> *********************************
> This message and any attachments (the "message") are confidential and intended solely for the addressees. 
> Any unauthorised use or dissemination is prohibited.
> Messages are susceptible to alteration. 
> France Telecom Group shall not be liable for the message if altered, changed or falsified.
> If you are not the intended addressee of this message, please cancel it immediately and inform the sender.
> ********************************
> 
>

RE: How to write a custom input format and record reader to read multiple lines of text from files

Posted by gu...@orange-ftgroup.com.

I've developed a version of a MultipleLineTextInputFormat for hadoop 0.19. I think it is not perfect but it works for my needs.
I've attached the code, feel free to improve or use it. Do not hesitate to contact me if you improve the code.




-----Message d'origine-----
De : Kunal Gupta [mailto:kunal@techlead-india.com] 
Envoyé : mardi 1 décembre 2009 09:50
À : mapreduce-user@hadoop.apache.org
Objet : Re: How to write a custom input format and record reader to read multiple lines of text from files

NLineInputFormat will help in splitting N lines of text for each Mapper,
but it will still pass single line of text to each call to the Map
function.

I want N lines of text to be passed as 'value' to the Map function.

By extending FileInputFormat and RecordReader classes i am concatinating
N lines of text and setting that as the 'value'.

But this program is not running. Probably some initialization error.

I am intimating the framework to use my extended classes as InputFormat:

job.setInputFormatClass(MultiLineFileInputFormat.class);

On Tue, 2009-12-01 at 13:53 +0530, Amogh Vasekar wrote:
> Hi,
> The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less
> the same, and should help you guide writing custom input format :)
> 
> Amogh
> 
> 
> On 12/1/09 11:47 AM, "Kunal Gupta" <ku...@techlead-india.com> wrote:
> 
>         Can someone explain how to override the "FileInputFormat" and
>         "RecordReader" in order to be able to read multiple lines of
>         text from
>         input files in a single map task?
>         
>         Here the key will be the offset of the first line of text and
>         value will
>         be the N lines of text.
>         
>         I have overridden the class FileInputFormat:
>         
>         public class MultiLineFileInputFormat
>                 extends FileInputFormat<LongWritable, Text>{
>         ...
>         }
>         
>         and implemented the abstract method:
>         
>         public RecordReader createRecordReader(InputSplit split,
>                         TaskAttemptContext context)
>                  throws IOException, InterruptedException {...}
>         
>         I have also overridden the recordreader class:
>         
>         public class MultiLineFileRecordReader extends
>         RecordReader<LongWritable, Text>
>         {...}
>         
>         and in the job configuration, specified this new InputFormat
>         class:
>         
>         job.setInputFormatClass(MultiLineFileInputFormat.class);
>         
>         --------------------------------------------------------------------------
>         When I  run this new map/reduce program, i get the following
>         java error:
>         --------------------------------------------------------------------------
>         Exception in thread "main" java.lang.RuntimeException:
>         java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>                 at
>         org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>                 at
>         org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>                 at
>         org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>                 at
>         org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>                 at
>         CustomRecordReader.main(CustomRecordReader.java:257)
>         Caused by: java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at java.lang.Class.getConstructor0(Class.java:2706)
>                 at
>         java.lang.Class.getDeclaredConstructor(Class.java:1985)
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>                 ... 5 more
>         
>         

*********************************
This message and any attachments (the "message") are confidential and intended solely for the addressees. 
Any unauthorised use or dissemination is prohibited.
Messages are susceptible to alteration. 
France Telecom Group shall not be liable for the message if altered, changed or falsified.
If you are not the intended addressee of this message, please cancel it immediately and inform the sender.
********************************

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Kunal Gupta <ku...@techlead-india.com>.

NLineInputFormat will help in splitting N lines of text for each Mapper,
but it will still pass single line of text to each call to the Map
function.

I want N lines of text to be passed as 'value' to the Map function.

By extending FileInputFormat and RecordReader classes i am concatinating
N lines of text and setting that as the 'value'.

But this program is not running. Probably some initialization error.

I am intimating the framework to use my extended classes as InputFormat:

job.setInputFormatClass(MultiLineFileInputFormat.class);

On Tue, 2009-12-01 at 13:53 +0530, Amogh Vasekar wrote:
> Hi,
> The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less
> the same, and should help you guide writing custom input format :)
> 
> Amogh
> 
> 
> On 12/1/09 11:47 AM, "Kunal Gupta" <ku...@techlead-india.com> wrote:
> 
>         Can someone explain how to override the "FileInputFormat" and
>         "RecordReader" in order to be able to read multiple lines of
>         text from
>         input files in a single map task?
>         
>         Here the key will be the offset of the first line of text and
>         value will
>         be the N lines of text.
>         
>         I have overridden the class FileInputFormat:
>         
>         public class MultiLineFileInputFormat
>                 extends FileInputFormat<LongWritable, Text>{
>         ...
>         }
>         
>         and implemented the abstract method:
>         
>         public RecordReader createRecordReader(InputSplit split,
>                         TaskAttemptContext context)
>                  throws IOException, InterruptedException {...}
>         
>         I have also overridden the recordreader class:
>         
>         public class MultiLineFileRecordReader extends
>         RecordReader<LongWritable, Text>
>         {...}
>         
>         and in the job configuration, specified this new InputFormat
>         class:
>         
>         job.setInputFormatClass(MultiLineFileInputFormat.class);
>         
>         --------------------------------------------------------------------------
>         When I  run this new map/reduce program, i get the following
>         java error:
>         --------------------------------------------------------------------------
>         Exception in thread "main" java.lang.RuntimeException:
>         java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
>                 at
>         org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
>                 at
>         org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
>                 at
>         org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
>                 at
>         org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
>                 at
>         CustomRecordReader.main(CustomRecordReader.java:257)
>         Caused by: java.lang.NoSuchMethodException: CustomRecordReader
>         $MultiLineFileInputFormat.<init>()
>                 at java.lang.Class.getConstructor0(Class.java:2706)
>                 at
>         java.lang.Class.getDeclaredConstructor(Class.java:1985)
>                 at
>         org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
>                 ... 5 more
>         
>

Re: How to write a custom input format and record reader to read multiple lines of text from files

Posted by Amogh Vasekar <am...@yahoo-inc.com>.

Hi,
The NLineInputFormat (o.a.h.mapreduce.lib.input) achieves more or less the same, and should help you guide writing custom input format :)

Amogh


On 12/1/09 11:47 AM, "Kunal Gupta" <ku...@techlead-india.com> wrote:

Can someone explain how to override the "FileInputFormat" and
"RecordReader" in order to be able to read multiple lines of text from
input files in a single map task?

Here the key will be the offset of the first line of text and value will
be the N lines of text.

I have overridden the class FileInputFormat:

public class MultiLineFileInputFormat
        extends FileInputFormat<LongWritable, Text>{
...
}

and implemented the abstract method:

public RecordReader createRecordReader(InputSplit split,
                TaskAttemptContext context)
         throws IOException, InterruptedException {...}

I have also overridden the recordreader class:

public class MultiLineFileRecordReader extends
RecordReader<LongWritable, Text>
{...}

and in the job configuration, specified this new InputFormat class:

job.setInputFormatClass(MultiLineFileInputFormat.class);

--------------------------------------------------------------------------
When I  run this new map/reduce program, i get the following java error:
--------------------------------------------------------------------------
Exception in thread "main" java.lang.RuntimeException:
java.lang.NoSuchMethodException: CustomRecordReader
$MultiLineFileInputFormat.<init>()
        at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:115)
        at
org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:882)
        at
org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779)
        at org.apache.hadoop.mapreduce.Job.submit(Job.java:432)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447)
        at CustomRecordReader.main(CustomRecordReader.java:257)
Caused by: java.lang.NoSuchMethodException: CustomRecordReader
$MultiLineFileInputFormat.<init>()
        at java.lang.Class.getConstructor0(Class.java:2706)
        at java.lang.Class.getDeclaredConstructor(Class.java:1985)
        at
org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:109)
        ... 5 more