You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@hadoop.apache.org by Chen He <ai...@gmail.com> on 2012/08/29 07:43:26 UTC

Custom InputFormat errer

Hi guys

I met a interesting problem when I implement my own custom InputFormat
which extends the FileInputFormat.(I rewrite the RecordReader class but not
the InputSplit class)

My recordreader will take following format as a basic record: (my
recordreader extends the LineRecordReader. It returns a record if it meets
#Trailer# and contains #Header#. I only have one input file that is
composed of many of following basic record)

#Header#
.....(many lines, may be 0 lines or 1000 lines, it varies)
#Trailer#

Everything works fine if above basic input unit in a file is integer times
of mapper. For example, I use 2 mappers and there are two basic records in
my input file. Or I use 3 mappers and there are 6 basic units in the input
file.

However, if I use 4 mappers but there are 3 basic units in the input
file(not integer times). The final output is incorrect. The "Map Input
Bytes" in the job counter is also less than the input file size. How can I
fix it? Do I need to rewrite the inputSplit?

Any reply will be appreciated!

Regards!

Chen

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <ha...@cloudera.com> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <ha...@cloudera.com> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <ha...@cloudera.com> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

That means I have to lose my input data because of Hadoop's FileSplit
evenly splits input file according to the "numSplits". But, I want to
prevent this. Is there any way?

Regards!

Chen

On Wed, Aug 29, 2012 at 9:49 PM, Harsh J <ha...@cloudera.com> wrote:

> No, what I mean is that your RecordReader should be able to handle a
> case where it may start from middle of a record and hence not be able
> to read any record (i.e. return false or whatever right up front).
>
> On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> > Hi Harsh
> >
> > Thank you for your reply. Do you mean I need to change the FileSplit to
> > avoid those errors I mentioned happen?
> >
> > Regards!
> >
> > Chen
> >
> > On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
> >>
> >> Hi Chen,
> >>
> >> Does your record reader and mapper handle the case where one map split
> >> may not exactly get the whole record? Your case is not very different
> >> from the newlines logic presented here:
> >> http://wiki.apache.org/hadoop/HadoopMapReduce
> >>
> >> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> >> > Hi guys
> >> >
> >> > I met a interesting problem when I implement my own custom InputFormat
> >> > which
> >> > extends the FileInputFormat.(I rewrite the RecordReader class but not
> >> > the
> >> > InputSplit class)
> >> >
> >> > My recordreader will take following format as a basic record: (my
> >> > recordreader extends the LineRecordReader. It returns a record if it
> >> > meets
> >> > #Trailer# and contains #Header#. I only have one input file that is
> >> > composed
> >> > of many of following basic record)
> >> >
> >> > #Header#
> >> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> >> > #Trailer#
> >> >
> >> > Everything works fine if above basic input unit in a file is integer
> >> > times
> >> > of mapper. For example, I use 2 mappers and there are two basic
> records
> >> > in
> >> > my input file. Or I use 3 mappers and there are 6 basic units in the
> >> > input
> >> > file.
> >> >
> >> > However, if I use 4 mappers but there are 3 basic units in the input
> >> > file(not integer times). The final output is incorrect. The "Map Input
> >> > Bytes" in the job counter is also less than the input file size. How
> can
> >> > I
> >> > fix it? Do I need to rewrite the inputSplit?
> >> >
> >> > Any reply will be appreciated!
> >> >
> >> > Regards!
> >> >
> >> > Chen
> >>
> >>
> >>
> >> --
> >> Harsh J
> >
> >
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

No, what I mean is that your RecordReader should be able to handle a
case where it may start from middle of a record and hence not be able
to read any record (i.e. return false or whatever right up front).

On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> > which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> > the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> > meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> > composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> > times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> > in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> > input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How can
>> > I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

No, what I mean is that your RecordReader should be able to handle a
case where it may start from middle of a record and hence not be able
to read any record (i.e. return false or whatever right up front).

On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> > which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> > the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> > meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> > composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> > times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> > in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> > input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How can
>> > I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

No, what I mean is that your RecordReader should be able to handle a
case where it may start from middle of a record and hence not be able
to read any record (i.e. return false or whatever right up front).

On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> > which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> > the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> > meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> > composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> > times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> > in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> > input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How can
>> > I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

BTW, I am using the old API from org.apache.hadoop.mapred not the
org.apache.hadoop.mapreduce

Thanks!

On Wed, Aug 29, 2012 at 2:57 AM, Chen He <ai...@gmail.com> wrote:

> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How
>> can I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

BTW, I am using the old API from org.apache.hadoop.mapred not the
org.apache.hadoop.mapreduce

Thanks!

On Wed, Aug 29, 2012 at 2:57 AM, Chen He <ai...@gmail.com> wrote:

> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How
>> can I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

BTW, I am using the old API from org.apache.hadoop.mapred not the
org.apache.hadoop.mapreduce

Thanks!

On Wed, Aug 29, 2012 at 2:57 AM, Chen He <ai...@gmail.com> wrote:

> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How
>> can I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

No, what I mean is that your RecordReader should be able to handle a
case where it may start from middle of a record and hence not be able
to read any record (i.e. return false or whatever right up front).

On Wed, Aug 29, 2012 at 1:27 PM, Chen He <ai...@gmail.com> wrote:
> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> > which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> > the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> > meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> > composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> > times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> > in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> > input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How can
>> > I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>
>



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

BTW, I am using the old API from org.apache.hadoop.mapred not the
org.apache.hadoop.mapreduce

Thanks!

On Wed, Aug 29, 2012 at 2:57 AM, Chen He <ai...@gmail.com> wrote:

> Hi Harsh
>
> Thank you for your reply. Do you mean I need to change the FileSplit to
> avoid those errors I mentioned happen?
>
> Regards!
>
> Chen
>
> On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:
>
>> Hi Chen,
>>
>> Does your record reader and mapper handle the case where one map split
>> may not exactly get the whole record? Your case is not very different
>> from the newlines logic presented here:
>> http://wiki.apache.org/hadoop/HadoopMapReduce
>>
>> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
>> > Hi guys
>> >
>> > I met a interesting problem when I implement my own custom InputFormat
>> which
>> > extends the FileInputFormat.(I rewrite the RecordReader class but not
>> the
>> > InputSplit class)
>> >
>> > My recordreader will take following format as a basic record: (my
>> > recordreader extends the LineRecordReader. It returns a record if it
>> meets
>> > #Trailer# and contains #Header#. I only have one input file that is
>> composed
>> > of many of following basic record)
>> >
>> > #Header#
>> > .....(many lines, may be 0 lines or 1000 lines, it varies)
>> > #Trailer#
>> >
>> > Everything works fine if above basic input unit in a file is integer
>> times
>> > of mapper. For example, I use 2 mappers and there are two basic records
>> in
>> > my input file. Or I use 3 mappers and there are 6 basic units in the
>> input
>> > file.
>> >
>> > However, if I use 4 mappers but there are 3 basic units in the input
>> > file(not integer times). The final output is incorrect. The "Map Input
>> > Bytes" in the job counter is also less than the input file size. How
>> can I
>> > fix it? Do I need to rewrite the inputSplit?
>> >
>> > Any reply will be appreciated!
>> >
>> > Regards!
>> >
>> > Chen
>>
>>
>>
>> --
>> Harsh J
>>
>
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

Thank you for your reply. Do you mean I need to change the FileSplit to
avoid those errors I mentioned happen?

Regards!

Chen

On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Chen,
>
> Does your record reader and mapper handle the case where one map split
> may not exactly get the whole record? Your case is not very different
> from the newlines logic presented here:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> > Hi guys
> >
> > I met a interesting problem when I implement my own custom InputFormat
> which
> > extends the FileInputFormat.(I rewrite the RecordReader class but not the
> > InputSplit class)
> >
> > My recordreader will take following format as a basic record: (my
> > recordreader extends the LineRecordReader. It returns a record if it
> meets
> > #Trailer# and contains #Header#. I only have one input file that is
> composed
> > of many of following basic record)
> >
> > #Header#
> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> > #Trailer#
> >
> > Everything works fine if above basic input unit in a file is integer
> times
> > of mapper. For example, I use 2 mappers and there are two basic records
> in
> > my input file. Or I use 3 mappers and there are 6 basic units in the
> input
> > file.
> >
> > However, if I use 4 mappers but there are 3 basic units in the input
> > file(not integer times). The final output is incorrect. The "Map Input
> > Bytes" in the job counter is also less than the input file size. How can
> I
> > fix it? Do I need to rewrite the inputSplit?
> >
> > Any reply will be appreciated!
> >
> > Regards!
> >
> > Chen
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

Thank you for your reply. Do you mean I need to change the FileSplit to
avoid those errors I mentioned happen?

Regards!

Chen

On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Chen,
>
> Does your record reader and mapper handle the case where one map split
> may not exactly get the whole record? Your case is not very different
> from the newlines logic presented here:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> > Hi guys
> >
> > I met a interesting problem when I implement my own custom InputFormat
> which
> > extends the FileInputFormat.(I rewrite the RecordReader class but not the
> > InputSplit class)
> >
> > My recordreader will take following format as a basic record: (my
> > recordreader extends the LineRecordReader. It returns a record if it
> meets
> > #Trailer# and contains #Header#. I only have one input file that is
> composed
> > of many of following basic record)
> >
> > #Header#
> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> > #Trailer#
> >
> > Everything works fine if above basic input unit in a file is integer
> times
> > of mapper. For example, I use 2 mappers and there are two basic records
> in
> > my input file. Or I use 3 mappers and there are 6 basic units in the
> input
> > file.
> >
> > However, if I use 4 mappers but there are 3 basic units in the input
> > file(not integer times). The final output is incorrect. The "Map Input
> > Bytes" in the job counter is also less than the input file size. How can
> I
> > fix it? Do I need to rewrite the inputSplit?
> >
> > Any reply will be appreciated!
> >
> > Regards!
> >
> > Chen
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

Thank you for your reply. Do you mean I need to change the FileSplit to
avoid those errors I mentioned happen?

Regards!

Chen

On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Chen,
>
> Does your record reader and mapper handle the case where one map split
> may not exactly get the whole record? Your case is not very different
> from the newlines logic presented here:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> > Hi guys
> >
> > I met a interesting problem when I implement my own custom InputFormat
> which
> > extends the FileInputFormat.(I rewrite the RecordReader class but not the
> > InputSplit class)
> >
> > My recordreader will take following format as a basic record: (my
> > recordreader extends the LineRecordReader. It returns a record if it
> meets
> > #Trailer# and contains #Header#. I only have one input file that is
> composed
> > of many of following basic record)
> >
> > #Header#
> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> > #Trailer#
> >
> > Everything works fine if above basic input unit in a file is integer
> times
> > of mapper. For example, I use 2 mappers and there are two basic records
> in
> > my input file. Or I use 3 mappers and there are 6 basic units in the
> input
> > file.
> >
> > However, if I use 4 mappers but there are 3 basic units in the input
> > file(not integer times). The final output is incorrect. The "Map Input
> > Bytes" in the job counter is also less than the input file size. How can
> I
> > fix it? Do I need to rewrite the inputSplit?
> >
> > Any reply will be appreciated!
> >
> > Regards!
> >
> > Chen
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Chen He <ai...@gmail.com>.

Hi Harsh

Thank you for your reply. Do you mean I need to change the FileSplit to
avoid those errors I mentioned happen?

Regards!

Chen

On Wed, Aug 29, 2012 at 2:46 AM, Harsh J <ha...@cloudera.com> wrote:

> Hi Chen,
>
> Does your record reader and mapper handle the case where one map split
> may not exactly get the whole record? Your case is not very different
> from the newlines logic presented here:
> http://wiki.apache.org/hadoop/HadoopMapReduce
>
> On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> > Hi guys
> >
> > I met a interesting problem when I implement my own custom InputFormat
> which
> > extends the FileInputFormat.(I rewrite the RecordReader class but not the
> > InputSplit class)
> >
> > My recordreader will take following format as a basic record: (my
> > recordreader extends the LineRecordReader. It returns a record if it
> meets
> > #Trailer# and contains #Header#. I only have one input file that is
> composed
> > of many of following basic record)
> >
> > #Header#
> > .....(many lines, may be 0 lines or 1000 lines, it varies)
> > #Trailer#
> >
> > Everything works fine if above basic input unit in a file is integer
> times
> > of mapper. For example, I use 2 mappers and there are two basic records
> in
> > my input file. Or I use 3 mappers and there are 6 basic units in the
> input
> > file.
> >
> > However, if I use 4 mappers but there are 3 basic units in the input
> > file(not integer times). The final output is incorrect. The "Map Input
> > Bytes" in the job counter is also less than the input file size. How can
> I
> > fix it? Do I need to rewrite the inputSplit?
> >
> > Any reply will be appreciated!
> >
> > Regards!
> >
> > Chen
>
>
>
> --
> Harsh J
>

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

Hi Chen,

Does your record reader and mapper handle the case where one map split
may not exactly get the whole record? Your case is not very different
from the newlines logic presented here:
http://wiki.apache.org/hadoop/HadoopMapReduce

On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> Hi guys
>
> I met a interesting problem when I implement my own custom InputFormat which
> extends the FileInputFormat.(I rewrite the RecordReader class but not the
> InputSplit class)
>
> My recordreader will take following format as a basic record: (my
> recordreader extends the LineRecordReader. It returns a record if it meets
> #Trailer# and contains #Header#. I only have one input file that is composed
> of many of following basic record)
>
> #Header#
> .....(many lines, may be 0 lines or 1000 lines, it varies)
> #Trailer#
>
> Everything works fine if above basic input unit in a file is integer times
> of mapper. For example, I use 2 mappers and there are two basic records in
> my input file. Or I use 3 mappers and there are 6 basic units in the input
> file.
>
> However, if I use 4 mappers but there are 3 basic units in the input
> file(not integer times). The final output is incorrect. The "Map Input
> Bytes" in the job counter is also less than the input file size. How can I
> fix it? Do I need to rewrite the inputSplit?
>
> Any reply will be appreciated!
>
> Regards!
>
> Chen



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

Hi Chen,

Does your record reader and mapper handle the case where one map split
may not exactly get the whole record? Your case is not very different
from the newlines logic presented here:
http://wiki.apache.org/hadoop/HadoopMapReduce

On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> Hi guys
>
> I met a interesting problem when I implement my own custom InputFormat which
> extends the FileInputFormat.(I rewrite the RecordReader class but not the
> InputSplit class)
>
> My recordreader will take following format as a basic record: (my
> recordreader extends the LineRecordReader. It returns a record if it meets
> #Trailer# and contains #Header#. I only have one input file that is composed
> of many of following basic record)
>
> #Header#
> .....(many lines, may be 0 lines or 1000 lines, it varies)
> #Trailer#
>
> Everything works fine if above basic input unit in a file is integer times
> of mapper. For example, I use 2 mappers and there are two basic records in
> my input file. Or I use 3 mappers and there are 6 basic units in the input
> file.
>
> However, if I use 4 mappers but there are 3 basic units in the input
> file(not integer times). The final output is incorrect. The "Map Input
> Bytes" in the job counter is also less than the input file size. How can I
> fix it? Do I need to rewrite the inputSplit?
>
> Any reply will be appreciated!
>
> Regards!
>
> Chen



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

Hi Chen,

Does your record reader and mapper handle the case where one map split
may not exactly get the whole record? Your case is not very different
from the newlines logic presented here:
http://wiki.apache.org/hadoop/HadoopMapReduce

On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> Hi guys
>
> I met a interesting problem when I implement my own custom InputFormat which
> extends the FileInputFormat.(I rewrite the RecordReader class but not the
> InputSplit class)
>
> My recordreader will take following format as a basic record: (my
> recordreader extends the LineRecordReader. It returns a record if it meets
> #Trailer# and contains #Header#. I only have one input file that is composed
> of many of following basic record)
>
> #Header#
> .....(many lines, may be 0 lines or 1000 lines, it varies)
> #Trailer#
>
> Everything works fine if above basic input unit in a file is integer times
> of mapper. For example, I use 2 mappers and there are two basic records in
> my input file. Or I use 3 mappers and there are 6 basic units in the input
> file.
>
> However, if I use 4 mappers but there are 3 basic units in the input
> file(not integer times). The final output is incorrect. The "Map Input
> Bytes" in the job counter is also less than the input file size. How can I
> fix it? Do I need to rewrite the inputSplit?
>
> Any reply will be appreciated!
>
> Regards!
>
> Chen



-- 
Harsh J

Re: Custom InputFormat errer

Posted by Harsh J <ha...@cloudera.com>.

Hi Chen,

Does your record reader and mapper handle the case where one map split
may not exactly get the whole record? Your case is not very different
from the newlines logic presented here:
http://wiki.apache.org/hadoop/HadoopMapReduce

On Wed, Aug 29, 2012 at 11:13 AM, Chen He <ai...@gmail.com> wrote:
> Hi guys
>
> I met a interesting problem when I implement my own custom InputFormat which
> extends the FileInputFormat.(I rewrite the RecordReader class but not the
> InputSplit class)
>
> My recordreader will take following format as a basic record: (my
> recordreader extends the LineRecordReader. It returns a record if it meets
> #Trailer# and contains #Header#. I only have one input file that is composed
> of many of following basic record)
>
> #Header#
> .....(many lines, may be 0 lines or 1000 lines, it varies)
> #Trailer#
>
> Everything works fine if above basic input unit in a file is integer times
> of mapper. For example, I use 2 mappers and there are two basic records in
> my input file. Or I use 3 mappers and there are 6 basic units in the input
> file.
>
> However, if I use 4 mappers but there are 3 basic units in the input
> file(not integer times). The final output is incorrect. The "Map Input
> Bytes" in the job counter is also less than the input file size. How can I
> fix it? Do I need to rewrite the inputSplit?
>
> Any reply will be appreciated!
>
> Regards!
>
> Chen



-- 
Harsh J