You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Bwolen Yang <wb...@gmail.com> on 2007/06/12 19:35:57 UTC

\r\n problem in LineRecordReader.java

I wrote a map/reduce to process an ascii input file, and noticed that
"Map input bytes" stats does not match the actual file size (only
8.5MB read on a 52MB file).   I simplified the program to an identity
map/reduce
   public static void main(String[] args) throws IOException {
      JobConf conf = new JobConf(CanonicalizeFriends.class);
      conf.setJobName("Identity Text Tester");
      JobClient.runJob(conf);
    }

and still have the same problem.

It turned out that this input file has "\r\n" as line breaks. If those
are replaced with "\n", then the "Map input bytes" matches properly.
The problem seems to be related to the read-till-newline code in
constructor, where a call to readLine is made with FSDataInputStream
("fileIn") instead of a BufferedInputStream.

  public LineRecordReader(Configuration job, FileSplit split)
      ....
    } else if (start != 0) {
      fileIn.seek(start - 1);
      LineRecordReader.readLine(fileIn, null);
      start = fileIn.getPos();
    }

    this.in = new BufferedInputStream(in);

if I were to replace it with using BufferedInputStream for all
readLine() calls as in
    boolean skipFirstLine = false;
    ...
    } else if (start != 0) {
      --start;
      fileIn.seek(start);
      skipFirstLine = true;
    }

    this.in = new BufferedInputStream(in);
    if (skipFirstLine) {
      start += LineRecordReader.readLine(this.in, null);
    }

the problem goes away.    It seems that readLine() calling
    in.mark(1);
with fileIn (FSDataInputStream) somehow corrupts the underlying
buffer.   This was checked with original code and by
   1. just comment out "in.mark(1)" and since this input file always
have \r\n, mark()/reset() is not needed on this input.   This case
gets the correct number of input bytes.

OR

   2. leaving original code as is, and throw an exception if "\n" is
not followed by "\r".  In this case, the exception does get thrown.  I
instrumented the code to output its file position.  While it
accurately pin-point where "\r" is, the original file does have an
"\n" right after it.

----------

I dig around the code for a few hours but wasn't fluent enough to
pin-point why this is the case.   any suggestions?

thanks

bwolen

Re: \r\n problem in LineRecordReader.java

Posted by Doug Cutting <cu...@apache.org>.
Bwolen Yang wrote:
> done.
>  https://issues.apache.org/jira/browse/HADOOP-1489

Thanks!

For extra-credit, attach a unit test that demonstrates it and a patch 
that fixes it!

http://wiki.apache.org/lucene-hadoop/HowToContribute

> a couple questions.
> 
> - When including code/stack trace, my input was nicely indented, but
> after submission, somehow the text formatting fields are all mess up.
> (e.g., the entire code segment became one line).   Is attachment the
> only way to do this right?

You can edit the issue and use wiki markup.  There are preview and help 
buttons beside the text box.  {code}foo{code} or {noformat}foo{noformat} 
are probably what you want.

> - To track this, I essentially was doing "printf" debugging.   It is
> rather slow/painful. 

That's actually my preferred method of debugging!

> I generally find the ability to debugger-attach to programs and just
> roam around the stack helps greatly in understanding complex programs
> quickly (even if I am not looking for bugs).

If you run with LocalRunner (instead of JobTracker and TaskTracker) then 
normal debuggers should work well.  But, when things are distributed, 
debugging is more difficult.

Doug

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
> You should file a Jira on this (with your traces etc).

done.
  https://issues.apache.org/jira/browse/HADOOP-1489

a couple questions.

- When including code/stack trace, my input was nicely indented, but
after submission, somehow the text formatting fields are all mess up.
(e.g., the entire code segment became one line).   Is attachment the
only way to do this right?

- To track this, I essentially was doing "printf" debugging.   It is
rather slow/painful.  Is there a way to get suspend the tasks, which
will then allow people to attach a debugger in time to trace thru the
execution.   If not, how do people generally debug problems like this?

I generally find the ability to debugger-attach to programs and just
roam around the stack helps greatly in understanding complex programs
quickly (even if I am not looking for bugs).


bwolen

Re: \r\n problem in LineRecordReader.java

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Hi Blowen,

Thanks for investigating this.

You should file a Jira on this (with your traces etc). Let me know if 
you want to me to file it.

 > and I think this precisely is why when you have mark()/reset() (as in
 > \r\n case), the read can be small.  I can track down java's
 > bufferedinputstream code to check for details, but it seems pretty
 > clear from actual code execution stack.

As you noted this could be caused by "marking" the input stream. I think 
Hadoop's dependence on BufferedInputStream read len size should be 
fixed. Alternately we could make this stream non-markable (We would 
still be depending on BufferedInputStreams behavior that is not part of 
its contract).

You might find HADOOP-1470 and HADOOP-1134(last 4-6 comments) relevant.

Raghu.

Bwolen Yang wrote:
>> taking values at runtime (i have it thru exceptions when the result is
>> 0 and print out he values).
> 
> the \r\n problem was observed on the 0.13.0 release.
> To study the behavior, I instrument the hadoop source from the head of 
> the tree.
> 
> More specifically, attached are two sample stacks.  (i have readbuffer
> throw when it gets 0 bytes, and have inputchecker catches the
> exception and rethrow both.  This way, I catch the values from both
> caller and callee.
> 
> on a separate note, if (len>=bytesPerSum) the assumption exists, would
> it be ok to throw exceptions when violated?   most of time (e.g., in
> crawl/indexing), people won't notice some part of input data is
> getting throw away.   It would be a lot easier to debug as code
> changes (and assumption get violated), and the cost in this case is
> probably not too bad as good part of the cost is probably in networks
> and going to disk.

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
> taking values at runtime (i have it thru exceptions when the result is
> 0 and print out he values).

the \r\n problem was observed on the 0.13.0 release.
To study the behavior, I instrument the hadoop source from the head of the tree.

More specifically, attached are two sample stacks.  (i have readbuffer
throw when it gets 0 bytes, and have inputchecker catches the
exception and rethrow both.  This way, I catch the values from both
caller and callee.

on a separate note, if (len>=bytesPerSum) the assumption exists, would
it be ok to throw exceptions when violated?   most of time (e.g., in
crawl/indexing), people won't notice some part of input data is
getting throw away.   It would be a lot easier to debug as code
changes (and assumption get violated), and the cost in this case is
probably not too bad as good part of the cost is probably in networks
and going to disk.

bwolen
-------------------------------------

java.lang.RuntimeException: end of read()
in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127
pos=45223932 res=-999999
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
	at java.io.FilterInputStream.read(FilterInputStream.java:66)
	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
Caused by: java.lang.RuntimeException: end of read()
datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932
len=-381 bytesPerSum=512 eof=false read=0
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
	... 11 more

-------------------------
java.lang.RuntimeException: end of read()
in=org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker len=127
pos=45223932 res=-999999
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:50)
	at java.io.BufferedInputStream.fill(BufferedInputStream.java:218)
	at java.io.BufferedInputStream.read(BufferedInputStream.java:237)
	at org.apache.hadoop.fs.FSDataInputStream$Buffer.read(FSDataInputStream.java:116)
	at java.io.FilterInputStream.read(FilterInputStream.java:66)
	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:132)
	at org.apache.hadoop.mapred.LineRecordReader.readLine(LineRecordReader.java:124)
	at org.apache.hadoop.mapred.LineRecordReader.next(LineRecordReader.java:108)
	at org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:168)
	at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:44)
	at org.apache.hadoop.mapred.MapTask.run(MapTask.java:186)
	at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:1720)
Caused by: java.lang.RuntimeException: end of read()
datas=org.apache.hadoop.dfs.DFSClient$DFSDataInputStream pos=45223932
len=-381 bytesPerSum=512 eof=false read=0
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.readBuffer(ChecksumFileSystem.java:200)
	at org.apache.hadoop.fs.ChecksumFileSystem$FSInputChecker.read(ChecksumFileSystem.java:175)
	at org.apache.hadoop.fs.FSDataInputStream$PositionCache.read(FSDataInputStream.java:47)
	... 11 more

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
On 6/13/07, Raghu Angadi <ra...@yahoo-inc.com> wrote:
> Bwolen Yang wrote:
> > Here is probably the cause of this bug:
> >
> >    public int read(byte b[], int off, int len) throws IOException {
> >      // make sure that it ends at a checksum boundary
> >      long curPos = getPos();
> >      long endPos = len+curPos/bytesPerSum*bytesPerSum;
> >      return readBuffer(b, off, (int)(endPos-curPos));
> >    }
> >
> > Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.
>
> Is this from looking at the code or you actually saw the values like
> this at runtime?

taking values at runtime (i have it thru exceptions when the result is
0 and print out he values).

> I think 'len' is never supposed to be less than bytesPerChecksum because
> there is a BufferedInputStream between with a buffer size of
> io.buffer.size (default 4096). So this buffer size is supposed to be
> larger than bytesPerChecksum (requirement changes with HADOOP-1450 where
> it uses buffer size equal to bytesPerChecksum).

bwolen

Re: \r\n problem in LineRecordReader.java

Posted by Raghu Angadi <ra...@yahoo-inc.com>.
Bwolen Yang wrote:
> Here is probably the cause of this bug:
> 
>    public int read(byte b[], int off, int len) throws IOException {
>      // make sure that it ends at a checksum boundary
>      long curPos = getPos();
>      long endPos = len+curPos/bytesPerSum*bytesPerSum;
>      return readBuffer(b, off, (int)(endPos-curPos));
>    }
> 
> Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.

Is this from looking at the code or you actually saw the values like 
this at runtime?

I think 'len' is never supposed to be less than bytesPerChecksum because 
there is a BufferedInputStream between with a buffer size of 
io.buffer.size (default 4096). So this buffer size is supposed to be 
larger than bytesPerChecksum (requirement changes with HADOOP-1450 where 
it uses buffer size equal to bytesPerChecksum).

Raghu.

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
> > Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.

make that bytesPerSum=512.  (sorry... I have been staring this for a
day... my eyes are blurry by now :)

bwolen

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
blah.  forgot to mention the code came from
   ChecksumFileSystem.FSInputChecker.read()


On 6/12/07, Bwolen Yang <wb...@gmail.com> wrote:
> Here is probably the cause of this bug:
>
>     public int read(byte b[], int off, int len) throws IOException {
>       // make sure that it ends at a checksum boundary
>       long curPos = getPos();
>       long endPos = len+curPos/bytesPerSum*bytesPerSum;
>       return readBuffer(b, off, (int)(endPos-curPos));
>     }
>
> Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.
> So, endPos-curPos became a negtaive number (e.g., -381).   So,
> readBuffer() gets called with a negative length, which naturally, the
> underlying DFSInputStream::read() returns 0.  When readBuffer() sees 0
> bytes read, it assumes end-of-file, and return -1.   The rest of the
> system treat -1 as end-of-file, and hence a large part of the input
> file is not read.
>
> I guess small read somehow get triggered by mark()/reset() which
> triggered this bug.  Any suggestions for how to fix this?
>
> btw, appended is roughly what the stack looks like on a distributed
> file system read.   Hopefully it will help the next person who wants
> to understand how read() happens.   Note that FSDataInputStream got
> wrapped twice.
>
> bwolen
>
>
> what happens on a DFS read.  note some of these are subclass call
> super class's read().  Some are a class calling its member variable's
> read().
>
>   FSDataInputStream::read() == DataInputStream::read()
>   Buffer::read()
>   BufferedInputStream::read()
>   PositionCache::read()
>   FSInputChecker::read()
>   DFSDataInputStream::read() == FSDataInputStream::read() == DataInputStream...
>   Buffer::read()
>   BufferedInputStream::read()
>   PositionCache::read()
>   DFSInputStream::read()
>   DataInputStream::read()    (blockStream)
>   BufferedInputStream::read()
>   socket
>

Re: \r\n problem in LineRecordReader.java

Posted by Bwolen Yang <wb...@gmail.com>.
Here is probably the cause of this bug:

    public int read(byte b[], int off, int len) throws IOException {
      // make sure that it ends at a checksum boundary
      long curPos = getPos();
      long endPos = len+curPos/bytesPerSum*bytesPerSum;
      return readBuffer(b, off, (int)(endPos-curPos));
    }

Here, the caller calls the function with 127 bytes, and bytesPerSum is 256.
So, endPos-curPos became a negtaive number (e.g., -381).   So,
readBuffer() gets called with a negative length, which naturally, the
underlying DFSInputStream::read() returns 0.  When readBuffer() sees 0
bytes read, it assumes end-of-file, and return -1.   The rest of the
system treat -1 as end-of-file, and hence a large part of the input
file is not read.

I guess small read somehow get triggered by mark()/reset() which
triggered this bug.  Any suggestions for how to fix this?

btw, appended is roughly what the stack looks like on a distributed
file system read.   Hopefully it will help the next person who wants
to understand how read() happens.   Note that FSDataInputStream got
wrapped twice.

bwolen


what happens on a DFS read.  note some of these are subclass call
super class's read().  Some are a class calling its member variable's
read().

  FSDataInputStream::read() == DataInputStream::read()
  Buffer::read()
  BufferedInputStream::read()
  PositionCache::read()
  FSInputChecker::read()
  DFSDataInputStream::read() == FSDataInputStream::read() == DataInputStream...
  Buffer::read()
  BufferedInputStream::read()
  PositionCache::read()
  DFSInputStream::read()
  DataInputStream::read()    (blockStream)
  BufferedInputStream::read()
  socket