You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Saptarshi Guha <sa...@gmail.com> on 2009/08/10 00:38:51 UTC

LineReader, Buffering for FileInputFormat

Hello,
I am using the TextInputFormat and its associated LineReader. In the
RecordReader for this class,
it reads key and value, using LineReader.
My question is does LineReader hit the disk every time it needs to read a
line?
I notice it uses DataInputStream, does that do some internal buffering?

I guess it would be be performance hit if LineReader read from disk every
time it needs to fetch a line,
so I'm guessing it reads a chunk and parses lines from the chunk, but i
didn't see that happening.

I am using Hadoop 0.20

Any comments would be appreciated.

Regards
Saptarshi

Re: LineReader, Buffering for FileInputFormat

Posted by Saptarshi Guha <sa...@gmail.com>.

Thank you. Is 64KB a good choice? From experience, there is a payoff between
large chunks and time taken to read the chunk.
I wonder if a larger value would be better.

On Sun, Aug 9, 2009 at 7:41 PM, Harold Valdivia Garcia <
harold.valdivia@upr.edu> wrote:

> You can see this two files:
>
>
> http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?revision=796148
>
>
> http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?revision=786726
>
> I think It doesnt access the disk every time it read a line.
>
> LineReader read 64k bytes  into a buffer, and then try to parse the data in
> lines.
>
>
>
>
> On Sun, Aug 9, 2009 at 6:38 PM, Saptarshi Guha <sa...@gmail.com>wrote:
>
>> Hello,
>> I am using the TextInputFormat and its associated LineReader. In the
>> RecordReader for this class,
>> it reads key and value, using LineReader.
>> My question is does LineReader hit the disk every time it needs to read a
>> line?
>> I notice it uses DataInputStream, does that do some internal buffering?
>>
>> I guess it would be be performance hit if LineReader read from disk every
>> time it needs to fetch a line,
>> so I'm guessing it reads a chunk and parses lines from the chunk, but i
>> didn't see that happening.
>>
>> I am using Hadoop 0.20
>>
>> Any comments would be appreciated.
>>
>> Regards
>> Saptarshi
>>
>
>
>
> --
> ******************************************
> Harold Dwight Valdivia Garcia
> Graduate Student
> M.S Computer Engineering
> University of Puerto Rico, Mayaguez Campus
> ******************************************
>

Re: LineReader, Buffering for FileInputFormat

Posted by Harold Valdivia Garcia <ha...@upr.edu>.

You can see this two files:

http://svn.apache.org/viewvc/hadoop/mapreduce/trunk/src/java/org/apache/hadoop/mapreduce/lib/input/LineRecordReader.java?revision=796148

http://svn.apache.org/viewvc/hadoop/common/trunk/src/java/org/apache/hadoop/util/LineReader.java?revision=786726

I think It doesnt access the disk every time it read a line.

LineReader read 64k bytes  into a buffer, and then try to parse the data in
lines.



On Sun, Aug 9, 2009 at 6:38 PM, Saptarshi Guha <sa...@gmail.com>wrote:

> Hello,
> I am using the TextInputFormat and its associated LineReader. In the
> RecordReader for this class,
> it reads key and value, using LineReader.
> My question is does LineReader hit the disk every time it needs to read a
> line?
> I notice it uses DataInputStream, does that do some internal buffering?
>
> I guess it would be be performance hit if LineReader read from disk every
> time it needs to fetch a line,
> so I'm guessing it reads a chunk and parses lines from the chunk, but i
> didn't see that happening.
>
> I am using Hadoop 0.20
>
> Any comments would be appreciated.
>
> Regards
> Saptarshi
>



-- 
******************************************
Harold Dwight Valdivia Garcia
Graduate Student
M.S Computer Engineering
University of Puerto Rico, Mayaguez Campus
******************************************