You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by tsuraan <ts...@gmail.com> on 2009/06/26 23:16:52 UTC

HDFS Random Access

All the documentation for HDFS says that it's for large streaming
jobs, but I couldn't find an explicit answer to this, so I'll try
asking here.  How is HDFS's random seek performance within an
FSDataInputStream?  I use lucene with a lot of indices (potentially
thousands), so I was thinking of putting them into HDFS and
reimplementing my search as a Hadoop map-reduce.  I've noticed that
lucene tends to do a bit of random seeking when searching though; I
don't believe that it guarantees that all seeks be to increasing file
positions either.

Would HDFS be a bad fit for an access pattern that involves seeks to
random positions within a stream?

Also, is getFileStatus the typical way of getting the length of a file
in HDFS, or is there some method on FSDataInputStream that I'm not
seeing?

Please cc: me on any reply; I'm not on the hadoop list.  Thanks!

Re: HDFS Random Access

Posted by Raghu Angadi <ra...@yahoo-inc.com>.

Yes, FSDataInputStream allows random access. There are way to read x 
bytes at a position p:
1) in.seek(p); read(buf, 0, x);
2) in.(p, buf, 0, x);
These two have slightly different semantics. The second one is preferred 
and is easier for HDFS to optimize further.

Random access should be pretty good with HDFS and it is increasingly 
getting more users and thus more importance. HBase is one of the users.

Just yesterday I attached a benchmark and comparissions to random access 
on native filesystem to https://issues.apache.org/jira/browse/HDFS-236 .

As of now, the overhead on average is about 2 ms over 9-10ms it takes 
for native read. There are a few fairly simple fixes possible to reduce 
this gap.

I think getFileStatus() is the way to find the length, though there 
might have been a call added to FSDataInputStream recently. I am not sure.

Raghu.
tsuraan wrote:
> All the documentation for HDFS says that it's for large streaming
> jobs, but I couldn't find an explicit answer to this, so I'll try
> asking here.  How is HDFS's random seek performance within an
> FSDataInputStream?  I use lucene with a lot of indices (potentially
> thousands), so I was thinking of putting them into HDFS and
> reimplementing my search as a Hadoop map-reduce.  I've noticed that
> lucene tends to do a bit of random seeking when searching though; I
> don't believe that it guarantees that all seeks be to increasing file
> positions either.
> 
> Would HDFS be a bad fit for an access pattern that involves seeks to
> random positions within a stream?
> 
> Also, is getFileStatus the typical way of getting the length of a file
> in HDFS, or is there some method on FSDataInputStream that I'm not
> seeing?
> 
> Please cc: me on any reply; I'm not on the hadoop list.  Thanks!