You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@hadoop.apache.org by Mike Anderson <mi...@mit.edu> on 2009/05/21 20:13:29 UTC

interface to HDFS

Hello, I'm working on a hadoop project where my data is comprised of many
HTML files (websites). One aspect of the project involves traditional
MapReduce analysis on the data set, but I would also like to use hadoop as a
sort of "cache server," i.e, having the ability to retrieve the HTML for a
website that I have already been to.

My question is this: what is the best way to interact with HDFS to make
simple existance queries and retrieve specific files for reading. Ideally I
would like to do this at an application level, (most likely written in
Ruby). So far I have explored the option of using one of the FUSE packages
to mount it in the userspace, but, I ran into quite a bit of difficulty
installing either of the two popular packages. My second option seems to be
Hive, but I haven't been able to find any bindings for Ruby or Python, etc.

Any suggestions or advice would be greatly appreciated!

Cheers,
Mike

Re: interface to HDFS

Posted by Amr Awadallah <aa...@cloudera.com>.

Mike,

webdav should work for you, see:

https://issues.apache.org/jira/browse/HADOOP-496
http://www.hadoop.iponweb.net/Home/hdfs-over-webdav/webdav-server

That said, note that HDFS is not optimized for handling lots of small 
files, it will store them fine, and diskspace will not be wasted, but 
the NameNode will not scale very well (the default block size is 64MB 
and HTML docs are way smaller than that). See this blog post for hints 
on how to work with many small files:

http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/

Cheers,

-- amr

Mike Anderson wrote:
> Hello, I'm working on a hadoop project where my data is comprised of many
> HTML files (websites). One aspect of the project involves traditional
> MapReduce analysis on the data set, but I would also like to use hadoop as a
> sort of "cache server," i.e, having the ability to retrieve the HTML for a
> website that I have already been to.
>
> My question is this: what is the best way to interact with HDFS to make
> simple existance queries and retrieve specific files for reading. Ideally I
> would like to do this at an application level, (most likely written in
> Ruby). So far I have explored the option of using one of the FUSE packages
> to mount it in the userspace, but, I ran into quite a bit of difficulty
> installing either of the two popular packages. My second option seems to be
> Hive, but I haven't been able to find any bindings for Ruby or Python, etc.
>
> Any suggestions or advice would be greatly appreciated!
>
> Cheers,
> Mike
>
>