You are viewing a plain text version of this content. The canonical link for it is here.
Posted to general@hadoop.apache.org by Mike Anderson <mi...@mit.edu> on 2009/05/21 20:13:29 UTC
interface to HDFS
Hello, I'm working on a hadoop project where my data is comprised of many
HTML files (websites). One aspect of the project involves traditional
MapReduce analysis on the data set, but I would also like to use hadoop as a
sort of "cache server," i.e, having the ability to retrieve the HTML for a
website that I have already been to.
My question is this: what is the best way to interact with HDFS to make
simple existance queries and retrieve specific files for reading. Ideally I
would like to do this at an application level, (most likely written in
Ruby). So far I have explored the option of using one of the FUSE packages
to mount it in the userspace, but, I ran into quite a bit of difficulty
installing either of the two popular packages. My second option seems to be
Hive, but I haven't been able to find any bindings for Ruby or Python, etc.
Any suggestions or advice would be greatly appreciated!
Cheers,
Mike
Re: interface to HDFS
Posted by Amr Awadallah <aa...@cloudera.com>.
Mike,
webdav should work for you, see:
https://issues.apache.org/jira/browse/HADOOP-496
http://www.hadoop.iponweb.net/Home/hdfs-over-webdav/webdav-server
That said, note that HDFS is not optimized for handling lots of small
files, it will store them fine, and diskspace will not be wasted, but
the NameNode will not scale very well (the default block size is 64MB
and HTML docs are way smaller than that). See this blog post for hints
on how to work with many small files:
http://www.cloudera.com/blog/2009/02/02/the-small-files-problem/
Cheers,
-- amr
Mike Anderson wrote:
> Hello, I'm working on a hadoop project where my data is comprised of many
> HTML files (websites). One aspect of the project involves traditional
> MapReduce analysis on the data set, but I would also like to use hadoop as a
> sort of "cache server," i.e, having the ability to retrieve the HTML for a
> website that I have already been to.
>
> My question is this: what is the best way to interact with HDFS to make
> simple existance queries and retrieve specific files for reading. Ideally I
> would like to do this at an application level, (most likely written in
> Ruby). So far I have explored the option of using one of the FUSE packages
> to mount it in the userspace, but, I ran into quite a bit of difficulty
> installing either of the two popular packages. My second option seems to be
> Hive, but I haven't been able to find any bindings for Ruby or Python, etc.
>
> Any suggestions or advice would be greatly appreciated!
>
> Cheers,
> Mike
>
>