You are viewing a plain text version of this content. The canonical link for it is here.
Posted to common-dev@hadoop.apache.org by Ari King <ar...@gmail.com> on 2014/12/10 21:20:04 UTC

Hadoop without HDFS

Hi,

I'm doing a research paper on Hadoop -- specifically relating to its
dependency on HDFS. I need to determine if and how HDFS can be replaced. As
I understand it, there are a number of organizations that have produced
HDFS alternatives that support the Hadoop ecosystem, i.e. MapReduce, Hive,
HBase, etc.

With the "if" part being answered, I'd appreciate insight/guidance on the
"how" part. Essentially, where can I find information on what MapReduce and
the other Hadoop subprojects require of the underlying file system and how
these subprojects expect to interact with the file system.

Thanks!

Best,
Ari

Re: Hadoop without HDFS

Posted by Jay Vyas <ja...@gmail.com>.
Yup, that's a great summary.  More details...

The HCFS wiki page will give you insight into some tests you can run to test your FileSystem plugin class, which you will put in a jar file described below. 

In general, hadoop apps are written to the file system interface which is loaded runtime, so as long as you configure the core-site file correctly, and have corresponding file paths qualified to reference the uri which you defined to map to correct Java classes (which are in a jar somewhere in hadoop/lib, for example), everything should work.  

We've tested solr, hbase, mahout and many other systems that use the FileSystem interface in various different ways, in general, it works pretty well... With the exception of Impala which e is HDFS specific (it checks at runtime that your running hdfs, and if not it throws an error).

A good suite of tests to run for hcfs compatibility is the BigTop smoke tests, which exersiZe pig, flume, mapreduce, mahout and we use those to validate glusterfs.



> On Dec 10, 2014, at 3:50 PM, Roman Shaposhnik <ro...@shaposhnik.org> wrote:
> 
>> On Wed, Dec 10, 2014 at 12:20 PM, Ari King <ar...@gmail.com> wrote:
>> Hi,
>> 
>> I'm doing a research paper on Hadoop -- specifically relating to its
>> dependency on HDFS. I need to determine if and how HDFS can be replaced. As
>> I understand it, there are a number of organizations that have produced
>> HDFS alternatives that support the Hadoop ecosystem, i.e. MapReduce, Hive,
>> HBase, etc.
> 
> There's a difference between producing a storage solution with
> on-the-wire-protocol compatible with HDFS vs. an HCFS one (see
> bellow).
> 
>> With the "if" part being answered, I'd appreciate insight/guidance on the
>> "how" part. Essentially, where can I find information on what MapReduce and
>> the other Hadoop subprojects require of the underlying file system and how
>> these subprojects expect to interact with the file system.
> 
> It really boils down for a storage solution to expose a Hadoop Compatible
> Filesystem API. This should give you a sufficient overview of the details:
>    https://wiki.apache.org/hadoop/HCFS
> 
> A lot of open source (Ceph, GlusterFS, etc.) and closed source storage solutions
> (Isilon, etc.) do that and can be used as a replacement for HDFS.
> 
> The real question, of course, is all the different tradeoffs that the
> implementations
> are making. That's where it gets fascinating.
> 
> Thanks,
> Roman.

Re: Hadoop without HDFS

Posted by Roman Shaposhnik <ro...@shaposhnik.org>.
On Wed, Dec 10, 2014 at 12:20 PM, Ari King <ar...@gmail.com> wrote:
> Hi,
>
> I'm doing a research paper on Hadoop -- specifically relating to its
> dependency on HDFS. I need to determine if and how HDFS can be replaced. As
> I understand it, there are a number of organizations that have produced
> HDFS alternatives that support the Hadoop ecosystem, i.e. MapReduce, Hive,
> HBase, etc.

There's a difference between producing a storage solution with
on-the-wire-protocol compatible with HDFS vs. an HCFS one (see
bellow).

> With the "if" part being answered, I'd appreciate insight/guidance on the
> "how" part. Essentially, where can I find information on what MapReduce and
> the other Hadoop subprojects require of the underlying file system and how
> these subprojects expect to interact with the file system.

It really boils down for a storage solution to expose a Hadoop Compatible
Filesystem API. This should give you a sufficient overview of the details:
    https://wiki.apache.org/hadoop/HCFS

A lot of open source (Ceph, GlusterFS, etc.) and closed source storage solutions
(Isilon, etc.) do that and can be used as a replacement for HDFS.

The real question, of course, is all the different tradeoffs that the
implementations
are making. That's where it gets fascinating.

Thanks,
Roman.

Re: Hadoop without HDFS

Posted by Steve Loughran <st...@hortonworks.com>.
one more thing, the "if" excludes object stores which don't offer
consistency and atomic create-no-overwrite and rename. You can't run all
hadoop apps directly on top of Amazon S3, without extra work (see netflix
S3mper). Object stores do not always behave as filesystems, even if they
implement the relevant Hadoop APIs (some do though, like google's and
microsoft's)

HADOOP-9361 and the filesystem documentation attempt to formally specify
what an FS should do;

http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html
Where "formally" means "try to rigorously define what HDFS does and how
other filesystems (especially posix ones) differ"


HADOOP-9565 looking at some explicit ObjectStore subclass of FileSystem to
provide more details on object stores

On 10 December 2014 at 20:20, Ari King <ar...@gmail.com> wrote:

> Hi,
>
> I'm doing a research paper on Hadoop -- specifically relating to its
> dependency on HDFS. I need to determine if and how HDFS can be replaced. As
> I understand it, there are a number of organizations that have produced
> HDFS alternatives that support the Hadoop ecosystem, i.e. MapReduce, Hive,
> HBase, etc.
>
> With the "if" part being answered, I'd appreciate insight/guidance on the
> "how" part. Essentially, where can I find information on what MapReduce and
> the other Hadoop subprojects require of the underlying file system and how
> these subprojects expect to interact with the file system.
>
> Thanks!
>
> Best,
> Ari
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.