You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Dylan Hutchison <dh...@cs.washington.edu> on 2017/01/16 20:16:31 UTC

Running Accumulo on a standard file system, without Hadoop

Hi folks,

A friend of mine asked about running Accumulo on a normal file system in
place of Hadoop, similar to the way MiniAccumulo runs.  How possible is
this, or how much work would it take to do so?

I think my friend is just interested in running on a single node, but I am
curious about both the single-node and distributed (via parallel file
system like Lustre) cases.

Thanks, Dylan

Re: Running Accumulo on a standard file system, without Hadoop

Posted by Christopher <ct...@apache.org>.
My recent blog post about running Accumulo on Fedora 25 describes how to do
this using the RawLocalFileSystem implementation of Hadoop for Accumulo
volumes matching file://

https://accumulo.apache.org/blog/2016/12/19/running-on-fedora-25.html

This works with other packaging also, not just in Fedora 25, but I think
the step-by-step process in my blog post is probably the simplest way to
get started with that scenario. Currently, only version 1.6.6 on Hadoop
2.4.1 is available, though.

On Mon, Jan 16, 2017 at 3:17 PM Dylan Hutchison <dh...@cs.washington.edu>
wrote:

> Hi folks,
>
> A friend of mine asked about running Accumulo on a normal file system in
> place of Hadoop, similar to the way MiniAccumulo runs.  How possible is
> this, or how much work would it take to do so?
>
> I think my friend is just interested in running on a single node, but I am
> curious about both the single-node and distributed (via parallel file
> system like Lustre) cases.
>
> Thanks, Dylan
>
-- 
Christopher

Re: Running Accumulo on a standard file system, without Hadoop

Posted by Keith Turner <ke...@deenlo.com>.
On Mon, Jan 16, 2017 at 5:53 PM, Josh Elser <jo...@gmail.com> wrote:
>
>
> Dylan Hutchison wrote:
>>>
>>> You can configure HDFS to use the RawLocalFileSystem class forfile://
>>> >  URIs which is what is done for a majority of the integration tests.
>>> > Beware
>>> >  that you configure the RawLocalFileSystem as the ChecksumFileSystem
>>> >  (default forfile://) will fail miserably around WAL recovery.
>>> >
>>> >  https://github.com/apache/accumulo/blob/master/test/src/main
>>> >  /java/org/apache/accumulo/test/BulkImportVolumeIT.java#L61
>>> >
>>> >
>>
>> Hi Josh, are you saying that the ChecksumFileSystem is required or
>> forbidden for WAL recovery?  Looking at the Hadoop code it seems that
>> LocalFileSystem wraps around a RawLocalFileSystem to provide checksum
>> capabilities.  Is that right?
>>
>
> Sorry I wasn't clearer: forbidden. If you use the RawLocalFileSystem and you
> should not see any issues. If you use the ChecksumFileSystem (which is the
> default) and you *will* see issues.

The ChecksumFileSystem does nothing for flush, thats why there are WAL
problems.  The RawLocalFileSystem pushes data to the OS (which may
buffer in memory for a short period), when flush is called.  However,
RawLocalFileSystem does not offer a way to force data to disk.  So
with RawLocalFileSystem you can restart Accumulo processes w/o losing
data.  However, it the OS is restarted then data may be lost.

Re: Running Accumulo on a standard file system, without Hadoop

Posted by Josh Elser <jo...@gmail.com>.

Dylan Hutchison wrote:
>> You can configure HDFS to use the RawLocalFileSystem class forfile://
>> >  URIs which is what is done for a majority of the integration tests. Beware
>> >  that you configure the RawLocalFileSystem as the ChecksumFileSystem
>> >  (default forfile://) will fail miserably around WAL recovery.
>> >
>> >  https://github.com/apache/accumulo/blob/master/test/src/main
>> >  /java/org/apache/accumulo/test/BulkImportVolumeIT.java#L61
>> >
>> >
> Hi Josh, are you saying that the ChecksumFileSystem is required or
> forbidden for WAL recovery?  Looking at the Hadoop code it seems that
> LocalFileSystem wraps around a RawLocalFileSystem to provide checksum
> capabilities.  Is that right?
>

Sorry I wasn't clearer: forbidden. If you use the RawLocalFileSystem and 
you should not see any issues. If you use the ChecksumFileSystem (which 
is the default) and you *will* see issues.

Re: Running Accumulo on a standard file system, without Hadoop

Posted by Dylan Hutchison <dh...@cs.washington.edu>.
On Mon, Jan 16, 2017 at 1:56 PM, Josh Elser <jo...@gmail.com> wrote:

> That's true, but HDFS supports multiple "implementations" based on the
> scheme of the URI being used.
>
> e.g. hdfs:// is mapped to DistributedFileSystem
>
> You can configure HDFS to use the RawLocalFileSystem class for file://
> URIs which is what is done for a majority of the integration tests. Beware
> that you configure the RawLocalFileSystem as the ChecksumFileSystem
> (default for file://) will fail miserably around WAL recovery.
>
> https://github.com/apache/accumulo/blob/master/test/src/main
> /java/org/apache/accumulo/test/BulkImportVolumeIT.java#L61
>
>
Hi Josh, are you saying that the ChecksumFileSystem is required or
forbidden for WAL recovery?  Looking at the Hadoop code it seems that
LocalFileSystem wraps around a RawLocalFileSystem to provide checksum
capabilities.  Is that right?


>
> Dave Marion wrote:
>
>> IIRC, Accumulo *only* uses the HDFS client, so it needs something on the
>> other side that can respond to that protocol. MiniAccumulo starts up
>> MiniHDFS for this. You could run some other type of service locally that is
>> HDFS client compatible (something like Quantcast QFS[1], setting up client
>> [2]). If Accumulo is using something in Hadoop outside of the public client
>> API, this may not work.
>>
>> [1] https://github.com/quantcast/qfs
>> [2] https://github.com/quantcast/qfs/wiki/Migration-Guide
>>
>>
>> -----Original Message-----
>>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>>> Sent: Monday, January 16, 2017 3:17 PM
>>> To: dev@accumulo.apache.org
>>> Subject: Running Accumulo on a standard file system, without Hadoop
>>>
>>> Hi folks,
>>>
>>> A friend of mine asked about running Accumulo on a normal file system in
>>> place of Hadoop, similar to the way MiniAccumulo runs.  How possible is
>>> this,
>>> or how much work would it take to do so?
>>>
>>> I think my friend is just interested in running on a single node, but I
>>> am
>>> curious about both the single-node and distributed (via parallel file
>>> system
>>> like Lustre) cases.
>>>
>>> Thanks, Dylan
>>>
>>
>>

Re: Running Accumulo on a standard file system, without Hadoop

Posted by Josh Elser <jo...@gmail.com>.
That's true, but HDFS supports multiple "implementations" based on the 
scheme of the URI being used.

e.g. hdfs:// is mapped to DistributedFileSystem

You can configure HDFS to use the RawLocalFileSystem class for file:// 
URIs which is what is done for a majority of the integration tests. 
Beware that you configure the RawLocalFileSystem as the 
ChecksumFileSystem (default for file://) will fail miserably around WAL 
recovery.

https://github.com/apache/accumulo/blob/master/test/src/main/java/org/apache/accumulo/test/BulkImportVolumeIT.java#L61

Dave Marion wrote:
> IIRC, Accumulo *only* uses the HDFS client, so it needs something on the other side that can respond to that protocol. MiniAccumulo starts up MiniHDFS for this. You could run some other type of service locally that is HDFS client compatible (something like Quantcast QFS[1], setting up client [2]). If Accumulo is using something in Hadoop outside of the public client API, this may not work.
>
> [1] https://github.com/quantcast/qfs
> [2] https://github.com/quantcast/qfs/wiki/Migration-Guide
>
>
>> -----Original Message-----
>> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
>> Sent: Monday, January 16, 2017 3:17 PM
>> To: dev@accumulo.apache.org
>> Subject: Running Accumulo on a standard file system, without Hadoop
>>
>> Hi folks,
>>
>> A friend of mine asked about running Accumulo on a normal file system in
>> place of Hadoop, similar to the way MiniAccumulo runs.  How possible is this,
>> or how much work would it take to do so?
>>
>> I think my friend is just interested in running on a single node, but I am
>> curious about both the single-node and distributed (via parallel file system
>> like Lustre) cases.
>>
>> Thanks, Dylan
>

RE: Running Accumulo on a standard file system, without Hadoop

Posted by Dave Marion <dl...@comcast.net>.
IIRC, Accumulo *only* uses the HDFS client, so it needs something on the other side that can respond to that protocol. MiniAccumulo starts up MiniHDFS for this. You could run some other type of service locally that is HDFS client compatible (something like Quantcast QFS[1], setting up client [2]). If Accumulo is using something in Hadoop outside of the public client API, this may not work.

[1] https://github.com/quantcast/qfs
[2] https://github.com/quantcast/qfs/wiki/Migration-Guide


> -----Original Message-----
> From: Dylan Hutchison [mailto:dhutchis@cs.washington.edu]
> Sent: Monday, January 16, 2017 3:17 PM
> To: dev@accumulo.apache.org
> Subject: Running Accumulo on a standard file system, without Hadoop
> 
> Hi folks,
> 
> A friend of mine asked about running Accumulo on a normal file system in
> place of Hadoop, similar to the way MiniAccumulo runs.  How possible is this,
> or how much work would it take to do so?
> 
> I think my friend is just interested in running on a single node, but I am
> curious about both the single-node and distributed (via parallel file system
> like Lustre) cases.
> 
> Thanks, Dylan