You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@accumulo.apache.org by James Hughes <jn...@virginia.edu> on 2017/04/14 14:16:13 UTC

Accumulo on Azure / WebHDFS

Hi all,

I know folks have asked about Accumulo on S3 before (1).

Has anyone tried running Accumulo on Azure's blob storage or data lake
solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
WebHDFS?)

As more background, I have deployed Accumulo on HDP clouds in Azure, and
that works great.  I'm interested in using the blob / data lake storage for
benefits with scaling, etc.

Thanks in advance,

Jim

1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
2.
https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services

Re: Accumulo on Azure / WebHDFS

Posted by James Hughes <jn...@virginia.edu>.
Thanks.  Can you say if the performance is on par with a cloud you might
otherwise spin-up?

In terms of the drop-in bits, it is as easy as setting 'instance.volumes'
to point at the new URL?

Thanks!

On Mon, Apr 17, 2017 at 4:57 PM, Josh Elser <jo...@gmail.com> wrote:

> I don't have any performance numbers handy. I'm not sure if
> Microsoft/Azure-team publishes them.
>
> In general, my understanding is that each of them are intended to be
> "drop-in replacements". There might be some implementation specific
> configuration (e.g. account/billing), but that's it.
>
> James Hughes wrote:
>
>> Hi Josh,
>>
>> Thanks again!
>>
>> As a follow-up, is any of the information about Accumulo on WASB or ADL
>> public?  I suppose I'm curious about configuration (is it just
>> plug-and-play?) and performance.
>>
>> Thanks in advance,
>>
>> Jim
>>
>> On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <josh.elser@gmail.com
>> <ma...@gmail.com>> wrote:
>>
>>     As I understand it, S3 is currently still a non-starter.
>>
>>     Long term, Amazon may provide some more features to fix the sync
>>     issue. Or, someone can modify Accumulo to support putting rfiles on
>>     s3 exclusively.
>>
>>     Happy to expand on this further if you're curious.
>>
>>
>>     On Apr 14, 2017 15:16, "James Hughes" <jnh5y@virginia.edu
>>     <ma...@virginia.edu>> wrote:
>>
>>         Hi Josh,
>>
>>         Thanks!  Sounds like Azure's offerings are providing better
>>         performance and sync()'ing over S3?  (I.e., is S3 still a no-go
>>         for Accumulo?)
>>
>>         Your description of WebHDFS makes totally sense.  I figured
>>         there may be an outside chance that WebHDFS handled or worked
>>         around limitations from S3, etc.
>>
>>         Cheers,
>>
>>         Jim
>>
>>         On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser
>>         <josh.elser@gmail.com <ma...@gmail.com>> wrote:
>>
>>             Hi Jim,
>>
>>             I can say that Accumulo will work on Azure's blob store and
>>             their data
>>             lake store. These are a result of testing I'm involved with at
>>             Hortonworks (dayjob). I know that these filesystems are
>>             tested to an
>>             appropriate degree, proving that they do provide the things
>> that
>>             Accumulo needs.
>>
>>             As a refresher, the things we need from a filesystem are:
>>             performance
>>             (Accumulo's write performance is pretty dominated by I/O) and
>>             durability guarantees (when we call sync() on a file, the
>>             data we just
>>             wrote better be there).
>>
>>             For WebHDFS, I think you would both hurt for performance and
>>             I would
>>             be surprised if it actually provided the durability
>>             correctness. My
>>             understanding is that WebHDFS is more meant to allow
>>             non-Java clients
>>             easy access to HDFS (as a one-off) rather than act as a
>>             fully-fledged
>>             access layer.
>>
>>             - Josh
>>
>>             On Fri, Apr 14, 2017 at 10:16 AM, James Hughes
>>             <jnh5y@virginia.edu <ma...@virginia.edu>> wrote:
>>              > Hi all,
>>              >
>>              > I know folks have asked about Accumulo on S3 before (1).
>>              >
>>              > Has anyone tried running Accumulo on Azure's blob storage
>>             or data lake
>>              > solutions (2)?  (Or perhaps more generally, has anyone
>>             tried Accumulo on
>>              > WebHDFS?)
>>              >
>>              > As more background, I have deployed Accumulo on HDP
>>             clouds in Azure, and
>>              > that works great.  I'm interested in using the blob /
>>             data lake storage for
>>              > benefits with scaling, etc.
>>              >
>>              > Thanks in advance,
>>              >
>>              > Jim
>>              >
>>              > 1.
>>             http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-
>> td16737.html
>>             <http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-
>> s3-td16737.html>
>>              > 2.
>>              >
>>             https://docs.microsoft.com/en-us/azure/data-lake-store/data-
>> lake-store-integrate-with-other-services
>>             <https://docs.microsoft.com/en-us/azure/data-lake-store/data
>> -lake-store-integrate-with-other-services>
>>
>>
>>
>>
>>

Re: Accumulo on Azure / WebHDFS

Posted by Josh Elser <jo...@gmail.com>.
I don't have any performance numbers handy. I'm not sure if 
Microsoft/Azure-team publishes them.

In general, my understanding is that each of them are intended to be 
"drop-in replacements". There might be some implementation specific 
configuration (e.g. account/billing), but that's it.

James Hughes wrote:
> Hi Josh,
>
> Thanks again!
>
> As a follow-up, is any of the information about Accumulo on WASB or ADL
> public?  I suppose I'm curious about configuration (is it just
> plug-and-play?) and performance.
>
> Thanks in advance,
>
> Jim
>
> On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <josh.elser@gmail.com
> <ma...@gmail.com>> wrote:
>
>     As I understand it, S3 is currently still a non-starter.
>
>     Long term, Amazon may provide some more features to fix the sync
>     issue. Or, someone can modify Accumulo to support putting rfiles on
>     s3 exclusively.
>
>     Happy to expand on this further if you're curious.
>
>
>     On Apr 14, 2017 15:16, "James Hughes" <jnh5y@virginia.edu
>     <ma...@virginia.edu>> wrote:
>
>         Hi Josh,
>
>         Thanks!  Sounds like Azure's offerings are providing better
>         performance and sync()'ing over S3?  (I.e., is S3 still a no-go
>         for Accumulo?)
>
>         Your description of WebHDFS makes totally sense.  I figured
>         there may be an outside chance that WebHDFS handled or worked
>         around limitations from S3, etc.
>
>         Cheers,
>
>         Jim
>
>         On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser
>         <josh.elser@gmail.com <ma...@gmail.com>> wrote:
>
>             Hi Jim,
>
>             I can say that Accumulo will work on Azure's blob store and
>             their data
>             lake store. These are a result of testing I'm involved with at
>             Hortonworks (dayjob). I know that these filesystems are
>             tested to an
>             appropriate degree, proving that they do provide the things that
>             Accumulo needs.
>
>             As a refresher, the things we need from a filesystem are:
>             performance
>             (Accumulo's write performance is pretty dominated by I/O) and
>             durability guarantees (when we call sync() on a file, the
>             data we just
>             wrote better be there).
>
>             For WebHDFS, I think you would both hurt for performance and
>             I would
>             be surprised if it actually provided the durability
>             correctness. My
>             understanding is that WebHDFS is more meant to allow
>             non-Java clients
>             easy access to HDFS (as a one-off) rather than act as a
>             fully-fledged
>             access layer.
>
>             - Josh
>
>             On Fri, Apr 14, 2017 at 10:16 AM, James Hughes
>             <jnh5y@virginia.edu <ma...@virginia.edu>> wrote:
>              > Hi all,
>              >
>              > I know folks have asked about Accumulo on S3 before (1).
>              >
>              > Has anyone tried running Accumulo on Azure's blob storage
>             or data lake
>              > solutions (2)?  (Or perhaps more generally, has anyone
>             tried Accumulo on
>              > WebHDFS?)
>              >
>              > As more background, I have deployed Accumulo on HDP
>             clouds in Azure, and
>              > that works great.  I'm interested in using the blob /
>             data lake storage for
>              > benefits with scaling, etc.
>              >
>              > Thanks in advance,
>              >
>              > Jim
>              >
>              > 1.
>             http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
>             <http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html>
>              > 2.
>              >
>             https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services
>             <https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services>
>
>
>
>

Re: Accumulo on Azure / WebHDFS

Posted by James Hughes <jn...@virginia.edu>.
Hi Josh,

Thanks again!

As a follow-up, is any of the information about Accumulo on WASB or ADL
public?  I suppose I'm curious about configuration (is it just
plug-and-play?) and performance.

Thanks in advance,

Jim

On Sat, Apr 15, 2017 at 2:25 PM, Josh Elser <jo...@gmail.com> wrote:

> As I understand it, S3 is currently still a non-starter.
>
> Long term, Amazon may provide some more features to fix the sync issue.
> Or, someone can modify Accumulo to support putting rfiles on s3 exclusively.
>
> Happy to expand on this further if you're curious.
>
>
> On Apr 14, 2017 15:16, "James Hughes" <jn...@virginia.edu> wrote:
>
> Hi Josh,
>
> Thanks!  Sounds like Azure's offerings are providing better performance
> and sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)
>
> Your description of WebHDFS makes totally sense.  I figured there may be
> an outside chance that WebHDFS handled or worked around limitations from
> S3, etc.
>
> Cheers,
>
> Jim
>
> On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <jo...@gmail.com> wrote:
>
>> Hi Jim,
>>
>> I can say that Accumulo will work on Azure's blob store and their data
>> lake store. These are a result of testing I'm involved with at
>> Hortonworks (dayjob). I know that these filesystems are tested to an
>> appropriate degree, proving that they do provide the things that
>> Accumulo needs.
>>
>> As a refresher, the things we need from a filesystem are: performance
>> (Accumulo's write performance is pretty dominated by I/O) and
>> durability guarantees (when we call sync() on a file, the data we just
>> wrote better be there).
>>
>> For WebHDFS, I think you would both hurt for performance and I would
>> be surprised if it actually provided the durability correctness. My
>> understanding is that WebHDFS is more meant to allow non-Java clients
>> easy access to HDFS (as a one-off) rather than act as a fully-fledged
>> access layer.
>>
>> - Josh
>>
>> On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <jn...@virginia.edu>
>> wrote:
>> > Hi all,
>> >
>> > I know folks have asked about Accumulo on S3 before (1).
>> >
>> > Has anyone tried running Accumulo on Azure's blob storage or data lake
>> > solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
>> > WebHDFS?)
>> >
>> > As more background, I have deployed Accumulo on HDP clouds in Azure, and
>> > that works great.  I'm interested in using the blob / data lake storage
>> for
>> > benefits with scaling, etc.
>> >
>> > Thanks in advance,
>> >
>> > Jim
>> >
>> > 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-
>> td16737.html
>> > 2.
>> > https://docs.microsoft.com/en-us/azure/data-lake-store/data-
>> lake-store-integrate-with-other-services
>>
>
>
>

Re: Accumulo on Azure / WebHDFS

Posted by Josh Elser <jo...@gmail.com>.
As I understand it, S3 is currently still a non-starter.

Long term, Amazon may provide some more features to fix the sync issue. Or,
someone can modify Accumulo to support putting rfiles on s3 exclusively.

Happy to expand on this further if you're curious.


On Apr 14, 2017 15:16, "James Hughes" <jn...@virginia.edu> wrote:

Hi Josh,

Thanks!  Sounds like Azure's offerings are providing better performance and
sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)

Your description of WebHDFS makes totally sense.  I figured there may be an
outside chance that WebHDFS handled or worked around limitations from S3,
etc.

Cheers,

Jim

On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <jo...@gmail.com> wrote:

> Hi Jim,
>
> I can say that Accumulo will work on Azure's blob store and their data
> lake store. These are a result of testing I'm involved with at
> Hortonworks (dayjob). I know that these filesystems are tested to an
> appropriate degree, proving that they do provide the things that
> Accumulo needs.
>
> As a refresher, the things we need from a filesystem are: performance
> (Accumulo's write performance is pretty dominated by I/O) and
> durability guarantees (when we call sync() on a file, the data we just
> wrote better be there).
>
> For WebHDFS, I think you would both hurt for performance and I would
> be surprised if it actually provided the durability correctness. My
> understanding is that WebHDFS is more meant to allow non-Java clients
> easy access to HDFS (as a one-off) rather than act as a fully-fledged
> access layer.
>
> - Josh
>
> On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <jn...@virginia.edu> wrote:
> > Hi all,
> >
> > I know folks have asked about Accumulo on S3 before (1).
> >
> > Has anyone tried running Accumulo on Azure's blob storage or data lake
> > solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> > WebHDFS?)
> >
> > As more background, I have deployed Accumulo on HDP clouds in Azure, and
> > that works great.  I'm interested in using the blob / data lake storage
> for
> > benefits with scaling, etc.
> >
> > Thanks in advance,
> >
> > Jim
> >
> > 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-
> td16737.html
> > 2.
> > https://docs.microsoft.com/en-us/azure/data-lake-store/data-
> lake-store-integrate-with-other-services
>

Re: Accumulo on Azure / WebHDFS

Posted by James Hughes <jn...@virginia.edu>.
Hi Josh,

Thanks!  Sounds like Azure's offerings are providing better performance and
sync()'ing over S3?  (I.e., is S3 still a no-go for Accumulo?)

Your description of WebHDFS makes totally sense.  I figured there may be an
outside chance that WebHDFS handled or worked around limitations from S3,
etc.

Cheers,

Jim

On Fri, Apr 14, 2017 at 12:47 PM, Josh Elser <jo...@gmail.com> wrote:

> Hi Jim,
>
> I can say that Accumulo will work on Azure's blob store and their data
> lake store. These are a result of testing I'm involved with at
> Hortonworks (dayjob). I know that these filesystems are tested to an
> appropriate degree, proving that they do provide the things that
> Accumulo needs.
>
> As a refresher, the things we need from a filesystem are: performance
> (Accumulo's write performance is pretty dominated by I/O) and
> durability guarantees (when we call sync() on a file, the data we just
> wrote better be there).
>
> For WebHDFS, I think you would both hurt for performance and I would
> be surprised if it actually provided the durability correctness. My
> understanding is that WebHDFS is more meant to allow non-Java clients
> easy access to HDFS (as a one-off) rather than act as a fully-fledged
> access layer.
>
> - Josh
>
> On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <jn...@virginia.edu> wrote:
> > Hi all,
> >
> > I know folks have asked about Accumulo on S3 before (1).
> >
> > Has anyone tried running Accumulo on Azure's blob storage or data lake
> > solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> > WebHDFS?)
> >
> > As more background, I have deployed Accumulo on HDP clouds in Azure, and
> > that works great.  I'm interested in using the blob / data lake storage
> for
> > benefits with scaling, etc.
> >
> > Thanks in advance,
> >
> > Jim
> >
> > 1.  http://apache-accumulo.1065345.n5.nabble.com/
> Accumulo-on-s3-td16737.html
> > 2.
> > https://docs.microsoft.com/en-us/azure/data-lake-store/data-
> lake-store-integrate-with-other-services
>

Re: Accumulo on Azure / WebHDFS

Posted by Josh Elser <jo...@gmail.com>.
Hi Jim,

I can say that Accumulo will work on Azure's blob store and their data
lake store. These are a result of testing I'm involved with at
Hortonworks (dayjob). I know that these filesystems are tested to an
appropriate degree, proving that they do provide the things that
Accumulo needs.

As a refresher, the things we need from a filesystem are: performance
(Accumulo's write performance is pretty dominated by I/O) and
durability guarantees (when we call sync() on a file, the data we just
wrote better be there).

For WebHDFS, I think you would both hurt for performance and I would
be surprised if it actually provided the durability correctness. My
understanding is that WebHDFS is more meant to allow non-Java clients
easy access to HDFS (as a one-off) rather than act as a fully-fledged
access layer.

- Josh

On Fri, Apr 14, 2017 at 10:16 AM, James Hughes <jn...@virginia.edu> wrote:
> Hi all,
>
> I know folks have asked about Accumulo on S3 before (1).
>
> Has anyone tried running Accumulo on Azure's blob storage or data lake
> solutions (2)?  (Or perhaps more generally, has anyone tried Accumulo on
> WebHDFS?)
>
> As more background, I have deployed Accumulo on HDP clouds in Azure, and
> that works great.  I'm interested in using the blob / data lake storage for
> benefits with scaling, etc.
>
> Thanks in advance,
>
> Jim
>
> 1.  http://apache-accumulo.1065345.n5.nabble.com/Accumulo-on-s3-td16737.html
> 2.
> https://docs.microsoft.com/en-us/azure/data-lake-store/data-lake-store-integrate-with-other-services