You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@accumulo.apache.org by Josh Elser <el...@apache.org> on 2019/10/25 21:28:47 UTC

WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Forking this off because I don't think it's related to Tushar's original 
question.

HBase and Accumulo both implementation a WAL which can be said to 
relying on a distributed FileSystem which:

1. Is API compatible with HDFS
2. Guarantees that data written prior to an hflush/hsync() is durable

There are actually a few filesystems capable of this: HDFS (duh), 
Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store 
(ADLS), and Azure's Blob Filesystem (ABFS).

Azure has had a pretty long interaction with the upstream Hadoop project 
(and some ties in with the HBase project) to make sure that we know how 
to configure their Hadoop drivers that work with those Azure blob stores 
to make that durability guarantee.

That said, it's wrong to say that HBase/Accumulo in a cloud solution 
require HDFS. It is accurate to say that S3 (via the S3A adapter) does 
not provide the durability guarantees that HBase/Accumulo need for WALs 
(but EMRFS does, from what I've heard through the grapevine, but 
requires you to be using EMR)

On 10/25/19 1:49 PM, David Mollitor wrote:
> Hello Team,
> 
> One short coming of Apache Accumulo and Apache HBase, as I understand it,
> is that they both rely on the HDFS for replicated WAL management.
> Therefore, HDFS is a requirement even if deploying to a cloud solution.  I
> believe Google has developed a consensus enabled WAL management so that
> three instances can be stood up without any external dependencies (other
> than storage for the collection of rfile/hfile).
> 
> Be interested to hear your thoughts on this.
> 
> On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> 
>> Hi Tushar,
>>
>> The closest thing we have are the performance tests in accumulo-testing,
>> which is probably the best place.
>> https://github.com/apache/accumulo-testing#performance-test
>> The instructions for setting up the scripts are in the README.  There are
>> only a limited number of tests written though and they used to be
>> integration tests that were moved out of the main test package.
>>
>> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
>> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
>> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
>> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
>> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
>> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
>>
>> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <tu...@gmail.com>
>> wrote:
>>
>>> Hello Everyone,
>>>
>>>
>>> I am a Software Engineer at Microsoft and our team is currently working
>> on
>>> making the deployment and operations of Accumulo on Azure as seamless as
>>> possible. As part of this effort, we are attempting to observe / measure
>>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
>> etc.)
>>> and how their performance varies over time on long standing Accumulo
>>> clusters running in Azure. As part of this we’re looking to come up with
>> a
>>> metric that we can use to evaluate how healthy / available an Accumulo
>>> cluster is. Over time we intend to use this to understand how underlying
>>> platform changes in Azure can affect overall health of Accumulo
>> workloads.
>>>
>>>
>>>
>>> As a starting metric for example, we are thinking of continually doing
>>> scans of random values across various tablet servers and capturing timing
>>> information related to how long such scans take. I took a quick look at
>> the
>>> accumulo-testing repo and didn’t find any tests or probes attempting to
>> do
>>> something along these lines. Does something like this seem reasonable?
>> Has
>>> anyone previously attempted something similar? Does accumulo-testing seem
>>> like a reasonable place for code that attempts to do something like this?
>>>
>>>
>>>
>>> Appreciate your thoughts and feedback.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Tushar Dhadiwal
>>>
>>
> 

Re: WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Posted by Josh Elser <el...@apache.org>.
No worries. Just wanted to make sure that people know what is actually
out there.

Your question is actually right on the money -- there isn't any place
that I'm aware of which actually documents this well. Could be
something that is up in Hadoop (give that we're using Hadoop API
semantics to define what Accumulo/HBase require), but it could also be
something we throw up in HBase or Accumulo documentation.

On Sat, Oct 26, 2019 at 12:26 PM David Mollitor <da...@gmail.com> wrote:
>
> Hey Josh,
>
> Thank you for the thoughtful clarification to my otherwise boorish remarks.
>
> I was surprised to learn that there are a few file systems that provide the
> required WAL append guarantees.
>
> Is there documentation that covers this topic and lists the file systems?
>
> Thanks
>
> On Fri, Oct 25, 2019, 5:28 PM Josh Elser <el...@apache.org> wrote:
>
> > Forking this off because I don't think it's related to Tushar's original
> > question.
> >
> > HBase and Accumulo both implementation a WAL which can be said to
> > relying on a distributed FileSystem which:
> >
> > 1. Is API compatible with HDFS
> > 2. Guarantees that data written prior to an hflush/hsync() is durable
> >
> > There are actually a few filesystems capable of this: HDFS (duh),
> > Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store
> > (ADLS), and Azure's Blob Filesystem (ABFS).
> >
> > Azure has had a pretty long interaction with the upstream Hadoop project
> > (and some ties in with the HBase project) to make sure that we know how
> > to configure their Hadoop drivers that work with those Azure blob stores
> > to make that durability guarantee.
> >
> > That said, it's wrong to say that HBase/Accumulo in a cloud solution
> > require HDFS. It is accurate to say that S3 (via the S3A adapter) does
> > not provide the durability guarantees that HBase/Accumulo need for WALs
> > (but EMRFS does, from what I've heard through the grapevine, but
> > requires you to be using EMR)
> >
> > On 10/25/19 1:49 PM, David Mollitor wrote:
> > > Hello Team,
> > >
> > > One short coming of Apache Accumulo and Apache HBase, as I understand it,
> > > is that they both rely on the HDFS for replicated WAL management.
> > > Therefore, HDFS is a requirement even if deploying to a cloud solution.
> > I
> > > believe Google has developed a consensus enabled WAL management so that
> > > three instances can be stood up without any external dependencies (other
> > > than storage for the collection of rfile/hfile).
> > >
> > > Be interested to hear your thoughts on this.
> > >
> > > On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> > >
> > >> Hi Tushar,
> > >>
> > >> The closest thing we have are the performance tests in accumulo-testing,
> > >> which is probably the best place.
> > >> https://github.com/apache/accumulo-testing#performance-test
> > >> The instructions for setting up the scripts are in the README.  There
> > are
> > >> only a limited number of tests written though and they used to be
> > >> integration tests that were moved out of the main test package.
> > >>
> > >> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
> > >> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
> > >> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
> > >> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
> > >> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
> > >> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
> > >>
> > >> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <
> > tushardhadiwal@gmail.com>
> > >> wrote:
> > >>
> > >>> Hello Everyone,
> > >>>
> > >>>
> > >>> I am a Software Engineer at Microsoft and our team is currently working
> > >> on
> > >>> making the deployment and operations of Accumulo on Azure as seamless
> > as
> > >>> possible. As part of this effort, we are attempting to observe /
> > measure
> > >>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
> > >> etc.)
> > >>> and how their performance varies over time on long standing Accumulo
> > >>> clusters running in Azure. As part of this we’re looking to come up
> > with
> > >> a
> > >>> metric that we can use to evaluate how healthy / available an Accumulo
> > >>> cluster is. Over time we intend to use this to understand how
> > underlying
> > >>> platform changes in Azure can affect overall health of Accumulo
> > >> workloads.
> > >>>
> > >>>
> > >>>
> > >>> As a starting metric for example, we are thinking of continually doing
> > >>> scans of random values across various tablet servers and capturing
> > timing
> > >>> information related to how long such scans take. I took a quick look at
> > >> the
> > >>> accumulo-testing repo and didn’t find any tests or probes attempting to
> > >> do
> > >>> something along these lines. Does something like this seem reasonable?
> > >> Has
> > >>> anyone previously attempted something similar? Does accumulo-testing
> > seem
> > >>> like a reasonable place for code that attempts to do something like
> > this?
> > >>>
> > >>>
> > >>>
> > >>> Appreciate your thoughts and feedback.
> > >>>
> > >>>
> > >>>
> > >>> Cheers,
> > >>>
> > >>> Tushar Dhadiwal
> > >>>
> > >>
> > >
> >

Re: WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Posted by David Mollitor <da...@gmail.com>.
Hey Josh,

Thank you for the thoughtful clarification to my otherwise boorish remarks.

I was surprised to learn that there are a few file systems that provide the
required WAL append guarantees.

Is there documentation that covers this topic and lists the file systems?

Thanks

On Fri, Oct 25, 2019, 5:28 PM Josh Elser <el...@apache.org> wrote:

> Forking this off because I don't think it's related to Tushar's original
> question.
>
> HBase and Accumulo both implementation a WAL which can be said to
> relying on a distributed FileSystem which:
>
> 1. Is API compatible with HDFS
> 2. Guarantees that data written prior to an hflush/hsync() is durable
>
> There are actually a few filesystems capable of this: HDFS (duh),
> Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store
> (ADLS), and Azure's Blob Filesystem (ABFS).
>
> Azure has had a pretty long interaction with the upstream Hadoop project
> (and some ties in with the HBase project) to make sure that we know how
> to configure their Hadoop drivers that work with those Azure blob stores
> to make that durability guarantee.
>
> That said, it's wrong to say that HBase/Accumulo in a cloud solution
> require HDFS. It is accurate to say that S3 (via the S3A adapter) does
> not provide the durability guarantees that HBase/Accumulo need for WALs
> (but EMRFS does, from what I've heard through the grapevine, but
> requires you to be using EMR)
>
> On 10/25/19 1:49 PM, David Mollitor wrote:
> > Hello Team,
> >
> > One short coming of Apache Accumulo and Apache HBase, as I understand it,
> > is that they both rely on the HDFS for replicated WAL management.
> > Therefore, HDFS is a requirement even if deploying to a cloud solution.
> I
> > believe Google has developed a consensus enabled WAL management so that
> > three instances can be stood up without any external dependencies (other
> > than storage for the collection of rfile/hfile).
> >
> > Be interested to hear your thoughts on this.
> >
> > On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> >
> >> Hi Tushar,
> >>
> >> The closest thing we have are the performance tests in accumulo-testing,
> >> which is probably the best place.
> >> https://github.com/apache/accumulo-testing#performance-test
> >> The instructions for setting up the scripts are in the README.  There
> are
> >> only a limited number of tests written though and they used to be
> >> integration tests that were moved out of the main test package.
> >>
> >> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
> >> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
> >> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
> >> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
> >> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
> >> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
> >>
> >> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <
> tushardhadiwal@gmail.com>
> >> wrote:
> >>
> >>> Hello Everyone,
> >>>
> >>>
> >>> I am a Software Engineer at Microsoft and our team is currently working
> >> on
> >>> making the deployment and operations of Accumulo on Azure as seamless
> as
> >>> possible. As part of this effort, we are attempting to observe /
> measure
> >>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
> >> etc.)
> >>> and how their performance varies over time on long standing Accumulo
> >>> clusters running in Azure. As part of this we’re looking to come up
> with
> >> a
> >>> metric that we can use to evaluate how healthy / available an Accumulo
> >>> cluster is. Over time we intend to use this to understand how
> underlying
> >>> platform changes in Azure can affect overall health of Accumulo
> >> workloads.
> >>>
> >>>
> >>>
> >>> As a starting metric for example, we are thinking of continually doing
> >>> scans of random values across various tablet servers and capturing
> timing
> >>> information related to how long such scans take. I took a quick look at
> >> the
> >>> accumulo-testing repo and didn’t find any tests or probes attempting to
> >> do
> >>> something along these lines. Does something like this seem reasonable?
> >> Has
> >>> anyone previously attempted something similar? Does accumulo-testing
> seem
> >>> like a reasonable place for code that attempts to do something like
> this?
> >>>
> >>>
> >>>
> >>> Appreciate your thoughts and feedback.
> >>>
> >>>
> >>>
> >>> Cheers,
> >>>
> >>> Tushar Dhadiwal
> >>>
> >>
> >
>