You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@accumulo.apache.org by Tushar Dhadiwal <tu...@gmail.com> on 2019/10/25 00:09:36 UTC

Accumulo on Azure - Long Term Monitoring

Hello Everyone,


I am a Software Engineer at Microsoft and our team is currently working on
making the deployment and operations of Accumulo on Azure as seamless as
possible. As part of this effort, we are attempting to observe / measure
some standard Accumulo operations (e.g. scan, canary queries, ingest, etc.)
and how their performance varies over time on long standing Accumulo
clusters running in Azure. As part of this we’re looking to come up with a
metric that we can use to evaluate how healthy / available an Accumulo
cluster is. Over time we intend to use this to understand how underlying
platform changes in Azure can affect overall health of Accumulo workloads.



As a starting metric for example, we are thinking of continually doing
scans of random values across various tablet servers and capturing timing
information related to how long such scans take. I took a quick look at the
accumulo-testing repo and didn’t find any tests or probes attempting to do
something along these lines. Does something like this seem reasonable? Has
anyone previously attempted something similar? Does accumulo-testing seem
like a reasonable place for code that attempts to do something like this?



Appreciate your thoughts and feedback.



Cheers,

Tushar Dhadiwal

Re: WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Posted by Josh Elser <el...@apache.org>.

No worries. Just wanted to make sure that people know what is actually
out there.

Your question is actually right on the money -- there isn't any place
that I'm aware of which actually documents this well. Could be
something that is up in Hadoop (give that we're using Hadoop API
semantics to define what Accumulo/HBase require), but it could also be
something we throw up in HBase or Accumulo documentation.

On Sat, Oct 26, 2019 at 12:26 PM David Mollitor <da...@gmail.com> wrote:
>
> Hey Josh,
>
> Thank you for the thoughtful clarification to my otherwise boorish remarks.
>
> I was surprised to learn that there are a few file systems that provide the
> required WAL append guarantees.
>
> Is there documentation that covers this topic and lists the file systems?
>
> Thanks
>
> On Fri, Oct 25, 2019, 5:28 PM Josh Elser <el...@apache.org> wrote:
>
> > Forking this off because I don't think it's related to Tushar's original
> > question.
> >
> > HBase and Accumulo both implementation a WAL which can be said to
> > relying on a distributed FileSystem which:
> >
> > 1. Is API compatible with HDFS
> > 2. Guarantees that data written prior to an hflush/hsync() is durable
> >
> > There are actually a few filesystems capable of this: HDFS (duh),
> > Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store
> > (ADLS), and Azure's Blob Filesystem (ABFS).
> >
> > Azure has had a pretty long interaction with the upstream Hadoop project
> > (and some ties in with the HBase project) to make sure that we know how
> > to configure their Hadoop drivers that work with those Azure blob stores
> > to make that durability guarantee.
> >
> > That said, it's wrong to say that HBase/Accumulo in a cloud solution
> > require HDFS. It is accurate to say that S3 (via the S3A adapter) does
> > not provide the durability guarantees that HBase/Accumulo need for WALs
> > (but EMRFS does, from what I've heard through the grapevine, but
> > requires you to be using EMR)
> >
> > On 10/25/19 1:49 PM, David Mollitor wrote:
> > > Hello Team,
> > >
> > > One short coming of Apache Accumulo and Apache HBase, as I understand it,
> > > is that they both rely on the HDFS for replicated WAL management.
> > > Therefore, HDFS is a requirement even if deploying to a cloud solution.
> > I
> > > believe Google has developed a consensus enabled WAL management so that
> > > three instances can be stood up without any external dependencies (other
> > > than storage for the collection of rfile/hfile).
> > >
> > > Be interested to hear your thoughts on this.
> > >
> > > On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> > >
> > >> Hi Tushar,
> > >>
> > >> The closest thing we have are the performance tests in accumulo-testing,
> > >> which is probably the best place.
> > >> https://github.com/apache/accumulo-testing#performance-test
> > >> The instructions for setting up the scripts are in the README.  There
> > are
> > >> only a limited number of tests written though and they used to be
> > >> integration tests that were moved out of the main test package.
> > >>
> > >> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
> > >> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
> > >> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
> > >> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
> > >> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
> > >> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
> > >>
> > >> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <
> > tushardhadiwal@gmail.com>
> > >> wrote:
> > >>
> > >>> Hello Everyone,
> > >>>
> > >>>
> > >>> I am a Software Engineer at Microsoft and our team is currently working
> > >> on
> > >>> making the deployment and operations of Accumulo on Azure as seamless
> > as
> > >>> possible. As part of this effort, we are attempting to observe /
> > measure
> > >>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
> > >> etc.)
> > >>> and how their performance varies over time on long standing Accumulo
> > >>> clusters running in Azure. As part of this we’re looking to come up
> > with
> > >> a
> > >>> metric that we can use to evaluate how healthy / available an Accumulo
> > >>> cluster is. Over time we intend to use this to understand how
> > underlying
> > >>> platform changes in Azure can affect overall health of Accumulo
> > >> workloads.
> > >>>
> > >>>
> > >>>
> > >>> As a starting metric for example, we are thinking of continually doing
> > >>> scans of random values across various tablet servers and capturing
> > timing
> > >>> information related to how long such scans take. I took a quick look at
> > >> the
> > >>> accumulo-testing repo and didn’t find any tests or probes attempting to
> > >> do
> > >>> something along these lines. Does something like this seem reasonable?
> > >> Has
> > >>> anyone previously attempted something similar? Does accumulo-testing
> > seem
> > >>> like a reasonable place for code that attempts to do something like
> > this?
> > >>>
> > >>>
> > >>>
> > >>> Appreciate your thoughts and feedback.
> > >>>
> > >>>
> > >>>
> > >>> Cheers,
> > >>>
> > >>> Tushar Dhadiwal
> > >>>
> > >>
> > >
> >

Re: WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Posted by David Mollitor <da...@gmail.com>.

Hey Josh,

Thank you for the thoughtful clarification to my otherwise boorish remarks.

I was surprised to learn that there are a few file systems that provide the
required WAL append guarantees.

Is there documentation that covers this topic and lists the file systems?

Thanks

On Fri, Oct 25, 2019, 5:28 PM Josh Elser <el...@apache.org> wrote:

> Forking this off because I don't think it's related to Tushar's original
> question.
>
> HBase and Accumulo both implementation a WAL which can be said to
> relying on a distributed FileSystem which:
>
> 1. Is API compatible with HDFS
> 2. Guarantees that data written prior to an hflush/hsync() is durable
>
> There are actually a few filesystems capable of this: HDFS (duh),
> Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store
> (ADLS), and Azure's Blob Filesystem (ABFS).
>
> Azure has had a pretty long interaction with the upstream Hadoop project
> (and some ties in with the HBase project) to make sure that we know how
> to configure their Hadoop drivers that work with those Azure blob stores
> to make that durability guarantee.
>
> That said, it's wrong to say that HBase/Accumulo in a cloud solution
> require HDFS. It is accurate to say that S3 (via the S3A adapter) does
> not provide the durability guarantees that HBase/Accumulo need for WALs
> (but EMRFS does, from what I've heard through the grapevine, but
> requires you to be using EMR)
>
> On 10/25/19 1:49 PM, David Mollitor wrote:
> > Hello Team,
> >
> > One short coming of Apache Accumulo and Apache HBase, as I understand it,
> > is that they both rely on the HDFS for replicated WAL management.
> > Therefore, HDFS is a requirement even if deploying to a cloud solution.
> I
> > believe Google has developed a consensus enabled WAL management so that
> > three instances can be stood up without any external dependencies (other
> > than storage for the collection of rfile/hfile).
> >
> > Be interested to hear your thoughts on this.
> >
> > On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> >
> >> Hi Tushar,
> >>
> >> The closest thing we have are the performance tests in accumulo-testing,
> >> which is probably the best place.
> >> https://github.com/apache/accumulo-testing#performance-test
> >> The instructions for setting up the scripts are in the README.  There
> are
> >> only a limited number of tests written though and they used to be
> >> integration tests that were moved out of the main test package.
> >>
> >> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
> >> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
> >> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
> >> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
> >> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
> >> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
> >>
> >> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <
> tushardhadiwal@gmail.com>
> >> wrote:
> >>
> >>> Hello Everyone,
> >>>
> >>>
> >>> I am a Software Engineer at Microsoft and our team is currently working
> >> on
> >>> making the deployment and operations of Accumulo on Azure as seamless
> as
> >>> possible. As part of this effort, we are attempting to observe /
> measure
> >>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
> >> etc.)
> >>> and how their performance varies over time on long standing Accumulo
> >>> clusters running in Azure. As part of this we’re looking to come up
> with
> >> a
> >>> metric that we can use to evaluate how healthy / available an Accumulo
> >>> cluster is. Over time we intend to use this to understand how
> underlying
> >>> platform changes in Azure can affect overall health of Accumulo
> >> workloads.
> >>>
> >>>
> >>>
> >>> As a starting metric for example, we are thinking of continually doing
> >>> scans of random values across various tablet servers and capturing
> timing
> >>> information related to how long such scans take. I took a quick look at
> >> the
> >>> accumulo-testing repo and didn’t find any tests or probes attempting to
> >> do
> >>> something along these lines. Does something like this seem reasonable?
> >> Has
> >>> anyone previously attempted something similar? Does accumulo-testing
> seem
> >>> like a reasonable place for code that attempts to do something like
> this?
> >>>
> >>>
> >>>
> >>> Appreciate your thoughts and feedback.
> >>>
> >>>
> >>>
> >>> Cheers,
> >>>
> >>> Tushar Dhadiwal
> >>>
> >>
> >
>

WALs and HDFS (was Re: Accumulo on Azure - Long Term Monitoring)

Posted by Josh Elser <el...@apache.org>.

Forking this off because I don't think it's related to Tushar's original 
question.

HBase and Accumulo both implementation a WAL which can be said to 
relying on a distributed FileSystem which:

1. Is API compatible with HDFS
2. Guarantees that data written prior to an hflush/hsync() is durable

There are actually a few filesystems capable of this: HDFS (duh), 
Azure's Windows Azure Storage Blob (WASB), Azure's Data Lake Store 
(ADLS), and Azure's Blob Filesystem (ABFS).

Azure has had a pretty long interaction with the upstream Hadoop project 
(and some ties in with the HBase project) to make sure that we know how 
to configure their Hadoop drivers that work with those Azure blob stores 
to make that durability guarantee.

That said, it's wrong to say that HBase/Accumulo in a cloud solution 
require HDFS. It is accurate to say that S3 (via the S3A adapter) does 
not provide the durability guarantees that HBase/Accumulo need for WALs 
(but EMRFS does, from what I've heard through the grapevine, but 
requires you to be using EMR)

On 10/25/19 1:49 PM, David Mollitor wrote:
> Hello Team,
> 
> One short coming of Apache Accumulo and Apache HBase, as I understand it,
> is that they both rely on the HDFS for replicated WAL management.
> Therefore, HDFS is a requirement even if deploying to a cloud solution.  I
> believe Google has developed a consensus enabled WAL management so that
> three instances can be stood up without any external dependencies (other
> than storage for the collection of rfile/hfile).
> 
> Be interested to hear your thoughts on this.
> 
> On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:
> 
>> Hi Tushar,
>>
>> The closest thing we have are the performance tests in accumulo-testing,
>> which is probably the best place.
>> https://github.com/apache/accumulo-testing#performance-test
>> The instructions for setting up the scripts are in the README.  There are
>> only a limited number of tests written though and they used to be
>> integration tests that were moved out of the main test package.
>>
>> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
>> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
>> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
>> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
>> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
>> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
>>
>> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <tu...@gmail.com>
>> wrote:
>>
>>> Hello Everyone,
>>>
>>>
>>> I am a Software Engineer at Microsoft and our team is currently working
>> on
>>> making the deployment and operations of Accumulo on Azure as seamless as
>>> possible. As part of this effort, we are attempting to observe / measure
>>> some standard Accumulo operations (e.g. scan, canary queries, ingest,
>> etc.)
>>> and how their performance varies over time on long standing Accumulo
>>> clusters running in Azure. As part of this we’re looking to come up with
>> a
>>> metric that we can use to evaluate how healthy / available an Accumulo
>>> cluster is. Over time we intend to use this to understand how underlying
>>> platform changes in Azure can affect overall health of Accumulo
>> workloads.
>>>
>>>
>>>
>>> As a starting metric for example, we are thinking of continually doing
>>> scans of random values across various tablet servers and capturing timing
>>> information related to how long such scans take. I took a quick look at
>> the
>>> accumulo-testing repo and didn’t find any tests or probes attempting to
>> do
>>> something along these lines. Does something like this seem reasonable?
>> Has
>>> anyone previously attempted something similar? Does accumulo-testing seem
>>> like a reasonable place for code that attempts to do something like this?
>>>
>>>
>>>
>>> Appreciate your thoughts and feedback.
>>>
>>>
>>>
>>> Cheers,
>>>
>>> Tushar Dhadiwal
>>>
>>
>

Re: Accumulo on Azure - Long Term Monitoring

Posted by David Mollitor <da...@gmail.com>.

Hello Team,

One short coming of Apache Accumulo and Apache HBase, as I understand it,
is that they both rely on the HDFS for replicated WAL management.
Therefore, HDFS is a requirement even if deploying to a cloud solution.  I
believe Google has developed a consensus enabled WAL management so that
three instances can be stood up without any external dependencies (other
than storage for the collection of rfile/hfile).

Be interested to hear your thoughts on this.

On Fri, Oct 25, 2019 at 1:46 PM Mike Miller <mm...@apache.org> wrote:

> Hi Tushar,
>
> The closest thing we have are the performance tests in accumulo-testing,
> which is probably the best place.
> https://github.com/apache/accumulo-testing#performance-test
> The instructions for setting up the scripts are in the README.  There are
> only a limited number of tests written though and they used to be
> integration tests that were moved out of the main test package.
>
> org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
> org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
> org.apache.accumulo.testing.performance.tests.ScanExecutorPT
> org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
> org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
> org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT
>
> On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <tu...@gmail.com>
> wrote:
>
> > Hello Everyone,
> >
> >
> > I am a Software Engineer at Microsoft and our team is currently working
> on
> > making the deployment and operations of Accumulo on Azure as seamless as
> > possible. As part of this effort, we are attempting to observe / measure
> > some standard Accumulo operations (e.g. scan, canary queries, ingest,
> etc.)
> > and how their performance varies over time on long standing Accumulo
> > clusters running in Azure. As part of this we’re looking to come up with
> a
> > metric that we can use to evaluate how healthy / available an Accumulo
> > cluster is. Over time we intend to use this to understand how underlying
> > platform changes in Azure can affect overall health of Accumulo
> workloads.
> >
> >
> >
> > As a starting metric for example, we are thinking of continually doing
> > scans of random values across various tablet servers and capturing timing
> > information related to how long such scans take. I took a quick look at
> the
> > accumulo-testing repo and didn’t find any tests or probes attempting to
> do
> > something along these lines. Does something like this seem reasonable?
> Has
> > anyone previously attempted something similar? Does accumulo-testing seem
> > like a reasonable place for code that attempts to do something like this?
> >
> >
> >
> > Appreciate your thoughts and feedback.
> >
> >
> >
> > Cheers,
> >
> > Tushar Dhadiwal
> >
>

Re: Accumulo on Azure - Long Term Monitoring

Posted by Mike Miller <mm...@apache.org>.

Hi Tushar,

The closest thing we have are the performance tests in accumulo-testing,
which is probably the best place.
https://github.com/apache/accumulo-testing#performance-test
The instructions for setting up the scripts are in the README.  There are
only a limited number of tests written though and they used to be
integration tests that were moved out of the main test package.

org.apache.accumulo.testing.performance.tests.DurabilityWriteSpeedPT
org.apache.accumulo.testing.performance.tests.YieldingScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanExecutorPT
org.apache.accumulo.testing.performance.tests.ScanFewFamiliesPT
org.apache.accumulo.testing.performance.tests.ConditionalMutationsPT
org.apache.accumulo.testing.performance.tests.RandomCachedLookupsPT

On Thu, Oct 24, 2019 at 8:09 PM Tushar Dhadiwal <tu...@gmail.com>
wrote:

> Hello Everyone,
>
>
> I am a Software Engineer at Microsoft and our team is currently working on
> making the deployment and operations of Accumulo on Azure as seamless as
> possible. As part of this effort, we are attempting to observe / measure
> some standard Accumulo operations (e.g. scan, canary queries, ingest, etc.)
> and how their performance varies over time on long standing Accumulo
> clusters running in Azure. As part of this we’re looking to come up with a
> metric that we can use to evaluate how healthy / available an Accumulo
> cluster is. Over time we intend to use this to understand how underlying
> platform changes in Azure can affect overall health of Accumulo workloads.
>
>
>
> As a starting metric for example, we are thinking of continually doing
> scans of random values across various tablet servers and capturing timing
> information related to how long such scans take. I took a quick look at the
> accumulo-testing repo and didn’t find any tests or probes attempting to do
> something along these lines. Does something like this seem reasonable? Has
> anyone previously attempted something similar? Does accumulo-testing seem
> like a reasonable place for code that attempts to do something like this?
>
>
>
> Appreciate your thoughts and feedback.
>
>
>
> Cheers,
>
> Tushar Dhadiwal
>

Re: Accumulo on Azure - Long Term Monitoring

Posted by Tushar Dhadiwal <tu...@gmail.com>.

Hello Gang,

I took a stab at adding initial code for a accumulo availability monitor
which is doing scans of random values across various tablet servers and
capturing timing information related to how long such scans take. Here is
the PR for the Same : https://github.com/apache/accumulo-testing/pull/118

The timings received from this probe were piped using a very lightweight
log4j appender to a Log Analytics service, where by querying logs and
plotting the graph of read timings vs time, I was able to determine changes
in accumulo cluster availability (For e.g. tablet servers down, poor
network connectivity etc.). I can provide more info and share code related
to this if anyone is interested in it.

Would appreciate your thoughts and feedback on this.

Cheers,
Tushar Dhadiwal

On Thu, Oct 24, 2019 at 5:09 PM Tushar Dhadiwal <tu...@gmail.com>
wrote:

> Hello Everyone,
>
>
> I am a Software Engineer at Microsoft and our team is currently working
> on making the deployment and operations of Accumulo on Azure as seamless as
> possible. As part of this effort, we are attempting to observe / measure
> some standard Accumulo operations (e.g. scan, canary queries, ingest, etc.)
> and how their performance varies over time on long standing Accumulo
> clusters running in Azure. As part of this we’re looking to come up with a
> metric that we can use to evaluate how healthy / available an Accumulo
> cluster is. Over time we intend to use this to understand how underlying
> platform changes in Azure can affect overall health of Accumulo workloads.
>
>
>
> As a starting metric for example, we are thinking of continually doing
> scans of random values across various tablet servers and capturing timing
> information related to how long such scans take. I took a quick look at
> the accumulo-testing repo and didn’t find any tests or probes attempting to
> do something along these lines. Does something like this seem
> reasonable? Has anyone previously attempted something similar? Does
> accumulo-testing seem like a reasonable place for code that attempts to do
> something like this?
>
>
>
> Appreciate your thoughts and feedback.
>
>
>
> Cheers,
>
> Tushar Dhadiwal
>