You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@hbase.apache.org by Peter Somogyi <ps...@cloudera.com> on 2017/06/14 10:53:21 UTC

Problem with IntegrationTestRegionReplicaReplication

Hi,

As one of my first task with HBase I started to look into
why IntegrationTestRegionReplicaReplication fails. I would like to get some
suggestions from you.

I noticed when I run the test using normal cluster or minicluster I get the
same error messages: "Error checking data for key [null], no data
returned". I looked into the code and here are my conclusions.

There are multiple threads writing data parallel which are read by multiple
reader threads simultaneously. Each writer gets a portion of the keys to
write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
The reader threads get the elements (e.g. key=1000) from the queue and
these reader threads assume that all the keys up to this are already in the
database. Since we're using multiple writers it can happen that another
thread has not yet written key=500 and verifying these keys will cause the
test failure.

Do you think my assumption is correct?

Thanks,
Peter

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Peter Somogyi <ps...@cloudera.com>.

I made some testing and found an interesting behavior that you might be
able to comment on.

When running the test against apache/branch-1.1 and apache/branch-1.2 using
the following command the tests consistently failed for me:
`mvn -pl hbase-it -am -Dtest=NoUnitTests
-Dit.test=IntegrationTestRegionReplicaReplication verify`

If I remove line 103 from the test then the test passes on both apache
branch and CDH based on v.1.2.
    conf.setLong(HConstants.HREGION_MEMSTORE_FLUSH_SIZE, 1024L * 1024 * 4);
// flush every 4 MB

Do you know why setting hbase.hregion.memstore.flush.size is needed? As far
as I understand the test verifies that async WAL replication works. Don't
we bypass that functionality if we flush too frequently?

Thanks,
Peter

On Mon, Jun 19, 2017 at 2:55 AM, Devaraj Das <dd...@hortonworks.com> wrote:

> If it is failing consistently I'd suspect we have introduced a bug in the
> 1.2 line or something. We do run the same test with a version based on
> 1.1.2 (HDP-2.3 and beyond) and it works fine
>
>
>
>
> On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" <
> psomogyi@cloudera.com<ma...@cloudera.com>> wrote:
>
>
> I'm using hbase based on 1.2 version.
>
> On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das  wrote:
>
> > Peter which version of HBase are tou testing with?
> >
> >
> >
> >
> > On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> > psomogyi@cloudera.com> wrote:
> >
> >
> > I tried with those parameters but the test still failed.
> > I noticed that some of the rows were not replicated to the replicas just
> > after I called flush manually. I think memstore replication is not
> working
> > on my system even though it is enabled in the configuration.
> > I will look into it today.
> >
> > On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
> >
> > > Peter, do have a look at IntegrationTestRegionReplicaReplication.java
> ..
> > > At the top of the file, the ways to specify the options are documented
> ..
> > > You need to add something like -DIntegrationTestRegionReplicaR
> > eplication.read_delay_ms
> > > ..
> > > ________________________________________
> > > From: Josh Elser
> > > Sent: Thursday, June 15, 2017 10:40 AM
> > > To: dev@hbase.apache.org
> > > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >
> > > I'd start trying a read_delay_ms=60000, region_replication=2,
> > > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > > reader and writer threads.
> > >
> > > Again, this can be quite dependent on the kind of hardware you have.
> > > You'll definitely have to tweak ;)
> > >
> > > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > > Thanks Josh and Devaraj!
> > > >
> > > > I will try to increase the timeouts. Devaraj, could you share the
> > > > parameters you used for this test which worked?
> > > >
> > > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > > wrote:
> > > >
> > > >> That sounds about right, Josh. Peter, in our internal testing we
> have
> > > seen
> > > >> this test failing and increasing timeouts (look at the test code
> > > options to
> > > >> do with increasing timeout) helped quite some.
> > > >> ________________________________________
> > > >> From: Josh Elser
> > > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > > >> To: dev@hbase.apache.org
> > > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > > >>
> > > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > > >>> Hi,
> > > >>>
> > > >>> As one of my first task with HBase I started to look into
> > > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> > get
> > > >> some
> > > >>> suggestions from you.
> > > >>>
> > > >>> I noticed when I run the test using normal cluster or minicluster I
> > get
> > > >> the
> > > >>> same error messages: "Error checking data for key [null], no data
> > > >>> returned". I looked into the code and here are my conclusions.
> > > >>>
> > > >>> There are multiple threads writing data parallel which are read by
> > > >> multiple
> > > >>> reader threads simultaneously. Each writer gets a portion of the
> keys
> > > to
> > > >>> write (e.g. 0-2000) and these keys are added to a
> ConstantDelayQueue.
> > > >>> The reader threads get the elements (e.g. key=1000) from the queue
> > and
> > > >>> these reader threads assume that all the keys up to this are
> already
> > in
> > > >> the
> > > >>> database. Since we're using multiple writers it can happen that
> > another
> > > >>> thread has not yet written key=500 and verifying these keys will
> > cause
> > > >> the
> > > >>> test failure.
> > > >>>
> > > >>> Do you think my assumption is correct?
> > > >>
> > > >> Hi Peter,
> > > >>
> > > >> No, as my memory serves, this is not correct. Readers are not made
> > aware
> > > >> of keys to verify until the write occur plus some delay. The delay
> is
> > > >> used to provide enough time for the internal region replication to
> > take
> > > >> effect.
> > > >>
> > > >> So: primary-write, pause, [region replication happens in
> background],
> > > >> add updated key to read queue, reader gets key from queue verifies
> the
> > > >> value on a replica.
> > > >>
> > > >> The primary should always have seen the new value for a key. If the
> > test
> > > >> is showing that a replica does not see the result, it's either a
> > timing
> > > >> issue (you need to give a larger delay for HBase to perform the
> region
> > > >> replication) or a bug in the region replication framework itself.
> That
> > > >> said, if you can show that you are seeing what you describe, that
> > sounds
> > > >> like the test framework itself is broken :)
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > >
> > >
> > >
> >
> >
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Devaraj Das <dd...@hortonworks.com>.

If it is failing consistently I'd suspect we have introduced a bug in the 1.2 line or something. We do run the same test with a version based on 1.1.2 (HDP-2.3 and beyond) and it works fine




On Sun, Jun 18, 2017 at 8:26 AM -0700, "Peter Somogyi" <ps...@cloudera.com>> wrote:


I'm using hbase based on 1.2 version.

On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das  wrote:

> Peter which version of HBase are tou testing with?
>
>
>
>
> On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> psomogyi@cloudera.com> wrote:
>
>
> I tried with those parameters but the test still failed.
> I noticed that some of the rows were not replicated to the replicas just
> after I called flush manually. I think memstore replication is not working
> on my system even though it is enabled in the configuration.
> I will look into it today.
>
> On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
>
> > Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> > At the top of the file, the ways to specify the options are documented ..
> > You need to add something like -DIntegrationTestRegionReplicaR
> eplication.read_delay_ms
> > ..
> > ________________________________________
> > From: Josh Elser
> > Sent: Thursday, June 15, 2017 10:40 AM
> > To: dev@hbase.apache.org
> > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >
> > I'd start trying a read_delay_ms=60000, region_replication=2,
> > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > reader and writer threads.
> >
> > Again, this can be quite dependent on the kind of hardware you have.
> > You'll definitely have to tweak ;)
> >
> > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > Thanks Josh and Devaraj!
> > >
> > > I will try to increase the timeouts. Devaraj, could you share the
> > > parameters you used for this test which worked?
> > >
> > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > wrote:
> > >
> > >> That sounds about right, Josh. Peter, in our internal testing we have
> > seen
> > >> this test failing and increasing timeouts (look at the test code
> > options to
> > >> do with increasing timeout) helped quite some.
> > >> ________________________________________
> > >> From: Josh Elser
> > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > >> To: dev@hbase.apache.org
> > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >>
> > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > >>> Hi,
> > >>>
> > >>> As one of my first task with HBase I started to look into
> > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> get
> > >> some
> > >>> suggestions from you.
> > >>>
> > >>> I noticed when I run the test using normal cluster or minicluster I
> get
> > >> the
> > >>> same error messages: "Error checking data for key [null], no data
> > >>> returned". I looked into the code and here are my conclusions.
> > >>>
> > >>> There are multiple threads writing data parallel which are read by
> > >> multiple
> > >>> reader threads simultaneously. Each writer gets a portion of the keys
> > to
> > >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > >>> The reader threads get the elements (e.g. key=1000) from the queue
> and
> > >>> these reader threads assume that all the keys up to this are already
> in
> > >> the
> > >>> database. Since we're using multiple writers it can happen that
> another
> > >>> thread has not yet written key=500 and verifying these keys will
> cause
> > >> the
> > >>> test failure.
> > >>>
> > >>> Do you think my assumption is correct?
> > >>
> > >> Hi Peter,
> > >>
> > >> No, as my memory serves, this is not correct. Readers are not made
> aware
> > >> of keys to verify until the write occur plus some delay. The delay is
> > >> used to provide enough time for the internal region replication to
> take
> > >> effect.
> > >>
> > >> So: primary-write, pause, [region replication happens in background],
> > >> add updated key to read queue, reader gets key from queue verifies the
> > >> value on a replica.
> > >>
> > >> The primary should always have seen the new value for a key. If the
> test
> > >> is showing that a replica does not see the result, it's either a
> timing
> > >> issue (you need to give a larger delay for HBase to perform the region
> > >> replication) or a bug in the region replication framework itself. That
> > >> said, if you can show that you are seeing what you describe, that
> sounds
> > >> like the test framework itself is broken :)
> > >>
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Peter Somogyi <ps...@cloudera.com>.

I'm using hbase based on 1.2 version.

On Sat, Jun 17, 2017 at 4:00 PM, Devaraj Das <dd...@hortonworks.com> wrote:

> Peter which version of HBase are tou testing with?
>
>
>
>
> On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <
> psomogyi@cloudera.com<ma...@cloudera.com>> wrote:
>
>
> I tried with those parameters but the test still failed.
> I noticed that some of the rows were not replicated to the replicas just
> after I called flush manually. I think memstore replication is not working
> on my system even though it is enabled in the configuration.
> I will look into it today.
>
> On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:
>
> > Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> > At the top of the file, the ways to specify the options are documented ..
> > You need to add something like -DIntegrationTestRegionReplicaR
> eplication.read_delay_ms
> > ..
> > ________________________________________
> > From: Josh Elser
> > Sent: Thursday, June 15, 2017 10:40 AM
> > To: dev@hbase.apache.org
> > Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >
> > I'd start trying a read_delay_ms=60000, region_replication=2,
> > num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> > reader and writer threads.
> >
> > Again, this can be quite dependent on the kind of hardware you have.
> > You'll definitely have to tweak ;)
> >
> > On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > > Thanks Josh and Devaraj!
> > >
> > > I will try to increase the timeouts. Devaraj, could you share the
> > > parameters you used for this test which worked?
> > >
> > > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> > wrote:
> > >
> > >> That sounds about right, Josh. Peter, in our internal testing we have
> > seen
> > >> this test failing and increasing timeouts (look at the test code
> > options to
> > >> do with increasing timeout) helped quite some.
> > >> ________________________________________
> > >> From: Josh Elser
> > >> Sent: Wednesday, June 14, 2017 3:17 PM
> > >> To: dev@hbase.apache.org
> > >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> > >>
> > >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > >>> Hi,
> > >>>
> > >>> As one of my first task with HBase I started to look into
> > >>> why IntegrationTestRegionReplicaReplication fails. I would like to
> get
> > >> some
> > >>> suggestions from you.
> > >>>
> > >>> I noticed when I run the test using normal cluster or minicluster I
> get
> > >> the
> > >>> same error messages: "Error checking data for key [null], no data
> > >>> returned". I looked into the code and here are my conclusions.
> > >>>
> > >>> There are multiple threads writing data parallel which are read by
> > >> multiple
> > >>> reader threads simultaneously. Each writer gets a portion of the keys
> > to
> > >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > >>> The reader threads get the elements (e.g. key=1000) from the queue
> and
> > >>> these reader threads assume that all the keys up to this are already
> in
> > >> the
> > >>> database. Since we're using multiple writers it can happen that
> another
> > >>> thread has not yet written key=500 and verifying these keys will
> cause
> > >> the
> > >>> test failure.
> > >>>
> > >>> Do you think my assumption is correct?
> > >>
> > >> Hi Peter,
> > >>
> > >> No, as my memory serves, this is not correct. Readers are not made
> aware
> > >> of keys to verify until the write occur plus some delay. The delay is
> > >> used to provide enough time for the internal region replication to
> take
> > >> effect.
> > >>
> > >> So: primary-write, pause, [region replication happens in background],
> > >> add updated key to read queue, reader gets key from queue verifies the
> > >> value on a replica.
> > >>
> > >> The primary should always have seen the new value for a key. If the
> test
> > >> is showing that a replica does not see the result, it's either a
> timing
> > >> issue (you need to give a larger delay for HBase to perform the region
> > >> replication) or a bug in the region replication framework itself. That
> > >> said, if you can show that you are seeing what you describe, that
> sounds
> > >> like the test framework itself is broken :)
> > >>
> > >>
> > >>
> > >>
> > >
> >
> >
> >
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Devaraj Das <dd...@hortonworks.com>.

Peter which version of HBase are tou testing with?




On Thu, Jun 15, 2017 at 11:57 PM -0700, "Peter Somogyi" <ps...@cloudera.com>> wrote:


I tried with those parameters but the test still failed.
I noticed that some of the rows were not replicated to the replicas just
after I called flush manually. I think memstore replication is not working
on my system even though it is enabled in the configuration.
I will look into it today.

On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das  wrote:

> Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> At the top of the file, the ways to specify the options are documented ..
> You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms
> ..
> ________________________________________
> From: Josh Elser
> Sent: Thursday, June 15, 2017 10:40 AM
> To: dev@hbase.apache.org
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> I'd start trying a read_delay_ms=60000, region_replication=2,
> num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> reader and writer threads.
>
> Again, this can be quite dependent on the kind of hardware you have.
> You'll definitely have to tweak ;)
>
> On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > Thanks Josh and Devaraj!
> >
> > I will try to increase the timeouts. Devaraj, could you share the
> > parameters you used for this test which worked?
> >
> > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das
> wrote:
> >
> >> That sounds about right, Josh. Peter, in our internal testing we have
> seen
> >> this test failing and increasing timeouts (look at the test code
> options to
> >> do with increasing timeout) helped quite some.
> >> ________________________________________
> >> From: Josh Elser
> >> Sent: Wednesday, June 14, 2017 3:17 PM
> >> To: dev@hbase.apache.org
> >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >>
> >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> >>> Hi,
> >>>
> >>> As one of my first task with HBase I started to look into
> >>> why IntegrationTestRegionReplicaReplication fails. I would like to get
> >> some
> >>> suggestions from you.
> >>>
> >>> I noticed when I run the test using normal cluster or minicluster I get
> >> the
> >>> same error messages: "Error checking data for key [null], no data
> >>> returned". I looked into the code and here are my conclusions.
> >>>
> >>> There are multiple threads writing data parallel which are read by
> >> multiple
> >>> reader threads simultaneously. Each writer gets a portion of the keys
> to
> >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> >>> The reader threads get the elements (e.g. key=1000) from the queue and
> >>> these reader threads assume that all the keys up to this are already in
> >> the
> >>> database. Since we're using multiple writers it can happen that another
> >>> thread has not yet written key=500 and verifying these keys will cause
> >> the
> >>> test failure.
> >>>
> >>> Do you think my assumption is correct?
> >>
> >> Hi Peter,
> >>
> >> No, as my memory serves, this is not correct. Readers are not made aware
> >> of keys to verify until the write occur plus some delay. The delay is
> >> used to provide enough time for the internal region replication to take
> >> effect.
> >>
> >> So: primary-write, pause, [region replication happens in background],
> >> add updated key to read queue, reader gets key from queue verifies the
> >> value on a replica.
> >>
> >> The primary should always have seen the new value for a key. If the test
> >> is showing that a replica does not see the result, it's either a timing
> >> issue (you need to give a larger delay for HBase to perform the region
> >> replication) or a bug in the region replication framework itself. That
> >> said, if you can show that you are seeing what you describe, that sounds
> >> like the test framework itself is broken :)
> >>
> >>
> >>
> >>
> >
>
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Peter Somogyi <ps...@cloudera.com>.

I tried with those parameters but the test still failed.
I noticed that some of the rows were not replicated to the replicas just
after I called flush manually. I think memstore replication is not working
on my system even though it is enabled in the configuration.
I will look into it today.

On Fri, Jun 16, 2017 at 7:09 AM, Devaraj Das <dd...@hortonworks.com> wrote:

> Peter, do have a look at IntegrationTestRegionReplicaReplication.java ..
> At the top of the file, the ways to specify the options are documented ..
> You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms
> ..
> ________________________________________
> From: Josh Elser <jo...@gmail.com>
> Sent: Thursday, June 15, 2017 10:40 AM
> To: dev@hbase.apache.org
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> I'd start trying a read_delay_ms=60000, region_replication=2,
> num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
> reader and writer threads.
>
> Again, this can be quite dependent on the kind of hardware you have.
> You'll definitely have to tweak ;)
>
> On 6/15/17 4:44 AM, Peter Somogyi wrote:
> > Thanks Josh and Devaraj!
> >
> > I will try to increase the timeouts. Devaraj, could you share the
> > parameters you used for this test which worked?
> >
> > On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <dd...@hortonworks.com>
> wrote:
> >
> >> That sounds about right, Josh. Peter, in our internal testing we have
> seen
> >> this test failing and increasing timeouts (look at the test code
> options to
> >> do with increasing timeout) helped quite some.
> >> ________________________________________
> >> From: Josh Elser <jo...@gmail.com>
> >> Sent: Wednesday, June 14, 2017 3:17 PM
> >> To: dev@hbase.apache.org
> >> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
> >>
> >> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> >>> Hi,
> >>>
> >>> As one of my first task with HBase I started to look into
> >>> why IntegrationTestRegionReplicaReplication fails. I would like to get
> >> some
> >>> suggestions from you.
> >>>
> >>> I noticed when I run the test using normal cluster or minicluster I get
> >> the
> >>> same error messages: "Error checking data for key [null], no data
> >>> returned". I looked into the code and here are my conclusions.
> >>>
> >>> There are multiple threads writing data parallel which are read by
> >> multiple
> >>> reader threads simultaneously. Each writer gets a portion of the keys
> to
> >>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> >>> The reader threads get the elements (e.g. key=1000) from the queue and
> >>> these reader threads assume that all the keys up to this are already in
> >> the
> >>> database. Since we're using multiple writers it can happen that another
> >>> thread has not yet written key=500 and verifying these keys will cause
> >> the
> >>> test failure.
> >>>
> >>> Do you think my assumption is correct?
> >>
> >> Hi Peter,
> >>
> >> No, as my memory serves, this is not correct. Readers are not made aware
> >> of keys to verify until the write occur plus some delay. The delay is
> >> used to provide enough time for the internal region replication to take
> >> effect.
> >>
> >> So: primary-write, pause, [region replication happens in background],
> >> add updated key to read queue, reader gets key from queue verifies the
> >> value on a replica.
> >>
> >> The primary should always have seen the new value for a key. If the test
> >> is showing that a replica does not see the result, it's either a timing
> >> issue (you need to give a larger delay for HBase to perform the region
> >> replication) or a bug in the region replication framework itself. That
> >> said, if you can show that you are seeing what you describe, that sounds
> >> like the test framework itself is broken :)
> >>
> >>
> >>
> >>
> >
>
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Devaraj Das <dd...@hortonworks.com>.

Peter, do have a look at IntegrationTestRegionReplicaReplication.java .. At the top of the file, the ways to specify the options are documented .. You need to add something like -DIntegrationTestRegionReplicaReplication.read_delay_ms .. 
________________________________________
From: Josh Elser <jo...@gmail.com>
Sent: Thursday, June 15, 2017 10:40 AM
To: dev@hbase.apache.org
Subject: Re: Problem with IntegrationTestRegionReplicaReplication

I'd start trying a read_delay_ms=60000, region_replication=2,
num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of
reader and writer threads.

Again, this can be quite dependent on the kind of hardware you have.
You'll definitely have to tweak ;)

On 6/15/17 4:44 AM, Peter Somogyi wrote:
> Thanks Josh and Devaraj!
>
> I will try to increase the timeouts. Devaraj, could you share the
> parameters you used for this test which worked?
>
> On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <dd...@hortonworks.com> wrote:
>
>> That sounds about right, Josh. Peter, in our internal testing we have seen
>> this test failing and increasing timeouts (look at the test code options to
>> do with increasing timeout) helped quite some.
>> ________________________________________
>> From: Josh Elser <jo...@gmail.com>
>> Sent: Wednesday, June 14, 2017 3:17 PM
>> To: dev@hbase.apache.org
>> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>>
>> On 6/14/17 3:53 AM, Peter Somogyi wrote:
>>> Hi,
>>>
>>> As one of my first task with HBase I started to look into
>>> why IntegrationTestRegionReplicaReplication fails. I would like to get
>> some
>>> suggestions from you.
>>>
>>> I noticed when I run the test using normal cluster or minicluster I get
>> the
>>> same error messages: "Error checking data for key [null], no data
>>> returned". I looked into the code and here are my conclusions.
>>>
>>> There are multiple threads writing data parallel which are read by
>> multiple
>>> reader threads simultaneously. Each writer gets a portion of the keys to
>>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
>>> The reader threads get the elements (e.g. key=1000) from the queue and
>>> these reader threads assume that all the keys up to this are already in
>> the
>>> database. Since we're using multiple writers it can happen that another
>>> thread has not yet written key=500 and verifying these keys will cause
>> the
>>> test failure.
>>>
>>> Do you think my assumption is correct?
>>
>> Hi Peter,
>>
>> No, as my memory serves, this is not correct. Readers are not made aware
>> of keys to verify until the write occur plus some delay. The delay is
>> used to provide enough time for the internal region replication to take
>> effect.
>>
>> So: primary-write, pause, [region replication happens in background],
>> add updated key to read queue, reader gets key from queue verifies the
>> value on a replica.
>>
>> The primary should always have seen the new value for a key. If the test
>> is showing that a replica does not see the result, it's either a timing
>> issue (you need to give a larger delay for HBase to perform the region
>> replication) or a bug in the region replication framework itself. That
>> said, if you can show that you are seeing what you describe, that sounds
>> like the test framework itself is broken :)
>>
>>
>>
>>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Josh Elser <jo...@gmail.com>.

I'd start trying a read_delay_ms=60000, region_replication=2, 
num_keys_per_server=5000, num_regions_per_server=5 with a maybe 10's of 
reader and writer threads.

Again, this can be quite dependent on the kind of hardware you have. 
You'll definitely have to tweak ;)

On 6/15/17 4:44 AM, Peter Somogyi wrote:
> Thanks Josh and Devaraj!
> 
> I will try to increase the timeouts. Devaraj, could you share the
> parameters you used for this test which worked?
> 
> On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <dd...@hortonworks.com> wrote:
> 
>> That sounds about right, Josh. Peter, in our internal testing we have seen
>> this test failing and increasing timeouts (look at the test code options to
>> do with increasing timeout) helped quite some.
>> ________________________________________
>> From: Josh Elser <jo...@gmail.com>
>> Sent: Wednesday, June 14, 2017 3:17 PM
>> To: dev@hbase.apache.org
>> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>>
>> On 6/14/17 3:53 AM, Peter Somogyi wrote:
>>> Hi,
>>>
>>> As one of my first task with HBase I started to look into
>>> why IntegrationTestRegionReplicaReplication fails. I would like to get
>> some
>>> suggestions from you.
>>>
>>> I noticed when I run the test using normal cluster or minicluster I get
>> the
>>> same error messages: "Error checking data for key [null], no data
>>> returned". I looked into the code and here are my conclusions.
>>>
>>> There are multiple threads writing data parallel which are read by
>> multiple
>>> reader threads simultaneously. Each writer gets a portion of the keys to
>>> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
>>> The reader threads get the elements (e.g. key=1000) from the queue and
>>> these reader threads assume that all the keys up to this are already in
>> the
>>> database. Since we're using multiple writers it can happen that another
>>> thread has not yet written key=500 and verifying these keys will cause
>> the
>>> test failure.
>>>
>>> Do you think my assumption is correct?
>>
>> Hi Peter,
>>
>> No, as my memory serves, this is not correct. Readers are not made aware
>> of keys to verify until the write occur plus some delay. The delay is
>> used to provide enough time for the internal region replication to take
>> effect.
>>
>> So: primary-write, pause, [region replication happens in background],
>> add updated key to read queue, reader gets key from queue verifies the
>> value on a replica.
>>
>> The primary should always have seen the new value for a key. If the test
>> is showing that a replica does not see the result, it's either a timing
>> issue (you need to give a larger delay for HBase to perform the region
>> replication) or a bug in the region replication framework itself. That
>> said, if you can show that you are seeing what you describe, that sounds
>> like the test framework itself is broken :)
>>
>>
>>
>>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Peter Somogyi <ps...@cloudera.com>.

Thanks Josh and Devaraj!

I will try to increase the timeouts. Devaraj, could you share the
parameters you used for this test which worked?

On Thu, Jun 15, 2017 at 6:44 AM, Devaraj Das <dd...@hortonworks.com> wrote:

> That sounds about right, Josh. Peter, in our internal testing we have seen
> this test failing and increasing timeouts (look at the test code options to
> do with increasing timeout) helped quite some.
> ________________________________________
> From: Josh Elser <jo...@gmail.com>
> Sent: Wednesday, June 14, 2017 3:17 PM
> To: dev@hbase.apache.org
> Subject: Re: Problem with IntegrationTestRegionReplicaReplication
>
> On 6/14/17 3:53 AM, Peter Somogyi wrote:
> > Hi,
> >
> > As one of my first task with HBase I started to look into
> > why IntegrationTestRegionReplicaReplication fails. I would like to get
> some
> > suggestions from you.
> >
> > I noticed when I run the test using normal cluster or minicluster I get
> the
> > same error messages: "Error checking data for key [null], no data
> > returned". I looked into the code and here are my conclusions.
> >
> > There are multiple threads writing data parallel which are read by
> multiple
> > reader threads simultaneously. Each writer gets a portion of the keys to
> > write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> > The reader threads get the elements (e.g. key=1000) from the queue and
> > these reader threads assume that all the keys up to this are already in
> the
> > database. Since we're using multiple writers it can happen that another
> > thread has not yet written key=500 and verifying these keys will cause
> the
> > test failure.
> >
> > Do you think my assumption is correct?
>
> Hi Peter,
>
> No, as my memory serves, this is not correct. Readers are not made aware
> of keys to verify until the write occur plus some delay. The delay is
> used to provide enough time for the internal region replication to take
> effect.
>
> So: primary-write, pause, [region replication happens in background],
> add updated key to read queue, reader gets key from queue verifies the
> value on a replica.
>
> The primary should always have seen the new value for a key. If the test
> is showing that a replica does not see the result, it's either a timing
> issue (you need to give a larger delay for HBase to perform the region
> replication) or a bug in the region replication framework itself. That
> said, if you can show that you are seeing what you describe, that sounds
> like the test framework itself is broken :)
>
>
>
>

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Devaraj Das <dd...@hortonworks.com>.

That sounds about right, Josh. Peter, in our internal testing we have seen this test failing and increasing timeouts (look at the test code options to do with increasing timeout) helped quite some.
________________________________________
From: Josh Elser <jo...@gmail.com>
Sent: Wednesday, June 14, 2017 3:17 PM
To: dev@hbase.apache.org
Subject: Re: Problem with IntegrationTestRegionReplicaReplication

On 6/14/17 3:53 AM, Peter Somogyi wrote:
> Hi,
>
> As one of my first task with HBase I started to look into
> why IntegrationTestRegionReplicaReplication fails. I would like to get some
> suggestions from you.
>
> I noticed when I run the test using normal cluster or minicluster I get the
> same error messages: "Error checking data for key [null], no data
> returned". I looked into the code and here are my conclusions.
>
> There are multiple threads writing data parallel which are read by multiple
> reader threads simultaneously. Each writer gets a portion of the keys to
> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> The reader threads get the elements (e.g. key=1000) from the queue and
> these reader threads assume that all the keys up to this are already in the
> database. Since we're using multiple writers it can happen that another
> thread has not yet written key=500 and verifying these keys will cause the
> test failure.
>
> Do you think my assumption is correct?

Hi Peter,

No, as my memory serves, this is not correct. Readers are not made aware
of keys to verify until the write occur plus some delay. The delay is
used to provide enough time for the internal region replication to take
effect.

So: primary-write, pause, [region replication happens in background],
add updated key to read queue, reader gets key from queue verifies the
value on a replica.

The primary should always have seen the new value for a key. If the test
is showing that a replica does not see the result, it's either a timing
issue (you need to give a larger delay for HBase to perform the region
replication) or a bug in the region replication framework itself. That
said, if you can show that you are seeing what you describe, that sounds
like the test framework itself is broken :)

Re: Problem with IntegrationTestRegionReplicaReplication

Posted by Josh Elser <jo...@gmail.com>.

On 6/14/17 3:53 AM, Peter Somogyi wrote:
> Hi,
> 
> As one of my first task with HBase I started to look into
> why IntegrationTestRegionReplicaReplication fails. I would like to get some
> suggestions from you.
> 
> I noticed when I run the test using normal cluster or minicluster I get the
> same error messages: "Error checking data for key [null], no data
> returned". I looked into the code and here are my conclusions.
> 
> There are multiple threads writing data parallel which are read by multiple
> reader threads simultaneously. Each writer gets a portion of the keys to
> write (e.g. 0-2000) and these keys are added to a ConstantDelayQueue.
> The reader threads get the elements (e.g. key=1000) from the queue and
> these reader threads assume that all the keys up to this are already in the
> database. Since we're using multiple writers it can happen that another
> thread has not yet written key=500 and verifying these keys will cause the
> test failure.
> 
> Do you think my assumption is correct?

Hi Peter,

No, as my memory serves, this is not correct. Readers are not made aware 
of keys to verify until the write occur plus some delay. The delay is 
used to provide enough time for the internal region replication to take 
effect.

So: primary-write, pause, [region replication happens in background], 
add updated key to read queue, reader gets key from queue verifies the 
value on a replica.

The primary should always have seen the new value for a key. If the test 
is showing that a replica does not see the result, it's either a timing 
issue (you need to give a larger delay for HBase to perform the region 
replication) or a bug in the region replication framework itself. That 
said, if you can show that you are seeing what you describe, that sounds 
like the test framework itself is broken :)