You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@hbase.apache.org by "George P. Stathis" <gs...@traackr.com> on 2011/04/19 21:08:13 UTC

Latency related configs for 0.90

Hi all,

In this chapter of our 0.89 to 0.90 migration saga, we are seeing what we
suspect might be latency related artifacts.

The setting:

   - Our EC2 dev environment running our CI builds
   - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode

We have several unit tests that have started mysteriously failing in random
ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
tests used to run against 0.89 and never failed before. They also run OK on
our local macbooks. On EC2, we are seeing lots of issues where the setup
data is not being persisted in time for the tests to assert against them.
They are also not always being torn down properly.

We first suspected our new code around secondary indexes; we do have
extensive unit tests around it that provide us with a solid level of
confidence that it works properly in our CRUD scenarios. We also performance
tested against the old hbase-trx contrib code and our new secondary indexes
seem to be running slightly faster as well (of course, that could be due to
the bump from 0.89 to 0.90).

We first started seeing issues running our hudson build on the same machine
as the hbase pseudo-cluster. We figured that was putting too much load on
the box, so we created a separate large instance on EC2 to host just the
0.90 stack. This migration nearly quadrupled the number of unit tests
failing at times. The only difference between for first and second CI setup
is the network in between.

Before we start tearing down our code line by line, I'd like to see if there
are latency related configuration tweaks we could try to make the setup
more resilient to network lag. Are there any hbase/zookepper settings that
might help? For instance, we see things such as HBASE_SLAVE_SLEEP
in hbase-env.sh . Can that help?

Any suggestions are more than welcome. Also, the overview above may not be
enough to go on, so please let me know if I could provide more details.

Thank you in advance for any help.

-GS

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Thu, Apr 28, 2011 at 3:31 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> I'll try to repro here.
>
> Also I was thinking that in order to slow you down that much, you
> would need to have a query pattern that requires reading and writing a
> LOT to .META., so would it be possible for you to verify this indeed
> your case?
>

I doubt that's the case. The unit tests we are using were doing simple CRUD
operations on regular data tables. Granted they attempted to do them
quickly, but still: not that much data added or removed at one time. Just
quickly added and removed repeatedly for each unit test method using the
setup/teardown methods in our unit tests.


>
> And finally, just to make sure, you're saying that you didn't see that
> sort of performance issue with 0.89 right? Which one exactly? The only
> difference I can think of between 0924 and 0.90 is that the latter
> does .META. warming by fetching more rows after every lookup. I can't
> think of any other cause at the moment.
>

Yes, the same exact unit tests were doing just fine under 0.89. BUT: we were
not running trunk 0.89. We were running this one:
https://github.com/jameskennedy/hbase/tree/HLogSplit_0.89.20100726 because
we were using the hbase-trx contrib. This whole migration was intended to
put us back to a supported 0.90 trunk.


>
> Thank you so much for keeping us updated,
>
> J-D
>
> On Wed, Apr 27, 2011 at 5:51 PM, George P. Stathis <gs...@traackr.com>
> wrote:
> > On Wed, Apr 27, 2011 at 8:13 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> I have a hard time digesting this... You ran the script, didn't change
> >> anything else, ran the test and everything was back to normal, right?
> >> Did you restart HBase or moved .META. around? The reason I'm asking is
> >> that this script doesn't have any effect until .META. is reopened so I
> >> would be quite flabbergasted to learn that _just_ running the script
> >> makes things faster.
> >>
> >
> > No, I did indeed restart hbase as the instructions suggest we do after
> the
> > script is executed.
> >
> >
> >>
> >> Another thing that's weird is that that setting for .META. has been
> >> there since 0.20.0 so if it's the cause then it should have been the
> >> same with 0.89
> >>
> >
> > The setting were there in the legacy dev hbase instance we have. It was
> set
> > to 16K and the script set it to 64K. So it seems this was expected. Where
> I
> > don't understand is the effect the script had on a brand new 0.90
> install.
> > Before we ran the script, the setting was not there. After we ran it, it
> was
> > there with the 64K value.
> >
> > For both legacy and new instances, running the script and re-starting
> hbase
> > solved our issue.
> >
> >
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I'll try to repro here.

Also I was thinking that in order to slow you down that much, you
would need to have a query pattern that requires reading and writing a
LOT to .META., so would it be possible for you to verify this indeed
your case?

And finally, just to make sure, you're saying that you didn't see that
sort of performance issue with 0.89 right? Which one exactly? The only
difference I can think of between 0924 and 0.90 is that the latter
does .META. warming by fetching more rows after every lookup. I can't
think of any other cause at the moment.

Thank you so much for keeping us updated,

J-D

On Wed, Apr 27, 2011 at 5:51 PM, George P. Stathis <gs...@traackr.com> wrote:
> On Wed, Apr 27, 2011 at 8:13 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> I have a hard time digesting this... You ran the script, didn't change
>> anything else, ran the test and everything was back to normal, right?
>> Did you restart HBase or moved .META. around? The reason I'm asking is
>> that this script doesn't have any effect until .META. is reopened so I
>> would be quite flabbergasted to learn that _just_ running the script
>> makes things faster.
>>
>
> No, I did indeed restart hbase as the instructions suggest we do after the
> script is executed.
>
>
>>
>> Another thing that's weird is that that setting for .META. has been
>> there since 0.20.0 so if it's the cause then it should have been the
>> same with 0.89
>>
>
> The setting were there in the legacy dev hbase instance we have. It was set
> to 16K and the script set it to 64K. So it seems this was expected. Where I
> don't understand is the effect the script had on a brand new 0.90 install.
> Before we ran the script, the setting was not there. After we ran it, it was
> there with the 64K value.
>
> For both legacy and new instances, running the script and re-starting hbase
> solved our issue.
>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Wed, Apr 27, 2011 at 8:13 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> I have a hard time digesting this... You ran the script, didn't change
> anything else, ran the test and everything was back to normal, right?
> Did you restart HBase or moved .META. around? The reason I'm asking is
> that this script doesn't have any effect until .META. is reopened so I
> would be quite flabbergasted to learn that _just_ running the script
> makes things faster.
>

No, I did indeed restart hbase as the instructions suggest we do after the
script is executed.


>
> Another thing that's weird is that that setting for .META. has been
> there since 0.20.0 so if it's the cause then it should have been the
> same with 0.89
>

The setting were there in the legacy dev hbase instance we have. It was set
to 16K and the script set it to 64K. So it seems this was expected. Where I
don't understand is the effect the script had on a brand new 0.90 install.
Before we ran the script, the setting was not there. After we ran it, it was
there with the 64K value.

For both legacy and new instances, running the script and re-starting hbase
solved our issue.


>
> Hope we can resolve this mystery.
>
> J-D
>
> On Mon, Apr 25, 2011 at 4:01 PM, George P. Stathis <gs...@traackr.com>
> wrote:
> > Quick update:
> >
> > It turns out that we needed to run bin/set_meta_memstore_size.rb (
> > http://hbase.apache.org/upgrading.html) . I'm curious though: I
> understand
> > that our legacy dev machine would suffer because of the old
> > MEMSTORE_FLUSHSIZE setting. But we setup a brand new dev box with a
> pristine
> > 0.90 version that had no legacy MEMSTORE_FLUSHSIZE setting. Why would
> that
> > instance have been affected? I looked at the -ROOT- table before I ran
> the
> > script and there was no MEMSTORE_FLUSHSIZE setting. As soon as I execute
> the
> > script and re-run the unit tests, they all started passing again. Also,
> the
> > legacy hbase instances running on our dev laptops didn't have the script
> > executed against them. Why would they work without it?
> >
> > Anyway, trying to understand for future reference.
> >
> > -GS
> >
> > On Thu, Apr 21, 2011 at 2:07 PM, George P. Stathis <gstathis@traackr.com
> >wrote:
> >
> >>
> >> On Thu, Apr 21, 2011 at 1:56 PM, Jean-Daniel Cryans <
> jdcryans@apache.org>wrote:
> >>
> >>> On Thu, Apr 21, 2011 at 10:49 AM, George P. Stathis
> >>> <gs...@traackr.com> wrote:
> >>> > I gave the thread that name because that was the best way I could
> come
> >>> up
> >>> > with to describe the symptoms. We still have the problem, it just may
> >>> end up
> >>> > be timestamp related after all.
> >>> >
> >>>
> >>> No worries.
> >>>
> >>> > This does look similar, yes. I see there is no resolution currently.
> >>>
> >>> Well, if your use case is really to Put and Delete the same stuff
> >>> right after the other, then yeah you will need a time resolution that
> >>> more precise than milliseconds. Or you pass your own timestamps by
> >>> doing the first puts are current time, then delete at current time +1,
> >>> then put again if you want at current time +2, etc.
> >>>
> >>
> >> Looking at HBASE-2256, seems that it has been around for a while. We
> were
> >> not seeing that issue when we were running on 0.20.6 and 0.89. The unit
> >> tests that are failing have been in place since then. So I'm actually
> less
> >> hopeful now. Still, I'll check the clocks on EC2.
> >>
> >>
> >>>
> >>> >
> >>> > We have also noticed this problem in our local systems (laptops) but
> >>> > actually much less frequently. I'll check the EC2 dates as soon as
> AWS
> >>> > recovers (http://twitter.com/#!/search/%23aws). Our dev setup is
> >>> currently
> >>> > hosed...
> >>>
> >>> Heh :P
> >>>
> >>> J-D
> >>>
> >>
> >>
> >
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
I have a hard time digesting this... You ran the script, didn't change
anything else, ran the test and everything was back to normal, right?
Did you restart HBase or moved .META. around? The reason I'm asking is
that this script doesn't have any effect until .META. is reopened so I
would be quite flabbergasted to learn that _just_ running the script
makes things faster.

Another thing that's weird is that that setting for .META. has been
there since 0.20.0 so if it's the cause then it should have been the
same with 0.89

Hope we can resolve this mystery.

J-D

On Mon, Apr 25, 2011 at 4:01 PM, George P. Stathis <gs...@traackr.com> wrote:
> Quick update:
>
> It turns out that we needed to run bin/set_meta_memstore_size.rb (
> http://hbase.apache.org/upgrading.html) . I'm curious though: I understand
> that our legacy dev machine would suffer because of the old
> MEMSTORE_FLUSHSIZE setting. But we setup a brand new dev box with a pristine
> 0.90 version that had no legacy MEMSTORE_FLUSHSIZE setting. Why would that
> instance have been affected? I looked at the -ROOT- table before I ran the
> script and there was no MEMSTORE_FLUSHSIZE setting. As soon as I execute the
> script and re-run the unit tests, they all started passing again. Also, the
> legacy hbase instances running on our dev laptops didn't have the script
> executed against them. Why would they work without it?
>
> Anyway, trying to understand for future reference.
>
> -GS
>
> On Thu, Apr 21, 2011 at 2:07 PM, George P. Stathis <gs...@traackr.com>wrote:
>
>>
>> On Thu, Apr 21, 2011 at 1:56 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>>
>>> On Thu, Apr 21, 2011 at 10:49 AM, George P. Stathis
>>> <gs...@traackr.com> wrote:
>>> > I gave the thread that name because that was the best way I could come
>>> up
>>> > with to describe the symptoms. We still have the problem, it just may
>>> end up
>>> > be timestamp related after all.
>>> >
>>>
>>> No worries.
>>>
>>> > This does look similar, yes. I see there is no resolution currently.
>>>
>>> Well, if your use case is really to Put and Delete the same stuff
>>> right after the other, then yeah you will need a time resolution that
>>> more precise than milliseconds. Or you pass your own timestamps by
>>> doing the first puts are current time, then delete at current time +1,
>>> then put again if you want at current time +2, etc.
>>>
>>
>> Looking at HBASE-2256, seems that it has been around for a while. We were
>> not seeing that issue when we were running on 0.20.6 and 0.89. The unit
>> tests that are failing have been in place since then. So I'm actually less
>> hopeful now. Still, I'll check the clocks on EC2.
>>
>>
>>>
>>> >
>>> > We have also noticed this problem in our local systems (laptops) but
>>> > actually much less frequently. I'll check the EC2 dates as soon as AWS
>>> > recovers (http://twitter.com/#!/search/%23aws). Our dev setup is
>>> currently
>>> > hosed...
>>>
>>> Heh :P
>>>
>>> J-D
>>>
>>
>>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
Quick update:

It turns out that we needed to run bin/set_meta_memstore_size.rb (
http://hbase.apache.org/upgrading.html) . I'm curious though: I understand
that our legacy dev machine would suffer because of the old
MEMSTORE_FLUSHSIZE setting. But we setup a brand new dev box with a pristine
0.90 version that had no legacy MEMSTORE_FLUSHSIZE setting. Why would that
instance have been affected? I looked at the -ROOT- table before I ran the
script and there was no MEMSTORE_FLUSHSIZE setting. As soon as I execute the
script and re-run the unit tests, they all started passing again. Also, the
legacy hbase instances running on our dev laptops didn't have the script
executed against them. Why would they work without it?

Anyway, trying to understand for future reference.

-GS

On Thu, Apr 21, 2011 at 2:07 PM, George P. Stathis <gs...@traackr.com>wrote:

>
> On Thu, Apr 21, 2011 at 1:56 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> On Thu, Apr 21, 2011 at 10:49 AM, George P. Stathis
>> <gs...@traackr.com> wrote:
>> > I gave the thread that name because that was the best way I could come
>> up
>> > with to describe the symptoms. We still have the problem, it just may
>> end up
>> > be timestamp related after all.
>> >
>>
>> No worries.
>>
>> > This does look similar, yes. I see there is no resolution currently.
>>
>> Well, if your use case is really to Put and Delete the same stuff
>> right after the other, then yeah you will need a time resolution that
>> more precise than milliseconds. Or you pass your own timestamps by
>> doing the first puts are current time, then delete at current time +1,
>> then put again if you want at current time +2, etc.
>>
>
> Looking at HBASE-2256, seems that it has been around for a while. We were
> not seeing that issue when we were running on 0.20.6 and 0.89. The unit
> tests that are failing have been in place since then. So I'm actually less
> hopeful now. Still, I'll check the clocks on EC2.
>
>
>>
>> >
>> > We have also noticed this problem in our local systems (laptops) but
>> > actually much less frequently. I'll check the EC2 dates as soon as AWS
>> > recovers (http://twitter.com/#!/search/%23aws). Our dev setup is
>> currently
>> > hosed...
>>
>> Heh :P
>>
>> J-D
>>
>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Thu, Apr 21, 2011 at 1:56 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> On Thu, Apr 21, 2011 at 10:49 AM, George P. Stathis
> <gs...@traackr.com> wrote:
> > I gave the thread that name because that was the best way I could come up
> > with to describe the symptoms. We still have the problem, it just may end
> up
> > be timestamp related after all.
> >
>
> No worries.
>
> > This does look similar, yes. I see there is no resolution currently.
>
> Well, if your use case is really to Put and Delete the same stuff
> right after the other, then yeah you will need a time resolution that
> more precise than milliseconds. Or you pass your own timestamps by
> doing the first puts are current time, then delete at current time +1,
> then put again if you want at current time +2, etc.
>

Looking at HBASE-2256, seems that it has been around for a while. We were
not seeing that issue when we were running on 0.20.6 and 0.89. The unit
tests that are failing have been in place since then. So I'm actually less
hopeful now. Still, I'll check the clocks on EC2.


>
> >
> > We have also noticed this problem in our local systems (laptops) but
> > actually much less frequently. I'll check the EC2 dates as soon as AWS
> > recovers (http://twitter.com/#!/search/%23aws). Our dev setup is
> currently
> > hosed...
>
> Heh :P
>
> J-D
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
On Thu, Apr 21, 2011 at 10:49 AM, George P. Stathis
<gs...@traackr.com> wrote:
> I gave the thread that name because that was the best way I could come up
> with to describe the symptoms. We still have the problem, it just may end up
> be timestamp related after all.
>

No worries.

> This does look similar, yes. I see there is no resolution currently.

Well, if your use case is really to Put and Delete the same stuff
right after the other, then yeah you will need a time resolution that
more precise than milliseconds. Or you pass your own timestamps by
doing the first puts are current time, then delete at current time +1,
then put again if you want at current time +2, etc.

>
> We have also noticed this problem in our local systems (laptops) but
> actually much less frequently. I'll check the EC2 dates as soon as AWS
> recovers (http://twitter.com/#!/search/%23aws). Our dev setup is currently
> hosed...

Heh :P

J-D

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Thu, Apr 21, 2011 at 12:28 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Sorry, I mostly focused on performance related issues I saw in your
> code given this thread is called "Latency related configs for 0.90".
> So it's not the case anymore right?
>

I gave the thread that name because that was the best way I could come up
with to describe the symptoms. We still have the problem, it just may end up
be timestamp related after all.


>
> What you are describing indeed sounds exactly like a timestamp issue,
> such as https://issues.apache.org/jira/browse/HBASE-2256


This does look similar, yes. I see there is no resolution currently.


>
>
> In the code you just pasted I don't see anything regarding the setting
> of timestamps BTW.
>
> Finally, it seems odd to me that the issue happens only when the
> machines are distant... HBASE-2256 happens when the client is as close
> as possible to the server. You might want to take a look at the
> system's dates and see if there's a skew.
>

We have also noticed this problem in our local systems (laptops) but
actually much less frequently. I'll check the EC2 dates as soon as AWS
recovers (http://twitter.com/#!/search/%23aws). Our dev setup is currently
hosed...



>
> J-D
>
> On Thu, Apr 21, 2011 at 5:48 AM, George P. Stathis <gs...@traackr.com>
> wrote:
> > Thanks J-D. Here is an updated one: http://pastebin.com/MZDgVBam
> >
> > I posted this test case as a sample of the type of operations we are
> doing;
> > it's not the actual code itself though. In our actual code, the htable
> pool
> > and config are all spring managed singleton instances available across
> the
> > entire app, so we don't keep creating them and dropping them. I fixed the
> > unit test to take your pointers into consideration. It allows to
> > drop hbase.zookeeper.property.maxClientCnxns back to the default 30, so
> > thanks for that.
> >
> > But this example was simply meant to illustrate what we are trying to do
> > with hbase; basically, create a secondary index row for a given record.
> >
> > The actual symptoms that we are experiencing are not maxClientCnxns
> issues.
> > We are seeing data not being persisted when we think they are or not
> being
> > entirely deleted when we think they are; this mostly happens when we
> > introduce a network in between the client and the hbase server (although
> > it's been seen to happen much less frequently when the client and server
> are
> > on the same box).
> >
> > As an example, we see things like this (pseudo-code):
> >
> > // Insert data
> > Put p = new Put("some_row_id");
> > p.add("familiyA","qualifierA","valueA");
> > p.add("familiyA","qualifierB","valueB");
> > p.add("familiyA","qualifierC","valueC");
> > table.put(p);
> >
> > // Validate row presence
> > Result row = table.get(new Get("some_row_id"));
> > System.out.println(row.toString());
> > => keyvalues={some_row_id/
> > familiyA:qualifierA/1303389288609/Put/vlen=13,
> > some_row_id/familiyA:qualifierB/1303389288610/Put/vlen=13,
> > some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13}
> >
> > // Delete row
> > table.delete(new Delete("some_row_id"));
> >
> > // Validate row deletion
> > Result deletedRow = table.get(new Get("some_row_id"));
> > System.out.println(row.toString());
> > => keyvalues={some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13}
> ///
> > orphaned cell !!!
> >
> > I was seeing this case happen last night for hours on end with the same
> test
> > data. I began suspecting timestamp issues as possible culprits. I went to
> > bed and left the test environment alone overnight (no processes running
> on
> > it). This morning, I re-ran the same test case: the orphaned cell
> phenomenon
> > is no longer happening. So it's very hit or miss, but the example I gave
> > above was definitely reproducible at will for a few hours.
> >
> > Are there any known cases where a deliberate delete on an entire row will
> > still leave data behind? Could we be messing our timestamps in such a way
> > that we could be causing this?
> >
> > -GS
> >
> >
> >
> > On Wed, Apr 20, 2011 at 6:58 PM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
> >
> >> Regarding the test:
> >>
> >>  - Try to only keep one HBaseAdmin, one HTablePool and always reuse
> >> the same conf between tests, creating a new HBA or HTP creates a new
> >> HBaseConfiguration thus a new connection. Use methods like
> >> setUpBeforeClass. Another option is to close the connection once you
> >> used those classes and the close the first one in tearDown that you
> >> created in setUp. Right now I can count 25 connections being created
> >> in this test (I know it stucks, it's a regression in 0.90)
> >>  - The fact that you are creating new HTablePools in do* means you are
> >> re-creating new HTables for almost every request you are doing and
> >> that's a pretty expensive operation. Again, keeping only a single
> >> instance will help a lot.
> >>
> >> That's the most obvious stuff I saw.
> >>
> >> J-D
> >>
> >> On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis
> >> <gs...@traackr.com> wrote:
> >> > On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote:
> >> >
> >> >> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
> >> >> <gs...@traackr.com> wrote:
> >> >> > We have several unit tests that have started mysteriously failing
> in
> >> >> random
> >> >> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3.
> >> Those
> >> >> > tests used to run against 0.89 and never failed before. They also
> run
> >> OK
> >> >> on
> >> >> > our local macbooks. On EC2, we are seeing lots of issues where the
> >> setup
> >> >> > data is not being persisted in time for the tests to assert against
> >> them.
> >> >> > They are also not always being torn down properly.
> >> >> >
> >> >>
> >> >> These are your tests George dependent on HBase.  What are they asking
> >> >> of HBase?  You are spinning up a cluster and then the takedown is not
> >> >> working?   Want to pastebin some log?  We might see something.
> >> >>
> >> >
> >> >
> >> > It's not practical to paste all the secondary-indexing code we have in
> >> > place. It's very likely that there is an issue in our code though, so
> I
> >> > don't want to send folks down a rabbit hole. I just wanted to validate
> >> that
> >> > there are no new configs in 0.90 (from 0.89) that could affect
> read/write
> >> > consistency.
> >> >
> >> > I created a test that simulates what most of our secondary-indexing
> code
> >> > does:
> >> >
> >> > http://pastebin.com/M9qKv87u
> >> >
> >> > It's a simplified version and of course, this one does not fail, or
> >> rather,
> >> > I have not been able to make it fail in the same way. The only thing
> that
> >> > I've hit with this test in pseudo-distributed mode
> >> > is hbase.zookeeper.property.maxClientCnxns which I bumped up and was
> able
> >> to
> >> > force it past it. The issue we are seeing does not throw any errors in
> >> any
> >> > of the master/regionserver/zookeeper logs, so, right now, all
> indications
> >> > are that the problem is on our side. I just need to diff deeper.
> >> >
> >> > BTW, we are not spinning up a temporary mini-cluster to test; instead,
> we
> >> > have a dedicated dev pseudo-distributed machine against which our CI
> >> tests
> >> > run. That's the environment that is presenting issues at the moment.
> >> Again,
> >> > the odd part is that we have setup our local instances the same way as
> >> our
> >> > dev pseudo-distributed machine and the tests pass. The differences are
> >> that
> >> > we run on macs and the dev instance is on EC2.
> >> >
> >> >
> >> >>
> >> >> > We first started seeing issues running our hudson build on the same
> >> >> machine
> >> >> > as the hbase pseudo-cluster. We figured that was putting too much
> load
> >> on
> >> >> > the box, so we created a separate large instance on EC2 to host
> just
> >> the
> >> >> > 0.90 stack. This migration nearly quadrupled the number of unit
> tests
> >> >> > failing at times. The only difference between for first and second
> CI
> >> >> setup
> >> >> > is the network in between.
> >> >> >
> >> >>
> >> >> Yeah.  EC2.  But we should be able to manage with a flakey network
> >> anyways.
> >> >>
> >> >
> >> > Just wanted to make sure that this was indeed the case.
> >> >
> >> >
> >> >>
> >> >>
> >> >> > Before we start tearing down our code line by line, I'd like to see
> if
> >> >> there
> >> >> > are latency related configuration tweaks we could try to make the
> >> setup
> >> >> > more resilient to network lag. Are there any hbase/zookepper
> settings
> >> >> that
> >> >> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> >> >> > in hbase-env.sh . Can that help?
> >> >> >
> >> >>
> >> >> You've seen that hbase uses a different config. when it runs tests;
> >> >> its in src/tests/resources/hbase-site.xml.
> >> >>
> >> >> But if stuff used to work on 0.89 w/ old config. this is probably not
> >> it.
> >> >>
> >> >
> >> > I reverted all our configs back to default but the issue remains. I'll
> >> take
> >> > a look at the test config and see if any of those settings may help
> out.
> >> > From what I can gather at first glance, the test settings are more
> >> > aggressive actually, so they seem even less tolerant of delays.
> >> >
> >> > Will keep digging and I'll post and update when we get somewhere.
> >> >
> >> >
> >> >>
> >> >> > Any suggestions are more than welcome. Also, the overview above may
> >> not
> >> >> be
> >> >> > enough to go on, so please let me know if I could provide more
> >> details.
> >> >> >
> >> >>
> >> >> I think pastebin of a failing test, one that used pass, with
> >> >> description (or code) of what is being done would be place to start;
> >> >> we might recognize the diff in 0.89 to 0.90.
> >> >>
> >> >> St.Ack
> >> >>
> >> >
> >>
> >
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Sorry, I mostly focused on performance related issues I saw in your
code given this thread is called "Latency related configs for 0.90".
So it's not the case anymore right?

What you are describing indeed sounds exactly like a timestamp issue,
such as https://issues.apache.org/jira/browse/HBASE-2256

In the code you just pasted I don't see anything regarding the setting
of timestamps BTW.

Finally, it seems odd to me that the issue happens only when the
machines are distant... HBASE-2256 happens when the client is as close
as possible to the server. You might want to take a look at the
system's dates and see if there's a skew.

J-D

On Thu, Apr 21, 2011 at 5:48 AM, George P. Stathis <gs...@traackr.com> wrote:
> Thanks J-D. Here is an updated one: http://pastebin.com/MZDgVBam
>
> I posted this test case as a sample of the type of operations we are doing;
> it's not the actual code itself though. In our actual code, the htable pool
> and config are all spring managed singleton instances available across the
> entire app, so we don't keep creating them and dropping them. I fixed the
> unit test to take your pointers into consideration. It allows to
> drop hbase.zookeeper.property.maxClientCnxns back to the default 30, so
> thanks for that.
>
> But this example was simply meant to illustrate what we are trying to do
> with hbase; basically, create a secondary index row for a given record.
>
> The actual symptoms that we are experiencing are not maxClientCnxns issues.
> We are seeing data not being persisted when we think they are or not being
> entirely deleted when we think they are; this mostly happens when we
> introduce a network in between the client and the hbase server (although
> it's been seen to happen much less frequently when the client and server are
> on the same box).
>
> As an example, we see things like this (pseudo-code):
>
> // Insert data
> Put p = new Put("some_row_id");
> p.add("familiyA","qualifierA","valueA");
> p.add("familiyA","qualifierB","valueB");
> p.add("familiyA","qualifierC","valueC");
> table.put(p);
>
> // Validate row presence
> Result row = table.get(new Get("some_row_id"));
> System.out.println(row.toString());
> => keyvalues={some_row_id/
> familiyA:qualifierA/1303389288609/Put/vlen=13,
> some_row_id/familiyA:qualifierB/1303389288610/Put/vlen=13,
> some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13}
>
> // Delete row
> table.delete(new Delete("some_row_id"));
>
> // Validate row deletion
> Result deletedRow = table.get(new Get("some_row_id"));
> System.out.println(row.toString());
> => keyvalues={some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13} ///
> orphaned cell !!!
>
> I was seeing this case happen last night for hours on end with the same test
> data. I began suspecting timestamp issues as possible culprits. I went to
> bed and left the test environment alone overnight (no processes running on
> it). This morning, I re-ran the same test case: the orphaned cell phenomenon
> is no longer happening. So it's very hit or miss, but the example I gave
> above was definitely reproducible at will for a few hours.
>
> Are there any known cases where a deliberate delete on an entire row will
> still leave data behind? Could we be messing our timestamps in such a way
> that we could be causing this?
>
> -GS
>
>
>
> On Wed, Apr 20, 2011 at 6:58 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:
>
>> Regarding the test:
>>
>>  - Try to only keep one HBaseAdmin, one HTablePool and always reuse
>> the same conf between tests, creating a new HBA or HTP creates a new
>> HBaseConfiguration thus a new connection. Use methods like
>> setUpBeforeClass. Another option is to close the connection once you
>> used those classes and the close the first one in tearDown that you
>> created in setUp. Right now I can count 25 connections being created
>> in this test (I know it stucks, it's a regression in 0.90)
>>  - The fact that you are creating new HTablePools in do* means you are
>> re-creating new HTables for almost every request you are doing and
>> that's a pretty expensive operation. Again, keeping only a single
>> instance will help a lot.
>>
>> That's the most obvious stuff I saw.
>>
>> J-D
>>
>> On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis
>> <gs...@traackr.com> wrote:
>> > On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote:
>> >
>> >> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
>> >> <gs...@traackr.com> wrote:
>> >> > We have several unit tests that have started mysteriously failing in
>> >> random
>> >> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3.
>> Those
>> >> > tests used to run against 0.89 and never failed before. They also run
>> OK
>> >> on
>> >> > our local macbooks. On EC2, we are seeing lots of issues where the
>> setup
>> >> > data is not being persisted in time for the tests to assert against
>> them.
>> >> > They are also not always being torn down properly.
>> >> >
>> >>
>> >> These are your tests George dependent on HBase.  What are they asking
>> >> of HBase?  You are spinning up a cluster and then the takedown is not
>> >> working?   Want to pastebin some log?  We might see something.
>> >>
>> >
>> >
>> > It's not practical to paste all the secondary-indexing code we have in
>> > place. It's very likely that there is an issue in our code though, so I
>> > don't want to send folks down a rabbit hole. I just wanted to validate
>> that
>> > there are no new configs in 0.90 (from 0.89) that could affect read/write
>> > consistency.
>> >
>> > I created a test that simulates what most of our secondary-indexing code
>> > does:
>> >
>> > http://pastebin.com/M9qKv87u
>> >
>> > It's a simplified version and of course, this one does not fail, or
>> rather,
>> > I have not been able to make it fail in the same way. The only thing that
>> > I've hit with this test in pseudo-distributed mode
>> > is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able
>> to
>> > force it past it. The issue we are seeing does not throw any errors in
>> any
>> > of the master/regionserver/zookeeper logs, so, right now, all indications
>> > are that the problem is on our side. I just need to diff deeper.
>> >
>> > BTW, we are not spinning up a temporary mini-cluster to test; instead, we
>> > have a dedicated dev pseudo-distributed machine against which our CI
>> tests
>> > run. That's the environment that is presenting issues at the moment.
>> Again,
>> > the odd part is that we have setup our local instances the same way as
>> our
>> > dev pseudo-distributed machine and the tests pass. The differences are
>> that
>> > we run on macs and the dev instance is on EC2.
>> >
>> >
>> >>
>> >> > We first started seeing issues running our hudson build on the same
>> >> machine
>> >> > as the hbase pseudo-cluster. We figured that was putting too much load
>> on
>> >> > the box, so we created a separate large instance on EC2 to host just
>> the
>> >> > 0.90 stack. This migration nearly quadrupled the number of unit tests
>> >> > failing at times. The only difference between for first and second CI
>> >> setup
>> >> > is the network in between.
>> >> >
>> >>
>> >> Yeah.  EC2.  But we should be able to manage with a flakey network
>> anyways.
>> >>
>> >
>> > Just wanted to make sure that this was indeed the case.
>> >
>> >
>> >>
>> >>
>> >> > Before we start tearing down our code line by line, I'd like to see if
>> >> there
>> >> > are latency related configuration tweaks we could try to make the
>> setup
>> >> > more resilient to network lag. Are there any hbase/zookepper settings
>> >> that
>> >> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
>> >> > in hbase-env.sh . Can that help?
>> >> >
>> >>
>> >> You've seen that hbase uses a different config. when it runs tests;
>> >> its in src/tests/resources/hbase-site.xml.
>> >>
>> >> But if stuff used to work on 0.89 w/ old config. this is probably not
>> it.
>> >>
>> >
>> > I reverted all our configs back to default but the issue remains. I'll
>> take
>> > a look at the test config and see if any of those settings may help out.
>> > From what I can gather at first glance, the test settings are more
>> > aggressive actually, so they seem even less tolerant of delays.
>> >
>> > Will keep digging and I'll post and update when we get somewhere.
>> >
>> >
>> >>
>> >> > Any suggestions are more than welcome. Also, the overview above may
>> not
>> >> be
>> >> > enough to go on, so please let me know if I could provide more
>> details.
>> >> >
>> >>
>> >> I think pastebin of a failing test, one that used pass, with
>> >> description (or code) of what is being done would be place to start;
>> >> we might recognize the diff in 0.89 to 0.90.
>> >>
>> >> St.Ack
>> >>
>> >
>>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
Thanks J-D. Here is an updated one: http://pastebin.com/MZDgVBam

I posted this test case as a sample of the type of operations we are doing;
it's not the actual code itself though. In our actual code, the htable pool
and config are all spring managed singleton instances available across the
entire app, so we don't keep creating them and dropping them. I fixed the
unit test to take your pointers into consideration. It allows to
drop hbase.zookeeper.property.maxClientCnxns back to the default 30, so
thanks for that.

But this example was simply meant to illustrate what we are trying to do
with hbase; basically, create a secondary index row for a given record.

The actual symptoms that we are experiencing are not maxClientCnxns issues.
We are seeing data not being persisted when we think they are or not being
entirely deleted when we think they are; this mostly happens when we
introduce a network in between the client and the hbase server (although
it's been seen to happen much less frequently when the client and server are
on the same box).

As an example, we see things like this (pseudo-code):

// Insert data
Put p = new Put("some_row_id");
p.add("familiyA","qualifierA","valueA");
p.add("familiyA","qualifierB","valueB");
p.add("familiyA","qualifierC","valueC");
table.put(p);

// Validate row presence
Result row = table.get(new Get("some_row_id"));
System.out.println(row.toString());
=> keyvalues={some_row_id/
familiyA:qualifierA/1303389288609/Put/vlen=13,
some_row_id/familiyA:qualifierB/1303389288610/Put/vlen=13,
some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13}

// Delete row
table.delete(new Delete("some_row_id"));

// Validate row deletion
Result deletedRow = table.get(new Get("some_row_id"));
System.out.println(row.toString());
=> keyvalues={some_row_id/familiyA:qualifierC/1303389289262/Put/vlen=13} ///
orphaned cell !!!

I was seeing this case happen last night for hours on end with the same test
data. I began suspecting timestamp issues as possible culprits. I went to
bed and left the test environment alone overnight (no processes running on
it). This morning, I re-ran the same test case: the orphaned cell phenomenon
is no longer happening. So it's very hit or miss, but the example I gave
above was definitely reproducible at will for a few hours.

Are there any known cases where a deliberate delete on an entire row will
still leave data behind? Could we be messing our timestamps in such a way
that we could be causing this?

-GS



On Wed, Apr 20, 2011 at 6:58 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Regarding the test:
>
>  - Try to only keep one HBaseAdmin, one HTablePool and always reuse
> the same conf between tests, creating a new HBA or HTP creates a new
> HBaseConfiguration thus a new connection. Use methods like
> setUpBeforeClass. Another option is to close the connection once you
> used those classes and the close the first one in tearDown that you
> created in setUp. Right now I can count 25 connections being created
> in this test (I know it stucks, it's a regression in 0.90)
>  - The fact that you are creating new HTablePools in do* means you are
> re-creating new HTables for almost every request you are doing and
> that's a pretty expensive operation. Again, keeping only a single
> instance will help a lot.
>
> That's the most obvious stuff I saw.
>
> J-D
>
> On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis
> <gs...@traackr.com> wrote:
> > On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote:
> >
> >> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
> >> <gs...@traackr.com> wrote:
> >> > We have several unit tests that have started mysteriously failing in
> >> random
> >> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3.
> Those
> >> > tests used to run against 0.89 and never failed before. They also run
> OK
> >> on
> >> > our local macbooks. On EC2, we are seeing lots of issues where the
> setup
> >> > data is not being persisted in time for the tests to assert against
> them.
> >> > They are also not always being torn down properly.
> >> >
> >>
> >> These are your tests George dependent on HBase.  What are they asking
> >> of HBase?  You are spinning up a cluster and then the takedown is not
> >> working?   Want to pastebin some log?  We might see something.
> >>
> >
> >
> > It's not practical to paste all the secondary-indexing code we have in
> > place. It's very likely that there is an issue in our code though, so I
> > don't want to send folks down a rabbit hole. I just wanted to validate
> that
> > there are no new configs in 0.90 (from 0.89) that could affect read/write
> > consistency.
> >
> > I created a test that simulates what most of our secondary-indexing code
> > does:
> >
> > http://pastebin.com/M9qKv87u
> >
> > It's a simplified version and of course, this one does not fail, or
> rather,
> > I have not been able to make it fail in the same way. The only thing that
> > I've hit with this test in pseudo-distributed mode
> > is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able
> to
> > force it past it. The issue we are seeing does not throw any errors in
> any
> > of the master/regionserver/zookeeper logs, so, right now, all indications
> > are that the problem is on our side. I just need to diff deeper.
> >
> > BTW, we are not spinning up a temporary mini-cluster to test; instead, we
> > have a dedicated dev pseudo-distributed machine against which our CI
> tests
> > run. That's the environment that is presenting issues at the moment.
> Again,
> > the odd part is that we have setup our local instances the same way as
> our
> > dev pseudo-distributed machine and the tests pass. The differences are
> that
> > we run on macs and the dev instance is on EC2.
> >
> >
> >>
> >> > We first started seeing issues running our hudson build on the same
> >> machine
> >> > as the hbase pseudo-cluster. We figured that was putting too much load
> on
> >> > the box, so we created a separate large instance on EC2 to host just
> the
> >> > 0.90 stack. This migration nearly quadrupled the number of unit tests
> >> > failing at times. The only difference between for first and second CI
> >> setup
> >> > is the network in between.
> >> >
> >>
> >> Yeah.  EC2.  But we should be able to manage with a flakey network
> anyways.
> >>
> >
> > Just wanted to make sure that this was indeed the case.
> >
> >
> >>
> >>
> >> > Before we start tearing down our code line by line, I'd like to see if
> >> there
> >> > are latency related configuration tweaks we could try to make the
> setup
> >> > more resilient to network lag. Are there any hbase/zookepper settings
> >> that
> >> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> >> > in hbase-env.sh . Can that help?
> >> >
> >>
> >> You've seen that hbase uses a different config. when it runs tests;
> >> its in src/tests/resources/hbase-site.xml.
> >>
> >> But if stuff used to work on 0.89 w/ old config. this is probably not
> it.
> >>
> >
> > I reverted all our configs back to default but the issue remains. I'll
> take
> > a look at the test config and see if any of those settings may help out.
> > From what I can gather at first glance, the test settings are more
> > aggressive actually, so they seem even less tolerant of delays.
> >
> > Will keep digging and I'll post and update when we get somewhere.
> >
> >
> >>
> >> > Any suggestions are more than welcome. Also, the overview above may
> not
> >> be
> >> > enough to go on, so please let me know if I could provide more
> details.
> >> >
> >>
> >> I think pastebin of a failing test, one that used pass, with
> >> description (or code) of what is being done would be place to start;
> >> we might recognize the diff in 0.89 to 0.90.
> >>
> >> St.Ack
> >>
> >
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Regarding the test:

 - Try to only keep one HBaseAdmin, one HTablePool and always reuse
the same conf between tests, creating a new HBA or HTP creates a new
HBaseConfiguration thus a new connection. Use methods like
setUpBeforeClass. Another option is to close the connection once you
used those classes and the close the first one in tearDown that you
created in setUp. Right now I can count 25 connections being created
in this test (I know it stucks, it's a regression in 0.90)
 - The fact that you are creating new HTablePools in do* means you are
re-creating new HTables for almost every request you are doing and
that's a pretty expensive operation. Again, keeping only a single
instance will help a lot.

That's the most obvious stuff I saw.

J-D

On Wed, Apr 20, 2011 at 12:46 PM, George P. Stathis
<gs...@traackr.com> wrote:
> On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote:
>
>> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
>> <gs...@traackr.com> wrote:
>> > We have several unit tests that have started mysteriously failing in
>> random
>> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
>> > tests used to run against 0.89 and never failed before. They also run OK
>> on
>> > our local macbooks. On EC2, we are seeing lots of issues where the setup
>> > data is not being persisted in time for the tests to assert against them.
>> > They are also not always being torn down properly.
>> >
>>
>> These are your tests George dependent on HBase.  What are they asking
>> of HBase?  You are spinning up a cluster and then the takedown is not
>> working?   Want to pastebin some log?  We might see something.
>>
>
>
> It's not practical to paste all the secondary-indexing code we have in
> place. It's very likely that there is an issue in our code though, so I
> don't want to send folks down a rabbit hole. I just wanted to validate that
> there are no new configs in 0.90 (from 0.89) that could affect read/write
> consistency.
>
> I created a test that simulates what most of our secondary-indexing code
> does:
>
> http://pastebin.com/M9qKv87u
>
> It's a simplified version and of course, this one does not fail, or rather,
> I have not been able to make it fail in the same way. The only thing that
> I've hit with this test in pseudo-distributed mode
> is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able to
> force it past it. The issue we are seeing does not throw any errors in any
> of the master/regionserver/zookeeper logs, so, right now, all indications
> are that the problem is on our side. I just need to diff deeper.
>
> BTW, we are not spinning up a temporary mini-cluster to test; instead, we
> have a dedicated dev pseudo-distributed machine against which our CI tests
> run. That's the environment that is presenting issues at the moment. Again,
> the odd part is that we have setup our local instances the same way as our
> dev pseudo-distributed machine and the tests pass. The differences are that
> we run on macs and the dev instance is on EC2.
>
>
>>
>> > We first started seeing issues running our hudson build on the same
>> machine
>> > as the hbase pseudo-cluster. We figured that was putting too much load on
>> > the box, so we created a separate large instance on EC2 to host just the
>> > 0.90 stack. This migration nearly quadrupled the number of unit tests
>> > failing at times. The only difference between for first and second CI
>> setup
>> > is the network in between.
>> >
>>
>> Yeah.  EC2.  But we should be able to manage with a flakey network anyways.
>>
>
> Just wanted to make sure that this was indeed the case.
>
>
>>
>>
>> > Before we start tearing down our code line by line, I'd like to see if
>> there
>> > are latency related configuration tweaks we could try to make the setup
>> > more resilient to network lag. Are there any hbase/zookepper settings
>> that
>> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
>> > in hbase-env.sh . Can that help?
>> >
>>
>> You've seen that hbase uses a different config. when it runs tests;
>> its in src/tests/resources/hbase-site.xml.
>>
>> But if stuff used to work on 0.89 w/ old config. this is probably not it.
>>
>
> I reverted all our configs back to default but the issue remains. I'll take
> a look at the test config and see if any of those settings may help out.
> From what I can gather at first glance, the test settings are more
> aggressive actually, so they seem even less tolerant of delays.
>
> Will keep digging and I'll post and update when we get somewhere.
>
>
>>
>> > Any suggestions are more than welcome. Also, the overview above may not
>> be
>> > enough to go on, so please let me know if I could provide more details.
>> >
>>
>> I think pastebin of a failing test, one that used pass, with
>> description (or code) of what is being done would be place to start;
>> we might recognize the diff in 0.89 to 0.90.
>>
>> St.Ack
>>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Wed, Apr 20, 2011 at 12:48 PM, Stack <st...@duboce.net> wrote:

> On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
> <gs...@traackr.com> wrote:
> > We have several unit tests that have started mysteriously failing in
> random
> > ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
> > tests used to run against 0.89 and never failed before. They also run OK
> on
> > our local macbooks. On EC2, we are seeing lots of issues where the setup
> > data is not being persisted in time for the tests to assert against them.
> > They are also not always being torn down properly.
> >
>
> These are your tests George dependent on HBase.  What are they asking
> of HBase?  You are spinning up a cluster and then the takedown is not
> working?   Want to pastebin some log?  We might see something.
>


It's not practical to paste all the secondary-indexing code we have in
place. It's very likely that there is an issue in our code though, so I
don't want to send folks down a rabbit hole. I just wanted to validate that
there are no new configs in 0.90 (from 0.89) that could affect read/write
consistency.

I created a test that simulates what most of our secondary-indexing code
does:

http://pastebin.com/M9qKv87u

It's a simplified version and of course, this one does not fail, or rather,
I have not been able to make it fail in the same way. The only thing that
I've hit with this test in pseudo-distributed mode
is hbase.zookeeper.property.maxClientCnxns which I bumped up and was able to
force it past it. The issue we are seeing does not throw any errors in any
of the master/regionserver/zookeeper logs, so, right now, all indications
are that the problem is on our side. I just need to diff deeper.

BTW, we are not spinning up a temporary mini-cluster to test; instead, we
have a dedicated dev pseudo-distributed machine against which our CI tests
run. That's the environment that is presenting issues at the moment. Again,
the odd part is that we have setup our local instances the same way as our
dev pseudo-distributed machine and the tests pass. The differences are that
we run on macs and the dev instance is on EC2.


>
> > We first started seeing issues running our hudson build on the same
> machine
> > as the hbase pseudo-cluster. We figured that was putting too much load on
> > the box, so we created a separate large instance on EC2 to host just the
> > 0.90 stack. This migration nearly quadrupled the number of unit tests
> > failing at times. The only difference between for first and second CI
> setup
> > is the network in between.
> >
>
> Yeah.  EC2.  But we should be able to manage with a flakey network anyways.
>

Just wanted to make sure that this was indeed the case.


>
>
> > Before we start tearing down our code line by line, I'd like to see if
> there
> > are latency related configuration tweaks we could try to make the setup
> > more resilient to network lag. Are there any hbase/zookepper settings
> that
> > might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> > in hbase-env.sh . Can that help?
> >
>
> You've seen that hbase uses a different config. when it runs tests;
> its in src/tests/resources/hbase-site.xml.
>
> But if stuff used to work on 0.89 w/ old config. this is probably not it.
>

I reverted all our configs back to default but the issue remains. I'll take
a look at the test config and see if any of those settings may help out.
>From what I can gather at first glance, the test settings are more
aggressive actually, so they seem even less tolerant of delays.

Will keep digging and I'll post and update when we get somewhere.


>
> > Any suggestions are more than welcome. Also, the overview above may not
> be
> > enough to go on, so please let me know if I could provide more details.
> >
>
> I think pastebin of a failing test, one that used pass, with
> description (or code) of what is being done would be place to start;
> we might recognize the diff in 0.89 to 0.90.
>
> St.Ack
>

Re: Latency related configs for 0.90

Posted by Stack <st...@duboce.net>.
On Tue, Apr 19, 2011 at 12:08 PM, George P. Stathis
<gs...@traackr.com> wrote:
> We have several unit tests that have started mysteriously failing in random
> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
> tests used to run against 0.89 and never failed before. They also run OK on
> our local macbooks. On EC2, we are seeing lots of issues where the setup
> data is not being persisted in time for the tests to assert against them.
> They are also not always being torn down properly.
>

These are your tests George dependent on HBase.  What are they asking
of HBase?  You are spinning up a cluster and then the takedown is not
working?   Want to pastebin some log?  We might see something.

> We first started seeing issues running our hudson build on the same machine
> as the hbase pseudo-cluster. We figured that was putting too much load on
> the box, so we created a separate large instance on EC2 to host just the
> 0.90 stack. This migration nearly quadrupled the number of unit tests
> failing at times. The only difference between for first and second CI setup
> is the network in between.
>

Yeah.  EC2.  But we should be able to manage with a flakey network anyways.


> Before we start tearing down our code line by line, I'd like to see if there
> are latency related configuration tweaks we could try to make the setup
> more resilient to network lag. Are there any hbase/zookepper settings that
> might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> in hbase-env.sh . Can that help?
>

You've seen that hbase uses a different config. when it runs tests;
its in src/tests/resources/hbase-site.xml.

But if stuff used to work on 0.89 w/ old config. this is probably not it.

> Any suggestions are more than welcome. Also, the overview above may not be
> enough to go on, so please let me know if I could provide more details.
>

I think pastebin of a failing test, one that used pass, with
description (or code) of what is being done would be place to start;
we might recognize the diff in 0.89 to 0.90.

St.Ack

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
On Wed, Apr 20, 2011 at 12:27 PM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hey George,
>
> Sorry for the late answer, there's nothing that comes to mind when
> reading your email.
>
> HBASE_SLAVE_SLEEP is only used by the bash scripts, like when you do
> hbase-daemons.sh it will wait that sleep time between each machine.
>
> Would you be able to come up with a test that shows the issues you are
> seeing? Like stripping out everything that's related to your stuff and
> leave only the parts that play with hbase?
>

:-) Right in the middle of it. Need to rule pure HBase out first (most
likely) and make it's just on our side of the code.


>
> Have you inspected the logs of those test for anything weird looking
> exceptions? Maybe the logs are screaming about something that needs to
> be taken care of? (just guesses)
>

No weird exceptions other than the ones raised by the missing data the tests
expect. We've also tailed the hbase logs while running the tests and hbase
seems to be happy.


>
> Our own experience migrating to 0.90 has been pretty good, we found a
> couple of issues with the new master but not one performance-related
> issue. We ran 0.90.1 for some weeks and now we are on 0.90.2
>

Good to know. Just need to make sure it's on our side then.


>
> J-D
>
> On Wed, Apr 20, 2011 at 6:15 AM, George P. Stathis <gs...@traackr.com>
> wrote:
> > Sorry to bump this, but we could really use a hand here. Right now, we
> have
> > a very hard time seeing repeatable read/write consistency. Any
> suggestions
> > are welcome.
> >
> > -GS
> >
> > On Tue, Apr 19, 2011 at 3:08 PM, George P. Stathis <gstathis@traackr.com
> >wrote:
> >
> >> Hi all,
> >>
> >> In this chapter of our 0.89 to 0.90 migration saga, we are seeing what
> we
> >> suspect might be latency related artifacts.
> >>
> >> The setting:
> >>
> >>    - Our EC2 dev environment running our CI builds
> >>    - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode
> >>
> >> We have several unit tests that have started mysteriously failing in
> random
> >> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
> >> tests used to run against 0.89 and never failed before. They also run OK
> on
> >> our local macbooks. On EC2, we are seeing lots of issues where the setup
> >> data is not being persisted in time for the tests to assert against
> them.
> >> They are also not always being torn down properly.
> >>
> >> We first suspected our new code around secondary indexes; we do have
> >> extensive unit tests around it that provide us with a solid level of
> >> confidence that it works properly in our CRUD scenarios. We also
> performance
> >> tested against the old hbase-trx contrib code and our new secondary
> indexes
> >> seem to be running slightly faster as well (of course, that could be due
> to
> >> the bump from 0.89 to 0.90).
> >>
> >> We first started seeing issues running our hudson build on the same
> machine
> >> as the hbase pseudo-cluster. We figured that was putting too much load
> on
> >> the box, so we created a separate large instance on EC2 to host just the
> >> 0.90 stack. This migration nearly quadrupled the number of unit tests
> >> failing at times. The only difference between for first and second CI
> setup
> >> is the network in between.
> >>
> >> Before we start tearing down our code line by line, I'd like to see if
> >> there are latency related configuration tweaks we could try to make the
> >> setup more resilient to network lag. Are there any hbase/zookepper
> settings
> >> that might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> >> in hbase-env.sh . Can that help?
> >>
> >> Any suggestions are more than welcome. Also, the overview above may not
> be
> >> enough to go on, so please let me know if I could provide more details.
> >>
> >> Thank you in advance for any help.
> >>
> >> -GS
> >>
> >>
> >>
> >
>

Re: Latency related configs for 0.90

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
So far just the latency problems, sorry. It may not be hbase related
still although not very likely.

On Wed, Apr 20, 2011 at 12:54 PM, George P. Stathis
<gs...@traackr.com> wrote:
> Dmitriy, what are you seeing on your side? Missing inserts? Deletes that are
> never applied? Both?
>
> On Wed, Apr 20, 2011 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:
>
>> exactly my symptoms with 0.90.1. What gives.
>>
>> On Wed, Apr 20, 2011 at 9:54 AM, George P. Stathis <gs...@traackr.com>
>> wrote:
>> > Ted, what makes you say that? Have you seen similar issues in
>> > pseudo-clustered mode? We have been running in that mode on our dev
>> > environment for a year now, we haven't had any issues like this before.
>> At
>> > any rate, I'll set it to standalone just in case to see if it makes a
>> > difference.
>>
>

Re: Latency related configs for 0.90

Posted by Gary Helmling <gh...@gmail.com>.
Hmm, by any chance are either of you disabling auto flush on table
instances? ie,

HTable.setAutoFlush(false)

I don't see it in the example code you posted, but just wondering if there's
any way this could be a case of:
https://issues.apache.org/jira/browse/HBASE-3750

This fix came after 0.90.2, so it has not yet been included in a release.

--gh


On Wed, Apr 20, 2011 at 12:54 PM, George P. Stathis <gs...@traackr.com>wrote:

> Dmitriy, what are you seeing on your side? Missing inserts? Deletes that
> are
> never applied? Both?
>
> On Wed, Apr 20, 2011 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com>
> wrote:
>
> > exactly my symptoms with 0.90.1. What gives.
> >
> > On Wed, Apr 20, 2011 at 9:54 AM, George P. Stathis <gstathis@traackr.com
> >
> > wrote:
> > > Ted, what makes you say that? Have you seen similar issues in
> > > pseudo-clustered mode? We have been running in that mode on our dev
> > > environment for a year now, we haven't had any issues like this before.
> > At
> > > any rate, I'll set it to standalone just in case to see if it makes a
> > > difference.
> >
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
Dmitriy, what are you seeing on your side? Missing inserts? Deletes that are
never applied? Both?

On Wed, Apr 20, 2011 at 3:04 PM, Dmitriy Lyubimov <dl...@gmail.com> wrote:

> exactly my symptoms with 0.90.1. What gives.
>
> On Wed, Apr 20, 2011 at 9:54 AM, George P. Stathis <gs...@traackr.com>
> wrote:
> > Ted, what makes you say that? Have you seen similar issues in
> > pseudo-clustered mode? We have been running in that mode on our dev
> > environment for a year now, we haven't had any issues like this before.
> At
> > any rate, I'll set it to standalone just in case to see if it makes a
> > difference.
>

Re: Latency related configs for 0.90

Posted by Dmitriy Lyubimov <dl...@gmail.com>.
exactly my symptoms with 0.90.1. What gives.

On Wed, Apr 20, 2011 at 9:54 AM, George P. Stathis <gs...@traackr.com> wrote:
> Ted, what makes you say that? Have you seen similar issues in
> pseudo-clustered mode? We have been running in that mode on our dev
> environment for a year now, we haven't had any issues like this before. At
> any rate, I'll set it to standalone just in case to see if it makes a
> difference.

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
Ted, what makes you say that? Have you seen similar issues in
pseudo-clustered mode? We have been running in that mode on our dev
environment for a year now, we haven't had any issues like this before. At
any rate, I'll set it to standalone just in case to see if it makes a
difference.

On Wed, Apr 20, 2011 at 12:32 PM, Ted Yu <yu...@gmail.com> wrote:

> I guess George's case has something to do with pseudo-clustered mode.
>
> On Wed, Apr 20, 2011 at 9:27 AM, Jean-Daniel Cryans <jdcryans@apache.org
> >wrote:
>
> > Hey George,
> >
> > Sorry for the late answer, there's nothing that comes to mind when
> > reading your email.
> >
> > HBASE_SLAVE_SLEEP is only used by the bash scripts, like when you do
> > hbase-daemons.sh it will wait that sleep time between each machine.
> >
> > Would you be able to come up with a test that shows the issues you are
> > seeing? Like stripping out everything that's related to your stuff and
> > leave only the parts that play with hbase?
> >
> > Have you inspected the logs of those test for anything weird looking
> > exceptions? Maybe the logs are screaming about something that needs to
> > be taken care of? (just guesses)
> >
> > Our own experience migrating to 0.90 has been pretty good, we found a
> > couple of issues with the new master but not one performance-related
> > issue. We ran 0.90.1 for some weeks and now we are on 0.90.2
> >
> > J-D
> >
> > On Wed, Apr 20, 2011 at 6:15 AM, George P. Stathis <gstathis@traackr.com
> >
> > wrote:
> > > Sorry to bump this, but we could really use a hand here. Right now, we
> > have
> > > a very hard time seeing repeatable read/write consistency. Any
> > suggestions
> > > are welcome.
> > >
> > > -GS
> > >
> > > On Tue, Apr 19, 2011 at 3:08 PM, George P. Stathis <
> gstathis@traackr.com
> > >wrote:
> > >
> > >> Hi all,
> > >>
> > >> In this chapter of our 0.89 to 0.90 migration saga, we are seeing what
> > we
> > >> suspect might be latency related artifacts.
> > >>
> > >> The setting:
> > >>
> > >>    - Our EC2 dev environment running our CI builds
> > >>    - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode
> > >>
> > >> We have several unit tests that have started mysteriously failing in
> > random
> > >> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3.
> Those
> > >> tests used to run against 0.89 and never failed before. They also run
> OK
> > on
> > >> our local macbooks. On EC2, we are seeing lots of issues where the
> setup
> > >> data is not being persisted in time for the tests to assert against
> > them.
> > >> They are also not always being torn down properly.
> > >>
> > >> We first suspected our new code around secondary indexes; we do have
> > >> extensive unit tests around it that provide us with a solid level of
> > >> confidence that it works properly in our CRUD scenarios. We also
> > performance
> > >> tested against the old hbase-trx contrib code and our new secondary
> > indexes
> > >> seem to be running slightly faster as well (of course, that could be
> due
> > to
> > >> the bump from 0.89 to 0.90).
> > >>
> > >> We first started seeing issues running our hudson build on the same
> > machine
> > >> as the hbase pseudo-cluster. We figured that was putting too much load
> > on
> > >> the box, so we created a separate large instance on EC2 to host just
> the
> > >> 0.90 stack. This migration nearly quadrupled the number of unit tests
> > >> failing at times. The only difference between for first and second CI
> > setup
> > >> is the network in between.
> > >>
> > >> Before we start tearing down our code line by line, I'd like to see if
> > >> there are latency related configuration tweaks we could try to make
> the
> > >> setup more resilient to network lag. Are there any hbase/zookepper
> > settings
> > >> that might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> > >> in hbase-env.sh . Can that help?
> > >>
> > >> Any suggestions are more than welcome. Also, the overview above may
> not
> > be
> > >> enough to go on, so please let me know if I could provide more
> details.
> > >>
> > >> Thank you in advance for any help.
> > >>
> > >> -GS
> > >>
> > >>
> > >>
> > >
> >
>

Re: Latency related configs for 0.90

Posted by Ted Yu <yu...@gmail.com>.
I guess George's case has something to do with pseudo-clustered mode.

On Wed, Apr 20, 2011 at 9:27 AM, Jean-Daniel Cryans <jd...@apache.org>wrote:

> Hey George,
>
> Sorry for the late answer, there's nothing that comes to mind when
> reading your email.
>
> HBASE_SLAVE_SLEEP is only used by the bash scripts, like when you do
> hbase-daemons.sh it will wait that sleep time between each machine.
>
> Would you be able to come up with a test that shows the issues you are
> seeing? Like stripping out everything that's related to your stuff and
> leave only the parts that play with hbase?
>
> Have you inspected the logs of those test for anything weird looking
> exceptions? Maybe the logs are screaming about something that needs to
> be taken care of? (just guesses)
>
> Our own experience migrating to 0.90 has been pretty good, we found a
> couple of issues with the new master but not one performance-related
> issue. We ran 0.90.1 for some weeks and now we are on 0.90.2
>
> J-D
>
> On Wed, Apr 20, 2011 at 6:15 AM, George P. Stathis <gs...@traackr.com>
> wrote:
> > Sorry to bump this, but we could really use a hand here. Right now, we
> have
> > a very hard time seeing repeatable read/write consistency. Any
> suggestions
> > are welcome.
> >
> > -GS
> >
> > On Tue, Apr 19, 2011 at 3:08 PM, George P. Stathis <gstathis@traackr.com
> >wrote:
> >
> >> Hi all,
> >>
> >> In this chapter of our 0.89 to 0.90 migration saga, we are seeing what
> we
> >> suspect might be latency related artifacts.
> >>
> >> The setting:
> >>
> >>    - Our EC2 dev environment running our CI builds
> >>    - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode
> >>
> >> We have several unit tests that have started mysteriously failing in
> random
> >> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
> >> tests used to run against 0.89 and never failed before. They also run OK
> on
> >> our local macbooks. On EC2, we are seeing lots of issues where the setup
> >> data is not being persisted in time for the tests to assert against
> them.
> >> They are also not always being torn down properly.
> >>
> >> We first suspected our new code around secondary indexes; we do have
> >> extensive unit tests around it that provide us with a solid level of
> >> confidence that it works properly in our CRUD scenarios. We also
> performance
> >> tested against the old hbase-trx contrib code and our new secondary
> indexes
> >> seem to be running slightly faster as well (of course, that could be due
> to
> >> the bump from 0.89 to 0.90).
> >>
> >> We first started seeing issues running our hudson build on the same
> machine
> >> as the hbase pseudo-cluster. We figured that was putting too much load
> on
> >> the box, so we created a separate large instance on EC2 to host just the
> >> 0.90 stack. This migration nearly quadrupled the number of unit tests
> >> failing at times. The only difference between for first and second CI
> setup
> >> is the network in between.
> >>
> >> Before we start tearing down our code line by line, I'd like to see if
> >> there are latency related configuration tweaks we could try to make the
> >> setup more resilient to network lag. Are there any hbase/zookepper
> settings
> >> that might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> >> in hbase-env.sh . Can that help?
> >>
> >> Any suggestions are more than welcome. Also, the overview above may not
> be
> >> enough to go on, so please let me know if I could provide more details.
> >>
> >> Thank you in advance for any help.
> >>
> >> -GS
> >>
> >>
> >>
> >
>

Re: Latency related configs for 0.90

Posted by Jean-Daniel Cryans <jd...@apache.org>.
Hey George,

Sorry for the late answer, there's nothing that comes to mind when
reading your email.

HBASE_SLAVE_SLEEP is only used by the bash scripts, like when you do
hbase-daemons.sh it will wait that sleep time between each machine.

Would you be able to come up with a test that shows the issues you are
seeing? Like stripping out everything that's related to your stuff and
leave only the parts that play with hbase?

Have you inspected the logs of those test for anything weird looking
exceptions? Maybe the logs are screaming about something that needs to
be taken care of? (just guesses)

Our own experience migrating to 0.90 has been pretty good, we found a
couple of issues with the new master but not one performance-related
issue. We ran 0.90.1 for some weeks and now we are on 0.90.2

J-D

On Wed, Apr 20, 2011 at 6:15 AM, George P. Stathis <gs...@traackr.com> wrote:
> Sorry to bump this, but we could really use a hand here. Right now, we have
> a very hard time seeing repeatable read/write consistency. Any suggestions
> are welcome.
>
> -GS
>
> On Tue, Apr 19, 2011 at 3:08 PM, George P. Stathis <gs...@traackr.com>wrote:
>
>> Hi all,
>>
>> In this chapter of our 0.89 to 0.90 migration saga, we are seeing what we
>> suspect might be latency related artifacts.
>>
>> The setting:
>>
>>    - Our EC2 dev environment running our CI builds
>>    - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode
>>
>> We have several unit tests that have started mysteriously failing in random
>> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
>> tests used to run against 0.89 and never failed before. They also run OK on
>> our local macbooks. On EC2, we are seeing lots of issues where the setup
>> data is not being persisted in time for the tests to assert against them.
>> They are also not always being torn down properly.
>>
>> We first suspected our new code around secondary indexes; we do have
>> extensive unit tests around it that provide us with a solid level of
>> confidence that it works properly in our CRUD scenarios. We also performance
>> tested against the old hbase-trx contrib code and our new secondary indexes
>> seem to be running slightly faster as well (of course, that could be due to
>> the bump from 0.89 to 0.90).
>>
>> We first started seeing issues running our hudson build on the same machine
>> as the hbase pseudo-cluster. We figured that was putting too much load on
>> the box, so we created a separate large instance on EC2 to host just the
>> 0.90 stack. This migration nearly quadrupled the number of unit tests
>> failing at times. The only difference between for first and second CI setup
>> is the network in between.
>>
>> Before we start tearing down our code line by line, I'd like to see if
>> there are latency related configuration tweaks we could try to make the
>> setup more resilient to network lag. Are there any hbase/zookepper settings
>> that might help? For instance, we see things such as HBASE_SLAVE_SLEEP
>> in hbase-env.sh . Can that help?
>>
>> Any suggestions are more than welcome. Also, the overview above may not be
>> enough to go on, so please let me know if I could provide more details.
>>
>> Thank you in advance for any help.
>>
>> -GS
>>
>>
>>
>

Re: Latency related configs for 0.90

Posted by "George P. Stathis" <gs...@traackr.com>.
Sorry to bump this, but we could really use a hand here. Right now, we have
a very hard time seeing repeatable read/write consistency. Any suggestions
are welcome.

-GS

On Tue, Apr 19, 2011 at 3:08 PM, George P. Stathis <gs...@traackr.com>wrote:

> Hi all,
>
> In this chapter of our 0.89 to 0.90 migration saga, we are seeing what we
> suspect might be latency related artifacts.
>
> The setting:
>
>    - Our EC2 dev environment running our CI builds
>    - CDH3 U0 (both hadoop and hbase) setup in pseudo-clustered mode
>
> We have several unit tests that have started mysteriously failing in random
> ways as soon as we migrated our EC2 CI build to the new 0.90 CDH3. Those
> tests used to run against 0.89 and never failed before. They also run OK on
> our local macbooks. On EC2, we are seeing lots of issues where the setup
> data is not being persisted in time for the tests to assert against them.
> They are also not always being torn down properly.
>
> We first suspected our new code around secondary indexes; we do have
> extensive unit tests around it that provide us with a solid level of
> confidence that it works properly in our CRUD scenarios. We also performance
> tested against the old hbase-trx contrib code and our new secondary indexes
> seem to be running slightly faster as well (of course, that could be due to
> the bump from 0.89 to 0.90).
>
> We first started seeing issues running our hudson build on the same machine
> as the hbase pseudo-cluster. We figured that was putting too much load on
> the box, so we created a separate large instance on EC2 to host just the
> 0.90 stack. This migration nearly quadrupled the number of unit tests
> failing at times. The only difference between for first and second CI setup
> is the network in between.
>
> Before we start tearing down our code line by line, I'd like to see if
> there are latency related configuration tweaks we could try to make the
> setup more resilient to network lag. Are there any hbase/zookepper settings
> that might help? For instance, we see things such as HBASE_SLAVE_SLEEP
> in hbase-env.sh . Can that help?
>
> Any suggestions are more than welcome. Also, the overview above may not be
> enough to go on, so please let me know if I could provide more details.
>
> Thank you in advance for any help.
>
> -GS
>
>
>