You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@kudu.apache.org by Todd Lipcon <to...@cloudera.com> on 2016/10/04 05:35:57 UTC

Fault tolerant scanner returning out-of-order partitions

Hey Dan (+CC dev in case anyone else knows about this too)

I'm debugging some flakiness in alter_table-randomized-test, and ti seems
like it's failing because the verification scan is returning some out of
order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
isn't publicly guaranteed to return rows in order, but I was under the
impression that, with range partitioned tablets, it would always do so.

The scan result I'm seeing has the following sequence within it:

(int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32 c990=NULL)
>>>> OUT OF ORDER ROW
(int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32 c990=NULL)
>>>> BACK TO NORMAL ORDER
(int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336, int32
c990=829302644)
(int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224, int32
c990=NULL)
(int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
c990=26527696)
(int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
c990=NULL)
(int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
c990=823764490)

The prior alter was:
I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481] Dropping
range partition: [805306356, 872415219) resulting partitions: (134217726,
201326589], (268435452, 335544315], (335544315, 402653178], (402653178,
469762041], (536870904, 603979767], (671088630, 738197493], (738197493,
805306356], (939524082, 1006632945], (1006632945, 1073741808], (1275068397,
1342177260], (1342177260, 1409286123], (1409286123, 1476394986],
(1610612712, 1677721575], (1879048164, 1946157027], (2013265890,
2080374753], (2080374753, 2147483616)
I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406] Committing
Alterations

The whole log is available here:
https://gist.githubusercontent.com/toddlipcon/466976caf973f496885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e84803815a8/alter_table-randomized-test.txt

Any ideas what might be causing this out-of-order result? Is the test
making some incorrect assumptions or might we have a bug?

-Todd






-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Fault tolerant scanner returning out-of-order partitions

Posted by Todd Lipcon <to...@cloudera.com>.
For those following along, I figured this one out:
https://issues.apache.org/jira/browse/KUDU-1677

Turns out to be a long-standing bug, unrelated to alter-table or add/drop
range partitions.

On Tue, Oct 4, 2016 at 12:24 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Good point that they're in the same partition. I have the data dir from
> this failure as well, so I'll dig in and see what I can determine. Dont
> worry about spending time on it for now, since it seems like it's not
> related to add/drop. Will report back if I need any more pointers.
>
> -Todd
>
> On Mon, Oct 3, 2016 at 10:56 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> It's also strange that the first two rows have the same value for c79.
>> That is extremely unlikely.  I can dig in more tomorrow.
>>
>> - Dan
>>
>> On Mon, Oct 3, 2016 at 10:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>>
>>> The first three rows (including the out of order row) all fall in the
>>> same range partition, so the issue is likely that the intra-tablet scan
>>> returned out of order results (as opposed to the client scanning tablets
>>> out of order). I'm under the same impression about SetFaultTolerant(),
>>> which is why the test explicitly sets it.  How often is this happening?
>>> Back when this test was committed a few months ago I ran it a few thousand
>>> times and never saw anything like this.
>>>
>>> On Mon, Oct 3, 2016 at 10:35 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>>
>>>> Hey Dan (+CC dev in case anyone else knows about this too)
>>>>
>>>> I'm debugging some flakiness in alter_table-randomized-test, and ti
>>>> seems like it's failing because the verification scan is returning some out
>>>> of order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
>>>> isn't publicly guaranteed to return rows in order, but I was under the
>>>> impression that, with range partitioned tablets, it would always do so.
>>>>
>>>> The scan result I'm seeing has the following sequence within it:
>>>>
>>>> (int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32
>>>> c990=NULL)
>>>> >>>> OUT OF ORDER ROW
>>>> (int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32
>>>> c990=NULL)
>>>> >>>> BACK TO NORMAL ORDER
>>>> (int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336,
>>>> int32 c990=829302644)
>>>> (int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224,
>>>> int32 c990=NULL)
>>>> (int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
>>>> c990=26527696)
>>>> (int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
>>>> c990=NULL)
>>>> (int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
>>>> c990=823764490)
>>>>
>>>> The prior alter was:
>>>> I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481]
>>>> Dropping range partition: [805306356, 872415219) resulting partitions:
>>>> (134217726, 201326589], (268435452, 335544315], (335544315, 402653178],
>>>> (402653178, 469762041], (536870904, 603979767], (671088630, 738197493],
>>>> (738197493, 805306356], (939524082, 1006632945], (1006632945, 1073741808],
>>>> (1275068397, 1342177260], (1342177260, 1409286123], (1409286123,
>>>> 1476394986], (1610612712, 1677721575], (1879048164, 1946157027],
>>>> (2013265890, 2080374753], (2080374753, 2147483616)
>>>> I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406]
>>>> Committing Alterations
>>>>
>>>> The whole log is available here:
>>>> https://gist.githubusercontent.com/toddlipcon/466976caf973f4
>>>> 96885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e8480381
>>>> 5a8/alter_table-randomized-test.txt
>>>>
>>>> Any ideas what might be causing this out-of-order result? Is the test
>>>> making some incorrect assumptions or might we have a bug?
>>>>
>>>> -Todd
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Todd Lipcon
>>>> Software Engineer, Cloudera
>>>>
>>>
>>>
>>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>



-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Fault tolerant scanner returning out-of-order partitions

Posted by Todd Lipcon <to...@cloudera.com>.
Good point that they're in the same partition. I have the data dir from
this failure as well, so I'll dig in and see what I can determine. Dont
worry about spending time on it for now, since it seems like it's not
related to add/drop. Will report back if I need any more pointers.

-Todd

On Mon, Oct 3, 2016 at 10:56 PM, Dan Burkert <da...@cloudera.com> wrote:

> It's also strange that the first two rows have the same value for c79.
> That is extremely unlikely.  I can dig in more tomorrow.
>
> - Dan
>
> On Mon, Oct 3, 2016 at 10:53 PM, Dan Burkert <da...@cloudera.com> wrote:
>
>> The first three rows (including the out of order row) all fall in the
>> same range partition, so the issue is likely that the intra-tablet scan
>> returned out of order results (as opposed to the client scanning tablets
>> out of order). I'm under the same impression about SetFaultTolerant(),
>> which is why the test explicitly sets it.  How often is this happening?
>> Back when this test was committed a few months ago I ran it a few thousand
>> times and never saw anything like this.
>>
>> On Mon, Oct 3, 2016 at 10:35 PM, Todd Lipcon <to...@cloudera.com> wrote:
>>
>>> Hey Dan (+CC dev in case anyone else knows about this too)
>>>
>>> I'm debugging some flakiness in alter_table-randomized-test, and ti
>>> seems like it's failing because the verification scan is returning some out
>>> of order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
>>> isn't publicly guaranteed to return rows in order, but I was under the
>>> impression that, with range partitioned tablets, it would always do so.
>>>
>>> The scan result I'm seeing has the following sequence within it:
>>>
>>> (int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32
>>> c990=NULL)
>>> >>>> OUT OF ORDER ROW
>>> (int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32
>>> c990=NULL)
>>> >>>> BACK TO NORMAL ORDER
>>> (int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336, int32
>>> c990=829302644)
>>> (int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224, int32
>>> c990=NULL)
>>> (int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
>>> c990=26527696)
>>> (int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
>>> c990=NULL)
>>> (int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
>>> c990=823764490)
>>>
>>> The prior alter was:
>>> I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481]
>>> Dropping range partition: [805306356, 872415219) resulting partitions:
>>> (134217726, 201326589], (268435452, 335544315], (335544315, 402653178],
>>> (402653178, 469762041], (536870904, 603979767], (671088630, 738197493],
>>> (738197493, 805306356], (939524082, 1006632945], (1006632945, 1073741808],
>>> (1275068397, 1342177260], (1342177260, 1409286123], (1409286123,
>>> 1476394986], (1610612712, 1677721575], (1879048164, 1946157027],
>>> (2013265890, 2080374753], (2080374753, 2147483616)
>>> I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406]
>>> Committing Alterations
>>>
>>> The whole log is available here:
>>> https://gist.githubusercontent.com/toddlipcon/466976caf973f4
>>> 96885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e8480381
>>> 5a8/alter_table-randomized-test.txt
>>>
>>> Any ideas what might be causing this out-of-order result? Is the test
>>> making some incorrect assumptions or might we have a bug?
>>>
>>> -Todd
>>>
>>>
>>>
>>>
>>>
>>>
>>> --
>>> Todd Lipcon
>>> Software Engineer, Cloudera
>>>
>>
>>
>


-- 
Todd Lipcon
Software Engineer, Cloudera

Re: Fault tolerant scanner returning out-of-order partitions

Posted by Dan Burkert <da...@cloudera.com>.
It's also strange that the first two rows have the same value for c79.
That is extremely unlikely.  I can dig in more tomorrow.

- Dan

On Mon, Oct 3, 2016 at 10:53 PM, Dan Burkert <da...@cloudera.com> wrote:

> The first three rows (including the out of order row) all fall in the same
> range partition, so the issue is likely that the intra-tablet scan returned
> out of order results (as opposed to the client scanning tablets out of
> order). I'm under the same impression about SetFaultTolerant(), which is
> why the test explicitly sets it.  How often is this happening? Back when
> this test was committed a few months ago I ran it a few thousand times and
> never saw anything like this.
>
> On Mon, Oct 3, 2016 at 10:35 PM, Todd Lipcon <to...@cloudera.com> wrote:
>
>> Hey Dan (+CC dev in case anyone else knows about this too)
>>
>> I'm debugging some flakiness in alter_table-randomized-test, and ti seems
>> like it's failing because the verification scan is returning some out of
>> order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
>> isn't publicly guaranteed to return rows in order, but I was under the
>> impression that, with range partitioned tablets, it would always do so.
>>
>> The scan result I'm seeing has the following sequence within it:
>>
>> (int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32
>> c990=NULL)
>> >>>> OUT OF ORDER ROW
>> (int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32
>> c990=NULL)
>> >>>> BACK TO NORMAL ORDER
>> (int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336, int32
>> c990=829302644)
>> (int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224, int32
>> c990=NULL)
>> (int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
>> c990=26527696)
>> (int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
>> c990=NULL)
>> (int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
>> c990=823764490)
>>
>> The prior alter was:
>> I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481] Dropping
>> range partition: [805306356, 872415219) resulting partitions: (134217726,
>> 201326589], (268435452, 335544315], (335544315, 402653178], (402653178,
>> 469762041], (536870904, 603979767], (671088630, 738197493], (738197493,
>> 805306356], (939524082, 1006632945], (1006632945, 1073741808], (1275068397,
>> 1342177260], (1342177260, 1409286123], (1409286123, 1476394986],
>> (1610612712, 1677721575], (1879048164, 1946157027], (2013265890,
>> 2080374753], (2080374753, 2147483616)
>> I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406]
>> Committing Alterations
>>
>> The whole log is available here:
>> https://gist.githubusercontent.com/toddlipcon/466976caf973f4
>> 96885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e8480381
>> 5a8/alter_table-randomized-test.txt
>>
>> Any ideas what might be causing this out-of-order result? Is the test
>> making some incorrect assumptions or might we have a bug?
>>
>> -Todd
>>
>>
>>
>>
>>
>>
>> --
>> Todd Lipcon
>> Software Engineer, Cloudera
>>
>
>

Re: Fault tolerant scanner returning out-of-order partitions

Posted by Dan Burkert <da...@cloudera.com>.
The first three rows (including the out of order row) all fall in the same
range partition, so the issue is likely that the intra-tablet scan returned
out of order results (as opposed to the client scanning tablets out of
order). I'm under the same impression about SetFaultTolerant(), which is
why the test explicitly sets it.  How often is this happening? Back when
this test was committed a few months ago I ran it a few thousand times and
never saw anything like this.

On Mon, Oct 3, 2016 at 10:35 PM, Todd Lipcon <to...@cloudera.com> wrote:

> Hey Dan (+CC dev in case anyone else knows about this too)
>
> I'm debugging some flakiness in alter_table-randomized-test, and ti seems
> like it's failing because the verification scan is returning some out of
> order rows, despite using "SetFaultTolerant()". Granted, fault tolerance
> isn't publicly guaranteed to return rows in order, but I was under the
> impression that, with range partitioned tablets, it would always do so.
>
> The scan result I'm seeing has the following sequence within it:
>
> (int32 key=537424064, int32 c945=NULL, int32 c79=234639860, int32
> c990=NULL)
> >>>> OUT OF ORDER ROW
> (int32 key=552025439, int32 c945=NULL, int32 c79=234639860, int32
> c990=NULL)
> >>>> BACK TO NORMAL ORDER
> (int32 key=539314778, int32 c945=1708089980, int32 c79=-878787336, int32
> c990=829302644)
> (int32 key=541817227, int32 c945=2064952224, int32 c79=2064952224, int32
> c990=NULL)
> (int32 key=546056206, int32 c945=26527696, int32 c79=26527696, int32
> c990=26527696)
> (int32 key=601960253, int32 c945=NULL, int32 c79=1088757503, int32
> c990=NULL)
> (int32 key=677154987, int32 c945=823764490, int32 c79=823764490, int32
> c990=823764490)
>
> The prior alter was:
> I1004 05:17:48.192611 28113 alter_table-randomized-test.cc:481] Dropping
> range partition: [805306356, 872415219) resulting partitions: (134217726,
> 201326589], (268435452, 335544315], (335544315, 402653178], (402653178,
> 469762041], (536870904, 603979767], (671088630, 738197493], (738197493,
> 805306356], (939524082, 1006632945], (1006632945, 1073741808], (1275068397,
> 1342177260], (1342177260, 1409286123], (1409286123, 1476394986],
> (1610612712, 1677721575], (1879048164, 1946157027], (2013265890,
> 2080374753], (2080374753, 2147483616)
> I1004 05:17:48.193013 28113 alter_table-randomized-test.cc:406]
> Committing Alterations
>
> The whole log is available here:
> https://gist.githubusercontent.com/toddlipcon/
> 466976caf973f496885da9efc2f7246c/raw/f9baf418dad4ad07f33961b131c86e
> 84803815a8/alter_table-randomized-test.txt
>
> Any ideas what might be causing this out-of-order result? Is the test
> making some incorrect assumptions or might we have a bug?
>
> -Todd
>
>
>
>
>
>
> --
> Todd Lipcon
> Software Engineer, Cloudera
>