You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@kudu.apache.org by "Andrew Wong (Code Review)" <ge...@cloudera.org> on 2019/05/21 00:30:22 UTC

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Andrew Wong has uploaded this change for review. ( http://gerrit.cloudera.org:8080/13384


Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................

hms: fix flakiness of master-stress-test from managed tables

The switch to managed tables opened the door for the race described in
HIVE-21759, described with more Kudu-specific details below:

1. Master A* sends a create table request for T.
2. The HMS creates a directory for that table and an entry in the HMS
   catalog.
3. Before A* can write this to the Kudu catalog, master B* becomes
   leader. A's write fails, and A sends a drop request to the HMS to
   roll back the created metadata for T.
4. The HMS receives A's drop request, and the HMS entry is dropped.
   Note: In the process of dropping a table from the HMS, the last thing
   the operation will do is drop any underlying data in the HMS (the
   directory, in this case).
5. The create request for T is retried by B*. Because the HMS entry for
   T is already dropped in the HMS, the request will be processed.
6. The HMS finally drops the underlying directory for T created by A.
7. Along the path of creating the table, the request from B* hits a "Not
   found" exception, expecting for the directory for T to exist.
8. The create table request is failed, since the HMS transaction to
   create the table is failed.

The end result is that the table doesn't exist in Kudu or in the HMS
following this sequence of events. This patch addresses this by catching
such errors with the HMS enabled and checking that an underlying Kudu
table and HMS entry is not created if so.

Without this patch, master-stress-test failed 24/300 times with 7 stress
threads. With it, it failed 9/300 times (mostly due to KUDU-2779).

Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
---
M src/kudu/integration-tests/master-stress-test.cc
1 file changed, 35 insertions(+), 0 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/13384/1
-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Adar Dembo (Code Review)" <ge...@cloudera.org>.
Adar Dembo has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 1:

(1 comment)

Looks like the patch this one is based on is failing tests.

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG@22
PS1, Line 22: is retried by B
> I would think it's being retried by the client that initiated the operation
Yes, but in context the master is acting as a proxy for the client (i.e. the client retries on B*, which talks to HMS). For example, step 1 is also initiated by the client; it's just implicit.



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 21 May 2019 20:15:02 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Hao Hao (Code Review)" <ge...@cloudera.org>.
Hao Hao has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 3: Code-Review+2

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG@9
PS2, Line 9:  managed tables opened the door
> This can't happen before the switch because external tables will not drop d
Ah, right, thanks a lot for the explanation.


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@280
PS1, Line 280:       // underlying Kudu table should not be created.
             :       if (hms_client_ && s.IsRemoteError() &&
             :           MatchPattern(s.ToString(), "*FileNotFoundException*")) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_c
> No, just master_failover-itest. It doesn't show up here because the RPC tim
Ok, sounds good.



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 24 May 2019 01:16:49 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Hao Hao (Code Review)" <ge...@cloudera.org>.
Hao Hao has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 2: Code-Review+1

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG@9
PS2, Line 9:  managed tables opened the door
I think this can happen without the switch? Switching only means it increase the risk of such failure?


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@280
PS1, Line 280:           MatchPattern(s.ToString(), "*FileNotFoundException*")) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.To
> If the table hasn't been rolled back, the new leader will see an AlreadyPre
Have you ever saw master-stress-test failed with such error? If so, should we also account for such failure?



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 24 May 2019 00:47:14 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Grant Henke, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13384

to look at the new patch set (#2).

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................

hms: fix flakiness of master-stress-test from managed tables

The switch to managed tables opened the door for the race described in
HIVE-21759, described with more Kudu-specific details below:

1. Master A* sends a create table request for T.
2. The HMS creates a directory for that table and an entry in the HMS
   catalog.
3. Before A* can write this to the Kudu catalog, master B* becomes
   leader. A's write fails, and A sends a drop request to the HMS to
   roll back the created metadata for T.
4. The HMS receives A's drop request, and the HMS entry is dropped.
   Note: In the process of dropping a table from the HMS, the last thing
   the operation will do is drop any underlying data in the HMS (the
   directory, in this case).
5. The create request for T is retried by B*. Because the HMS entry for
   T is already dropped in the HMS, the request will be processed.
6. The HMS finally drops the underlying directory for T created by A.
7. Along the path of creating the table, the request from B* hits a "Not
   found" exception, expecting for the directory for T to exist.
8. The create table request is failed, since the HMS transaction to
   create the table is failed.

The end result is that the table doesn't exist in Kudu or in the HMS
following this sequence of events. This patch addresses this by catching
such errors with the HMS enabled and checking that an underlying Kudu
table and HMS entry is not created if so.

Without this patch, master-stress-test failed 24/300 times with 7 stress
threads. With it, it failed 9/300 times (mostly due to KUDU-2779).

Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
---
M src/kudu/integration-tests/master-stress-test.cc
1 file changed, 36 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/13384/2
-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 3: Verified+1

Flakes caused by unsynchronized NTP :(


-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 24 May 2019 01:30:31 +0000
Gerrit-HasComments: No

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has removed a vote on this change.

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Removed Verified-1 by Kudu Jenkins (120)
-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: deleteVote
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/2//COMMIT_MSG@9
PS2, Line 9:  managed tables opened the door
> I think this can happen without the switch? Switching only means it increas
This can't happen before the switch because external tables will not drop data when dropping the tables. See:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2647

which checks if it's external here:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2704

and based on that, it doesn't drop data:
https://github.com/apache/hive/blob/master/standalone-metastore/metastore-server/src/main/java/org/apache/hadoop/hive/metastore/HiveMetaStore.java#L2682


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@280
PS1, Line 280:           MatchPattern(s.ToString(), "*FileNotFoundException*")) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.To
> Have you ever saw master-stress-test failed with such error? If so, should 
No, just master_failover-itest. It doesn't show up here because the RPC timeout is bumped up enough (60s in this test) that the client will wait long enough for the rollback to complete before sending the second create table request.



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Fri, 24 May 2019 01:03:40 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG@22
PS1, Line 22: is retried by B
I would think it's being retried by the client that initiated the operation, no?


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@279
PS1, Line 279:       if (hms_client_ && s.IsRemoteError()) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.ToString(), table_name.ToString(), &hms_table);
             :         CHECK(s.IsNotFound()) << s.ToString();
I'm curious whether this piece might inadvertently hide issues in other scenarios.  Is there anything specific in the Status::RemoteError() in this case that we could use to make sure this outcome has happened due to the leader change and nothing else?


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@280
PS1, Line 280:         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.ToString(), table_name.ToString(), &hms_table);
             :         CHECK(s.IsNotFound()) << s.ToString();
Another this I'm curious is whether this check catches all the outcomes of the race described in the commit message.

For example, is it possible that by the time of this check the entry is still in the HMS and hasn't been rolled back by the scoped cleanup in the former master, and the table is seen as not-yet-existing in Kudu by the newly elected leader master?



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 21 May 2019 06:13:57 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 2:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc
File src/kudu/integration-tests/master-stress-test.cc:

http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@279
PS1, Line 279:       if (hms_client_ && s.IsRemoteError() &&
             :           MatchPattern(s.ToString(), "*FileNotFoundException*")) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.To
> I'm curious whether this piece might inadvertently hide issues in other sce
Done


http://gerrit.cloudera.org:8080/#/c/13384/1/src/kudu/integration-tests/master-stress-test.cc@280
PS1, Line 280:           MatchPattern(s.ToString(), "*FileNotFoundException*")) {
             :         // Check that the table was not created in Kudu.
             :         client::sp::shared_ptr<KuduTable> table;
             :         s = client_->OpenTable(to_create, &table);
             :         CHECK(s.IsNotFound()) << s.ToString();
             : 
             :         // Check that the table was not created in the HMS.
             :         Slice db;
             :         Slice table_name;
             :         CHECK_OK(ParseHiveTableIdentifier(to_create, &db, &table_name));
             :         hive::Table hms_table;
             :         Status s = hms_client_->GetTable(db.To
> Another this I'm curious is whether this check catches all the outcomes of 
If the table hasn't been rolled back, the new leader will see an AlreadyPresent error. The race you described is the cause of: https://jira.apache.org/jira/browse/KUDU-2718

The error there looks like:

Bad status: Already present: Error creating table default.test_delete_table_sync on the master: an error occurred while creating table default.test_delete_table_sync in the HMS: failed to create Hive MetaStore table: TException - service has thrown: AlreadyExistsException(message=Table test_delete_table_sync already exists)



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 2
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Thu, 23 May 2019 23:23:52 +0000
Gerrit-HasComments: Yes

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Andrew Wong has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................

hms: fix flakiness of master-stress-test from managed tables

The switch to managed tables opened the door for the race described in
HIVE-21759, described with more Kudu-specific details below:

1. Master A* sends a create table request for T.
2. The HMS creates a directory for that table and an entry in the HMS
   catalog.
3. Before A* can write this to the Kudu catalog, master B* becomes
   leader. A's write fails, and A sends a drop request to the HMS to
   roll back the created metadata for T.
4. The HMS receives A's drop request, and the HMS entry is dropped.
   Note: In the process of dropping a table from the HMS, the last thing
   the operation will do is drop any underlying data in the HMS (the
   directory, in this case).
5. The create request for T is retried by B*. Because the HMS entry for
   T is already dropped in the HMS, the request will be processed.
6. The HMS finally drops the underlying directory for T created by A.
7. Along the path of creating the table, the request from B* hits a "Not
   found" exception, expecting for the directory for T to exist.
8. The create table request is failed, since the HMS transaction to
   create the table is failed.

The end result is that the table doesn't exist in Kudu or in the HMS
following this sequence of events. This patch addresses this by catching
such errors with the HMS enabled and checking that an underlying Kudu
table and HMS entry is not created if so.

Without this patch, master-stress-test failed 24/300 times with 7 stress
threads. With it, it failed 9/300 times (mostly due to KUDU-2779).

Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Reviewed-on: http://gerrit.cloudera.org:8080/13384
Reviewed-by: Hao Hao <ha...@cloudera.com>
Tested-by: Andrew Wong <aw...@cloudera.com>
---
M src/kudu/integration-tests/master-stress-test.cc
1 file changed, 38 insertions(+), 0 deletions(-)

Approvals:
  Hao Hao: Looks good to me, approved
  Andrew Wong: Verified

-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 4
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.
Hello Alexey Serbin, Kudu Jenkins, Adar Dembo, Grant Henke, Hao Hao, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/13384

to look at the new patch set (#3).

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................

hms: fix flakiness of master-stress-test from managed tables

The switch to managed tables opened the door for the race described in
HIVE-21759, described with more Kudu-specific details below:

1. Master A* sends a create table request for T.
2. The HMS creates a directory for that table and an entry in the HMS
   catalog.
3. Before A* can write this to the Kudu catalog, master B* becomes
   leader. A's write fails, and A sends a drop request to the HMS to
   roll back the created metadata for T.
4. The HMS receives A's drop request, and the HMS entry is dropped.
   Note: In the process of dropping a table from the HMS, the last thing
   the operation will do is drop any underlying data in the HMS (the
   directory, in this case).
5. The create request for T is retried by B*. Because the HMS entry for
   T is already dropped in the HMS, the request will be processed.
6. The HMS finally drops the underlying directory for T created by A.
7. Along the path of creating the table, the request from B* hits a "Not
   found" exception, expecting for the directory for T to exist.
8. The create table request is failed, since the HMS transaction to
   create the table is failed.

The end result is that the table doesn't exist in Kudu or in the HMS
following this sequence of events. This patch addresses this by catching
such errors with the HMS enabled and checking that an underlying Kudu
table and HMS entry is not created if so.

Without this patch, master-stress-test failed 24/300 times with 7 stress
threads. With it, it failed 9/300 times (mostly due to KUDU-2779).

Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
---
M src/kudu/integration-tests/master-stress-test.cc
1 file changed, 38 insertions(+), 0 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/84/13384/3
-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 3
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)

[kudu-CR] hms: fix flakiness of master-stress-test from managed tables

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.
Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/13384 )

Change subject: hms: fix flakiness of master-stress-test from managed tables
......................................................................


Patch Set 1:

(1 comment)

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/13384/1//COMMIT_MSG@22
PS1, Line 22: is retried by B
> Yes, but in context the master is acting as a proxy for the client (i.e. th
Ah, I see.  Thank you for the clarification.



-- 
To view, visit http://gerrit.cloudera.org:8080/13384
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I1233e5b133035d36f98ce08265c0273570caa454
Gerrit-Change-Number: 13384
Gerrit-PatchSet: 1
Gerrit-Owner: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Adar Dembo <ad...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Grant Henke <gr...@apache.org>
Gerrit-Reviewer: Hao Hao <ha...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Comment-Date: Tue, 21 May 2019 22:41:32 +0000
Gerrit-HasComments: Yes