You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2018/03/13 19:10:00 UTC

[jira] [Commented] (KUDU-2342) Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "

    [ https://issues.apache.org/jira/browse/KUDU-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16397479#comment-16397479 ] 

Todd Lipcon commented on KUDU-2342:
-----------------------------------

The server vc1515 has the following spewing in its logs:

{code}
I0313 11:56:27.615651 43703 consensus_peers.cc:230] T b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer f7376c96c6b64e7fa6a7bfc84fd0cd64 (vc1534.halxg.cloudera.com:7050): Could not obtain request from queue for peer: f7376c96c6b64e7fa6a7bfc84fd0cd64. Status: Not found: Failed to read ops 1143..1221: Segment 1130 which contained index 1143 has been GCed
I0313 11:56:27.973654 43703 consensus_peers.cc:230] T b8431200388d486995a4426c88bc06a2 P a260dca5a9c846e99cb621881a7b86b8 -> Peer e3fdd8da21a643aba21b7acdd6b17499 (va1038.halxg.cloudera.com:7050): Could not obtain request from queue for peer: e3fdd8da21a643aba21b7acdd6b17499. Status: Not found: Failed to read ops 1055..1221: Segment 1043 which contained index 1055 has been GCed
{code}

in other words, it appears to have evicted the log segments necessary to catch up both of its followers. Thus it's unable to replicate and commit any writes, so the write here timed out. Instead of letting it time out we should of course respond more rapidly saying that the tablet is unavailable, but that's a separate issue.

I guess in this case we can't recover because it wont evict a follower either because it knows that it wouldn't be able to commit the config change. So, how did it get into the state where it had GCed logs behind the majority_replicated watermark? [~aserbin] said he can take a look

> Insert into Lineitem table with 1340 tablets on 129 node cluster failed with "Failed to write batch "
> -----------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2342
>                 URL: https://issues.apache.org/jira/browse/KUDU-2342
>             Project: Kudu
>          Issue Type: Bug
>          Components: tablet
>    Affects Versions: 1.7.0
>            Reporter: Mostafa Mokhtar
>            Priority: Major
>              Labels: scalability
>         Attachments: Impala query profile.txt
>
>
> While loading TPCH 30TB on 129 node cluster via Impala, write operation failed with :
>     Query Status: Kudu error(s) reported, first error: Timed out: Failed to write batch of 38590 ops to tablet b8431200388d486995a4426c88bc06a2 after 1 attempt(s): Failed to write to server: a260dca5a9c846e99cb621881a7b86b8 (vc1515.halxg.cloudera.com:7050): Write RPC to X.X.X.X:7050 timed out after 180.000s (SENT)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)