You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Deneche A. Hakim (JIRA)" <ji...@apache.org> on 2016/01/07 22:37:40 UTC

[jira] [Commented] (DRILL-3845) PartitionSender doesn't send last batch for receivers that already terminated

    [ https://issues.apache.org/jira/browse/DRILL-3845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088156#comment-15088156 ] 

Deneche A. Hakim commented on DRILL-3845:
-----------------------------------------

We've seen this issue once again in a different query. An intermediate fragment contains a hashjoin, the left side is generating lot's of data (it's a view that contains 2 flatten operators) and takes more than 10 minutes to finish sending all it's data. The right side is really small and sends everything in less than 2 seconds. 
For some reason (maybe a skew caused by our hashing function) some fragments don't receive any data at all on both sides and the hashjoin stops the fragment. But because the left side didn't send any data either, it will send the "last batch" when it's done, 10 minutes later, and the query fails because the fragment is not even in the recently finished cache.

The proposed fix updates PartitionSender to not send the "last batch" for any receiver that sent an early termination request.

> PartitionSender doesn't send last batch for receivers that already terminated
> -----------------------------------------------------------------------------
>
>                 Key: DRILL-3845
>                 URL: https://issues.apache.org/jira/browse/DRILL-3845
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Execution - Relational Operators
>            Reporter: Deneche A. Hakim
>            Assignee: Deneche A. Hakim
>             Fix For: 1.5.0
>
>         Attachments: 29c45a5b-e2b9-72d6-89f2-d49ba88e2939.sys.drill
>
>
> Even if a receiver has finished and informed the corresponding partition sender, the sender will still try to send a "last batch" to the receiver when it's done. In most cases this is fine as those batches will be silently dropped by the receiving DataServer, but if a receiver has finished +10 minutes ago, DataServer will throw an exception as it couldn't find the corresponding FragmentManager (WorkEventBus has a 10 minutes recentlyFinished cache).
> DRILL-2274 is a reproduction for this case (after the corresponding fix is applied).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)