You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "peay (JIRA)" <ji...@apache.org> on 2017/04/01 14:43:42 UTC

[jira] [Commented] (BEAM-1848) GroupByKey stuck with more than one worker on Dataflow

    [ https://issues.apache.org/jira/browse/BEAM-1848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15952249#comment-15952249 ] 

peay commented on BEAM-1848:
----------------------------

Indeed, the issue is gone. Thanks for looking into it so quickly -- I hadn't found those issues.

> GroupByKey stuck with more than one worker on Dataflow
> ------------------------------------------------------
>
>                 Key: BEAM-1848
>                 URL: https://issues.apache.org/jira/browse/BEAM-1848
>             Project: Beam
>          Issue Type: Bug
>          Components: runner-dataflow, sdk-java-core
>    Affects Versions: 0.6.0
>            Reporter: peay
>            Assignee: Kenneth Knowles
>            Priority: Blocker
>
> I have a simple pipeline which has a sliding window, a {{GroupByKey}} and then a simple stateful {{DoFn}}. I run in batch mode ({{--streaming=false}}) on Dataflow. 
> On a very small dataset  of a couple KBs, I can run the pipeline to completion. Dataflow does show "successful".  On a larger dataset (but still very small, 100s MB read by source), the pipeline stays stuck, no matter how long I wait. In addition, it never really gets stuck at the same point. I expect about 340k output records, and never get more than 70k before it gets stuck.
> Dataflow always autoscales from 1 to 8 workers, which is my limit.
> Run A: after 30mins+: no elements added out of {{GroupByKey}}, but logs have repeating occurrences of
> {code}
> Proposing dynamic split of work unit xxxxxx;aaaaaa;bbbbbb at {"position":{"shufflePosition":"AAAAAAD_AP8A_wEAAQ"}}
> Refusing to split <unstarted in shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ): unstarted   
> Refused to split GroupingShuffleReader <unstarted in shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ)
> {code}
> Run B: after a couple minutes, elements get added to output of {{GroupByKey}}, up to 56,128 and then stays stuck doing nothing, but logs have repeating occurrences of
> {code}
> Proposing dynamic split of work unit xxxxxx;ccccccc;dddddd at {"position":{"shufflePosition":"AAAAAQD_AP8A_wEAAQ"}}
> Refusing to split <unstarted in shuffle range [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ): unstarted
> Refused to split GroupingShuffleReader <unstarted in shuffle range [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ)
> {code}
> Run C: after 10mins: elements get added to output of {{GroupByKey}}, up to 70,262 and then stays stuck doing nothing. No logs as above as far as I can find.
> I've run this about a dozen times and it always gets stuck. I am trying out right now to run the pipeline with the worker limit set to one, and {{GroupByKey}} has output 150k so far, still increasing. This seems like a workaround, but using one worker only is not ideal.
> cc [~dhalperi@google.com]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)