You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@beam.apache.org by "peay (JIRA)" <ji...@apache.org> on 2017/03/31 16:31:41 UTC

[jira] [Created] (BEAM-1848) GroupByKey stuck with more than one worker on Dataflow

peay created BEAM-1848:
--------------------------

             Summary: GroupByKey stuck with more than one worker on Dataflow
                 Key: BEAM-1848
                 URL: https://issues.apache.org/jira/browse/BEAM-1848
             Project: Beam
          Issue Type: Bug
          Components: runner-dataflow, sdk-java-core
    Affects Versions: 0.6.0
            Reporter: peay
            Assignee: Davor Bonaci
            Priority: Blocker


I have a simple pipeline which has a sliding window, a {{GroupByKey}} and then a simple stateful {{DoFn}}. I run in batch mode ({{--streaming=false}}) on Dataflow. 

On a very small dataset  of a couple KBs, I can run the pipeline to completion. Dataflow does show "successful".  On a larger dataset (but still very small, 100s MB read by source), the pipeline stays stuck, no matter how long I wait. In addition, it never really gets stuck at the same point. I expect about 340k output records, and never get more than 70k before it gets stuck.

Dataflow always autoscales from 1 to 8 workers, which is my limit.

Run A: after 30mins+: no elements added out of {{GroupByKey}}, but logs have repeating occurrences of
{code}
Proposing dynamic split of work unit xxxxxx;aaaaaa;bbbbbb at {"position":{"shufflePosition":"AAAAAAD_AP8A_wEAAQ"}}
Refusing to split <unstarted in shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ): unstarted   
Refused to split GroupingShuffleReader <unstarted in shuffle range [ShufflePosition(base64:AAAAAAD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAAD_AP8A_wEAAQ)
{code}

Run B: after a couple minutes, elements get added to output of {{GroupByKey}}, up to 56,128 and then stays stuck doing nothing, but logs have repeating occurrences of
{code}
Proposing dynamic split of work unit xxxxxx;ccccccc;dddddd at {"position":{"shufflePosition":"AAAAAQD_AP8A_wEAAQ"}}
Refusing to split <unstarted in shuffle range [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ): unstarted
Refused to split GroupingShuffleReader <unstarted in shuffle range [ShufflePosition(base64:AAAAAQD_AP8A_wD_AAE), ShufflePosition(base64:AAAAAgD_AP8A_wD_AAE))> at ShufflePosition(base64:AAAAAQD_AP8A_wEAAQ)
{code}

Run C: after 10mins: elements get added to output of {{GroupByKey}}, up to 70,262 and then stays stuck doing nothing. No logs as above as far as I can find.

I've run this about a dozen times and it always gets stuck. I am trying out right now to run the pipeline with the worker limit set to one, and {{GroupByKey}} has output 150k so far, still increasing. This seems like a workaround, but using one worker only is not ideal.

cc [~dhalperi@google.com]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)