You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Daniel Oliveira (Jira)" <ji...@apache.org> on 2021/03/05 03:03:00 UTC

[jira] [Commented] (BEAM-11916) Combine failed on large PCollection of uint64 arrays

    [ https://issues.apache.org/jira/browse/BEAM-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17295707#comment-17295707 ] 

Daniel Oliveira commented on BEAM-11916:
----------------------------------------

Hey Tao, I left a possible solution for this on your [Stack Overflow question|https://stackoverflow.com/questions/66446338/issue-with-combine-function-in-apache-beam-go-sdk]. Let me know if it works.

> Combine failed on large PCollection of uint64 arrays
> ----------------------------------------------------
>
>                 Key: BEAM-11916
>                 URL: https://issues.apache.org/jira/browse/BEAM-11916
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-go
>    Affects Versions: 2.28.0
>         Environment: Google Dataflow
>            Reporter: Tao Liao
>            Assignee: Daniel Oliveira
>            Priority: P2
>              Labels: GCP
>         Attachments: dataflow autoscaling.png
>
>
> We came across an issue with the Combine operation with Apache Beam Go SDK (v2.28.0), when running a pipeline on Google Cloud Dataflow. Source code: 
> https://github.com/le0000000/dataflow_combine
> We understand that the Go SDK is experimental but it would be great if someone can help us understand if there’s anything wrong with our code, or if there's a bug in the Go SDK or Dataflow. The issue only happens when running the pipeline with Google Dataflow, with some large data set. We are trying to combine a _PCollection<pairedVec>_, with
> _type pairedVec struct {_
>     _Vec1 [1048576]uint64_
>     _Vec2 [1048576]uint64_
> _}_
> There are 10,000,000 items in the PCollection. After reading the input file, Dataflow scheduled 1000 workers to generate the PCollection, and started to do the combination. Then the worker number reduced to almost 1 and lasted for a very long time. Eventually the job failed with the following error log:
> 2021-03-02T06:13:40.438112597ZWorkflow failed. Causes: S09:CombinePerKey/CoGBK'1/Read+CombinePerKey/main.combineVecFn+CombinePerKey/main.combineVecFn/Extract+beam.dropKeyFn+main.flattenVecFn+textio.Write/beam.addFixedKeyFn+textio.Write/CoGBK/Write failed., The job failed because a work item has failed 4 times. Look in previous log entries for the cause of each one of the 4 failures. For more information, see https://cloud.google.com/dataflow/docs/guides/common-errors. The work item was attempted on these workers: go-job-1-1614659244459204-03012027-u5s6-harness-q8tx Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-44hk Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-05nm Root cause: The worker lost contact with the service., go-job-1-1614659244459204-03012027-u5s6-harness-l22w Root cause: The worker lost contact with the service.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)