You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Pawel <pa...@o2.pl> on 2022/03/30 05:01:13 UTC

Failing job

Hi  I have very limited knowledge about Apache Beam and I&#39;m facing a strange problem when starting job in Dataflow using Go SDK.   First I get this debug message (I see more of such Splits)   PB Split: instruction_id:&#34;process_bundle-2455586051109757402-4328&#34;  desired_splits:{key:&#34;ptransform-11&#34;  value:{fraction_of_remainder:0.5  allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}   And just after this debug message there is an error   github.com github.com/apache/beam/sdks/v2@v2.37.0/go/pkg/beam/core/runtime/harness/harness.go:432 &#34;  message: &#34;unable to split process_bundle-2455586051109757402-4328: failed to split DataSource (at index: 1) at requested splits: {[0 1]}&#34;   After this error whole jobs crashes. Is this likely a problem in my in my pipeline or a bug in Dataflow and/or Beam?  Thanks in advance   --  Paweł

Re: Re: Failing job

Posted by Pawel <pa...@o2.pl>.
Thank you Danny for your help.   I think you are right. I changed the function to my implementation of Write which indeed writes to shards and the process didn&#39;t fail. Thanks for your help.   --  Paweł 
            
          
     
      
       
        Dnia 31 marca 2022 21:03 Danny McCormick &lt;dannymccormick@google.com&gt; napisał(a):
       
    
       
         Hey Pawel, the size is almost definitely the problem - textio.Write isn&#39;t really built to scale to single files of that size (at this time at least). Specifically, it performs an  github.com AddFixedKey/GroupByKey  operation before writing to a file, which gathers all the file contents onto a single machine before writing the whole file at once. It seems likely that this is causing some sort of memory issue which is eventually causing the crash (I suspect the failed split is a red herring since failing to split  shouldn&#39;t  be fatal).  If you need to write it all to GCS, I&#39;d recommend writing some sort of sharding function to split it up into several files and then writing it to GCS that way - there&#39;s an example of how to do something like that in Beam Go&#39;s  github.com large wordcount example  which you could probably adapt to your use case.  Thanks, Danny  On Thu, Mar 31, 2022 at 3:15 AM Pawel &lt;   pawelrog@o2.pl &gt; wrote:  
           Hi Danny,  Thank you for your reply. I don&#39;t use splittable DoFn explicitly. The problem appears when I attach write function at the end of my pipeline    After ~10 hours of running a job with 30 workers first I start getting a lot of errors like this from Dataflow   Error syncing pod 3cfc1e3bc4c70cd9c668944ce9769a42 (&#34;df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)&#34;), skipping: failed to &#34;StartContainer&#34; for &#34;sdk-0-0&#34; with CrashLoopBackOff: &#34;back-off 40s restarting failed container=sdk-0-0 pod=df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)   Then the only 2 errors which appeared on workers are   unable to split process_bundle-2455586051109759173-4329: failed to split DataSource (at index: 1) at requested splits: {[0 1]}   Just after that the pipeline was marked as failed in Dataflow. This makes me think that this split is causing the pipeline to fail.  I am pretty sure this happened in the last step which is simply textio.Write to write. Maybe the size of the output file was a problem because I&#39;m expecting between 1TB and 2TB file to be written to GCS,  maybe this is the problem.   --  Paweł  
            
          
     
      
       
        Dnia 30 marca 2022 14:47 Danny McCormick &lt;   dannymccormick@google.com &gt; napisał(a):
       
    
       
         Hey Pawel, are you able to provide any more information about the splittable DoFn that is failing and what its SplitRestriction function looks like? I&#39;m not sure that we have enough information to fully diagnose the issue with the error alone.  The error itself is coming from here -  github.com https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538  - and is pretty much what it sounds like - the runner is trying to initiate a split but it fails because the sdk can&#39;t find a split point. I am kinda surprised that this causes the whole pipeline to fail though - was there any additional logging around the job crash? I wouldn&#39;t expect a failed split to be enough to bring the whole job down.   @Daniel Oliveira  probably has more context here, Daniel do you have anything to add?   Thanks, Danny  On Wed, Mar 30, 2022 at 1:01 AM Pawel &lt;    pawelrog@o2.pl &gt; wrote:  Hi  I have very limited knowledge about Apache Beam and I&#39;m facing a strange problem when starting job in Dataflow using Go SDK.   First I get this debug message (I see more of such Splits)   PB Split: instruction_id:&#34;process_bundle-2455586051109757402-4328&#34;  desired_splits:{key:&#34;ptransform-11&#34;  value:{fraction_of_remainder:0.5  allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}   And just after this debug message there is an error   github.com github.com/apache/beam/sdks/v2@v2.37.0/go/pkg/beam/core/runtime/harness/harness.go:432 &#34;  message: &#34;unable to split process_bundle-2455586051109757402-4328: failed to split DataSource (at index: 1) at requested splits: {[0 1]}&#34;   After this error whole jobs crashes. Is this likely a problem in my in my pipeline or a bug in Dataflow and/or Beam?  Thanks in advance   --  Paweł

Re: Re: Failing job

Posted by Danny McCormick <da...@google.com>.
Hey Pawel, the size is almost definitely the problem - textio.Write isn't
really built to scale to single files of that size (at this time at least).
Specifically, it performs an AddFixedKey/GroupByKey
<https://github.com/apache/beam/blob/410ad7699621e28433d81809f6b9c42fe7bd6a60/sdks/go/pkg/beam/io/textio/textio.go#L130>
operation
before writing to a file, which gathers all the file contents onto a single
machine before writing the whole file at once. It seems likely that this is
causing some sort of memory issue which is eventually causing the crash (I
suspect the failed split is a red herring since failing to split *shouldn't*
be fatal).

If you need to write it all to GCS, I'd recommend writing some sort of
sharding function to split it up into several files and then writing it to
GCS that way - there's an example of how to do something like that in Beam
Go's large wordcount example
<https://github.com/apache/beam/blob/32cec70f7fe92b20a79cc9def65289c12d650a2c/sdks/go/examples/large_wordcount/large_wordcount.go#L132>
which
you could probably adapt to your use case.

Thanks,
Danny

On Thu, Mar 31, 2022 at 3:15 AM Pawel <pa...@o2.pl> wrote:

> Hi Danny,
> Thank you for your reply. I don't use splittable DoFn explicitly. The
> problem appears when I attach write function at the end of my pipeline
>
> After ~10 hours of running a job with 30 workers first I start getting a
> lot of errors like this from Dataflow
>
> Error syncing pod 3cfc1e3bc4c70cd9c668944ce9769a42
> ("df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)"),
> skipping: failed to "StartContainer" for "sdk-0-0" with CrashLoopBackOff:
> "back-off 40s restarting failed container=sdk-0-0
> pod=df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)
>
> Then the only 2 errors which appeared on workers are
>
> unable to split process_bundle-2455586051109759173-4329: failed to split
> DataSource (at index: 1) at requested splits: {[0 1]}
>
> Just after that the pipeline was marked as failed in Dataflow. This makes
> me think that this split is causing the pipeline to fail.
> I am pretty sure this happened in the last step which is simply
> textio.Write to write. Maybe the size of the output file was a problem
> because I'm expecting between 1TB and 2TB file to be written to GCS,  maybe
> this is the problem.
>
> --
> Paweł
>
> Dnia 30 marca 2022 14:47 Danny McCormick <da...@google.com>
> napisał(a):
>
> Hey Pawel, are you able to provide any more information about the
> splittable DoFn that is failing and what its SplitRestriction
> function looks like? I'm not sure that we have enough information to fully
> diagnose the issue with the error alone.
>
> The error itself is coming from here -
> https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538
> - and is pretty much what it sounds like - the runner is trying to initiate
> a split but it fails because the sdk can't find a split point. I am kinda
> surprised that this causes the whole pipeline to fail though - was there
> any additional logging around the job crash? I wouldn't expect a failed
> split to be enough to bring the whole job down.
>
> @Daniel Oliveira <da...@google.com> probably has more context here,
> Daniel do you have anything to add?
>
> Thanks,
> Danny
>
> On Wed, Mar 30, 2022 at 1:01 AM Pawel < <pa...@o2.pl>
> wrote:
>
> Hi
> I have very limited knowledge about Apache Beam and I'm facing a strange
> problem when starting job in Dataflow using Go SDK.
> First I get this debug message (I see more of such Splits)
>
> PB Split: instruction_id:"process_bundle-2455586051109757402-4328"
> desired_splits:{key:"ptransform-11"  value:{fraction_of_remainder:0.5
> allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}
>
> And just after this debug message there is an error
>
> github.com/apache/beam/sdks/v2@v2.37.0/go/pkg/beam/core/runtime/harness/harness.go:432
> "
> message: "unable to split process_bundle-2455586051109757402-4328: failed
> to split DataSource (at index: 1) at requested splits: {[0 1]}"
>
> After this error whole jobs crashes. Is this likely a problem in my in my
> pipeline or a bug in Dataflow and/or Beam?
> Thanks in advance
>
> --
> Paweł
>
>

Odp: Re: Failing job

Posted by Pawel <pa...@o2.pl>.
Hi Danny,  Thank you for your reply. I don&#39;t use splittable DoFn explicitly. The problem appears when I attach write function at the end of my pipeline    After ~10 hours of running a job with 30 workers first I start getting a lot of errors like this from Dataflow   Error syncing pod 3cfc1e3bc4c70cd9c668944ce9769a42 (&#34;df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)&#34;), skipping: failed to &#34;StartContainer&#34; for &#34;sdk-0-0&#34; with CrashLoopBackOff: &#34;back-off 40s restarting failed container=sdk-0-0 pod=df-go-job-1-1648642256936168-03300511-upar-harness-nn4g_default(3cfc1e3bc4c70cd9c668944ce9769a42)   Then the only 2 errors which appeared on workers are   unable to split process_bundle-2455586051109759173-4329: failed to split DataSource (at index: 1) at requested splits: {[0 1]}   Just after that the pipeline was marked as failed in Dataflow. This makes me think that this split is causing the pipeline to fail.  I am pretty sure this happened in the last step which is simply textio.Write to write. Maybe the size of the output file was a problem because I&#39;m expecting between 1TB and 2TB file to be written to GCS,  maybe this is the problem.   --  Paweł  
            
          
     
      
       
        Dnia 30 marca 2022 14:47 Danny McCormick &lt;dannymccormick@google.com&gt; napisał(a):
       
    
       
         Hey Pawel, are you able to provide any more information about the splittable DoFn that is failing and what its SplitRestriction function looks like? I&#39;m not sure that we have enough information to fully diagnose the issue with the error alone.  The error itself is coming from here -  github.com https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538  - and is pretty much what it sounds like - the runner is trying to initiate a split but it fails because the sdk can&#39;t find a split point. I am kinda surprised that this causes the whole pipeline to fail though - was there any additional logging around the job crash? I wouldn&#39;t expect a failed split to be enough to bring the whole job down.   @Daniel Oliveira  probably has more context here, Daniel do you have anything to add?   Thanks, Danny  On Wed, Mar 30, 2022 at 1:01 AM Pawel &lt;   pawelrog@o2.pl &gt; wrote:  Hi  I have very limited knowledge about Apache Beam and I&#39;m facing a strange problem when starting job in Dataflow using Go SDK.   First I get this debug message (I see more of such Splits)   PB Split: instruction_id:&#34;process_bundle-2455586051109757402-4328&#34;  desired_splits:{key:&#34;ptransform-11&#34;  value:{fraction_of_remainder:0.5  allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}   And just after this debug message there is an error   github.com github.com/apache/beam/sdks/v2@v2.37.0/go/pkg/beam/core/runtime/harness/harness.go:432 &#34;  message: &#34;unable to split process_bundle-2455586051109757402-4328: failed to split DataSource (at index: 1) at requested splits: {[0 1]}&#34;   After this error whole jobs crashes. Is this likely a problem in my in my pipeline or a bug in Dataflow and/or Beam?  Thanks in advance   --  Paweł

Re: Failing job

Posted by Danny McCormick <da...@google.com>.
Hey Pawel, are you able to provide any more information about the
splittable DoFn that is failing and what its SplitRestriction
function looks like? I'm not sure that we have enough information to fully
diagnose the issue with the error alone.

The error itself is coming from here -
https://github.com/apache/beam/blob/ec62f3d9fa38e67279a70df28f2be8b17f344e27/sdks/go/pkg/beam/core/runtime/exec/datasource.go#L538
- and is pretty much what it sounds like - the runner is trying to initiate
a split but it fails because the sdk can't find a split point. I am kinda
surprised that this causes the whole pipeline to fail though - was there
any additional logging around the job crash? I wouldn't expect a failed
split to be enough to bring the whole job down.

@Daniel Oliveira <da...@google.com> probably has more context here,
Daniel do you have anything to add?

Thanks,
Danny

On Wed, Mar 30, 2022 at 1:01 AM Pawel <pa...@o2.pl> wrote:

> Hi
> I have very limited knowledge about Apache Beam and I'm facing a strange
> problem when starting job in Dataflow using Go SDK.
> First I get this debug message (I see more of such Splits)
>
> PB Split: instruction_id:"process_bundle-2455586051109757402-4328"
> desired_splits:{key:"ptransform-11"  value:{fraction_of_remainder:0.5
> allowed_split_points:0  allowed_split_points:1  estimated_input_elements:1}}
>
> And just after this debug message there is an error
>
> github.com/apache/beam/sdks/v2@v2.37.0/go/pkg/beam/core/runtime/harness/harness.go:432
> "
> message: "unable to split process_bundle-2455586051109757402-4328: failed
> to split DataSource (at index: 1) at requested splits: {[0 1]}"
>
> After this error whole jobs crashes. Is this likely a problem in my in my
> pipeline or a bug in Dataflow and/or Beam?
> Thanks in advance
>
> --
> Paweł
>