You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@beam.apache.org by "Kenneth Knowles (Jira)" <ji...@apache.org> on 2022/01/24 20:26:00 UTC

[jira] [Updated] (BEAM-13721) ZeroDivisionError if source bundle smaller than 1mb

     [ https://issues.apache.org/jira/browse/BEAM-13721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kenneth Knowles updated BEAM-13721:
-----------------------------------
    Description: 
Hi,
I built a (GCP) DataFlow + apache-beam (version 2.24.0) pipeline, using python.
The pipeline's stages are:
1. reading from BigQuery
2. running a custom function (using the ParDo while passing it a class inheriting 'beam.DoFn'), that transforms the read data.
3. writing the transformed data to BigQuery
 
The pipeline works fine when stage 1 is querying a small amount of data, but when it is querying data from the last six months (lots of data), I am getting this error:
 
{code:python}
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 69, in _{_}iter{_}_
    self._source.start_position, self._source.stop_position)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 78, in get_range_tracker
    start_position, stop_position, self._source_bundles)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 131, in _{_}init{_}_
    self._compute_cumulative_weights(source_bundles[start[0]:last]) + [1] *
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 154, in _compute_cumulative_weights
    running_total.append(max(min_diff, min(1, running_total[-1] + w / total)))
ZeroDivisionError: float division by zero
{code}

 
I saw this issue in your repository:
 
https://issues.apache.org/jira/browse/BEAM-10004
 
which is referencing the same problem I have, but even though this fix is already implemented in my version of apache-beam (2.24.0), I still get this error.
 
Can you please guide me in fixing this issue?
Is it something I am doing wrong or is this your bug?
 
Thank you in advance and have a good one!
 
Inbar Dekel

  was:
Hi,
I built a (GCP) DataFlow + apache-beam (version 2.24.0) pipeline, using python.
The pipeline's stages are:
1. reading from BigQuery
2. running a custom function (using the ParDo while passing it a class inheriting 'beam.DoFn'), that transforms the read data.
3. writing the transformed data to BigQuery
 
The pipeline works fine when stage 1 is querying a small amount of data, but when it is querying data from the last six months (lots of data), I am getting this error:
 
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
  File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
  File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
  File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 69, in __iter__
    self._source.start_position, self._source.stop_position)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 78, in get_range_tracker
    start_position, stop_position, self._source_bundles)
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 131, in __init__
    self._compute_cumulative_weights(source_bundles[start[0]:last]) + [1] *
  File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 154, in _compute_cumulative_weights
    running_total.append(max(min_diff, min(1, running_total[-1] + w / total)))
ZeroDivisionError: float division by zero
 
I saw this issue in your repository:
 
https://issues.apache.org/jira/browse/BEAM-10004
 
which is referencing the same problem I have, but even though this fix is already implemented in my version of apache-beam (2.24.0), I still get this error.
 
Can you please guide me in fixing this issue?
Is it something I am doing wrong or is this your bug?
 
Thank you in advance and have a good one!
 
Inbar Dekel


> ZeroDivisionError if source bundle smaller than 1mb
> ---------------------------------------------------
>
>                 Key: BEAM-13721
>                 URL: https://issues.apache.org/jira/browse/BEAM-13721
>             Project: Beam
>          Issue Type: Bug
>          Components: io-py-gcp
>    Affects Versions: 2.24.0
>            Reporter: Inbar Dekel
>            Priority: P2
>
> Hi,
> I built a (GCP) DataFlow + apache-beam (version 2.24.0) pipeline, using python.
> The pipeline's stages are:
> 1. reading from BigQuery
> 2. running a custom function (using the ParDo while passing it a class inheriting 'beam.DoFn'), that transforms the read data.
> 3. writing the transformed data to BigQuery
>  
> The pipeline works fine when stage 1 is querying a small amount of data, but when it is querying data from the last six months (lots of data), I am getting this error:
>  
> {code:python}
> apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
> Traceback (most recent call last):
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
>     work_executor.execute()
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/executor.py", line 179, in execute
>     op.start()
>   File "dataflow_worker/native_operations.py", line 38, in dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 39, in dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 44, in dataflow_worker.native_operations.NativeReadOperation.start
>   File "dataflow_worker/native_operations.py", line 48, in dataflow_worker.native_operations.NativeReadOperation.start
>   File "/usr/local/lib/python3.7/site-packages/dataflow_worker/workercustomsources.py", line 69, in _{_}iter{_}_
>     self._source.start_position, self._source.stop_position)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 78, in get_range_tracker
>     start_position, stop_position, self._source_bundles)
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 131, in _{_}init{_}_
>     self._compute_cumulative_weights(source_bundles[start[0]:last]) + [1] *
>   File "/usr/local/lib/python3.7/site-packages/apache_beam/io/concat_source.py", line 154, in _compute_cumulative_weights
>     running_total.append(max(min_diff, min(1, running_total[-1] + w / total)))
> ZeroDivisionError: float division by zero
> {code}
>  
> I saw this issue in your repository:
>  
> https://issues.apache.org/jira/browse/BEAM-10004
>  
> which is referencing the same problem I have, but even though this fix is already implemented in my version of apache-beam (2.24.0), I still get this error.
>  
> Can you please guide me in fixing this issue?
> Is it something I am doing wrong or is this your bug?
>  
> Thank you in advance and have a good one!
>  
> Inbar Dekel



--
This message was sent by Atlassian Jira
(v8.20.1#820001)