You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 23:31:18 UTC

[GitHub] [beam] damccorm opened a new issue, #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

damccorm opened a new issue, #21432:
URL: https://github.com/apache/beam/issues/21432

    
   
   The following beam pipeline works correctly using `DirectRunner` but fails with a very vague error when using `DataflowRunner`.
   ```
   
   (    
   pipeline    
   | beam.io.ReadFromPubSub(input_topic, with_attributes=True)    
   | beam.Map(pubsub_message_to_row)
      
   | beam.WindowInto(beam.transforms.window.FixedWindows(5))    
   | beam.GroupBy(<beam.Row col name>)
      
   | beam.CombineValues(<instance of beam.CombineFn subclass>)    
   | beam.Values()  
   | beam.io.gcp.bigquery.WriteToBigQuery(
   . . . )
   )
   ```
   
   Stacktrace:
   ```
   
   Traceback (most recent call last):
     File "src/read_quality_pipeline/__init__.py", line 128, in <module>
   
      (
     File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/pipeline.py",
   line 597, in __exit__
       self.result.wait_until_finish()
     File "/home/pkg_dev/.cache/pypoetry/virtualenvs/apache-beam-poc-5nxBvN9R-py3.8/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py",
   line 1633, in wait_until_finish
       raise DataflowRuntimeException(
   apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException:
   Dataflow pipeline failed. State: FAILED, Error:
   Error processing pipeline. 
   ```
   
   Log output:
   ```
   
   2022-02-01T16:54:43.645Z: JOB_MESSAGE_WARNING: Autoscaling is enabled for Dataflow Streaming Engine.
   Workers will scale between 1 and 100 unless maxNumWorkers is specified.
   2022-02-01T16:54:43.736Z: JOB_MESSAGE_DETAILED:
   Autoscaling is enabled for job 2022-02-01_08_54_40-8791019287477103665. The number of workers will be
   between 1 and 100.
   2022-02-01T16:54:43.757Z: JOB_MESSAGE_DETAILED: Autoscaling was automatically enabled
   for job 2022-02-01_08_54_40-8791019287477103665.
   2022-02-01T16:54:44.624Z: JOB_MESSAGE_ERROR: Error
   processing pipeline. 
   ```
   
   With the `CombineValues` step removed this pipeline successfully starts in dataflow.
   
    
   
   I thought this was an issue with Dataflow on the server side since the Dataflow API (v1b3.projects.locations.jobs.messages) is just returning the textPayload: "Error processing pipeline". But then I found the issue BEAM-12636 where a go SDK user has the same error message but seemingly as a result of bugs in the go SDK?
   
   Imported from Jira [BEAM-13795](https://issues.apache.org/jira/browse/BEAM-13795). Original Jira may contain additional context.
   Reported by: Jake_Zuliani.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] MOscity commented on issue #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Posted by GitBox <gi...@apache.org>.
MOscity commented on issue #21432:
URL: https://github.com/apache/beam/issues/21432#issuecomment-1398541772

   Hey, I'm facing the same issue here, whole pipeline works with DirectRunner (all steps), but DataflowRunner fails after 1-3secs and emits no logs. It works fine without the the CountCombineFn Step.
   
   ```
   def transform_data(right_side_data, step):
       data_out = (
               right_side_data
               | 'Step 1'.format(step) >> beam.Map(prepare_key_value)
               | 'Step 2'.format(step) >> beam.GroupByKey()
               
               # This line fails with DataflowRunner, but runs in DirectRunner locally:
               | 'Step 3'.format(step) >> beam.CombineValues(beam.combiners.CountCombineFn())
       )
       return data_out
   ```
   
   Error Log:
   ```
   ERROR:apache_beam.runners.dataflow.dataflow_runner:Console URL: https://console.cloud.google.com/dataflow/jobs/<RegionId>/2023-01-20_06_59_03-4426498189309546663?project=<ProjectId>
   Traceback (most recent call last):
     File "./path/to/file/my_python.py", line 618, in <module>
       run_pipeline()
     File "./path/to/file/my_python.py", line 598, in run_pipeline
       print(f'----- After Step: {step}.')
     File "/home/myusername/.local/share/virtualenvs/pipenv_20-Y278SNFx/lib/python3.8/site-packages/apache_beam/pipeline.py", line 598, in __exit__
       self.result.wait_until_finish()
     File "/home/myusername/.local/share/virtualenvs/pipenv_20-Y278SNFx/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1641, in wait_until_finish
       raise DataflowRuntimeException(
   apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
   Error processing pipeline.
   ```
   
   Didn't figure out a workaround yet... anyone an idea?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] viniciusdsmello commented on issue #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Posted by GitBox <gi...@apache.org>.
viniciusdsmello commented on issue #21432:
URL: https://github.com/apache/beam/issues/21432#issuecomment-1361451153

   Hi, I'm facing the same issue here. Did anyone figure out a workaround?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] lohmingyao1993 commented on issue #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Posted by GitBox <gi...@apache.org>.
lohmingyao1993 commented on issue #21432:
URL: https://github.com/apache/beam/issues/21432#issuecomment-1170602646

   Hi, I have faced the same issue as well. Is there any workaround or stable versions to use? thanks!
   ```   
   with beam.Pipeline(options=pipeline_options) as p:
           p \
           | "Read From PubSub Subscription" >> ReadFromPubSub(
               subscription=subscription) \
           | beam.Map(lambda row: logging.info)
   ```
   
   ```
   Traceback (most recent call last):
     File "src/main.py", line 101, in <module>
       run()
     File "src/main.py", line 54, in run
       p \
     File "/Users/name/.local/share/virtualenvs/QqobXVVz/lib/python3.8/site-packages/apache_beam/pipeline.py", line 597, in __exit__
       self.result.wait_until_finish()
     File "/Users/name/.local/share/virtualenvs/QqobXVVz/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1667, in wait_until_finish
       raise DataflowRuntimeException(
   apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
   Error processing pipeline.
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] liferoad commented on issue #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Posted by "liferoad (via GitHub)" <gi...@apache.org>.
liferoad commented on issue #21432:
URL: https://github.com/apache/beam/issues/21432#issuecomment-1504085727

   .take-issue


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] linamartensson commented on issue #21432: `beam.CombineValues` on DataFlow runner causes ambiguous failure with python SDK

Posted by "linamartensson (via GitHub)" <gi...@apache.org>.
linamartensson commented on issue #21432:
URL: https://github.com/apache/beam/issues/21432#issuecomment-1472766256

   We started encountering this issue on Nov 2 2022 with a job running daily.
   It runs from a template which was created on June 14 2022, but ran just fine at first. We're able to work around it with the suggestion here, but this is odd - and it doesn't line up with anything from the [Dataflow release schedule](https://cloud.google.com/dataflow/docs/release-notes) as far as I can tell.
   
   So - how could this have happened?
   It's worrisome that a job that was already running could just stop. I'm also wondering if we may have done some Cloud change on our end that might have suddenly triggered it. Also, clearly we should have discovered this issue sooner, but here we are. ;)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org