You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 18:07:16 UTC

[GitHub] [beam] damccorm opened a new issue, #20542: Running Apache Beam to distribute the cleaning of a dataset in Google Cloud Dataflow

damccorm opened a new issue, #20542:
URL: https://github.com/apache/beam/issues/20542

   Trying to download C4 via [these instructions]([https://github.com/google-research/text-to-text-transfer-transformer#c4)](https://github.com/google-research/text-to-text-transfer-transformer#c4)) and 3 hours into my job I get this. Can't find any help on google for this error.
   
    
   
   Traceback (most recent call last):
    File "/usr/local/lib/python3.6/site-packages/dataflow_worker/batchworker.py", line 649, in do_work
    work_executor.execute()
    File "/usr/local/lib/python3.6/site-packages/dataflow_worker/executor.py", line 179, in execute
    op.start()
    File "dataflow_worker/shuffle_operations.py", line 63, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
    File "dataflow_worker/shuffle_operations.py", line 64, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
    File "dataflow_worker/shuffle_operations.py", line 79, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
    File "dataflow_worker/shuffle_operations.py", line 80, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
    File "dataflow_worker/shuffle_operations.py", line 84, in dataflow_worker.shuffle_operations.GroupedShuffleReadOperation.start
    File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
    File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
    File "dataflow_worker/shuffle_operations.py", line 261, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
    File "dataflow_worker/shuffle_operations.py", line 268, in dataflow_worker.shuffle_operations.BatchGroupAlsoByWindowsOperation.process
    File "apache_beam/runners/worker/operations.py", line 332, in apache_beam.runners.worker.operations.Operation.output
    File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
    File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
    File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
    File "apache_beam/runners/common.py", line 1215, in apache_beam.runners.common.DoFnRunner.process
    File "apache_beam/runners/common.py", line 1279, in apache_beam.runners.common.DoFnRunner._reraise_augmented
    File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
    File "apache_beam/runners/common.py", line 569, in apache_beam.runners.common.SimpleInvoker.invoke_process
    File "apache_beam/runners/common.py", line 1371, in apache_beam.runners.common._OutputProcessor.process_outputs
    File "apache_beam/runners/worker/operations.py", line 195, in apache_beam.runners.worker.operations.SingletonConsumerSet.receive
    File "apache_beam/runners/worker/operations.py", line 670, in apache_beam.runners.worker.operations.DoOperation.process
    File "apache_beam/runners/worker/operations.py", line 671, in apache_beam.runners.worker.operations.DoOperation.process
    File "apache_beam/runners/common.py", line 1215, in apache_beam.runners.common.DoFnRunner.process
    File "apache_beam/runners/common.py", line 1294, in apache_beam.runners.common.DoFnRunner._reraise_augmented
    File "/usr/local/lib/python3.6/site-packages/future/utils/__init__.py", line 446, in raise_with_traceback
    raise exc.with_traceback(traceback)
    File "apache_beam/runners/common.py", line 1213, in apache_beam.runners.common.DoFnRunner.process
    File "apache_beam/runners/common.py", line 570, in apache_beam.runners.common.SimpleInvoker.invoke_process
    File "/mnt/pccfs/backed_up/crytting/persuasion/createc4/lib/python3.6/site-packages/apache_beam/transforms/core.py", line 815, in <lambda\>
    self.process = lambda element: fn(element)
   TypeError: clean_page() got an unexpected keyword argument 'badwords_regex' [while running 'clean_pages']
   
   Imported from Jira [BEAM-11098](https://issues.apache.org/jira/browse/BEAM-11098). Original Jira may contain additional context.
   Reported by: crytting.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn closed issue #20542: Running Apache Beam to distribute the cleaning of a dataset in Google Cloud Dataflow

Posted by GitBox <gi...@apache.org>.
tvalentyn closed issue #20542: Running Apache Beam to distribute the cleaning of a dataset in Google Cloud Dataflow
URL: https://github.com/apache/beam/issues/20542


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [beam] tvalentyn commented on issue #20542: Running Apache Beam to distribute the cleaning of a dataset in Google Cloud Dataflow

Posted by GitBox <gi...@apache.org>.
tvalentyn commented on issue #20542:
URL: https://github.com/apache/beam/issues/20542#issuecomment-1164301014

   closing as obsolete.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org