You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@beam.apache.org by GitBox <gi...@apache.org> on 2022/06/04 18:11:18 UTC

[GitHub] [beam] damccorm opened a new issue, #20568: Cannot run Python PortableRunner on EMR cluster

damccorm opened a new issue, #20568:
URL: https://github.com/apache/beam/issues/20568

   I have been trying to run the python word-count example on an [AWS EMR](https://aws.amazon.com/emr/) cluster. And it does not work.
   
   Things I have tried:
    * Running with 
   ```
   
   python3 py_codes/word_count_beam.py --output word_count_output --runner=SparkRunner
   
   ```
   
   This results in implicitly running with `--spark-master-url local[4]` which defeats the purpose of running it in a cluster
   
    * Tried
   ```
   
   python3 py_codes/word_count_beam.py --output word_count_output --runner=SparkRunner --spark-master-url=yarn
   
   ```
   
   Still uses local master.
   
    * Could not use method described in [https://beam.apache.org/documentation/runners/spark/](https://beam.apache.org/documentation/runners/spark/) under "Running on a pre-deployed Spark cluster" because in yarn master is not exposed with an URL like localhost:7077
   
    * Tried
   ```
   
   python3 py_codes/word_ount_beam.py --output word_count_output --runner=SparkRunner --output_executable_path=jars/beam_word_count.jar
   
   ```
   
   as described in https://issues.apache.org/jira/browse/BEAM-8970
    It can create a jar file, but when I submit the jar with spark-submit I get docker permission denied exception. Possibly related to https://issues.apache.org/jira/browse/BEAM-6020
   
   *So, no way to run a python beam code in a yarn spark cluster?*
    This also means no way to run TFX code (which uses beam) in a yarn cluster.
   
   Imported from Jira [BEAM-11378](https://issues.apache.org/jira/browse/BEAM-11378). Original Jira may contain additional context.
   Reported by: ratulray.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Cannot run Python PortableRunner -> SparkRunner on EMR cluster [beam]

Posted by "moradology (via GitHub)" <gi...@apache.org>.
moradology commented on issue #20568:
URL: https://github.com/apache/beam/issues/20568#issuecomment-1819423006

   Hey - I'm curious for any who may be familiar with this issue and the SDKs that could be used here about whether the `PROCESS` SDK should work around the docker permissions issue being described? Naively, I'd have thought so, but upon experimenting with this, it appears as though the `boot` process compiled via gradle still attempts to summon a container. Is that expected behavior which I somehow missed in the relevant [docs](https://beam.apache.org/documentation/runtime/sdk-harness-config/)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Cannot run Python PortableRunner -> SparkRunner on EMR cluster [beam]

Posted by "moradology (via GitHub)" <gi...@apache.org>.
moradology commented on issue #20568:
URL: https://github.com/apache/beam/issues/20568#issuecomment-1839577712

   It appears to me that the `PROCESS` SDK environment avoids the problems encountered here. A discussion of what ended up working for me on this issue from a downstream project: https://github.com/pangeo-forge/pangeo-forge-runner/issues/133


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@beam.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org