You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Kyle Weaver <kc...@google.com> on 2021/11/03 01:05:22 UTC
Re: Trying to run Beam on Spark cluster

Hi Mark. Sorry for the late reply. There's a long-standing feature request
for mounting a volume on Dockerized workers:
https://issues.apache.org/jira/browse/BEAM-5440

On Wed, Sep 29, 2021 at 4:55 PM Mark Striebeck <ma...@gmail.com>
wrote:

> Thanks Kyle,
>
> On Fri, Sep 24, 2021 at 1:48 PM Kyle Weaver <kc...@google.com> wrote:
>
>> Hi Mark. Looks like a problem with artifact staging. PortableRunner
>> implicitly requires a directory (configurable with --artifacts_dir, under
>> /tmp by default) that is accessible by both the job server and Beam worker.
>>
>
> Hmmm, I guess I could create an NFS share between the machines and use
> that for the artifacts_dir. But if I use enironment_type=DOCKER, the docker
> image won't have access to that. Is there some easy way to modify the
> docker command that the worker runs when it stars the docker image to map
> this directory (via '-v') into the docker image?
>
>
>> You should be able to get around this by using --runner SparkRunner
>> instead:
>>
>> python -m apache_beam.examples.wordcount
>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>> gs://datapipeline-output/output/   --project august-ascent-325423
>> --environment_type=DOCKER *--runner SparkRunner
>> --spark_rest_url http://hostname:6066 <http://hostname:6066>*
>>
>> This requires you to enable REST on your Spark master by putting
>> `spark.master.rest.enabled` in your config, and then setting the Beam
>> pipeline option --spark_rest_url to use its address (6066 is the default
>> port).
>>
>
> I'll try that next while waiting for the answer above.
>
> Thanks
>      Mark
>
>>
>> This starts the job server for you, so you don't need to do that ahead of
>> time.
>>
>> Best,
>> Kyle
>>
>> On Sun, Sep 19, 2021 at 12:22 PM Mark Striebeck <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to run beam on a small spark cluster. I setup spark (master
>>> plus one slave). I am using the portable runner and invoke the beam
>>> pipeline with:
>>>
>>> python -m apache_beam.examples.wordcount
>>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>>> gs://datapipeline-output/output/   --project august-ascent-325423 --runner
>>> PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER
>>>
>>> I always get an error:
>>> Caused by: java.util.concurrent.TimeoutException: Timed out while
>>> waiting for command 'docker run -d --network=host
>>> --env=DOCKER_MAC_CONTAINER=null apache/beam_python3.8_sdk:2.32.0 --id=4-1
>>> --provision_endpoint=localhost:46757'
>>>
>>> It takes ~2.5 minutes to pull the beam image which should be enough. But
>>> I pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
>>> and then tried to run the pipeline again.
>>>
>>> Now, when I run the pipeline I get an error:
>>> java.io.FileNotFoundException:
>>> /tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
>>> (No such file or directory)
>>>
>>> and then further down
>>>
>>> ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
>>> java.lang.IllegalStateException: No container running for id
>>> 7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879
>>>
>>> (here is the full strack trace:
>>> https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)
>>>
>>> Any pointer or idea is appreciated (sorry, if this is something obvious
>>> - I'm still pretty new to beam/spark).
>>>
>>> Thanks
>>>       Mark
>>>
>>