You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Mark Striebeck <ma...@gmail.com> on 2021/09/19 19:21:54 UTC

Trying to run Beam on Spark cluster

Hi,

I am trying to run beam on a small spark cluster. I setup spark (master
plus one slave). I am using the portable runner and invoke the beam
pipeline with:

python -m apache_beam.examples.wordcount
 gs://datapipeline-output/shakespeare-alls-11.txt --output
gs://datapipeline-output/output/   --project august-ascent-325423 --runner
PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER

I always get an error:
Caused by: java.util.concurrent.TimeoutException: Timed out while waiting
for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null
apache/beam_python3.8_sdk:2.32.0 --id=4-1
--provision_endpoint=localhost:46757'

It takes ~2.5 minutes to pull the beam image which should be enough. But I
pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
and then tried to run the pipeline again.

Now, when I run the pipeline I get an error:
java.io.FileNotFoundException:
/tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
(No such file or directory)

and then further down

ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
java.lang.IllegalStateException: No container running for id
7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879

(here is the full strack trace:
https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)

Any pointer or idea is appreciated (sorry, if this is something obvious -
I'm still pretty new to beam/spark).

Thanks
      Mark

Re: Trying to run Beam on Spark cluster

Posted by Kyle Weaver <kc...@google.com>.

Hi Mark. Sorry for the late reply. There's a long-standing feature request
for mounting a volume on Dockerized workers:
https://issues.apache.org/jira/browse/BEAM-5440

On Wed, Sep 29, 2021 at 4:55 PM Mark Striebeck <ma...@gmail.com>
wrote:

> Thanks Kyle,
>
> On Fri, Sep 24, 2021 at 1:48 PM Kyle Weaver <kc...@google.com> wrote:
>
>> Hi Mark. Looks like a problem with artifact staging. PortableRunner
>> implicitly requires a directory (configurable with --artifacts_dir, under
>> /tmp by default) that is accessible by both the job server and Beam worker.
>>
>
> Hmmm, I guess I could create an NFS share between the machines and use
> that for the artifacts_dir. But if I use enironment_type=DOCKER, the docker
> image won't have access to that. Is there some easy way to modify the
> docker command that the worker runs when it stars the docker image to map
> this directory (via '-v') into the docker image?
>
>
>> You should be able to get around this by using --runner SparkRunner
>> instead:
>>
>> python -m apache_beam.examples.wordcount
>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>> gs://datapipeline-output/output/   --project august-ascent-325423
>> --environment_type=DOCKER *--runner SparkRunner
>> --spark_rest_url http://hostname:6066 <http://hostname:6066>*
>>
>> This requires you to enable REST on your Spark master by putting
>> `spark.master.rest.enabled` in your config, and then setting the Beam
>> pipeline option --spark_rest_url to use its address (6066 is the default
>> port).
>>
>
> I'll try that next while waiting for the answer above.
>
> Thanks
>      Mark
>
>>
>> This starts the job server for you, so you don't need to do that ahead of
>> time.
>>
>> Best,
>> Kyle
>>
>> On Sun, Sep 19, 2021 at 12:22 PM Mark Striebeck <ma...@gmail.com>
>> wrote:
>>
>>> Hi,
>>>
>>> I am trying to run beam on a small spark cluster. I setup spark (master
>>> plus one slave). I am using the portable runner and invoke the beam
>>> pipeline with:
>>>
>>> python -m apache_beam.examples.wordcount
>>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>>> gs://datapipeline-output/output/   --project august-ascent-325423 --runner
>>> PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER
>>>
>>> I always get an error:
>>> Caused by: java.util.concurrent.TimeoutException: Timed out while
>>> waiting for command 'docker run -d --network=host
>>> --env=DOCKER_MAC_CONTAINER=null apache/beam_python3.8_sdk:2.32.0 --id=4-1
>>> --provision_endpoint=localhost:46757'
>>>
>>> It takes ~2.5 minutes to pull the beam image which should be enough. But
>>> I pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
>>> and then tried to run the pipeline again.
>>>
>>> Now, when I run the pipeline I get an error:
>>> java.io.FileNotFoundException:
>>> /tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
>>> (No such file or directory)
>>>
>>> and then further down
>>>
>>> ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
>>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
>>> java.lang.IllegalStateException: No container running for id
>>> 7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879
>>>
>>> (here is the full strack trace:
>>> https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)
>>>
>>> Any pointer or idea is appreciated (sorry, if this is something obvious
>>> - I'm still pretty new to beam/spark).
>>>
>>> Thanks
>>>       Mark
>>>
>>

Re: Trying to run Beam on Spark cluster

Posted by Mark Striebeck <ma...@gmail.com>.

Thanks Kyle,

On Fri, Sep 24, 2021 at 1:48 PM Kyle Weaver <kc...@google.com> wrote:

> Hi Mark. Looks like a problem with artifact staging. PortableRunner
> implicitly requires a directory (configurable with --artifacts_dir, under
> /tmp by default) that is accessible by both the job server and Beam worker.
>

Hmmm, I guess I could create an NFS share between the machines and use that
for the artifacts_dir. But if I use enironment_type=DOCKER, the docker
image won't have access to that. Is there some easy way to modify the
docker command that the worker runs when it stars the docker image to map
this directory (via '-v') into the docker image?


> You should be able to get around this by using --runner SparkRunner
> instead:
>
> python -m apache_beam.examples.wordcount
>  gs://datapipeline-output/shakespeare-alls-11.txt --output
> gs://datapipeline-output/output/   --project august-ascent-325423
> --environment_type=DOCKER *--runner SparkRunner
> --spark_rest_url http://hostname:6066 <http://hostname:6066>*
>
> This requires you to enable REST on your Spark master by putting
> `spark.master.rest.enabled` in your config, and then setting the Beam
> pipeline option --spark_rest_url to use its address (6066 is the default
> port).
>

I'll try that next while waiting for the answer above.

Thanks
     Mark

>
> This starts the job server for you, so you don't need to do that ahead of
> time.
>
> Best,
> Kyle
>
> On Sun, Sep 19, 2021 at 12:22 PM Mark Striebeck <ma...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am trying to run beam on a small spark cluster. I setup spark (master
>> plus one slave). I am using the portable runner and invoke the beam
>> pipeline with:
>>
>> python -m apache_beam.examples.wordcount
>>  gs://datapipeline-output/shakespeare-alls-11.txt --output
>> gs://datapipeline-output/output/   --project august-ascent-325423 --runner
>> PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER
>>
>> I always get an error:
>> Caused by: java.util.concurrent.TimeoutException: Timed out while waiting
>> for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null
>> apache/beam_python3.8_sdk:2.32.0 --id=4-1
>> --provision_endpoint=localhost:46757'
>>
>> It takes ~2.5 minutes to pull the beam image which should be enough. But
>> I pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
>> and then tried to run the pipeline again.
>>
>> Now, when I run the pipeline I get an error:
>> java.io.FileNotFoundException:
>> /tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
>> (No such file or directory)
>>
>> and then further down
>>
>> ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
>> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
>> java.lang.IllegalStateException: No container running for id
>> 7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879
>>
>> (here is the full strack trace:
>> https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)
>>
>> Any pointer or idea is appreciated (sorry, if this is something obvious -
>> I'm still pretty new to beam/spark).
>>
>> Thanks
>>       Mark
>>
>

Re: Trying to run Beam on Spark cluster

Posted by Kyle Weaver <kc...@google.com>.

Hi Mark. Looks like a problem with artifact staging. PortableRunner
implicitly requires a directory (configurable with --artifacts_dir, under
/tmp by default) that is accessible by both the job server and Beam worker.
You should be able to get around this by using --runner SparkRunner instead:

python -m apache_beam.examples.wordcount
 gs://datapipeline-output/shakespeare-alls-11.txt --output
gs://datapipeline-output/output/   --project august-ascent-325423
--environment_type=DOCKER *--runner SparkRunner
--spark_rest_url http://hostname:6066 <http://hostname:6066>*

This requires you to enable REST on your Spark master by putting
`spark.master.rest.enabled` in your config, and then setting the Beam
pipeline option --spark_rest_url to use its address (6066 is the default
port).

This starts the job server for you, so you don't need to do that ahead of
time.

Best,
Kyle

On Sun, Sep 19, 2021 at 12:22 PM Mark Striebeck <ma...@gmail.com>
wrote:

> Hi,
>
> I am trying to run beam on a small spark cluster. I setup spark (master
> plus one slave). I am using the portable runner and invoke the beam
> pipeline with:
>
> python -m apache_beam.examples.wordcount
>  gs://datapipeline-output/shakespeare-alls-11.txt --output
> gs://datapipeline-output/output/   --project august-ascent-325423 --runner
> PortableRunner --job_endpoint=localhost:8099 --environment_type=DOCKER
>
> I always get an error:
> Caused by: java.util.concurrent.TimeoutException: Timed out while waiting
> for command 'docker run -d --network=host --env=DOCKER_MAC_CONTAINER=null
> apache/beam_python3.8_sdk:2.32.0 --id=4-1
> --provision_endpoint=localhost:46757'
>
> It takes ~2.5 minutes to pull the beam image which should be enough. But I
> pulled the image manually (docker pull apache/beam_python3.8_sdk:2.32.0)
> and then tried to run the pipeline again.
>
> Now, when I run the pipeline I get an error:
> java.io.FileNotFoundException:
> /tmp/beam-artifact-staging/60321f712323c195764ab31b3e205b228a405fbb80b50fafa67b38b21959c63f/1-ref_Environment_default_e-pickled_main_session
> (No such file or directory)
>
> and then further down
>
> ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
> org.apache.beam.vendor.guava.v26_0_jre.com.google.common.util.concurrent.UncheckedExecutionException:
> java.lang.IllegalStateException: No container running for id
> 7014a9ea98dc0b3f453a9d3860aff43ba42214195d2240d7cefcefcfabf93879
>
> (here is the full strack trace:
> https://drive.google.com/file/d/1mRzt8G7I9Akkya48KfAbrqPp8wRCzXDe/view)
>
> Any pointer or idea is appreciated (sorry, if this is something obvious -
> I'm still pretty new to beam/spark).
>
> Thanks
>       Mark
>