You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by 赵 毓文 <zh...@outlook.com> on 2023/03/06 09:07:17 UTC
[Question]dataproc(gce)+spark+beam(python) seems only use one worker
Hi Beam Community,
I follow this doc(https://beam.apache.org/documentation/runners/spark/#running-on-dataproc-cluster-yarn-backed).
Try to use dataproc to run my beam pipline.
I ceate a dataproc cluster(1 master 2 worker),I could only see one worker is working(use `docker ps` to check worker node).
I also increase to use 4 worker got same result.
Is there some paramters that I could use, or I do something wrong?
I also try to use dataproc(gce)+flink+beam(python) use option `parallelism` seems not working either.
This is my code.
```
import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import PipelineOptions
options = PipelineOptions([
'--runner=SparkRunner',
'--output_executable_path=job.jar',
'--spark_version=3',
])
class SplitWords(beam.DoFn):
def __init__(self, delimiter=','):
self.delimiter = delimiter
def process(self, text):
import time
time.sleep(10)
for word in text.split(self.delimiter):
yield word
logging.getLogger().setLevel(logging.INFO)
data = ['Strawberry,Carrot,Eggplant','Tomato,Potato']*800
with beam.Pipeline(options=options) as pipeline:
plants = (
pipeline
| 'Gardening plants' >> beam.Create(data)
| 'Split words' >> beam.ParDo(SplitWords(','))
| beam.Map(print))
```
Thanks
Yuwen Zhao