You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by 赵毓文 <zh...@outlook.com> on 2023/03/06 09:07:17 UTC

[Question]dataproc(gce)+spark+beam(python) seems only use one worker

Hi Beam Community,

I follow this doc(https://beam.apache.org/documentation/runners/spark/#running-on-dataproc-cluster-yarn-backed).
Try to use dataproc to run my beam pipline.

I ceate a dataproc cluster(1 master 2 worker),I could only see one worker is working(use `docker ps` to check worker node).
I also increase to use 4 worker got same result.

Is there some paramters that I could use, or I do something wrong?
I also try to use dataproc(gce)+flink+beam(python) use option `parallelism` seems not working either.


This is my code.
```
import apache_beam as beam
import logging
from apache_beam.options.pipeline_options import PipelineOptions

options = PipelineOptions([
    '--runner=SparkRunner',
    '--output_executable_path=job.jar',
    '--spark_version=3',
])

class SplitWords(beam.DoFn):
  def __init__(self, delimiter=','):
    self.delimiter = delimiter

  def process(self, text):
    import time
    time.sleep(10)

    for word in text.split(self.delimiter):
      yield word

logging.getLogger().setLevel(logging.INFO)
data = ['Strawberry,Carrot,Eggplant','Tomato,Potato']*800



with beam.Pipeline(options=options) as pipeline:
  plants = (
      pipeline
      | 'Gardening plants' >> beam.Create(data)
      | 'Split words' >> beam.ParDo(SplitWords(','))
      | beam.Map(print))
```

Thanks
Yuwen Zhao