You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Xander Song <ia...@gmail.com> on 2020/01/30 22:52:13 UTC

Running Beam Pipelines with GPUs (and other questions)

Hello,

I am new to the Apache ecosystem and am attempting to use Beam to build a
horizontally scalable pipeline for feature extraction from video data. The
extraction process for certain features can be accelerated using GPUs,
while other features require only a CPU to compute. I have several
questions, listed in order of decreasing priority:

   1. Can I run a Beam pipeline with GPUs? (as far as I can tell, Google
   Cloud Dataflow does not currently support this option)
   2. Is it possible to achieve this functionality using Spark or Flink as
   a runner?
   3. Is it possible to mix hardware types in a Beam pipeline (e.g., to
   have certain features extracted by CPUs and others extracted by GPUs), or
   does this go against the Beam paradigm of abstracting away such details?
   4. Do the Spark and Flink runners have support for auto-scaling like
   Google Cloud Dataflow?
   5. What are relevant considerations when selecting between Spark vs.
   Flink as a runner?

Any guidance, resources, or tips are appreciated. Thank you in advance!
-Xander

Re: Running Beam Pipelines with GPUs (and other questions)

Posted by Valentyn Tymofieiev <va...@google.com>.
Hi, some responses inline:

On Thu, Jan 30, 2020 at 2:52 PM Xander Song <ia...@gmail.com> wrote:

> Hello,
>
> I am new to the Apache ecosystem and am attempting to use Beam to build a
> horizontally scalable pipeline for feature extraction from video data. The
> extraction process for certain features can be accelerated using GPUs,
> while other features require only a CPU to compute. I have several
> questions, listed in order of decreasing priority:
>
>    1. Can I run a Beam pipeline with GPUs? (as far as I can tell, Google
>    Cloud Dataflow does not currently support this option)
>
> There was a thread on user[1] that discusses this. I think the status quo
hasn't changed much since then.

>
>    1. Is it possible to achieve this functionality using Spark or Flink
>    as a runner?
>
> It should be possible, although I have not tried it. It is possible to run
Beam Flink/Spark clusters, and it is possible to create a Flink cluster
with GPUs. Beam custom containers[4] can provide a way to manage required
GPU dependencies (CUDA toolkit, cuDNN, etc). Google Cloud Dataproc offers a
way to create managed Flink/Spark clusters and attaching GPUs to Dataproc
clusters [2].

>
>    1. Is it possible to mix hardware types in a Beam pipeline (e.g., to
>    have certain features extracted by CPUs and others extracted by GPUs), or
>    does this go against the Beam paradigm of abstracting away such details?
>
> It does not go against the paradigm, but support for annotating parts of
Beam pipelines with hardware requirements, has not been implemented yet [2].

>
>    1. Do the Spark and Flink runners have support for auto-scaling like
>    Google Cloud Dataflow?
>
> Support for autoscaling should be implemented in Flink/Spark itself, not
so much in Beam Flink/Spark runner. To my knowledge, the answer is no.

>
>    1. What are relevant considerations when selecting between Spark vs.
>    Flink as a runner?
>
> Language support, pipeline type (batch/streaming), runner capabilities are
all relevant considerations. There are two Spark runners: portable(Python,
Java, Go, supports custom containers) and non-portable (Java only). I think
you'd want to go with a portable runner for your use case. Among portable
runners I think Flink had the most capabilities implemented as of last
year, see: [5] [6], but the information may be out of date.

Any guidance, resources, or tips are appreciated. Thank you in advance!
> -Xander
>

[1]
https://lists.apache.org/thread.html/00c1b5b44204b5c7f33bdae53da20d84739e1f80c3c286db8a9151b6%40%3Cuser.beam.apache.org%3E
.
[2] https://cloud.google.com/dataproc/docs/release-notes#September_24_2019
[3] https://issues.apache.org/jira/browse/BEAM-2085
[4] https://beam.apache.org/documentation/runtime/environments/
[5] https://s.apache.org/apache-beam-portability-support-table
[6] https://beam.apache.org/documentation/runners/capability-matrix/