You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@beam.apache.org by Steve973 <st...@gmail.com> on 2019/09/25 10:38:28 UTC

A couple questions from someone new to Beam

Hi, all.  I am still ramping up on my learning of how to use Beam, and I
have a couple of questions for the experts.  And, while I have read the
documentation, I have either looked at the wrong parts, or my particular
questions were not specifically answered.  If I have missed something, then
please point me in the right direction.

   1. When using the MongoDB, for reading and writing from an execution
   node, does it need to take the time, each time an executor runs, to set up
   the connection to Mongo?  Or does Beam cache the connections and reuse them
   to mitigate the performance hit of setting up the connection each time?  If
   so, I am curious how it handles that for multiple nodes, unless Beam is
   "smart" enough to pre-cache connections in a pool on execution nodes in
   advance.
   2. When something is executed in parallel (ParDo), do the parallel jobs
   run in one thread on an execution node?  Or, will Beam utilize more
   resources/threads, as available, on a node?  I would like to utilize as
   many threads as possible on available cluster nodes.  My thought is that,
   if a job is stateless, it seems reasonable to be able to utilize multiple
   threads on a node to further parallelize and maximize performance.
   Although, it also occurs to me that this would probably be
   implementation-dependent on the runner.  The other approach that I can see
   is to simply use CompletableFutures in my jobs, which is what I am already
   doing in my code that does not (yet) use Beam. But it would be preferable
   to allow Beam to manage all of the parallelization.

I am sure that I will have some more questions as time goes on, but this
would be great info to have for now.

Thanks,
Steve

Re: A couple questions from someone new to Beam

Posted by Alexey Romanenko <ar...@gmail.com>.

Hi Steve,

1) In general, managing the client connections is a responsibility of IO transform. Usually, one client instance is used per input split (bounded or unbounded source) and it opens a connection in the beginning of reading this split and closes in the end. The is in theory. Practically, every IO can implement it a bit differently and provide some additional options for connection configuration for users, like setting maximum connection time, accepting custom data source provider and connection pool and so on. Every input split will be processed in parallel but this is already a responsibility of backend data processing engine how to schedule and manage it on cluster. 

2) Again, this is responsibility of backend data processing engine how to run it in parallel. Beam is a SDK that allows user to create a pipeline once and run it on different engines. To make it possible, Bean incorporates different runners, that are responsible to translate unified Beam pipeline DAG into specific backend engine one. Also, as Jeff mentioned before, Beam SDK is not thread-safe, so your should assume this while writing your pipeline code. In the same time, your backend engine can decide to fusion some of your pipeline stages, so, you can add break a fusion by adding GBK transform, but number of parallel threads/workers again will be depended on your backend engine. 

> On 25 Sep 2019, at 12:38, Steve973 <st...@gmail.com> wrote:
> 
> Hi, all.  I am still ramping up on my learning of how to use Beam, and I have a couple of questions for the experts.  And, while I have read the documentation, I have either looked at the wrong parts, or my particular questions were not specifically answered.  If I have missed something, then please point me in the right direction.
> When using the MongoDB, for reading and writing from an execution node, does it need to take the time, each time an executor runs, to set up the connection to Mongo?  Or does Beam cache the connections and reuse them to mitigate the performance hit of setting up the connection each time?  If so, I am curious how it handles that for multiple nodes, unless Beam is "smart" enough to pre-cache connections in a pool on execution nodes in advance.
> When something is executed in parallel (ParDo), do the parallel jobs run in one thread on an execution node?  Or, will Beam utilize more resources/threads, as available, on a node?  I would like to utilize as many threads as possible on available cluster nodes.  My thought is that, if a job is stateless, it seems reasonable to be able to utilize multiple threads on a node to further parallelize and maximize performance.  Although, it also occurs to me that this would probably be implementation-dependent on the runner.  The other approach that I can see is to simply use CompletableFutures in my jobs, which is what I am already doing in my code that does not (yet) use Beam. But it would be preferable to allow Beam to manage all of the parallelization.
> I am sure that I will have some more questions as time goes on, but this would be great info to have for now.
> 
> Thanks,
> Steve

Re: A couple questions from someone new to Beam

Posted by Jeff Klukas <jk...@mozilla.com>.

(1) I don't have direct experience with contacting MongoDB in Beam, but my
general expectation is that yes, Beam IOs will do reasonable things for
connection setup and teardown. For the case of MongoDbIO in the Java SDK,
it looks like this connection setup happens in BoundedMongoDbReader [0]. In
general, the IOs in Beam tend to implement classes like BoundedSource that
handle hooking into Beam's execution model and provide hooks for things
like connection setup.

(2) As to threads, there is some relevant documentation in the Beam
Programming Guide about requirements for user transforms [1]. In
particular, that states that instances of the function you provide in your
ParDo will only be accessed by one thread at a time, and "note that static
members in your function object are not passed to worker instances and that
multiple instances of your function may be accessed from different
threads". So details of how many instances of a function run concurrently
is left up to runners to decide, but it certainly allowed by the Beam model.

[0]
https://github.com/apache/beam/blob/master/sdks/java/io/mongodb/src/main/java/org/apache/beam/sdk/io/mongodb/MongoDbIO.java#L642
[1]
https://beam.apache.org/documentation/programming-guide/#requirements-for-writing-user-code-for-beam-transforms

On Wed, Sep 25, 2019 at 6:38 AM Steve973 <st...@gmail.com> wrote:

> Hi, all.  I am still ramping up on my learning of how to use Beam, and I
> have a couple of questions for the experts.  And, while I have read the
> documentation, I have either looked at the wrong parts, or my particular
> questions were not specifically answered.  If I have missed something, then
> please point me in the right direction.
>
>    1. When using the MongoDB, for reading and writing from an execution
>    node, does it need to take the time, each time an executor runs, to set up
>    the connection to Mongo?  Or does Beam cache the connections and reuse them
>    to mitigate the performance hit of setting up the connection each time?  If
>    so, I am curious how it handles that for multiple nodes, unless Beam is
>    "smart" enough to pre-cache connections in a pool on execution nodes in
>    advance.
>    2. When something is executed in parallel (ParDo), do the parallel
>    jobs run in one thread on an execution node?  Or, will Beam utilize more
>    resources/threads, as available, on a node?  I would like to utilize as
>    many threads as possible on available cluster nodes.  My thought is that,
>    if a job is stateless, it seems reasonable to be able to utilize multiple
>    threads on a node to further parallelize and maximize performance.
>    Although, it also occurs to me that this would probably be
>    implementation-dependent on the runner.  The other approach that I can see
>    is to simply use CompletableFutures in my jobs, which is what I am already
>    doing in my code that does not (yet) use Beam. But it would be preferable
>    to allow Beam to manage all of the parallelization.
>
> I am sure that I will have some more questions as time goes on, but this
> would be great info to have for now.
>
> Thanks,
> Steve
>