You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Ana Markovic <am...@york.ac.uk> on 2021/09/07 10:32:27 UTC

[GENERAL QUESTION] How independent are worker nodes

To whom this may concern,

I've been looking into polyglot data processing frameworks recently, and I
read Beam's documentation as well as developed a few examples to get some
hands-on experience. I've been wondering, and I haven't found this in the
documentation, is there a way to set up worker nodes so they are
"opinionated" or "smart" in a sense that they can decide for themselves
which jobs they will perform? For example, in a word count example, an
opinionated worker node could only decide to monitor occurrences of a
specific word if it's among the node's favourite words.

I hope I explained it well, but please let me know if more details are
needed to answer this question.

Thankful in advance,
Ana

Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Ana Markovic <am...@york.ac.uk>.
Hi everyone and thanks for all the replies and suggestions!

Stateful DoFn's seem like something that could do the trick, I'll give it a
try and let you know if I have any particular feedback (comparing to the
example I shared previously).


Thanks,
Ana


On Wed, 8 Sept 2021 at 05:14, Thanh Phan Truong <th...@quod.ai> wrote:

> Hi Ana,
>
> I faced the issue where multiple workers need to access file from
> downloaded repositories. From my experiences you could try NFS disk, so
> that multiple workers can share the same disk. Performance is slower so you
> could try to copy it into local disk for git operations.
>
> For a Flink on K8S cluster, setting an NFS disk is quite easy, you can
> also use AWS EBS or AWS disk that support ReadWriteMany.
>
> Best,
>
> Thanh
> On Sep 8 2021, at 12:12 am, Ana Markovic <am...@york.ac.uk> wrote:
>
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to
> recreate in Beam, and I'm sharing the link below. Generally speaking, nodes
> keep their favourite words and accept only jobs that involve those
> favourites. This is a simple example but could be beneficial in processing
> large pieces of data (for example, software repositories), where nodes
> could work on the repositories they already processed (and have some files
> already downloaded) and avoid downloading unnecessary repository contents
> if another node already has them. This could be enabled by allowing nodes
> to check their internal state and decide if they want to accept/reject a
> certain repository as a job. I know that the "more complicated" example
> might be a far fetch, but I wanted to give you more context on what I'd
> want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1]
> https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated
> <https://link.getmailspring.com/link/28249698-30FB-44A3-B420-9053BE1869C9@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fcrossflowlabs%2Fcrossflow%2Ftree%2Fmaster%2Forg.crossflow.tests%2Fsrc%2Forg%2Fcrossflow%2Ftests%2Fopinionated&recipient=dXNlckBiZWFtLmFwYWNoZS5vcmc%3D>
>
>
> [image: Sent from Mailspring]
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je...@seznam.cz> wrote:
>
> Hi Ana,
>
> in general, worker nodes do not share any state, and cannot themselves
> decide which work to accept and which to reject. How the work is
> distributed to downstream processing is defined by a runner, not the Beam
> model. On the other hand, what you ask for might be possibly accomplished
> using a grouping operation - either a GroupByKey or a stateful DoFn might
> help you with that. Can you further describe your intent?
>
> Best,
>
>  Jan
>
> On 9/7/21 12:32 PM, Ana Markovic wrote:
>
> To whom this may concern,
>
> I've been looking into polyglot data processing frameworks recently, and I
> read Beam's documentation as well as developed a few examples to get some
> hands-on experience. I've been wondering, and I haven't found this in the
> documentation, is there a way to set up worker nodes so they are
> "opinionated" or "smart" in a sense that they can decide for themselves
> which jobs they will perform? For example, in a word count example, an
> opinionated worker node could only decide to monitor occurrences of a
> specific word if it's among the node's favourite words.
>
> I hope I explained it well, but please let me know if more details are
> needed to answer this question.
>
> Thankful in advance,
> Ana
>
> --
> Best,
> Ana
>
>

Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Thanh Phan Truong <th...@quod.ai>.
Hi Ana,

I faced the issue where multiple workers need to access file from downloaded repositories. From my experiences you could try NFS disk, so that multiple workers can share the same disk. Performance is slower so you could try to copy it into local disk for git operations.
For a Flink on K8S cluster, setting an NFS disk is quite easy, you can also use AWS EBS or AWS disk that support ReadWriteMany.
Best,
Thanh
On Sep 8 2021, at 12:12 am, Ana Markovic <am...@york.ac.uk> wrote:
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to recreate in Beam, and I'm sharing the link below. Generally speaking, nodes keep their favourite words and accept only jobs that involve those favourites. This is a simple example but could be beneficial in processing large pieces of data (for example, software repositories), where nodes could work on the repositories they already processed (and have some files already downloaded) and avoid downloading unnecessary repository contents if another node already has them. This could be enabled by allowing nodes to check their internal state and decide if they want to accept/reject a certain repository as a job. I know that the "more complicated" example might be a far fetch, but I wanted to give you more context on what I'd want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1] https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated (https://link.getmailspring.com/link/28249698-30FB-44A3-B420-9053BE1869C9@getmailspring.com/0?redirect=https%3A%2F%2Fgithub.com%2Fcrossflowlabs%2Fcrossflow%2Ftree%2Fmaster%2Forg.crossflow.tests%2Fsrc%2Forg%2Fcrossflow%2Ftests%2Fopinionated&recipient=dXNlckBiZWFtLmFwYWNoZS5vcmc%3D)
>
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je.ik@seznam.cz (mailto:je.ik@seznam.cz)> wrote:
> > Hi Ana,
> >
> > in general, worker nodes do not share any state, and cannot themselves decide which work to accept and which to reject. How the work is distributed to downstream processing is defined by a runner, not the Beam model. On the other hand, what you ask for might be possibly accomplished using a grouping operation - either a GroupByKey or a stateful DoFn might help you with that. Can you further describe your intent?
> > Best,
> > Jan
> > On 9/7/21 12:32 PM, Ana Markovic wrote:
> > > To whom this may concern,
> > >
> > > I've been looking into polyglot data processing frameworks recently, and I read Beam's documentation as well as developed a few examples to get some hands-on experience. I've been wondering, and I haven't found this in the documentation, is there a way to set up worker nodes so they are "opinionated" or "smart" in a sense that they can decide for themselves which jobs they will perform? For example, in a word count example, an opinionated worker node could only decide to monitor occurrences of a specific word if it's among the node's favourite words.
> > >
> > > I hope I explained it well, but please let me know if more details are needed to answer this question.
> > >
> > > Thankful in advance,
> > > Ana
> > >
> > >
> > >
> > >
> > >
> >
> >
>
>
>
> --
> Best,
> Ana
>
>


Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Luke Cwik <lc...@google.com>.
+1 on using a stateful DoFn.

On Tue, Sep 7, 2021 at 11:29 AM Jan Lukavský <je...@seznam.cz> wrote:

> Hi Ana,
>
> what you describe sounds like logical grouping to me. For example - when
> Beam runs a stateful operation (DoFn), every record has to be associated
> with a _key_. All records with the same key are then processed by the same
> worker. If you have some resources that need to be downloaded (cached) from
> the outside of the Pipeline, one option would be to use a stateful DoFn,
> which would look into its local cache (held in a state) and download the
> required resource if it does not have it (or if it is stale). There would
> probably be needed more logic around freeing the state, but I'll leave that
> out for now.
>
> Would that work for your case?
>
>  Jan
> On 9/7/21 7:12 PM, Ana Markovic wrote:
>
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to
> recreate in Beam, and I'm sharing the link below. Generally speaking, nodes
> keep their favourite words and accept only jobs that involve those
> favourites. This is a simple example but could be beneficial in processing
> large pieces of data (for example, software repositories), where nodes
> could work on the repositories they already processed (and have some files
> already downloaded) and avoid downloading unnecessary repository contents
> if another node already has them. This could be enabled by allowing nodes
> to check their internal state and decide if they want to accept/reject a
> certain repository as a job. I know that the "more complicated" example
> might be a far fetch, but I wanted to give you more context on what I'd
> want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1]
> https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated
>
>
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi Ana,
>>
>> in general, worker nodes do not share any state, and cannot themselves
>> decide which work to accept and which to reject. How the work is
>> distributed to downstream processing is defined by a runner, not the Beam
>> model. On the other hand, what you ask for might be possibly accomplished
>> using a grouping operation - either a GroupByKey or a stateful DoFn might
>> help you with that. Can you further describe your intent?
>>
>> Best,
>>
>>  Jan
>> On 9/7/21 12:32 PM, Ana Markovic wrote:
>>
>> To whom this may concern,
>>
>> I've been looking into polyglot data processing frameworks recently, and
>> I read Beam's documentation as well as developed a few examples to get some
>> hands-on experience. I've been wondering, and I haven't found this in the
>> documentation, is there a way to set up worker nodes so they are
>> "opinionated" or "smart" in a sense that they can decide for themselves
>> which jobs they will perform? For example, in a word count example, an
>> opinionated worker node could only decide to monitor occurrences of a
>> specific word if it's among the node's favourite words.
>>
>> I hope I explained it well, but please let me know if more details are
>> needed to answer this question.
>>
>> Thankful in advance,
>> Ana
>>
>> --
> Best,
> Ana
>
>

Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Jan Lukavský <je...@seznam.cz>.
Hi Ana,

what you describe sounds like logical grouping to me. For example - when 
Beam runs a stateful operation (DoFn), every record has to be associated 
with a _key_. All records with the same key are then processed by the 
same worker. If you have some resources that need to be downloaded 
(cached) from the outside of the Pipeline, one option would be to use a 
stateful DoFn, which would look into its local cache (held in a state) 
and download the required resource if it does not have it (or if it is 
stale). There would probably be needed more logic around freeing the 
state, but I'll leave that out for now.

Would that work for your case?

  Jan

On 9/7/21 7:12 PM, Ana Markovic wrote:
> Hi Jan,
>
> Thanks for the fast reply! I came across an example that I wanted to 
> recreate in Beam, and I'm sharing the link below. Generally speaking, 
> nodes keep their favourite words and accept only jobs that involve 
> those favourites. This is a simple example but could be beneficial in 
> processing large pieces of data (for example, software repositories), 
> where nodes could work on the repositories they already processed (and 
> have some files already downloaded) and avoid downloading unnecessary 
> repository contents if another node already has them. This could be 
> enabled by allowing nodes to check their internal state and decide if 
> they want to accept/reject a certain repository as a job. I know that 
> the "more complicated" example might be a far fetch, but I wanted to 
> give you more context on what I'd want to know about Beam.
>
> Thanks for all the insights!
>
> Best,
> Ana
>
> [1] 
> https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated 
> <https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated>
>
>
> On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je.ik@seznam.cz 
> <ma...@seznam.cz>> wrote:
>
>     Hi Ana,
>
>     in general, worker nodes do not share any state, and cannot
>     themselves decide which work to accept and which to reject. How
>     the work is distributed to downstream processing is defined by a
>     runner, not the Beam model. On the other hand, what you ask for
>     might be possibly accomplished using a grouping operation - either
>     a GroupByKey or a stateful DoFn might help you with that. Can you
>     further describe your intent?
>
>     Best,
>
>      Jan
>
>     On 9/7/21 12:32 PM, Ana Markovic wrote:
>>     To whom this may concern,
>>
>>     I've been looking into polyglot data processing frameworks
>>     recently, and I read Beam's documentation as well as developed a
>>     few examples to get some hands-on experience. I've been
>>     wondering, and I haven't found this in the documentation, is
>>     there a way to set up worker nodes so they are "opinionated" or
>>     "smart" in a sense that they can decide for themselves which jobs
>>     they will perform? For example, in a word count example, an
>>     opinionated worker node could only decide to monitor
>>     occurrences of a specific word if it's among the node's favourite
>>     words.
>>
>>     I hope I explained it well, but please let me know if more
>>     details are needed to answer this question.
>>
>>     Thankful in advance,
>>     Ana
>
> -- 
> Best,
> Ana

Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Ana Markovic <am...@york.ac.uk>.
Hi Jan,

Thanks for the fast reply! I came across an example that I wanted to
recreate in Beam, and I'm sharing the link below. Generally speaking, nodes
keep their favourite words and accept only jobs that involve those
favourites. This is a simple example but could be beneficial in processing
large pieces of data (for example, software repositories), where nodes
could work on the repositories they already processed (and have some files
already downloaded) and avoid downloading unnecessary repository contents
if another node already has them. This could be enabled by allowing nodes
to check their internal state and decide if they want to accept/reject a
certain repository as a job. I know that the "more complicated" example
might be a far fetch, but I wanted to give you more context on what I'd
want to know about Beam.

Thanks for all the insights!

Best,
Ana

[1]
https://github.com/crossflowlabs/crossflow/tree/master/org.crossflow.tests/src/org/crossflow/tests/opinionated


On Tue, 7 Sept 2021 at 13:57, Jan Lukavský <je...@seznam.cz> wrote:

> Hi Ana,
>
> in general, worker nodes do not share any state, and cannot themselves
> decide which work to accept and which to reject. How the work is
> distributed to downstream processing is defined by a runner, not the Beam
> model. On the other hand, what you ask for might be possibly accomplished
> using a grouping operation - either a GroupByKey or a stateful DoFn might
> help you with that. Can you further describe your intent?
>
> Best,
>
>  Jan
> On 9/7/21 12:32 PM, Ana Markovic wrote:
>
> To whom this may concern,
>
> I've been looking into polyglot data processing frameworks recently, and I
> read Beam's documentation as well as developed a few examples to get some
> hands-on experience. I've been wondering, and I haven't found this in the
> documentation, is there a way to set up worker nodes so they are
> "opinionated" or "smart" in a sense that they can decide for themselves
> which jobs they will perform? For example, in a word count example, an
> opinionated worker node could only decide to monitor occurrences of a
> specific word if it's among the node's favourite words.
>
> I hope I explained it well, but please let me know if more details are
> needed to answer this question.
>
> Thankful in advance,
> Ana
>
> --
Best,
Ana

Re: [GENERAL QUESTION] How independent are worker nodes

Posted by Jan Lukavský <je...@seznam.cz>.
Hi Ana,

in general, worker nodes do not share any state, and cannot themselves 
decide which work to accept and which to reject. How the work is 
distributed to downstream processing is defined by a runner, not the 
Beam model. On the other hand, what you ask for might be possibly 
accomplished using a grouping operation - either a GroupByKey or a 
stateful DoFn might help you with that. Can you further describe your 
intent?

Best,

  Jan

On 9/7/21 12:32 PM, Ana Markovic wrote:
> To whom this may concern,
>
> I've been looking into polyglot data processing frameworks recently, 
> and I read Beam's documentation as well as developed a few examples to 
> get some hands-on experience. I've been wondering, and I haven't found 
> this in the documentation, is there a way to set up worker nodes so 
> they are "opinionated" or "smart" in a sense that they can decide for 
> themselves which jobs they will perform? For example, in a word count 
> example, an opinionated worker node could only decide to monitor 
> occurrences of a specific word if it's among the node's favourite words.
>
> I hope I explained it well, but please let me know if more details are 
> needed to answer this question.
>
> Thankful in advance,
> Ana