You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Steve973 <st...@gmail.com> on 2019/09/17 09:51:38 UTC

Any possibility to run larger data sets with DirectRunner?

Hi, all.  I would like to begin to set up my workflow in Apache Beam, but
only run it on a local machine until our system administrators have the
capacity to set up an adequate (spark or hadoop) cluster.  From the
documentation, I understand that we should be mindful of the memory
requirements of a data set that we use, but is there any alternative (of
course, at the sacrifice of speed) to using a larger data set with the
DirectRunner?  Can we configure it to spill to disk, possibly?

Thanks,
Steve

Re: Any possibility to run larger data sets with DirectRunner?

Posted by Lukasz Cwik <lc...@google.com>.
+1 for local execution using Flink.

On Tue, Sep 17, 2019 at 4:24 AM Paweł Kordek <pa...@farfetch.com>
wrote:

> Hi Steve
>
> Maybe local execution on a Flink cluster will work for you:
> https://beam.apache.org/documentation/runners/flink/ ?
>
> Cheers
> Pawel
>
> On Tue, 17 Sep 2019 at 10:51, Steve973 <st...@gmail.com> wrote:
>
>> Hi, all.  I would like to begin to set up my workflow in Apache Beam, but
>> only run it on a local machine until our system administrators have the
>> capacity to set up an adequate (spark or hadoop) cluster.  From the
>> documentation, I understand that we should be mindful of the memory
>> requirements of a data set that we use, but is there any alternative (of
>> course, at the sacrifice of speed) to using a larger data set with the
>> DirectRunner?  Can we configure it to spill to disk, possibly?
>>
>> Thanks,
>> Steve
>>
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>

Re: Any possibility to run larger data sets with DirectRunner?

Posted by Paweł Kordek <pa...@farfetch.com>.
Hi Steve

Maybe local execution on a Flink cluster will work for you:
https://beam.apache.org/documentation/runners/flink/ ?

Cheers
Pawel

On Tue, 17 Sep 2019 at 10:51, Steve973 <st...@gmail.com> wrote:

> Hi, all.  I would like to begin to set up my workflow in Apache Beam, but
> only run it on a local machine until our system administrators have the
> capacity to set up an adequate (spark or hadoop) cluster.  From the
> documentation, I understand that we should be mindful of the memory
> requirements of a data set that we use, but is there any alternative (of
> course, at the sacrifice of speed) to using a larger data set with the
> DirectRunner?  Can we configure it to spill to disk, possibly?
>
> Thanks,
> Steve
>

-- 


This email and any files transmitted
with it are confidential and 
intended solely for the use of the individual or
entity to whom they are 
addressed. If you have received this email in error
please notify the 
system manager. This message contains confidential
information and is 
intended only for the individual named. If you are not the
named addressee 
you should not disseminate, distribute or copy this e-mail.
Please notify 
the sender immediately by e-mail if you have received this e-mail
by 
mistake and delete this e-mail from your system. If you are not the 
intended
recipient you are notified that disclosing, copying, distributing 
or taking any
action in reliance on the contents of this information is 
strictly prohibited.