You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Steve973 <st...@gmail.com> on 2019/09/17 09:51:38 UTC
Any possibility to run larger data sets with DirectRunner?
Hi, all. I would like to begin to set up my workflow in Apache Beam, but
only run it on a local machine until our system administrators have the
capacity to set up an adequate (spark or hadoop) cluster. From the
documentation, I understand that we should be mindful of the memory
requirements of a data set that we use, but is there any alternative (of
course, at the sacrifice of speed) to using a larger data set with the
DirectRunner? Can we configure it to spill to disk, possibly?
Thanks,
Steve
Re: Any possibility to run larger data sets with DirectRunner?
Posted by Lukasz Cwik <lc...@google.com>.
+1 for local execution using Flink.
On Tue, Sep 17, 2019 at 4:24 AM Paweł Kordek <pa...@farfetch.com>
wrote:
> Hi Steve
>
> Maybe local execution on a Flink cluster will work for you:
> https://beam.apache.org/documentation/runners/flink/ ?
>
> Cheers
> Pawel
>
> On Tue, 17 Sep 2019 at 10:51, Steve973 <st...@gmail.com> wrote:
>
>> Hi, all. I would like to begin to set up my workflow in Apache Beam, but
>> only run it on a local machine until our system administrators have the
>> capacity to set up an adequate (spark or hadoop) cluster. From the
>> documentation, I understand that we should be mindful of the memory
>> requirements of a data set that we use, but is there any alternative (of
>> course, at the sacrifice of speed) to using a larger data set with the
>> DirectRunner? Can we configure it to spill to disk, possibly?
>>
>> Thanks,
>> Steve
>>
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>
Re: Any possibility to run larger data sets with DirectRunner?
Posted by Paweł Kordek <pa...@farfetch.com>.
Hi Steve
Maybe local execution on a Flink cluster will work for you:
https://beam.apache.org/documentation/runners/flink/ ?
Cheers
Pawel
On Tue, 17 Sep 2019 at 10:51, Steve973 <st...@gmail.com> wrote:
> Hi, all. I would like to begin to set up my workflow in Apache Beam, but
> only run it on a local machine until our system administrators have the
> capacity to set up an adequate (spark or hadoop) cluster. From the
> documentation, I understand that we should be mindful of the memory
> requirements of a data set that we use, but is there any alternative (of
> course, at the sacrifice of speed) to using a larger data set with the
> DirectRunner? Can we configure it to spill to disk, possibly?
>
> Thanks,
> Steve
>
--
This email and any files transmitted
with it are confidential and
intended solely for the use of the individual or
entity to whom they are
addressed. If you have received this email in error
please notify the
system manager. This message contains confidential
information and is
intended only for the individual named. If you are not the
named addressee
you should not disseminate, distribute or copy this e-mail.
Please notify
the sender immediately by e-mail if you have received this e-mail
by
mistake and delete this e-mail from your system. If you are not the
intended
recipient you are notified that disclosing, copying, distributing
or taking any
action in reliance on the contents of this information is
strictly prohibited.