You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Shree Tanna <sh...@gmail.com> on 2022/07/19 16:40:18 UTC

[Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Hi all,

I'm planning to use Apache beam to extract and load part of the ETL
pipeline and run the jobs on Dataflow. I will have to do the REST API
ingestion on our platform. I can opt to make sync API calls from DoFn. With
that pipelines will stall while REST requests are made over the network.

Is it best practice to run the REST ingestion job on Dataflow? Is there any
best practice I can follow to accomplish this? Just as a reference I'm
adding this
<https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
StackOverflow thread here too. Also, I notice that Rest I/O transform
<https://beam.apache.org/documentation/io/built-in/> built-in connector is
in progress for Java.

Let me know if this is the right group to ask this question. I can also ask
dev@beam.apache.org if needed.
-- 
Thanks,
Shree

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Shree Tanna <sh...@gmail.com>.
Thanks all, This discussion was very helpful. I will test all the ideas out.

On Wed, Jul 20, 2022 at 3:59 PM Chamikara Jayalath via user <
user@beam.apache.org> wrote:

>
>
> On Wed, Jul 20, 2022 at 12:57 PM Chamikara Jayalath <ch...@google.com>
> wrote:
>
>> I don't think it's an antipattern per se. You can implement arbitrary
>> operations in a DoFn or an SDF to read data.
>>
>> But if a single resource ID maps to a large amount of data, Beam runners
>> (including Dataflow) will be able to parallelize reading, hence your
>> solution may have suboptimal performance compared to reading from a Beam
>> source that can be fully parallelized.
>>
>
> *will not be able to*
>
>
>>
>> Thanks,
>> Cham
>>
>> On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <sh...@gmail.com>
>> wrote:
>>
>>> Thank you!
>>> I will try this out.
>>> One more question on this, is it considered anti-pattern to do HTTP
>>> ingestion on GCP Dataflow due to the reasoning I mentioned in my original
>>> message? I ask because I am getting that indication from some of my
>>> co-workers and also from google cloud support. Not sure if this is the
>>> right place to ask this question. Happy to move this conversation to
>>> somewhere else if not.
>>>
>>> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <us...@beam.apache.org>
>>> wrote:
>>>
>>>> Even if you don't have the resource ids ahead of time, you can have a
>>>> pipeline like:
>>>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
>>>> ParDo(ReadResourceIds) -> ...
>>>>
>>>> You could also compose these as splittable DoFns [1, 2, 3]:
>>>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>>>>
>>>> The first approach is the simplest as the reshuffle will rebalance the
>>>> reading of each resource id across worker nodes but is limited in
>>>> generating resource ids on one worker. Making the generation a splittable
>>>> DoFn will mean that you can increase the parallelism of generation which is
>>>> important if there are so many that it could crash a worker or fail to have
>>>> the output committed (these kinds of failures are runner dependent on how
>>>> well they handle single bundles with large outputs). Making the reading
>>>> splittable allows you to handle a large resource (imagine a large file) so
>>>> that it can be read and processed in parallel (and will have similar
>>>> failures if the runner can't handle single bundles with large outputs).
>>>>
>>>> You can always start with the first solution and swap either piece to
>>>> be a splittable DoFn depending on your performance requirements and how
>>>> well the simple solution works.
>>>>
>>>> 1: https://beam.apache.org/blog/splittable-do-fn/
>>>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
>>>> 3:
>>>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>>>
>>>>
>>>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <
>>>> damianakpan2001@gmail.com> wrote:
>>>>
>>>>> Provided you have all the resources ids ahead of fetching, Beam will
>>>>> spread the fetches to its workers. It will still fetch synchronously but
>>>>> within that worker.
>>>>>
>>>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> I'm planning to use Apache beam to extract and load part of the ETL
>>>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>>>>> that pipelines will stall while REST requests are made over the network.
>>>>>>
>>>>>> Is it best practice to run the REST ingestion job on Dataflow? Is
>>>>>> there any best practice I can follow to accomplish this? Just as a
>>>>>> reference I'm adding this
>>>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>>>>> <https://beam.apache.org/documentation/io/built-in/> built-in
>>>>>> connector is in progress for Java.
>>>>>>
>>>>>> Let me know if this is the right group to ask this question. I can
>>>>>> also ask dev@beam.apache.org if needed.
>>>>>> --
>>>>>> Thanks,
>>>>>> Shree
>>>>>>
>>>>>
>>>
>>> --
>>> Best,
>>> Shree
>>>
>>

-- 
Best,
Shree

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Chamikara Jayalath via user <us...@beam.apache.org>.
On Wed, Jul 20, 2022 at 12:57 PM Chamikara Jayalath <ch...@google.com>
wrote:

> I don't think it's an antipattern per se. You can implement arbitrary
> operations in a DoFn or an SDF to read data.
>
> But if a single resource ID maps to a large amount of data, Beam runners
> (including Dataflow) will be able to parallelize reading, hence your
> solution may have suboptimal performance compared to reading from a Beam
> source that can be fully parallelized.
>

*will not be able to*


>
> Thanks,
> Cham
>
> On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <sh...@gmail.com>
> wrote:
>
>> Thank you!
>> I will try this out.
>> One more question on this, is it considered anti-pattern to do HTTP
>> ingestion on GCP Dataflow due to the reasoning I mentioned in my original
>> message? I ask because I am getting that indication from some of my
>> co-workers and also from google cloud support. Not sure if this is the
>> right place to ask this question. Happy to move this conversation to
>> somewhere else if not.
>>
>> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <us...@beam.apache.org>
>> wrote:
>>
>>> Even if you don't have the resource ids ahead of time, you can have a
>>> pipeline like:
>>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
>>> ParDo(ReadResourceIds) -> ...
>>>
>>> You could also compose these as splittable DoFns [1, 2, 3]:
>>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>>>
>>> The first approach is the simplest as the reshuffle will rebalance the
>>> reading of each resource id across worker nodes but is limited in
>>> generating resource ids on one worker. Making the generation a splittable
>>> DoFn will mean that you can increase the parallelism of generation which is
>>> important if there are so many that it could crash a worker or fail to have
>>> the output committed (these kinds of failures are runner dependent on how
>>> well they handle single bundles with large outputs). Making the reading
>>> splittable allows you to handle a large resource (imagine a large file) so
>>> that it can be read and processed in parallel (and will have similar
>>> failures if the runner can't handle single bundles with large outputs).
>>>
>>> You can always start with the first solution and swap either piece to be
>>> a splittable DoFn depending on your performance requirements and how well
>>> the simple solution works.
>>>
>>> 1: https://beam.apache.org/blog/splittable-do-fn/
>>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
>>> 3:
>>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>>
>>>
>>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <da...@gmail.com>
>>> wrote:
>>>
>>>> Provided you have all the resources ids ahead of fetching, Beam will
>>>> spread the fetches to its workers. It will still fetch synchronously but
>>>> within that worker.
>>>>
>>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> I'm planning to use Apache beam to extract and load part of the ETL
>>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>>>> that pipelines will stall while REST requests are made over the network.
>>>>>
>>>>> Is it best practice to run the REST ingestion job on Dataflow? Is
>>>>> there any best practice I can follow to accomplish this? Just as a
>>>>> reference I'm adding this
>>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>>>> <https://beam.apache.org/documentation/io/built-in/> built-in
>>>>> connector is in progress for Java.
>>>>>
>>>>> Let me know if this is the right group to ask this question. I can
>>>>> also ask dev@beam.apache.org if needed.
>>>>> --
>>>>> Thanks,
>>>>> Shree
>>>>>
>>>>
>>
>> --
>> Best,
>> Shree
>>
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Chamikara Jayalath via user <us...@beam.apache.org>.
I don't think it's an antipattern per se. You can implement arbitrary
operations in a DoFn or an SDF to read data.

But if a single resource ID maps to a large amount of data, Beam runners
(including Dataflow) will be able to parallelize reading, hence your
solution may have suboptimal performance compared to reading from a Beam
source that can be fully parallelized.

Thanks,
Cham

On Wed, Jul 20, 2022 at 11:53 AM Shree Tanna <sh...@gmail.com> wrote:

> Thank you!
> I will try this out.
> One more question on this, is it considered anti-pattern to do HTTP
> ingestion on GCP Dataflow due to the reasoning I mentioned in my original
> message? I ask because I am getting that indication from some of my
> co-workers and also from google cloud support. Not sure if this is the
> right place to ask this question. Happy to move this conversation to
> somewhere else if not.
>
> On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <us...@beam.apache.org>
> wrote:
>
>> Even if you don't have the resource ids ahead of time, you can have a
>> pipeline like:
>> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
>> ParDo(ReadResourceIds) -> ...
>>
>> You could also compose these as splittable DoFns [1, 2, 3]:
>> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>>
>> The first approach is the simplest as the reshuffle will rebalance the
>> reading of each resource id across worker nodes but is limited in
>> generating resource ids on one worker. Making the generation a splittable
>> DoFn will mean that you can increase the parallelism of generation which is
>> important if there are so many that it could crash a worker or fail to have
>> the output committed (these kinds of failures are runner dependent on how
>> well they handle single bundles with large outputs). Making the reading
>> splittable allows you to handle a large resource (imagine a large file) so
>> that it can be read and processed in parallel (and will have similar
>> failures if the runner can't handle single bundles with large outputs).
>>
>> You can always start with the first solution and swap either piece to be
>> a splittable DoFn depending on your performance requirements and how well
>> the simple solution works.
>>
>> 1: https://beam.apache.org/blog/splittable-do-fn/
>> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
>> 3:
>> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>>
>>
>> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <da...@gmail.com>
>> wrote:
>>
>>> Provided you have all the resources ids ahead of fetching, Beam will
>>> spread the fetches to its workers. It will still fetch synchronously but
>>> within that worker.
>>>
>>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com>
>>> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I'm planning to use Apache beam to extract and load part of the ETL
>>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>>> that pipelines will stall while REST requests are made over the network.
>>>>
>>>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>>>> any best practice I can follow to accomplish this? Just as a reference I'm
>>>> adding this
>>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>>> <https://beam.apache.org/documentation/io/built-in/> built-in
>>>> connector is in progress for Java.
>>>>
>>>> Let me know if this is the right group to ask this question. I can also
>>>> ask dev@beam.apache.org if needed.
>>>> --
>>>> Thanks,
>>>> Shree
>>>>
>>>
>
> --
> Best,
> Shree
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Shree Tanna <sh...@gmail.com>.
Thank you!
I will try this out.
One more question on this, is it considered anti-pattern to do HTTP
ingestion on GCP Dataflow due to the reasoning I mentioned in my original
message? I ask because I am getting that indication from some of my
co-workers and also from google cloud support. Not sure if this is the
right place to ask this question. Happy to move this conversation to
somewhere else if not.

On Tue, Jul 19, 2022 at 5:18 PM Luke Cwik via user <us...@beam.apache.org>
wrote:

> Even if you don't have the resource ids ahead of time, you can have a
> pipeline like:
> Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
> ParDo(ReadResourceIds) -> ...
>
> You could also compose these as splittable DoFns [1, 2, 3]:
> ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)
>
> The first approach is the simplest as the reshuffle will rebalance the
> reading of each resource id across worker nodes but is limited in
> generating resource ids on one worker. Making the generation a splittable
> DoFn will mean that you can increase the parallelism of generation which is
> important if there are so many that it could crash a worker or fail to have
> the output committed (these kinds of failures are runner dependent on how
> well they handle single bundles with large outputs). Making the reading
> splittable allows you to handle a large resource (imagine a large file) so
> that it can be read and processed in parallel (and will have similar
> failures if the runner can't handle single bundles with large outputs).
>
> You can always start with the first solution and swap either piece to be a
> splittable DoFn depending on your performance requirements and how well the
> simple solution works.
>
> 1: https://beam.apache.org/blog/splittable-do-fn/
> 2: https://beam.apache.org/blog/splittable-do-fn-is-available/
> 3:
> https://beam.apache.org/documentation/programming-guide/#splittable-dofns
>
>
> On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <da...@gmail.com>
> wrote:
>
>> Provided you have all the resources ids ahead of fetching, Beam will
>> spread the fetches to its workers. It will still fetch synchronously but
>> within that worker.
>>
>> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com>
>> wrote:
>>
>>> Hi all,
>>>
>>> I'm planning to use Apache beam to extract and load part of the ETL
>>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>>> that pipelines will stall while REST requests are made over the network.
>>>
>>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>>> any best practice I can follow to accomplish this? Just as a reference I'm
>>> adding this
>>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>>> <https://beam.apache.org/documentation/io/built-in/> built-in connector
>>> is in progress for Java.
>>>
>>> Let me know if this is the right group to ask this question. I can also
>>> ask dev@beam.apache.org if needed.
>>> --
>>> Thanks,
>>> Shree
>>>
>>

-- 
Best,
Shree

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Luke Cwik via user <us...@beam.apache.org>.
Even if you don't have the resource ids ahead of time, you can have a
pipeline like:
Impulse -> ParDo(GenerateResourceIds) -> Reshuffle ->
ParDo(ReadResourceIds) -> ...

You could also compose these as splittable DoFns [1, 2, 3]:
ParDo(SplittableGenerateResourceIds) -> ParDo(SplittableReadResourceIds)

The first approach is the simplest as the reshuffle will rebalance the
reading of each resource id across worker nodes but is limited in
generating resource ids on one worker. Making the generation a splittable
DoFn will mean that you can increase the parallelism of generation which is
important if there are so many that it could crash a worker or fail to have
the output committed (these kinds of failures are runner dependent on how
well they handle single bundles with large outputs). Making the reading
splittable allows you to handle a large resource (imagine a large file) so
that it can be read and processed in parallel (and will have similar
failures if the runner can't handle single bundles with large outputs).

You can always start with the first solution and swap either piece to be a
splittable DoFn depending on your performance requirements and how well the
simple solution works.

1: https://beam.apache.org/blog/splittable-do-fn/
2: https://beam.apache.org/blog/splittable-do-fn-is-available/
3: https://beam.apache.org/documentation/programming-guide/#splittable-dofns


On Tue, Jul 19, 2022 at 10:05 AM Damian Akpan <da...@gmail.com>
wrote:

> Provided you have all the resources ids ahead of fetching, Beam will
> spread the fetches to its workers. It will still fetch synchronously but
> within that worker.
>
> On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com> wrote:
>
>> Hi all,
>>
>> I'm planning to use Apache beam to extract and load part of the ETL
>> pipeline and run the jobs on Dataflow. I will have to do the REST API
>> ingestion on our platform. I can opt to make sync API calls from DoFn. With
>> that pipelines will stall while REST requests are made over the network.
>>
>> Is it best practice to run the REST ingestion job on Dataflow? Is there
>> any best practice I can follow to accomplish this? Just as a reference I'm
>> adding this
>> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
>> StackOverflow thread here too. Also, I notice that Rest I/O transform
>> <https://beam.apache.org/documentation/io/built-in/> built-in connector
>> is in progress for Java.
>>
>> Let me know if this is the right group to ask this question. I can also
>> ask dev@beam.apache.org if needed.
>> --
>> Thanks,
>> Shree
>>
>

Re: [Dataflow][Python] Guidance on HTTP ingestion on Dataflow

Posted by Damian Akpan <da...@gmail.com>.
Provided you have all the resources ids ahead of fetching, Beam will spread
the fetches to its workers. It will still fetch synchronously but within
that worker.

On Tue, Jul 19, 2022 at 5:40 PM Shree Tanna <sh...@gmail.com> wrote:

> Hi all,
>
> I'm planning to use Apache beam to extract and load part of the ETL
> pipeline and run the jobs on Dataflow. I will have to do the REST API
> ingestion on our platform. I can opt to make sync API calls from DoFn. With
> that pipelines will stall while REST requests are made over the network.
>
> Is it best practice to run the REST ingestion job on Dataflow? Is there
> any best practice I can follow to accomplish this? Just as a reference I'm
> adding this
> <https://stackoverflow.com/questions/50335521/best-practices-in-http-calls-in-cloud-dataflow-java>
> StackOverflow thread here too. Also, I notice that Rest I/O transform
> <https://beam.apache.org/documentation/io/built-in/> built-in connector
> is in progress for Java.
>
> Let me know if this is the right group to ask this question. I can also
> ask dev@beam.apache.org if needed.
> --
> Thanks,
> Shree
>