You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Cristian Garcia <cg...@gmail.com> on 2018/06/21 15:48:49 UTC

Python: Single vs Multiple DoFns for Image Processing

Hi,

I am running Beam with the DataflowRunner and want to do 3 tasks:

   1. Read an image from GCS
   2. Process the image (data augmentation)
   3. Serialize the image to a string

I could do all this in a single DoFn, but I could also split it into these
3 stages. I don't know what would be better given the Beam model. Here are
some thoughts:

   - Doing it in a single DoFn wastes concurrency e.g. one stage can be
   reading the image while the other does the processing.
   - Doing it in multiple DoFns might mean sending the images through the
   network, increasing latency.

Sorry if these question are very basic. I am trying to get my head around
this. The pipeline I currently have is processing about 15 imgs/sec which
seems really slow, dataflow suggest that I increase some quotas to enable
around 400 workers (is this an overkill?)

Regards,
Cristian

Re: Python: Single vs Multiple DoFns for Image Processing

Posted by Cristian Garcia <cg...@gmail.com>.
Hi Robert!

I read the images from GCS using Tensorflows "FileIO" module.
I am starting to realize that maybe the bottle neck is the machine type, I
use some that have better CPUs to process the images.

Regards,
Cristian

On Thu, Jun 21, 2018 at 10:52 AM Robert Bradshaw <ro...@google.com>
wrote:

> I would write these as three separate DoFns; they will get fused together
> to minimize IO.
>
> 400 workers may not be overkill, depending on how many images you have. Ia
> dataflow not scaling up and sharing the work? Where is your list of images
> coming from?
>
>
> On Thu, Jun 21, 2018 at 8:49 AM Cristian Garcia <cg...@gmail.com>
> wrote:
>
>> Hi,
>>
>> I am running Beam with the DataflowRunner and want to do 3 tasks:
>>
>>    1. Read an image from GCS
>>    2. Process the image (data augmentation)
>>    3. Serialize the image to a string
>>
>> I could do all this in a single DoFn, but I could also split it into
>> these 3 stages. I don't know what would be better given the Beam model.
>> Here are some thoughts:
>>
>>    - Doing it in a single DoFn wastes concurrency e.g. one stage can be
>>    reading the image while the other does the processing.
>>    - Doing it in multiple DoFns might mean sending the images through
>>    the network, increasing latency.
>>
>> Sorry if these question are very basic. I am trying to get my head around
>> this. The pipeline I currently have is processing about 15 imgs/sec which
>> seems really slow, dataflow suggest that I increase some quotas to enable
>> around 400 workers (is this an overkill?)
>>
>> Regards,
>> Cristian
>>
>

Re: Python: Single vs Multiple DoFns for Image Processing

Posted by Robert Bradshaw <ro...@google.com>.
I would write these as three separate DoFns; they will get fused together
to minimize IO.

400 workers may not be overkill, depending on how many images you have. Ia
dataflow not scaling up and sharing the work? Where is your list of images
coming from?

On Thu, Jun 21, 2018 at 8:49 AM Cristian Garcia <cg...@gmail.com>
wrote:

> Hi,
>
> I am running Beam with the DataflowRunner and want to do 3 tasks:
>
>    1. Read an image from GCS
>    2. Process the image (data augmentation)
>    3. Serialize the image to a string
>
> I could do all this in a single DoFn, but I could also split it into these
> 3 stages. I don't know what would be better given the Beam model. Here are
> some thoughts:
>
>    - Doing it in a single DoFn wastes concurrency e.g. one stage can be
>    reading the image while the other does the processing.
>    - Doing it in multiple DoFns might mean sending the images through the
>    network, increasing latency.
>
> Sorry if these question are very basic. I am trying to get my head around
> this. The pipeline I currently have is processing about 15 imgs/sec which
> seems really slow, dataflow suggest that I increase some quotas to enable
> around 400 workers (is this an overkill?)
>
> Regards,
> Cristian
>