You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by André Rocha Silva <a....@portaltelemedicina.com.br> on 2020/03/24 20:34:56 UTC

N TFRecords in Dataflow

Fellow beamers

I am facing a problem regarding to the architecture of an ML pipeline.
Batch mode.

I have several images I have to split between some TF Records, in order to
train an AI.
The problem is that the number of images may vary. Then I have no idea of
number of shards upfront.

1) Is it possible that I give the number of images per TFRecord to the
function "beam.io.tfrecordio.WriteToTFRecord()"?
The place I save the TFRecords also is decided by the pipeline.

2) A solution I found is to have a cloud function processing this split,
that takes about 50 seconds to process. Then, to each chunk I start a
dataflow job, via template.
This way I have around 40 jobs running simultaneously. Is this a problem?
Is this bad practice?

Thank you very much

-- 

   *ANDRÉ ROCHA SILVA*
  * DATA ENGINEER*
  (48) 3181-0611

  <https://www.linkedin.com/in/andre-rocha-silva/> /andre-rocha-silva/
<http://portaltelemedicina.com.br/>
<https://www.youtube.com/channel/UC0KH36-OXHFIKjlRY2GyAtQ>
<https://pt-br.facebook.com/PortalTelemedicina/>
<https://www.linkedin.com/company/9426084/>