You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@beam.apache.org by Jean Wisser <jw...@flowtraders.com> on 2023/03/23 09:48:45 UTC

PubsubIO to AvroIO huge fanout

Hi,

I have a pipeline that reads filepaths from pubsub and sends them to avroIO to parse the actual files and writes them back into parquet.

PubSubIO(filepaths) ----> AvroIO.parseFilesGenericRecords() ----> Window ----> FileWriter

Each file can contain millions of records which create a fanout in AvroIO.
If I start the pipeline while there is already a backlog of messages in PubSub, all messages are quickly getting consumed and acknowledged but the pipeline then struggles to parse all files and the memory blows up.
I have tried with different types of windows, or by adding a reshuffle step right after parseFilesGenericRecords, but without any success.

Any ideas on how to resolve that ?

Thanks,
Jean Wisser.



Jean Wisser
Data Engineer

[https://www.flowtraders.com/sites/flowtraders-corp/files/images/flow-traders-logo-mv.png]

Flow Traders B.V.

T: +31 20 799 6497
F: +31 20 799 6780

Jacob Bontiusplaats 9
1018 LL Amsterdam
Nederland
www.flowtraders.com<http://www.flowtraders.com>

Flow Traders B.V. has its seat in Amsterdam, Nederland, its registered office at Jacob Bontiusplaats 9, 1018 LL, Amsterdam, Nederland and is registered with the Trade Registry of the Chamber of Commerce under number 33.22.3268. This message may contain information that is not intended for you. If you are not the addressee or if this message was sent to you by mistake, you are requested to inform the sender and delete the message. This message may not be forwarded or published to any other person than its addressees without Flow Traders B.V.'s prior consent. Flow Traders B.V. accepts no liability for damage of any kind resulting from the risks inherent in the electronic transmission of messages.