You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Sandys-Lumsdaine, James" <Ja...@systematica.com> on 2022/03/11 13:59:40 UTC

Slowness using GenericWriteAheadSink

Hello,

We are using the GenericWriteAheadSink to buffer up values to then send to a SQL Server database with a fast bulk copy upload. However, when I watch my process running it seems to be a huge amount of time iterating the Iterable<T> provided to the sendValues() method. It takes such a long time I’ve had to increase the checkpoint timeout because it causes the whole workflow to suspend.

I am using Flink 1.14.0 and have attached a simple, self-contained example. If I was to guess then there is a very large deserialization overhead from the checkpointed data even though I’m currently using a HashMapStateBackend. I have profiled the application and it definitely seems to spend most of its time there. The object involved is just a plain POJO.

A second “issue” is that I am forced to clone the objects provided by the iterator – when I dug into the code I could see a ReusingMutableToRegularIteratorWrapper class being using and the objects passed were being reused between 2 objects. I don’t know the reasoning behind this (except to prevent extra garbage?) but it would be nice if I could specify a “non-reusing” one otherwise there is a deserialization AND a clone for every object in the list.

Any pointers or advice on a better way to send large amounts of data to a SQL Server sink would be appreciated.

James.

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. Any market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of Systematica Investments UK LLP, its parents, subsidiaries or affiliates.

Systematica Investments UK LLP (“SIUK”), which is authorised and regulated by the Financial Conduct Authority of the United Kingdom (the “FCA”) is authorised and regulated by the Financial Conduct Authority and is registered with the U.S. Securities and Exchange Commission as an investment adviser under the Investment Advisers Act of 1940.

Systematica Investments UK LLP is registered in England and Wales with a partnership number OC424197. Registered Office: Equitable House, 47 King William Street, London EC4R 9AF.

Recipients of this communication should note that electronic communication, whether by email, website, SWIFT or otherwise, is an unsafe method of communication. Emails and SWIFT messages may be lost, delivered to the wrong address, intercepted or affected by delays, interference by third parties or viruses and their confidentiality, security and integrity cannot be guaranteed. None of SIGPL or any of its affiliates bear any liability or responsibility therefor.

Please see the important information at www.systematica.com/disclaimer.<http://www.systematica.com/disclaimer>

Please see the important information, including regarding the processing of personal data by Systematica, at www.systematica.com/PrivacyNotice.

www.systematica.com<http://www.systematica.com/>

Re: Slowness using GenericWriteAheadSink

Posted by James Sandys-Lumsdaine <ja...@hotmail.com>.
Is anyone able to comment on the below? My worry is this class isn’t well support so I may need to find an alternative to bulk copy data into SQL Server e.g. use a simple file sink and then have some process bulk copy the files.
________________________________
From: Sandys-Lumsdaine, James <Ja...@systematica.com>
Sent: 11 March 2022 13:59
To: user@flink.apache.org <us...@flink.apache.org>
Subject: Slowness using GenericWriteAheadSink


Hello,



We are using the GenericWriteAheadSink to buffer up values to then send to a SQL Server database with a fast bulk copy upload. However, when I watch my process running it seems to be a huge amount of time iterating the Iterable<T> provided to the sendValues() method. It takes such a long time I’ve had to increase the checkpoint timeout because it causes the whole workflow to suspend.



I am using Flink 1.14.0 and have attached a simple, self-contained example. If I was to guess then there is a very large deserialization overhead from the checkpointed data even though I’m currently using a HashMapStateBackend. I have profiled the application and it definitely seems to spend most of its time there. The object involved is just a plain POJO.



A second “issue” is that I am forced to clone the objects provided by the iterator – when I dug into the code I could see a ReusingMutableToRegularIteratorWrapper class being using and the objects passed were being reused between 2 objects. I don’t know the reasoning behind this (except to prevent extra garbage?) but it would be nice if I could specify a “non-reusing” one otherwise there is a deserialization AND a clone for every object in the list.



Any pointers or advice on a better way to send large amounts of data to a SQL Server sink would be appreciated.



James.

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. Any market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of Systematica Investments UK LLP, its parents, subsidiaries or affiliates.

Systematica Investments UK LLP (“SIUK”), which is authorised and regulated by the Financial Conduct Authority of the United Kingdom (the “FCA”) is authorised and regulated by the Financial Conduct Authority and is registered with the U.S. Securities and Exchange Commission as an investment adviser under the Investment Advisers Act of 1940.

Systematica Investments UK LLP is registered in England and Wales with a partnership number OC424197. Registered Office: Equitable House, 47 King William Street, London EC4R 9AF.

Recipients of this communication should note that electronic communication, whether by email, website, SWIFT or otherwise, is an unsafe method of communication. Emails and SWIFT messages may be lost, delivered to the wrong address, intercepted or affected by delays, interference by third parties or viruses and their confidentiality, security and integrity cannot be guaranteed. None of SIGPL or any of its affiliates bear any liability or responsibility therefor.

Please see the important information at www.systematica.com/disclaimer.<http://www.systematica.com/disclaimer>

Please see the important information, including regarding the processing of personal data by Systematica, at www.systematica.com/PrivacyNotice.

www.systematica.com<http://www.systematica.com/>

RE: Slowness using GenericWriteAheadSink

Posted by "Sandys-Lumsdaine, James" <Ja...@systematica.com>.
Is anyone able to comment on the below? My worry is this class isn’t well support so I may need to find an alternative to bulk copy data into SQL Server e.g. use a simple file sink and then have some process bulk copy the files.

From: James Sandys-Lumsdaine <ja...@hotmail.com>
Sent: 14 March 2022 09:31
To: user@flink.apache.org
Subject: Slowness using GenericWriteAheadSink

CAUTION: External email. The email originated outside of our company
Hello,

We are using the GenericWriteAheadSink to buffer up values to then send to a SQL Server database with a fast bulk copy upload. However, when I watch my process running it seems to be a huge amount of time iterating the Iterable<T> provided to the sendValues() method. It takes such a long time I’ve had to increase the checkpoint timeout because it causes the whole workflow to suspend.

I am using Flink 1.14.0 and have attached a simple, self-contained example. If I was to guess then there is a very large deserialization overhead from the checkpointed data even though I’m currently using a HashMapStateBackend. I have profiled the application and it definitely seems to spend most of its time there. The object involved is just a plain POJO.

A second “issue” is that I am forced to clone the objects provided by the iterator – when I dug into the code I could see a ReusingMutableToRegularIteratorWrapper class being using and the objects passed were being reused between 2 objects. I don’t know the reasoning behind this (except to prevent extra garbage?) but it would be nice if I could specify a “non-reusing” one otherwise there is a deserialization AND a clone for every object in the list.

Any pointers or advice on a better way to send large amounts of data to a SQL Server sink would be appreciated.

James.

The information transmitted is intended only for the person or entity to which it is addressed and may contain confidential and/or privileged material. Any review, retransmission, dissemination or other use of, or taking of any action in reliance upon, this information by persons or entities other than the intended recipient is prohibited. If you received this in error, please contact the sender and delete the material from any computer.

This communication is for informational purposes only. It is not intended as an offer or solicitation for the purchase or sale of any financial instrument or as an official confirmation of any transaction. Any market prices, data and other information are not warranted as to completeness or accuracy and are subject to change without notice. Any comments or statements made herein do not necessarily reflect those of Systematica Investments UK LLP, its parents, subsidiaries or affiliates.

Systematica Investments UK LLP (“SIUK”), which is authorised and regulated by the Financial Conduct Authority of the United Kingdom (the “FCA”) is authorised and regulated by the Financial Conduct Authority and is registered with the U.S. Securities and Exchange Commission as an investment adviser under the Investment Advisers Act of 1940.

Systematica Investments UK LLP is registered in England and Wales with a partnership number OC424197. Registered Office: Equitable House, 47 King William Street, London EC4R 9AF.

Recipients of this communication should note that electronic communication, whether by email, website, SWIFT or otherwise, is an unsafe method of communication. Emails and SWIFT messages may be lost, delivered to the wrong address, intercepted or affected by delays, interference by third parties or viruses and their confidentiality, security and integrity cannot be guaranteed. None of SIGPL or any of its affiliates bear any liability or responsibility therefor.

Please see the important information at www.systematica.com/disclaimer.<http://www.systematica.com/disclaimer>

Please see the important information, including regarding the processing of personal data by Systematica, at www.systematica.com/PrivacyNotice.

www.systematica.com<http://www.systematica.com/>

Slowness using GenericWriteAheadSink

Posted by James Sandys-Lumsdaine <ja...@hotmail.com>.
Hello,

We are using the GenericWriteAheadSink to buffer up values to then send to a SQL Server database with a fast bulk copy upload. However, when I watch my process running it seems to be a huge amount of time iterating the Iterable<T> provided to the sendValues() method. It takes such a long time I’ve had to increase the checkpoint timeout because it causes the whole workflow to suspend.

I am using Flink 1.14.0 and have attached a simple, self-contained example. If I was to guess then there is a very large deserialization overhead from the checkpointed data even though I’m currently using a HashMapStateBackend. I have profiled the application and it definitely seems to spend most of its time there. The object involved is just a plain POJO.

A second “issue” is that I am forced to clone the objects provided by the iterator – when I dug into the code I could see a ReusingMutableToRegularIteratorWrapper class being using and the objects passed were being reused between 2 objects. I don’t know the reasoning behind this (except to prevent extra garbage?) but it would be nice if I could specify a “non-reusing” one otherwise there is a deserialization AND a clone for every object in the list.

Any pointers or advice on a better way to send large amounts of data to a SQL Server sink would be appreciated.

James.