You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Francis Conroy <fr...@switchdin.com> on 2021/12/22 03:30:45 UTC

Re: PyFlink Perfomance

I've just run an analysis using a similar example which involves a single
python flatmap operator and we're getting 100x less through by using python
over java. I'm interested to know if you can do such a comparison. I'm
using Flink 14.0.

Thanks,
Francis

On Thu, 18 Nov 2021 at 02:20, Thomas Portugal <th...@gmail.com>
wrote:

> Hello community,
> My team is developing an application using Pyflink. We are using the
> Datastream API. Basically, we read from a kafka topic, do some maps, and
> write on another kafka topic. One restriction about it is the first map,
> that has to be serialized and with parallelism equals to one. This is
> causing a bottleneck on the throughput, and we are achieving approximately
> 2k msgs/sec. Monitoring the cpu usage and the number of records on each
> operator, it seems that the first operator is causing the issue.
> The first operator is like a buffer that groups the messages from kafka
> and sends them to the next operators. We are using a dequeue from python's
> collections. Since we are stuck on this issue, could you answer some
> questions about this matter?
>
> 1 - Using data structures from python can introduce some latency or
> increase the CPU usage?
> 2 - There are alternatives to this approach? We were thinking about Window
> structure, from Flink, but in our case it's not time based, and we didn't
> find an equivalent on python API.
> 3 - Using Table API to read from Kafka Topic and do the windowing can
> improve our performance?
>
> We already set some parameters like python.fn-execution.bundle.time and
> buffer.timeout to improve our performance.
>
> Thanks for your attention.
> Best Regards
>

-- 
This email and any attachments are proprietary and confidential and are 
intended solely for the use of the individual to whom it is addressed. Any 
views or opinions expressed are solely those of the author and do not 
necessarily reflect or represent those of SwitchDin Pty Ltd. If you have 
received this email in error, please let us know immediately by reply email 
and delete it from your system. You may not use, disseminate, distribute or 
copy this message nor disclose its contents to anyone. 
SwitchDin Pty Ltd 
(ABN 29 154893857) PO Box 1165, Newcastle NSW 2300 Australia

Re: PyFlink Perfomance

Posted by Francis Conroy <fr...@switchdin.com>.
Hi Dian, I'll build up something similar and post it, my current test code
contains proprietary information.

On Wed, 22 Dec 2021 at 14:49, Dian Fu <di...@gmail.com> wrote:

> Hi Francis,
>
> Could you share the benchmark code you use?
>
> Regards,
> Dian
>
> On Wed, Dec 22, 2021 at 11:31 AM Francis Conroy <
> francis.conroy@switchdin.com> wrote:
>
>> I've just run an analysis using a similar example which involves a single
>> python flatmap operator and we're getting 100x less through by using python
>> over java. I'm interested to know if you can do such a comparison. I'm
>> using Flink 14.0.
>>
>> Thanks,
>> Francis
>>
>> On Thu, 18 Nov 2021 at 02:20, Thomas Portugal <th...@gmail.com>
>> wrote:
>>
>>> Hello community,
>>> My team is developing an application using Pyflink. We are using the
>>> Datastream API. Basically, we read from a kafka topic, do some maps, and
>>> write on another kafka topic. One restriction about it is the first map,
>>> that has to be serialized and with parallelism equals to one. This is
>>> causing a bottleneck on the throughput, and we are achieving approximately
>>> 2k msgs/sec. Monitoring the cpu usage and the number of records on each
>>> operator, it seems that the first operator is causing the issue.
>>> The first operator is like a buffer that groups the messages from kafka
>>> and sends them to the next operators. We are using a dequeue from python's
>>> collections. Since we are stuck on this issue, could you answer some
>>> questions about this matter?
>>>
>>> 1 - Using data structures from python can introduce some latency or
>>> increase the CPU usage?
>>> 2 - There are alternatives to this approach? We were thinking about
>>> Window structure, from Flink, but in our case it's not time based, and we
>>> didn't find an equivalent on python API.
>>> 3 - Using Table API to read from Kafka Topic and do the windowing can
>>> improve our performance?
>>>
>>> We already set some parameters like python.fn-execution.bundle.time and
>>> buffer.timeout to improve our performance.
>>>
>>> Thanks for your attention.
>>> Best Regards
>>>
>>
>> This email and any attachments are proprietary and confidential and are
>> intended solely for the use of the individual to whom it is addressed. Any
>> views or opinions expressed are solely those of the author and do not
>> necessarily reflect or represent those of SwitchDin Pty Ltd. If you have
>> received this email in error, please let us know immediately by reply email
>> and delete it from your system. You may not use, disseminate, distribute or
>> copy this message nor disclose its contents to anyone.
>> SwitchDin Pty Ltd (ABN 29 154893857) PO Box 1165, Newcastle NSW 2300
>> Australia
>>
>

-- 
This email and any attachments are proprietary and confidential and are 
intended solely for the use of the individual to whom it is addressed. Any 
views or opinions expressed are solely those of the author and do not 
necessarily reflect or represent those of SwitchDin Pty Ltd. If you have 
received this email in error, please let us know immediately by reply email 
and delete it from your system. You may not use, disseminate, distribute or 
copy this message nor disclose its contents to anyone. 
SwitchDin Pty Ltd 
(ABN 29 154893857) PO Box 1165, Newcastle NSW 2300 Australia

Re: PyFlink Perfomance

Posted by Dian Fu <di...@gmail.com>.
Hi Francis,

Could you share the benchmark code you use?

Regards,
Dian

On Wed, Dec 22, 2021 at 11:31 AM Francis Conroy <
francis.conroy@switchdin.com> wrote:

> I've just run an analysis using a similar example which involves a single
> python flatmap operator and we're getting 100x less through by using python
> over java. I'm interested to know if you can do such a comparison. I'm
> using Flink 14.0.
>
> Thanks,
> Francis
>
> On Thu, 18 Nov 2021 at 02:20, Thomas Portugal <th...@gmail.com>
> wrote:
>
>> Hello community,
>> My team is developing an application using Pyflink. We are using the
>> Datastream API. Basically, we read from a kafka topic, do some maps, and
>> write on another kafka topic. One restriction about it is the first map,
>> that has to be serialized and with parallelism equals to one. This is
>> causing a bottleneck on the throughput, and we are achieving approximately
>> 2k msgs/sec. Monitoring the cpu usage and the number of records on each
>> operator, it seems that the first operator is causing the issue.
>> The first operator is like a buffer that groups the messages from kafka
>> and sends them to the next operators. We are using a dequeue from python's
>> collections. Since we are stuck on this issue, could you answer some
>> questions about this matter?
>>
>> 1 - Using data structures from python can introduce some latency or
>> increase the CPU usage?
>> 2 - There are alternatives to this approach? We were thinking about
>> Window structure, from Flink, but in our case it's not time based, and we
>> didn't find an equivalent on python API.
>> 3 - Using Table API to read from Kafka Topic and do the windowing can
>> improve our performance?
>>
>> We already set some parameters like python.fn-execution.bundle.time and
>> buffer.timeout to improve our performance.
>>
>> Thanks for your attention.
>> Best Regards
>>
>
> This email and any attachments are proprietary and confidential and are
> intended solely for the use of the individual to whom it is addressed. Any
> views or opinions expressed are solely those of the author and do not
> necessarily reflect or represent those of SwitchDin Pty Ltd. If you have
> received this email in error, please let us know immediately by reply email
> and delete it from your system. You may not use, disseminate, distribute or
> copy this message nor disclose its contents to anyone.
> SwitchDin Pty Ltd (ABN 29 154893857) PO Box 1165, Newcastle NSW 2300
> Australia
>