You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Robert Metzger (JIRA)" <ji...@apache.org> on 2019/02/28 12:48:00 UTC
[jira] [Updated] (FLINK-9597) Flink application does not scale as
expected
[ https://issues.apache.org/jira/browse/FLINK-9597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Metzger updated FLINK-9597:
----------------------------------
Component/s: (was: Core)
> Flink application does not scale as expected
> --------------------------------------------
>
> Key: FLINK-9597
> URL: https://issues.apache.org/jira/browse/FLINK-9597
> Project: Flink
> Issue Type: Bug
> Affects Versions: 1.5.0
> Reporter: swy
> Priority: Major
> Attachments: JM.png, TM.png, flink_app_parser_git.zip, sample.png, scaleNotWork.png
>
>
> Hi, we found that our Flink application with simple logic, which using process function is not scale-able when scale from 8 parallelism onward even though with sufficient resources. Below it the result which is capped at ~250k TPS. No matter how we tune the parallelism of the operators it just not scale, same to increase source parallelism.
> Please refer to "scaleNotWork.png",
> 1. fixed source parallelism 4, other operators parallelism 8
> 2. fixed source parallelism 4, other operators parallelism 16
> 3. fixed source parallelism 4, other operators parallelism 32
> 4. fixed source parallelism 6, other operators parallelism 8
> 5. fixed source parallelism 6, other operators parallelism 16
> 6. fixed source parallelism 6, other operators parallelism 32
> 7. fixed source parallelism 6, other operators parallelism 64 performance worse than parallelism 32.
> Sample source code attached(flink_app_parser_git.zip). It is a simple program, parsing json record into object, and pass it to a empty logic Flink's process function. Rocksdb is in used, and the source is generated by the program itself. This could be reproduce easily.
> We choose Flink because of it scalability, but this is not the case now, appreciated if anyone could help as this is impacting our projects! thank you.
> To run the program, sample parameters,
> "aggrinterval=6000000 loop=7500000 statsd=1 psrc=4 pJ2R=32 pAggr=72 URL=do36.mycompany.com:8127"
> * aggrinterval: time in ms for timer to trigger
> * loop: how many row of data to feed
> * statsd: to send result to statsd
> * psrc: source parallelism
> * pJ2R: parallelism of map operator(JsonRecTranslator)
> * pAggr: parallelism of process+timer operator(AggregationDuration)
> We are running in VMWare, 5 Task Managers and each has 32 slots.
> lscpu
> Architecture: x86_64
> CPU op-mode(s): 32-bit, 64-bit
> Byte Order: Little Endian
> CPU(s): 32
> On-line CPU(s) list: 0-31
> Thread(s) per core: 1
> Core(s) per socket: 1
> Socket(s): 32
> NUMA node(s): 1
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 63
> Model name: Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz
> Stepping: 2
> CPU MHz: 2593.993
> BogoMIPS: 5187.98
> Hypervisor vendor: VMware
> Virtualization type: full
> L1d cache: 32K
> L1i cache: 32K
> L2 cache: 256K
> L3 cache: 20480K
> NUMA node0 CPU(s): 0-31
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts nopl xtopology tsc_reliable nonstop_tsc aperfmperf pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand hypervisor lahf_lm epb fsgsbase smep dtherm ida arat pln pts
> total used free shared buff/cache available
> Mem: 98 24 72 0 1 72
> Swap: 3 0 3
> Please refer TM.png and JM.png for further details.
> The test without any checkpoint enable.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)