You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Jon Yeargers <jo...@cedexis.com> on 2016/08/15 20:22:21 UTC

Performance issues - is my topology not setup properly?

Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of
RAM.

Job is submitted via yarn.

Topology:

read csv files from SQS -> parse files by line  and create object for each
line -> pass through 'KeySelector' to pair entries (by hash) over 60 second
window -> write original and matched sets to BigQuery.

Each file contains ~ 15K lines and there are ~10 files / second.

My topology can't keep up with this stream. What am I doing wrong?

Articles like this (
http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/)
speak of > 1 million events / sec / core. Im not clear what constitutes an
'event' but given the number of cores Im throwing at this problem I would
expect higher throughput.

I run the job as :

HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8
-yst -ytm 4096 ../flink_all.jar

Re: Performance issues - is my topology not setup properly?

Posted by Ufuk Celebi <uc...@apache.org>.

Hey Jon! Thanks for sharing this. The blog post refers to each record
in the stream as an event.

The YARN command you've shared looks good, you could give the machines
more memory, but I would not expect this to be the problem here. I
would rather think that the sources are the bottleneck (but of course
not sure).

We can do the following to sanity check this as a first step: look at
the back pressure monitoring tab in the web frontend (see [1]). What
status do you see there? If the sources all say "OK" they produce as
fast as they can and the part of the topology after the them is not
back pressuring them/slowing them down.

Can we be sure that we can read from SQS with the desired throughput you need?

You could also replace the sources with a custom source function
producing data inside of Flink to check how much data the rest of the
topology can handle.

[1] https://ci.apache.org/projects/flink/flink-docs-master/internals/back_pressure_monitoring.html

On Mon, Aug 15, 2016 at 10:22 PM, Jon Yeargers <jo...@cedexis.com> wrote:
> Flink 1.1.1 is running on AWS / EMR. 3 boxes - total 24 cores and 90Gb of
> RAM.
>
> Job is submitted via yarn.
>
> Topology:
>
> read csv files from SQS -> parse files by line  and create object for each
> line -> pass through 'KeySelector' to pair entries (by hash) over 60 second
> window -> write original and matched sets to BigQuery.
>
> Each file contains ~ 15K lines and there are ~10 files / second.
>
> My topology can't keep up with this stream. What am I doing wrong?
>
> Articles like this
> (http://data-artisans.com/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink/)
> speak of > 1 million events / sec / core. Im not clear what constitutes an
> 'event' but given the number of cores Im throwing at this problem I would
> expect higher throughput.
>
> I run the job as :
>
> HADOOP_CONF_DIR=/etc/hadoop/conf bin/flink run -m yarn-cluster -yn 3 -ys 8
> -yst -ytm 4096 ../flink_all.jar