You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Meghashyam Sandeep V <vr...@gmail.com> on 2016/12/15 17:11:50 UTC

benchmarking flink streaming

Hi There,

We are evaluating Flink streaming for real time data analysis. I have my
flink job running in EMR with Yarn. What are the possible benchmarking
tools that work best with Flink? I couldn't find this information in the
Apache website.

Thanks,
Sandeep

Re: benchmarking flink streaming

Posted by Stephan Ewen <se...@apache.org>.

The latency markers "pass through windows" so they do not take the latency
of windows into account.
They represent only the latency of the actual streams and their
backpressure.

On Wed, Jan 25, 2017 at 6:08 PM, Dominik Safaric <do...@gmail.com>
wrote:

> Hi Stephan,
>
> As I’m already familiar with the latency markers of Flink 1.2, there is
> one question that bothers me in regard to them - how does Flink measure
> end-to-end latency when dealing with e.g. aggregations?
>
> Suppose you have a topology ingesting data from Kafka, and you want to
> output frequency per key. In this case, the sink is just given tuples of
> (key: String, frequency: Int).
>
> On 25 Jan 2017, at 16:11, Stephan Ewen <se...@apache.org> wrote:
>
> Hi!
>
> There are new latency metrics in Flink 1.2 that you can use. They are
> sampled, so not on every record.
>
> You can always attach your own timestamps, in order to measure the latency
> of specific records.
>
> Stephan
>
>
> On Fri, Dec 16, 2016 at 5:02 PM, Meghashyam Sandeep V <
> vr1meghashyam@gmail.com> wrote:
>
>> Hi Stephan,
>>
>> Thanks for your answer. Is there a way to get the metrics such as latency
>> of each message in the stream? For eg. I have a Kafka source, Cassandra
>>  sink and I do some processing in between. I would like to know how long
>> does it take for each message from the beginning(entering flink streaming
>> from kafka) to end(sending/executing the query).
>>
>> On Fri, Dec 16, 2016 at 7:36 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Hi!
>>>
>>> I am not sure there exists a recommended benchmarking tool. Performance
>>> comparisons depend heavily on the scenarios you are looking at: Simple
>>> event processing, shuffles (grouping aggregation), joins, small state,
>>> large state, etc...
>>>
>>> As fas as I know, most people try to write a "mock" version of a job
>>> that is representative for the jobs they want to run, and test with that.
>>>
>>> That said, I agree that it would actually be helpful to collect some
>>> jobs in a form of "evaluation suite".
>>>
>>> Stephan
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <
>>> vr1meghashyam@gmail.com> wrote:
>>>
>>>> Hi There,
>>>>
>>>> We are evaluating Flink streaming for real time data analysis. I have
>>>> my flink job running in EMR with Yarn. What are the possible benchmarking
>>>> tools that work best with Flink? I couldn't find this information in the
>>>> Apache website.
>>>>
>>>> Thanks,
>>>> Sandeep
>>>>
>>>
>>>
>>
>
>

Re: benchmarking flink streaming

Posted by Meghashyam Sandeep V <vr...@gmail.com>.

Hi Stephan,

Thats great to hear. We are using EMR which is still on Flink 1.1.3. I'll
use the latency markers when Flink on EMR is upgraded.

Thanks,
Sandeep

On Wed, Jan 25, 2017 at 9:08 AM, Dominik Safaric <do...@gmail.com>
wrote:

> Hi Stephan,
>
> As I’m already familiar with the latency markers of Flink 1.2, there is
> one question that bothers me in regard to them - how does Flink measure
> end-to-end latency when dealing with e.g. aggregations?
>
> Suppose you have a topology ingesting data from Kafka, and you want to
> output frequency per key. In this case, the sink is just given tuples of
> (key: String, frequency: Int).
>
> On 25 Jan 2017, at 16:11, Stephan Ewen <se...@apache.org> wrote:
>
> Hi!
>
> There are new latency metrics in Flink 1.2 that you can use. They are
> sampled, so not on every record.
>
> You can always attach your own timestamps, in order to measure the latency
> of specific records.
>
> Stephan
>
>
> On Fri, Dec 16, 2016 at 5:02 PM, Meghashyam Sandeep V <
> vr1meghashyam@gmail.com> wrote:
>
>> Hi Stephan,
>>
>> Thanks for your answer. Is there a way to get the metrics such as latency
>> of each message in the stream? For eg. I have a Kafka source, Cassandra
>>  sink and I do some processing in between. I would like to know how long
>> does it take for each message from the beginning(entering flink streaming
>> from kafka) to end(sending/executing the query).
>>
>> On Fri, Dec 16, 2016 at 7:36 AM, Stephan Ewen <se...@apache.org> wrote:
>>
>>> Hi!
>>>
>>> I am not sure there exists a recommended benchmarking tool. Performance
>>> comparisons depend heavily on the scenarios you are looking at: Simple
>>> event processing, shuffles (grouping aggregation), joins, small state,
>>> large state, etc...
>>>
>>> As fas as I know, most people try to write a "mock" version of a job
>>> that is representative for the jobs they want to run, and test with that.
>>>
>>> That said, I agree that it would actually be helpful to collect some
>>> jobs in a form of "evaluation suite".
>>>
>>> Stephan
>>>
>>>
>>>
>>> On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <
>>> vr1meghashyam@gmail.com> wrote:
>>>
>>>> Hi There,
>>>>
>>>> We are evaluating Flink streaming for real time data analysis. I have
>>>> my flink job running in EMR with Yarn. What are the possible benchmarking
>>>> tools that work best with Flink? I couldn't find this information in the
>>>> Apache website.
>>>>
>>>> Thanks,
>>>> Sandeep
>>>>
>>>
>>>
>>
>
>

Re: benchmarking flink streaming

Posted by Dominik Safaric <do...@gmail.com>.

Hi Stephan,

As I’m already familiar with the latency markers of Flink 1.2, there is one question that bothers me in regard to them - how does Flink measure end-to-end latency when dealing with e.g. aggregations? 

Suppose you have a topology ingesting data from Kafka, and you want to output frequency per key. In this case, the sink is just given tuples of (key: String, frequency: Int).   

> On 25 Jan 2017, at 16:11, Stephan Ewen <se...@apache.org> wrote:
> 
> Hi!
> 
> There are new latency metrics in Flink 1.2 that you can use. They are sampled, so not on every record.
> 
> You can always attach your own timestamps, in order to measure the latency of specific records.
> 
> Stephan
> 
> 
> On Fri, Dec 16, 2016 at 5:02 PM, Meghashyam Sandeep V <vr1meghashyam@gmail.com <ma...@gmail.com>> wrote:
> Hi Stephan,
> 
> Thanks for your answer. Is there a way to get the metrics such as latency of each message in the stream? For eg. I have a Kafka source, Cassandra  sink and I do some processing in between. I would like to know how long does it take for each message from the beginning(entering flink streaming from kafka) to end(sending/executing the query). 
> 
> On Fri, Dec 16, 2016 at 7:36 AM, Stephan Ewen <sewen@apache.org <ma...@apache.org>> wrote:
> Hi!
> 
> I am not sure there exists a recommended benchmarking tool. Performance comparisons depend heavily on the scenarios you are looking at: Simple event processing, shuffles (grouping aggregation), joins, small state, large state, etc...
> 
> As fas as I know, most people try to write a "mock" version of a job that is representative for the jobs they want to run, and test with that.
> 
> That said, I agree that it would actually be helpful to collect some jobs in a form of "evaluation suite".
> 
> Stephan
> 
> 
> 
> On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <vr1meghashyam@gmail.com <ma...@gmail.com>> wrote:
> Hi There,
> 
> We are evaluating Flink streaming for real time data analysis. I have my flink job running in EMR with Yarn. What are the possible benchmarking tools that work best with Flink? I couldn't find this information in the Apache website. 
> 
> Thanks,
> Sandeep
> 
> 
>

Re: benchmarking flink streaming

Posted by Stephan Ewen <se...@apache.org>.

Hi!

There are new latency metrics in Flink 1.2 that you can use. They are
sampled, so not on every record.

You can always attach your own timestamps, in order to measure the latency
of specific records.

Stephan


On Fri, Dec 16, 2016 at 5:02 PM, Meghashyam Sandeep V <
vr1meghashyam@gmail.com> wrote:

> Hi Stephan,
>
> Thanks for your answer. Is there a way to get the metrics such as latency
> of each message in the stream? For eg. I have a Kafka source, Cassandra
>  sink and I do some processing in between. I would like to know how long
> does it take for each message from the beginning(entering flink streaming
> from kafka) to end(sending/executing the query).
>
> On Fri, Dec 16, 2016 at 7:36 AM, Stephan Ewen <se...@apache.org> wrote:
>
>> Hi!
>>
>> I am not sure there exists a recommended benchmarking tool. Performance
>> comparisons depend heavily on the scenarios you are looking at: Simple
>> event processing, shuffles (grouping aggregation), joins, small state,
>> large state, etc...
>>
>> As fas as I know, most people try to write a "mock" version of a job that
>> is representative for the jobs they want to run, and test with that.
>>
>> That said, I agree that it would actually be helpful to collect some jobs
>> in a form of "evaluation suite".
>>
>> Stephan
>>
>>
>>
>> On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <
>> vr1meghashyam@gmail.com> wrote:
>>
>>> Hi There,
>>>
>>> We are evaluating Flink streaming for real time data analysis. I have my
>>> flink job running in EMR with Yarn. What are the possible benchmarking
>>> tools that work best with Flink? I couldn't find this information in the
>>> Apache website.
>>>
>>> Thanks,
>>> Sandeep
>>>
>>
>>
>

Re: benchmarking flink streaming

Posted by Meghashyam Sandeep V <vr...@gmail.com>.

Hi Stephan,

Thanks for your answer. Is there a way to get the metrics such as latency
of each message in the stream? For eg. I have a Kafka source, Cassandra
 sink and I do some processing in between. I would like to know how long
does it take for each message from the beginning(entering flink streaming
from kafka) to end(sending/executing the query).

On Fri, Dec 16, 2016 at 7:36 AM, Stephan Ewen <se...@apache.org> wrote:

> Hi!
>
> I am not sure there exists a recommended benchmarking tool. Performance
> comparisons depend heavily on the scenarios you are looking at: Simple
> event processing, shuffles (grouping aggregation), joins, small state,
> large state, etc...
>
> As fas as I know, most people try to write a "mock" version of a job that
> is representative for the jobs they want to run, and test with that.
>
> That said, I agree that it would actually be helpful to collect some jobs
> in a form of "evaluation suite".
>
> Stephan
>
>
>
> On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <
> vr1meghashyam@gmail.com> wrote:
>
>> Hi There,
>>
>> We are evaluating Flink streaming for real time data analysis. I have my
>> flink job running in EMR with Yarn. What are the possible benchmarking
>> tools that work best with Flink? I couldn't find this information in the
>> Apache website.
>>
>> Thanks,
>> Sandeep
>>
>
>

Re: benchmarking flink streaming

Posted by Stephan Ewen <se...@apache.org>.

Hi!

I am not sure there exists a recommended benchmarking tool. Performance
comparisons depend heavily on the scenarios you are looking at: Simple
event processing, shuffles (grouping aggregation), joins, small state,
large state, etc...

As fas as I know, most people try to write a "mock" version of a job that
is representative for the jobs they want to run, and test with that.

That said, I agree that it would actually be helpful to collect some jobs
in a form of "evaluation suite".

Stephan

On Thu, Dec 15, 2016 at 6:11 PM, Meghashyam Sandeep V <
vr1meghashyam@gmail.com> wrote:

> Hi There,
>
> We are evaluating Flink streaming for real time data analysis. I have my
> flink job running in EMR with Yarn. What are the possible benchmarking
> tools that work best with Flink? I couldn't find this information in the
> Apache website.
>
> Thanks,
> Sandeep
>