You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@storm.apache.org by John Yost <ho...@gmail.com> on 2016/04/01 15:16:13 UTC

Complete Latency Vs. Throughput--when do they not change in same direction?

Hi Everyone,

I am a little puzzled by what I am seeing in some testing with a topology I
have where the topo is reading from a KafkaSpout, doing some CPU intensive
processing, and then writing out to Kafka via the standard KafkaBolt.

I am doing testing in a multi-tenant environment and so test results can
vary by 10-20% on average.  However, results are much more variable the
last couple of days.

The big thing I am noticing: whereas the throughput--as measured in tuples
acked/minute--is half today of what it was yesterday for the same
configuraton, the Complete Latency (total time a tuple is in the topology
from the time it hits the KafkaSpout to the time it is acked in the
KafkaBolt) today is a third of what it was yesterday.

Any ideas as to how the throughput could go down dramatically at the same
time the Complete Latency is improving?

Thanks

--John

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by John Yost <ho...@gmail.com>.
Hey Nikos,

Yep, what we're describing here is precisely why there are TONS of DevOps
jobs available. These systems are, an times, highly difficult to
troubleshoot.  But...unless there is a DEFCON1 deadline, it's normally
quite fun to figure all of this out.

Regarding Storm's heavy use of Zookeeper, that's a very important point and
precisely why we have a separate Zk cluster solely for Storm. We learned
that having both Kafka and Storm use the same cluster is something to avoid
if possible.

I know that Bobby Evans and the Yahoo team have done some great work on a
separate heartbeat server that bypasses Zookeeper (mentioned in this
presentation http://bit.ly/1GKmHi6).
--John

On Sat, Apr 2, 2016 at 10:02 AM, Nikos R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Hello again,
>
> Well, it is not hand-waving... It is just hard to trace these issues in a
> real system.
>
> Definitely, varying behavior on Kafka makes all metrics unreliable and it
> will definitely create unstable behavior on your pipeline. Remember, Kafka
> is a distributed system itself, relying on ZooKeeper, which in turn is
> heavily abused by Storm. Therefore, there is an "entanglement" between
> Kafka and Storm in your setting. So, yes, your explanation sounds
> reasonable.
>
> Overall, I think your observations captured that heavy usage of Kafka, and
> therefore can not be considered as the norm (they are rather an anomaly).
> Interesting problem though! I have been battling latency and throughput
> myself with Storm, and I find it a fascinating subject when one is dealing
> with real-time data analysis.
>
> Cheers,
> Nikos
>
> On Sat, Apr 2, 2016 at 9:35 AM, John Yost <ho...@gmail.com> wrote:
>
>> Hey Nikos,
>>
>> I believe I have it figured out.  I spoke with another DevOps-err and she
>> said we had a large backlog of data that's been hitting Kafka for the last
>> couple of days, so our Kafka cluster has been under heavy load.
>> Consequently, my hypothesis that that, despite the fact my topo is reading
>> off of a static Kafka partition, there is significant variability in the
>> KafkaSpout read rates is likely correct; this could explain the lower
>> throughput combined with lower Complete Latency. In addition, since Kafka
>> is under heavy load, there is likely corresponding variability in the Kafka
>> writes; this could explain the tuple failures coupled with the lower
>> throughput and Complete Latency.
>>
>> There's definitely a bit of hand waving here, but the picture as I wrote
>> it out here seems reasonable to me. What do you think?
>>
>> Thanks
>>
>> --John
>>
>> On Fri, Apr 1, 2016 at 1:43 PM, Nikos R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>>> Glad that we are on the same page.
>>>
>>> The observation you present in your last paragraph is quite hard to
>>> answer. Before I present you my explanation, I assume you are using more
>>> than one machines (i.e. using Netty for sending/receiving tuples). If the
>>> previous is the case, then I am pretty sure that if you check your workers'
>>> logs you will say a bunch of Netty events for failed messages. On the other
>>> scenario, that you are running Storm on a single node, and you are using
>>> LMAX Disruptor, you should be using some different strategy than
>>> BlockingWaitStrategy, in order to see failed messages. Please, clarify your
>>> setup a bit further.
>>>
>>> Carrying on, if tuple failures happen in the Netty layer, then one
>>> explanation might be that something causes time-outs in your system (either
>>> context-switching or something else) and tuples are considered failed.
>>> Concretely, if you have a low input rate, then your Netty buffer needs more
>>> time to fill up a batch before sending it to the next node. Therefore, if
>>> your timeout is somewhere close to the time the oldest tuple spends in your
>>> Netty buffer, you might end up flagging that tuple as failed and having to
>>> re-send it.
>>>
>>> The previous is one possible scenario. Nevertheless, what you are
>>> experiencing might be the outcome of any of the layers of Storm and is hard
>>> to pinpoint exactly.
>>>
>>> Cheers,
>>> Nikos
>>>
>>>
>>> On Fri, Apr 1, 2016 at 1:02 PM, John Yost <ho...@gmail.com> wrote:
>>>
>>>> Hey Nikos,
>>>>
>>>> Yep, totally agree with what you've written here. As I wrote in my
>>>> response, the key point is whether the input rate differs. If the input
>>>> rate is lower, totally agree that it makes sense that both the throughput
>>>> and Complete Latency going down makes good sense. Similarly, if the rate
>>>> goes up, the Complete Latency goes up since there's more data for the topo
>>>> to process.
>>>>
>>>> For my purposes, I need to confirm if the rate of data coming in--in
>>>> other words, the rate at which KafkaSpout is reading off of a static Kafka
>>>> topic--varies between experiments. Up until now I've discounted this
>>>> possibility, but, given the great points you've made here, I'm going to
>>>> revisit this possibility.
>>>>
>>>> Having written all of this...another important point: when I see the
>>>> throughput and Complete Latency decrease in tandem, I also see tuple
>>>> failures go from zero at or close to best throughput, to between 1-5%
>>>> failure rate when throughput and Complete Latency decrease. This seems to
>>>> contradict what you and I are discussing here, but I may be missing
>>>> something.  What do you think?
>>>>
>>>> --John
>>>>
>>>> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
>>>> nick.katsip@gmail.com> wrote:
>>>>
>>>>> Hello again John,
>>>>>
>>>>> No need to apologize. All experiments in distributed environments have
>>>>> so many details, that it is only normal to forget some of them in an
>>>>> initial explanation.
>>>>>
>>>>> Going back to my point on seeing less throughput with less latency, I
>>>>> will give you an example:
>>>>>
>>>>> Assume you own a system, which has the resources to handle up to 100
>>>>> events/sec and guarantees that you get a mean latency of 5 msec. If you run
>>>>> an experiment in which you send in 100 events/sec, your system runs in a
>>>>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
>>>>> expected to be somewhere close to 100 events/sec (but lower if you factor
>>>>> in latency). Now, if you run another experiment in which you send in 50
>>>>> events/sec, your system runs at 50% capacity and you monitor an average
>>>>> end-to-end latency somewhere around 2.5 msec. In the second experiment, you
>>>>> are expected to see lower throughput compared to the first experiment and
>>>>> somewhere around 50 events/sec.
>>>>>
>>>>> Of course, in the example above I assumed that there is a 1:1 mapping
>>>>> between each input data point and each output. If that is not the case,
>>>>> then you have to give more details.
>>>>>
>>>>> Thanks,
>>>>> Nikos
>>>>>
>>>>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hey Nikos,
>>>>>>
>>>>>> Thanks for responding so quickly, and I apologize for leaving out a
>>>>>> crucially important detail--the Kafka topic. My topo is reading from a
>>>>>> static topic.  I definitely agree that reading from a live topic could--and
>>>>>> would likely--lead to variable throughput rates, both in terms of raw input
>>>>>> rates as well as variability in the content. Again, great question and
>>>>>> points, I should have specified my topo is reading from a static Kafka
>>>>>> topic in my original post.
>>>>>>
>>>>>> Regarding your third point, my thinking is that throughput would go
>>>>>> up if Complete Latency went down since its my understanding that Complete
>>>>>> Latency measures the avg amount of time that each tuple spends in the
>>>>>> topology. The key if here is if the input rate stays the same. If Complete
>>>>>> Latency decreases, more tuples can be processed by the topology in a given
>>>>>> amount time. But I see what you're saying the avg time spent on each tuple
>>>>>> would be less if the input rate goes up because there's more data per
>>>>>> second, more context switching amongst the executors, etc... Please confirm
>>>>>> if I am thinking about this the wrong way, because this seems to be a
>>>>>> pretty fundamental fact about Storm that I need to have right.
>>>>>>
>>>>>> Great point regarding waiting for topology to complete warm up.  I
>>>>>> let my topo run for 20 minutes before measuring anything.
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> --John
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>>>>>> nick.katsip@gmail.com> wrote:
>>>>>>
>>>>>>> Hello John,
>>>>>>>
>>>>>>> I have to say that a system's telemetry is not a mystery easily
>>>>>>> understood. Then, let us try to deduce what might be the case in your
>>>>>>> use-case that causes inconsistent performance metrics.
>>>>>>>
>>>>>>> At first, I would like to ask if your KafkaSpout's produce tuples
>>>>>>> with the same rate. In other words, do you produce or read data in a
>>>>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>>>>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>>>>>> reason I am asking is because figuring out what happens in the source of
>>>>>>> your data (in terms of input rate) is really important. If your use-case
>>>>>>> involves varying input-rate for your sources, I would suggest picking a
>>>>>>> particular snapshot of that source, and replay your experiments in order to
>>>>>>> check if the variance in latency/throughput still exists.
>>>>>>>
>>>>>>> The second point I would like to make is that sometimes throughput
>>>>>>> (or ack-rate as you correctly put it) might be related to the data you are
>>>>>>> pushing. For instance, a computation-heavy task might take more time for a
>>>>>>> particular value distribution than for another. Therefore, please make sure
>>>>>>> that the data you send in the system always cause the same amount of
>>>>>>> computation.
>>>>>>>
>>>>>>> And third, noticing dropping throughput and latency at the same time
>>>>>>> immediately points to a dropped input rate. Think about it. If I send in
>>>>>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>>>>>> sending tuples with a lower input rate), and at the same time the heavy
>>>>>>> computation has to work with less data (thus end-to-end latency also
>>>>>>> drops). Does the previous make sense to you? Can you verify that among the
>>>>>>> different runs, you had consistent input rates?
>>>>>>>
>>>>>>> Finally, I would suggest to you that you let Storm warm-up and drop
>>>>>>> your initial metrics. In my experience with Storm, latency and throughput,
>>>>>>> in the beginning of a task (until all buffers get full), are highly
>>>>>>> variable, and therefore, not reliable data points to include in your
>>>>>>> analysis. You can verify my claim by doing an overtime plot of your data.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Nikos
>>>>>>>
>>>>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi Everyone,
>>>>>>>>
>>>>>>>> I am a little puzzled by what I am seeing in some testing with a
>>>>>>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>>>>>>> intensive processing, and then writing out to Kafka via the standard
>>>>>>>> KafkaBolt.
>>>>>>>>
>>>>>>>> I am doing testing in a multi-tenant environment and so test
>>>>>>>> results can vary by 10-20% on average.  However, results are much more
>>>>>>>> variable the last couple of days.
>>>>>>>>
>>>>>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>>>>>> tuples acked/minute--is half today of what it was yesterday for the same
>>>>>>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>>>>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>>>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>>>>>
>>>>>>>> Any ideas as to how the throughput could go down dramatically at
>>>>>>>> the same time the Complete Latency is improving?
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> --John
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Nikos R. Katsipoulakis,
>>>>>>> Department of Computer Science
>>>>>>> University of Pittsburgh
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nikos R. Katsipoulakis,
>>>>> Department of Computer Science
>>>>> University of Pittsburgh
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nikos R. Katsipoulakis,
>>> Department of Computer Science
>>> University of Pittsburgh
>>>
>>
>>
>
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by "Nikos R. Katsipoulakis" <ni...@gmail.com>.
Hello again,

Well, it is not hand-waving... It is just hard to trace these issues in a
real system.

Definitely, varying behavior on Kafka makes all metrics unreliable and it
will definitely create unstable behavior on your pipeline. Remember, Kafka
is a distributed system itself, relying on ZooKeeper, which in turn is
heavily abused by Storm. Therefore, there is an "entanglement" between
Kafka and Storm in your setting. So, yes, your explanation sounds
reasonable.

Overall, I think your observations captured that heavy usage of Kafka, and
therefore can not be considered as the norm (they are rather an anomaly).
Interesting problem though! I have been battling latency and throughput
myself with Storm, and I find it a fascinating subject when one is dealing
with real-time data analysis.

Cheers,
Nikos

On Sat, Apr 2, 2016 at 9:35 AM, John Yost <ho...@gmail.com> wrote:

> Hey Nikos,
>
> I believe I have it figured out.  I spoke with another DevOps-err and she
> said we had a large backlog of data that's been hitting Kafka for the last
> couple of days, so our Kafka cluster has been under heavy load.
> Consequently, my hypothesis that that, despite the fact my topo is reading
> off of a static Kafka partition, there is significant variability in the
> KafkaSpout read rates is likely correct; this could explain the lower
> throughput combined with lower Complete Latency. In addition, since Kafka
> is under heavy load, there is likely corresponding variability in the Kafka
> writes; this could explain the tuple failures coupled with the lower
> throughput and Complete Latency.
>
> There's definitely a bit of hand waving here, but the picture as I wrote
> it out here seems reasonable to me. What do you think?
>
> Thanks
>
> --John
>
> On Fri, Apr 1, 2016 at 1:43 PM, Nikos R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Glad that we are on the same page.
>>
>> The observation you present in your last paragraph is quite hard to
>> answer. Before I present you my explanation, I assume you are using more
>> than one machines (i.e. using Netty for sending/receiving tuples). If the
>> previous is the case, then I am pretty sure that if you check your workers'
>> logs you will say a bunch of Netty events for failed messages. On the other
>> scenario, that you are running Storm on a single node, and you are using
>> LMAX Disruptor, you should be using some different strategy than
>> BlockingWaitStrategy, in order to see failed messages. Please, clarify your
>> setup a bit further.
>>
>> Carrying on, if tuple failures happen in the Netty layer, then one
>> explanation might be that something causes time-outs in your system (either
>> context-switching or something else) and tuples are considered failed.
>> Concretely, if you have a low input rate, then your Netty buffer needs more
>> time to fill up a batch before sending it to the next node. Therefore, if
>> your timeout is somewhere close to the time the oldest tuple spends in your
>> Netty buffer, you might end up flagging that tuple as failed and having to
>> re-send it.
>>
>> The previous is one possible scenario. Nevertheless, what you are
>> experiencing might be the outcome of any of the layers of Storm and is hard
>> to pinpoint exactly.
>>
>> Cheers,
>> Nikos
>>
>>
>> On Fri, Apr 1, 2016 at 1:02 PM, John Yost <ho...@gmail.com> wrote:
>>
>>> Hey Nikos,
>>>
>>> Yep, totally agree with what you've written here. As I wrote in my
>>> response, the key point is whether the input rate differs. If the input
>>> rate is lower, totally agree that it makes sense that both the throughput
>>> and Complete Latency going down makes good sense. Similarly, if the rate
>>> goes up, the Complete Latency goes up since there's more data for the topo
>>> to process.
>>>
>>> For my purposes, I need to confirm if the rate of data coming in--in
>>> other words, the rate at which KafkaSpout is reading off of a static Kafka
>>> topic--varies between experiments. Up until now I've discounted this
>>> possibility, but, given the great points you've made here, I'm going to
>>> revisit this possibility.
>>>
>>> Having written all of this...another important point: when I see the
>>> throughput and Complete Latency decrease in tandem, I also see tuple
>>> failures go from zero at or close to best throughput, to between 1-5%
>>> failure rate when throughput and Complete Latency decrease. This seems to
>>> contradict what you and I are discussing here, but I may be missing
>>> something.  What do you think?
>>>
>>> --John
>>>
>>> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
>>> nick.katsip@gmail.com> wrote:
>>>
>>>> Hello again John,
>>>>
>>>> No need to apologize. All experiments in distributed environments have
>>>> so many details, that it is only normal to forget some of them in an
>>>> initial explanation.
>>>>
>>>> Going back to my point on seeing less throughput with less latency, I
>>>> will give you an example:
>>>>
>>>> Assume you own a system, which has the resources to handle up to 100
>>>> events/sec and guarantees that you get a mean latency of 5 msec. If you run
>>>> an experiment in which you send in 100 events/sec, your system runs in a
>>>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
>>>> expected to be somewhere close to 100 events/sec (but lower if you factor
>>>> in latency). Now, if you run another experiment in which you send in 50
>>>> events/sec, your system runs at 50% capacity and you monitor an average
>>>> end-to-end latency somewhere around 2.5 msec. In the second experiment, you
>>>> are expected to see lower throughput compared to the first experiment and
>>>> somewhere around 50 events/sec.
>>>>
>>>> Of course, in the example above I assumed that there is a 1:1 mapping
>>>> between each input data point and each output. If that is not the case,
>>>> then you have to give more details.
>>>>
>>>> Thanks,
>>>> Nikos
>>>>
>>>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hey Nikos,
>>>>>
>>>>> Thanks for responding so quickly, and I apologize for leaving out a
>>>>> crucially important detail--the Kafka topic. My topo is reading from a
>>>>> static topic.  I definitely agree that reading from a live topic could--and
>>>>> would likely--lead to variable throughput rates, both in terms of raw input
>>>>> rates as well as variability in the content. Again, great question and
>>>>> points, I should have specified my topo is reading from a static Kafka
>>>>> topic in my original post.
>>>>>
>>>>> Regarding your third point, my thinking is that throughput would go up
>>>>> if Complete Latency went down since its my understanding that Complete
>>>>> Latency measures the avg amount of time that each tuple spends in the
>>>>> topology. The key if here is if the input rate stays the same. If Complete
>>>>> Latency decreases, more tuples can be processed by the topology in a given
>>>>> amount time. But I see what you're saying the avg time spent on each tuple
>>>>> would be less if the input rate goes up because there's more data per
>>>>> second, more context switching amongst the executors, etc... Please confirm
>>>>> if I am thinking about this the wrong way, because this seems to be a
>>>>> pretty fundamental fact about Storm that I need to have right.
>>>>>
>>>>> Great point regarding waiting for topology to complete warm up.  I let
>>>>> my topo run for 20 minutes before measuring anything.
>>>>>
>>>>> Thanks
>>>>>
>>>>> --John
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>>>>> nick.katsip@gmail.com> wrote:
>>>>>
>>>>>> Hello John,
>>>>>>
>>>>>> I have to say that a system's telemetry is not a mystery easily
>>>>>> understood. Then, let us try to deduce what might be the case in your
>>>>>> use-case that causes inconsistent performance metrics.
>>>>>>
>>>>>> At first, I would like to ask if your KafkaSpout's produce tuples
>>>>>> with the same rate. In other words, do you produce or read data in a
>>>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>>>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>>>>> reason I am asking is because figuring out what happens in the source of
>>>>>> your data (in terms of input rate) is really important. If your use-case
>>>>>> involves varying input-rate for your sources, I would suggest picking a
>>>>>> particular snapshot of that source, and replay your experiments in order to
>>>>>> check if the variance in latency/throughput still exists.
>>>>>>
>>>>>> The second point I would like to make is that sometimes throughput
>>>>>> (or ack-rate as you correctly put it) might be related to the data you are
>>>>>> pushing. For instance, a computation-heavy task might take more time for a
>>>>>> particular value distribution than for another. Therefore, please make sure
>>>>>> that the data you send in the system always cause the same amount of
>>>>>> computation.
>>>>>>
>>>>>> And third, noticing dropping throughput and latency at the same time
>>>>>> immediately points to a dropped input rate. Think about it. If I send in
>>>>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>>>>> sending tuples with a lower input rate), and at the same time the heavy
>>>>>> computation has to work with less data (thus end-to-end latency also
>>>>>> drops). Does the previous make sense to you? Can you verify that among the
>>>>>> different runs, you had consistent input rates?
>>>>>>
>>>>>> Finally, I would suggest to you that you let Storm warm-up and drop
>>>>>> your initial metrics. In my experience with Storm, latency and throughput,
>>>>>> in the beginning of a task (until all buffers get full), are highly
>>>>>> variable, and therefore, not reliable data points to include in your
>>>>>> analysis. You can verify my claim by doing an overtime plot of your data.
>>>>>>
>>>>>> Thanks,
>>>>>> Nikos
>>>>>>
>>>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Everyone,
>>>>>>>
>>>>>>> I am a little puzzled by what I am seeing in some testing with a
>>>>>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>>>>>> intensive processing, and then writing out to Kafka via the standard
>>>>>>> KafkaBolt.
>>>>>>>
>>>>>>> I am doing testing in a multi-tenant environment and so test results
>>>>>>> can vary by 10-20% on average.  However, results are much more variable the
>>>>>>> last couple of days.
>>>>>>>
>>>>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>>>>> tuples acked/minute--is half today of what it was yesterday for the same
>>>>>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>>>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>>>>
>>>>>>> Any ideas as to how the throughput could go down dramatically at the
>>>>>>> same time the Complete Latency is improving?
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> --John
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Nikos R. Katsipoulakis,
>>>>>> Department of Computer Science
>>>>>> University of Pittsburgh
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Nikos R. Katsipoulakis,
>>>> Department of Computer Science
>>>> University of Pittsburgh
>>>>
>>>
>>>
>>
>>
>> --
>> Nikos R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>
>


-- 
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by John Yost <ho...@gmail.com>.
Hey Nikos,

I believe I have it figured out.  I spoke with another DevOps-err and she
said we had a large backlog of data that's been hitting Kafka for the last
couple of days, so our Kafka cluster has been under heavy load.
Consequently, my hypothesis that that, despite the fact my topo is reading
off of a static Kafka partition, there is significant variability in the
KafkaSpout read rates is likely correct; this could explain the lower
throughput combined with lower Complete Latency. In addition, since Kafka
is under heavy load, there is likely corresponding variability in the Kafka
writes; this could explain the tuple failures coupled with the lower
throughput and Complete Latency.

There's definitely a bit of hand waving here, but the picture as I wrote it
out here seems reasonable to me. What do you think?

Thanks

--John

On Fri, Apr 1, 2016 at 1:43 PM, Nikos R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Glad that we are on the same page.
>
> The observation you present in your last paragraph is quite hard to
> answer. Before I present you my explanation, I assume you are using more
> than one machines (i.e. using Netty for sending/receiving tuples). If the
> previous is the case, then I am pretty sure that if you check your workers'
> logs you will say a bunch of Netty events for failed messages. On the other
> scenario, that you are running Storm on a single node, and you are using
> LMAX Disruptor, you should be using some different strategy than
> BlockingWaitStrategy, in order to see failed messages. Please, clarify your
> setup a bit further.
>
> Carrying on, if tuple failures happen in the Netty layer, then one
> explanation might be that something causes time-outs in your system (either
> context-switching or something else) and tuples are considered failed.
> Concretely, if you have a low input rate, then your Netty buffer needs more
> time to fill up a batch before sending it to the next node. Therefore, if
> your timeout is somewhere close to the time the oldest tuple spends in your
> Netty buffer, you might end up flagging that tuple as failed and having to
> re-send it.
>
> The previous is one possible scenario. Nevertheless, what you are
> experiencing might be the outcome of any of the layers of Storm and is hard
> to pinpoint exactly.
>
> Cheers,
> Nikos
>
>
> On Fri, Apr 1, 2016 at 1:02 PM, John Yost <ho...@gmail.com> wrote:
>
>> Hey Nikos,
>>
>> Yep, totally agree with what you've written here. As I wrote in my
>> response, the key point is whether the input rate differs. If the input
>> rate is lower, totally agree that it makes sense that both the throughput
>> and Complete Latency going down makes good sense. Similarly, if the rate
>> goes up, the Complete Latency goes up since there's more data for the topo
>> to process.
>>
>> For my purposes, I need to confirm if the rate of data coming in--in
>> other words, the rate at which KafkaSpout is reading off of a static Kafka
>> topic--varies between experiments. Up until now I've discounted this
>> possibility, but, given the great points you've made here, I'm going to
>> revisit this possibility.
>>
>> Having written all of this...another important point: when I see the
>> throughput and Complete Latency decrease in tandem, I also see tuple
>> failures go from zero at or close to best throughput, to between 1-5%
>> failure rate when throughput and Complete Latency decrease. This seems to
>> contradict what you and I are discussing here, but I may be missing
>> something.  What do you think?
>>
>> --John
>>
>> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>>> Hello again John,
>>>
>>> No need to apologize. All experiments in distributed environments have
>>> so many details, that it is only normal to forget some of them in an
>>> initial explanation.
>>>
>>> Going back to my point on seeing less throughput with less latency, I
>>> will give you an example:
>>>
>>> Assume you own a system, which has the resources to handle up to 100
>>> events/sec and guarantees that you get a mean latency of 5 msec. If you run
>>> an experiment in which you send in 100 events/sec, your system runs in a
>>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
>>> expected to be somewhere close to 100 events/sec (but lower if you factor
>>> in latency). Now, if you run another experiment in which you send in 50
>>> events/sec, your system runs at 50% capacity and you monitor an average
>>> end-to-end latency somewhere around 2.5 msec. In the second experiment, you
>>> are expected to see lower throughput compared to the first experiment and
>>> somewhere around 50 events/sec.
>>>
>>> Of course, in the example above I assumed that there is a 1:1 mapping
>>> between each input data point and each output. If that is not the case,
>>> then you have to give more details.
>>>
>>> Thanks,
>>> Nikos
>>>
>>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com> wrote:
>>>
>>>> Hey Nikos,
>>>>
>>>> Thanks for responding so quickly, and I apologize for leaving out a
>>>> crucially important detail--the Kafka topic. My topo is reading from a
>>>> static topic.  I definitely agree that reading from a live topic could--and
>>>> would likely--lead to variable throughput rates, both in terms of raw input
>>>> rates as well as variability in the content. Again, great question and
>>>> points, I should have specified my topo is reading from a static Kafka
>>>> topic in my original post.
>>>>
>>>> Regarding your third point, my thinking is that throughput would go up
>>>> if Complete Latency went down since its my understanding that Complete
>>>> Latency measures the avg amount of time that each tuple spends in the
>>>> topology. The key if here is if the input rate stays the same. If Complete
>>>> Latency decreases, more tuples can be processed by the topology in a given
>>>> amount time. But I see what you're saying the avg time spent on each tuple
>>>> would be less if the input rate goes up because there's more data per
>>>> second, more context switching amongst the executors, etc... Please confirm
>>>> if I am thinking about this the wrong way, because this seems to be a
>>>> pretty fundamental fact about Storm that I need to have right.
>>>>
>>>> Great point regarding waiting for topology to complete warm up.  I let
>>>> my topo run for 20 minutes before measuring anything.
>>>>
>>>> Thanks
>>>>
>>>> --John
>>>>
>>>>
>>>>
>>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>>>> nick.katsip@gmail.com> wrote:
>>>>
>>>>> Hello John,
>>>>>
>>>>> I have to say that a system's telemetry is not a mystery easily
>>>>> understood. Then, let us try to deduce what might be the case in your
>>>>> use-case that causes inconsistent performance metrics.
>>>>>
>>>>> At first, I would like to ask if your KafkaSpout's produce tuples with
>>>>> the same rate. In other words, do you produce or read data in a
>>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>>>> reason I am asking is because figuring out what happens in the source of
>>>>> your data (in terms of input rate) is really important. If your use-case
>>>>> involves varying input-rate for your sources, I would suggest picking a
>>>>> particular snapshot of that source, and replay your experiments in order to
>>>>> check if the variance in latency/throughput still exists.
>>>>>
>>>>> The second point I would like to make is that sometimes throughput (or
>>>>> ack-rate as you correctly put it) might be related to the data you are
>>>>> pushing. For instance, a computation-heavy task might take more time for a
>>>>> particular value distribution than for another. Therefore, please make sure
>>>>> that the data you send in the system always cause the same amount of
>>>>> computation.
>>>>>
>>>>> And third, noticing dropping throughput and latency at the same time
>>>>> immediately points to a dropped input rate. Think about it. If I send in
>>>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>>>> sending tuples with a lower input rate), and at the same time the heavy
>>>>> computation has to work with less data (thus end-to-end latency also
>>>>> drops). Does the previous make sense to you? Can you verify that among the
>>>>> different runs, you had consistent input rates?
>>>>>
>>>>> Finally, I would suggest to you that you let Storm warm-up and drop
>>>>> your initial metrics. In my experience with Storm, latency and throughput,
>>>>> in the beginning of a task (until all buffers get full), are highly
>>>>> variable, and therefore, not reliable data points to include in your
>>>>> analysis. You can verify my claim by doing an overtime plot of your data.
>>>>>
>>>>> Thanks,
>>>>> Nikos
>>>>>
>>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Everyone,
>>>>>>
>>>>>> I am a little puzzled by what I am seeing in some testing with a
>>>>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>>>>> intensive processing, and then writing out to Kafka via the standard
>>>>>> KafkaBolt.
>>>>>>
>>>>>> I am doing testing in a multi-tenant environment and so test results
>>>>>> can vary by 10-20% on average.  However, results are much more variable the
>>>>>> last couple of days.
>>>>>>
>>>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>>>> tuples acked/minute--is half today of what it was yesterday for the same
>>>>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>>>
>>>>>> Any ideas as to how the throughput could go down dramatically at the
>>>>>> same time the Complete Latency is improving?
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> --John
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Nikos R. Katsipoulakis,
>>>>> Department of Computer Science
>>>>> University of Pittsburgh
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Nikos R. Katsipoulakis,
>>> Department of Computer Science
>>> University of Pittsburgh
>>>
>>
>>
>
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by "Nikos R. Katsipoulakis" <ni...@gmail.com>.
Glad that we are on the same page.

The observation you present in your last paragraph is quite hard to answer.
Before I present you my explanation, I assume you are using more than one
machines (i.e. using Netty for sending/receiving tuples). If the previous
is the case, then I am pretty sure that if you check your workers' logs you
will say a bunch of Netty events for failed messages. On the other
scenario, that you are running Storm on a single node, and you are using
LMAX Disruptor, you should be using some different strategy than
BlockingWaitStrategy, in order to see failed messages. Please, clarify your
setup a bit further.

Carrying on, if tuple failures happen in the Netty layer, then one
explanation might be that something causes time-outs in your system (either
context-switching or something else) and tuples are considered failed.
Concretely, if you have a low input rate, then your Netty buffer needs more
time to fill up a batch before sending it to the next node. Therefore, if
your timeout is somewhere close to the time the oldest tuple spends in your
Netty buffer, you might end up flagging that tuple as failed and having to
re-send it.

The previous is one possible scenario. Nevertheless, what you are
experiencing might be the outcome of any of the layers of Storm and is hard
to pinpoint exactly.

Cheers,
Nikos


On Fri, Apr 1, 2016 at 1:02 PM, John Yost <ho...@gmail.com> wrote:

> Hey Nikos,
>
> Yep, totally agree with what you've written here. As I wrote in my
> response, the key point is whether the input rate differs. If the input
> rate is lower, totally agree that it makes sense that both the throughput
> and Complete Latency going down makes good sense. Similarly, if the rate
> goes up, the Complete Latency goes up since there's more data for the topo
> to process.
>
> For my purposes, I need to confirm if the rate of data coming in--in other
> words, the rate at which KafkaSpout is reading off of a static Kafka
> topic--varies between experiments. Up until now I've discounted this
> possibility, but, given the great points you've made here, I'm going to
> revisit this possibility.
>
> Having written all of this...another important point: when I see the
> throughput and Complete Latency decrease in tandem, I also see tuple
> failures go from zero at or close to best throughput, to between 1-5%
> failure rate when throughput and Complete Latency decrease. This seems to
> contradict what you and I are discussing here, but I may be missing
> something.  What do you think?
>
> --John
>
> On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Hello again John,
>>
>> No need to apologize. All experiments in distributed environments have so
>> many details, that it is only normal to forget some of them in an initial
>> explanation.
>>
>> Going back to my point on seeing less throughput with less latency, I
>> will give you an example:
>>
>> Assume you own a system, which has the resources to handle up to 100
>> events/sec and guarantees that you get a mean latency of 5 msec. If you run
>> an experiment in which you send in 100 events/sec, your system runs in a
>> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
>> expected to be somewhere close to 100 events/sec (but lower if you factor
>> in latency). Now, if you run another experiment in which you send in 50
>> events/sec, your system runs at 50% capacity and you monitor an average
>> end-to-end latency somewhere around 2.5 msec. In the second experiment, you
>> are expected to see lower throughput compared to the first experiment and
>> somewhere around 50 events/sec.
>>
>> Of course, in the example above I assumed that there is a 1:1 mapping
>> between each input data point and each output. If that is not the case,
>> then you have to give more details.
>>
>> Thanks,
>> Nikos
>>
>> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com> wrote:
>>
>>> Hey Nikos,
>>>
>>> Thanks for responding so quickly, and I apologize for leaving out a
>>> crucially important detail--the Kafka topic. My topo is reading from a
>>> static topic.  I definitely agree that reading from a live topic could--and
>>> would likely--lead to variable throughput rates, both in terms of raw input
>>> rates as well as variability in the content. Again, great question and
>>> points, I should have specified my topo is reading from a static Kafka
>>> topic in my original post.
>>>
>>> Regarding your third point, my thinking is that throughput would go up
>>> if Complete Latency went down since its my understanding that Complete
>>> Latency measures the avg amount of time that each tuple spends in the
>>> topology. The key if here is if the input rate stays the same. If Complete
>>> Latency decreases, more tuples can be processed by the topology in a given
>>> amount time. But I see what you're saying the avg time spent on each tuple
>>> would be less if the input rate goes up because there's more data per
>>> second, more context switching amongst the executors, etc... Please confirm
>>> if I am thinking about this the wrong way, because this seems to be a
>>> pretty fundamental fact about Storm that I need to have right.
>>>
>>> Great point regarding waiting for topology to complete warm up.  I let
>>> my topo run for 20 minutes before measuring anything.
>>>
>>> Thanks
>>>
>>> --John
>>>
>>>
>>>
>>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>>> nick.katsip@gmail.com> wrote:
>>>
>>>> Hello John,
>>>>
>>>> I have to say that a system's telemetry is not a mystery easily
>>>> understood. Then, let us try to deduce what might be the case in your
>>>> use-case that causes inconsistent performance metrics.
>>>>
>>>> At first, I would like to ask if your KafkaSpout's produce tuples with
>>>> the same rate. In other words, do you produce or read data in a
>>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>>> reason I am asking is because figuring out what happens in the source of
>>>> your data (in terms of input rate) is really important. If your use-case
>>>> involves varying input-rate for your sources, I would suggest picking a
>>>> particular snapshot of that source, and replay your experiments in order to
>>>> check if the variance in latency/throughput still exists.
>>>>
>>>> The second point I would like to make is that sometimes throughput (or
>>>> ack-rate as you correctly put it) might be related to the data you are
>>>> pushing. For instance, a computation-heavy task might take more time for a
>>>> particular value distribution than for another. Therefore, please make sure
>>>> that the data you send in the system always cause the same amount of
>>>> computation.
>>>>
>>>> And third, noticing dropping throughput and latency at the same time
>>>> immediately points to a dropped input rate. Think about it. If I send in
>>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>>> sending tuples with a lower input rate), and at the same time the heavy
>>>> computation has to work with less data (thus end-to-end latency also
>>>> drops). Does the previous make sense to you? Can you verify that among the
>>>> different runs, you had consistent input rates?
>>>>
>>>> Finally, I would suggest to you that you let Storm warm-up and drop
>>>> your initial metrics. In my experience with Storm, latency and throughput,
>>>> in the beginning of a task (until all buffers get full), are highly
>>>> variable, and therefore, not reliable data points to include in your
>>>> analysis. You can verify my claim by doing an overtime plot of your data.
>>>>
>>>> Thanks,
>>>> Nikos
>>>>
>>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> I am a little puzzled by what I am seeing in some testing with a
>>>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>>>> intensive processing, and then writing out to Kafka via the standard
>>>>> KafkaBolt.
>>>>>
>>>>> I am doing testing in a multi-tenant environment and so test results
>>>>> can vary by 10-20% on average.  However, results are much more variable the
>>>>> last couple of days.
>>>>>
>>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>>> tuples acked/minute--is half today of what it was yesterday for the same
>>>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>>
>>>>> Any ideas as to how the throughput could go down dramatically at the
>>>>> same time the Complete Latency is improving?
>>>>>
>>>>> Thanks
>>>>>
>>>>> --John
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Nikos R. Katsipoulakis,
>>>> Department of Computer Science
>>>> University of Pittsburgh
>>>>
>>>
>>>
>>
>>
>> --
>> Nikos R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>
>


-- 
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by John Yost <ho...@gmail.com>.
Hey Nikos,

Yep, totally agree with what you've written here. As I wrote in my
response, the key point is whether the input rate differs. If the input
rate is lower, totally agree that it makes sense that both the throughput
and Complete Latency going down makes good sense. Similarly, if the rate
goes up, the Complete Latency goes up since there's more data for the topo
to process.

For my purposes, I need to confirm if the rate of data coming in--in other
words, the rate at which KafkaSpout is reading off of a static Kafka
topic--varies between experiments. Up until now I've discounted this
possibility, but, given the great points you've made here, I'm going to
revisit this possibility.

Having written all of this...another important point: when I see the
throughput and Complete Latency decrease in tandem, I also see tuple
failures go from zero at or close to best throughput, to between 1-5%
failure rate when throughput and Complete Latency decrease. This seems to
contradict what you and I are discussing here, but I may be missing
something.  What do you think?

--John

On Fri, Apr 1, 2016 at 12:34 PM, Nikos R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Hello again John,
>
> No need to apologize. All experiments in distributed environments have so
> many details, that it is only normal to forget some of them in an initial
> explanation.
>
> Going back to my point on seeing less throughput with less latency, I will
> give you an example:
>
> Assume you own a system, which has the resources to handle up to 100
> events/sec and guarantees that you get a mean latency of 5 msec. If you run
> an experiment in which you send in 100 events/sec, your system runs in a
> 100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
> expected to be somewhere close to 100 events/sec (but lower if you factor
> in latency). Now, if you run another experiment in which you send in 50
> events/sec, your system runs at 50% capacity and you monitor an average
> end-to-end latency somewhere around 2.5 msec. In the second experiment, you
> are expected to see lower throughput compared to the first experiment and
> somewhere around 50 events/sec.
>
> Of course, in the example above I assumed that there is a 1:1 mapping
> between each input data point and each output. If that is not the case,
> then you have to give more details.
>
> Thanks,
> Nikos
>
> On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com> wrote:
>
>> Hey Nikos,
>>
>> Thanks for responding so quickly, and I apologize for leaving out a
>> crucially important detail--the Kafka topic. My topo is reading from a
>> static topic.  I definitely agree that reading from a live topic could--and
>> would likely--lead to variable throughput rates, both in terms of raw input
>> rates as well as variability in the content. Again, great question and
>> points, I should have specified my topo is reading from a static Kafka
>> topic in my original post.
>>
>> Regarding your third point, my thinking is that throughput would go up if
>> Complete Latency went down since its my understanding that Complete Latency
>> measures the avg amount of time that each tuple spends in the topology. The
>> key if here is if the input rate stays the same. If Complete Latency
>> decreases, more tuples can be processed by the topology in a given amount
>> time. But I see what you're saying the avg time spent on each tuple would
>> be less if the input rate goes up because there's more data per second,
>> more context switching amongst the executors, etc... Please confirm if I am
>> thinking about this the wrong way, because this seems to be a pretty
>> fundamental fact about Storm that I need to have right.
>>
>> Great point regarding waiting for topology to complete warm up.  I let my
>> topo run for 20 minutes before measuring anything.
>>
>> Thanks
>>
>> --John
>>
>>
>>
>> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
>> nick.katsip@gmail.com> wrote:
>>
>>> Hello John,
>>>
>>> I have to say that a system's telemetry is not a mystery easily
>>> understood. Then, let us try to deduce what might be the case in your
>>> use-case that causes inconsistent performance metrics.
>>>
>>> At first, I would like to ask if your KafkaSpout's produce tuples with
>>> the same rate. In other words, do you produce or read data in a
>>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>>> non-controllable source of data (like Twitter feed, news feed etc)? The
>>> reason I am asking is because figuring out what happens in the source of
>>> your data (in terms of input rate) is really important. If your use-case
>>> involves varying input-rate for your sources, I would suggest picking a
>>> particular snapshot of that source, and replay your experiments in order to
>>> check if the variance in latency/throughput still exists.
>>>
>>> The second point I would like to make is that sometimes throughput (or
>>> ack-rate as you correctly put it) might be related to the data you are
>>> pushing. For instance, a computation-heavy task might take more time for a
>>> particular value distribution than for another. Therefore, please make sure
>>> that the data you send in the system always cause the same amount of
>>> computation.
>>>
>>> And third, noticing dropping throughput and latency at the same time
>>> immediately points to a dropped input rate. Think about it. If I send in
>>> tuples with a lower input rate, I expect throughput to drop (since I am
>>> sending tuples with a lower input rate), and at the same time the heavy
>>> computation has to work with less data (thus end-to-end latency also
>>> drops). Does the previous make sense to you? Can you verify that among the
>>> different runs, you had consistent input rates?
>>>
>>> Finally, I would suggest to you that you let Storm warm-up and drop your
>>> initial metrics. In my experience with Storm, latency and throughput, in
>>> the beginning of a task (until all buffers get full), are highly variable,
>>> and therefore, not reliable data points to include in your analysis. You
>>> can verify my claim by doing an overtime plot of your data.
>>>
>>> Thanks,
>>> Nikos
>>>
>>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com> wrote:
>>>
>>>> Hi Everyone,
>>>>
>>>> I am a little puzzled by what I am seeing in some testing with a
>>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>>> intensive processing, and then writing out to Kafka via the standard
>>>> KafkaBolt.
>>>>
>>>> I am doing testing in a multi-tenant environment and so test results
>>>> can vary by 10-20% on average.  However, results are much more variable the
>>>> last couple of days.
>>>>
>>>> The big thing I am noticing: whereas the throughput--as measured in
>>>> tuples acked/minute--is half today of what it was yesterday for the same
>>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>>> from the time it hits the KafkaSpout to the time it is acked in the
>>>> KafkaBolt) today is a third of what it was yesterday.
>>>>
>>>> Any ideas as to how the throughput could go down dramatically at the
>>>> same time the Complete Latency is improving?
>>>>
>>>> Thanks
>>>>
>>>> --John
>>>>
>>>
>>>
>>>
>>> --
>>> Nikos R. Katsipoulakis,
>>> Department of Computer Science
>>> University of Pittsburgh
>>>
>>
>>
>
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by "Nikos R. Katsipoulakis" <ni...@gmail.com>.
Hello again John,

No need to apologize. All experiments in distributed environments have so
many details, that it is only normal to forget some of them in an initial
explanation.

Going back to my point on seeing less throughput with less latency, I will
give you an example:

Assume you own a system, which has the resources to handle up to 100
events/sec and guarantees that you get a mean latency of 5 msec. If you run
an experiment in which you send in 100 events/sec, your system runs in a
100% capacity and you monitor 5 msec end-to-end latency. Your throughput is
expected to be somewhere close to 100 events/sec (but lower if you factor
in latency). Now, if you run another experiment in which you send in 50
events/sec, your system runs at 50% capacity and you monitor an average
end-to-end latency somewhere around 2.5 msec. In the second experiment, you
are expected to see lower throughput compared to the first experiment and
somewhere around 50 events/sec.

Of course, in the example above I assumed that there is a 1:1 mapping
between each input data point and each output. If that is not the case,
then you have to give more details.

Thanks,
Nikos

On Fri, Apr 1, 2016 at 12:20 PM, John Yost <ho...@gmail.com> wrote:

> Hey Nikos,
>
> Thanks for responding so quickly, and I apologize for leaving out a
> crucially important detail--the Kafka topic. My topo is reading from a
> static topic.  I definitely agree that reading from a live topic could--and
> would likely--lead to variable throughput rates, both in terms of raw input
> rates as well as variability in the content. Again, great question and
> points, I should have specified my topo is reading from a static Kafka
> topic in my original post.
>
> Regarding your third point, my thinking is that throughput would go up if
> Complete Latency went down since its my understanding that Complete Latency
> measures the avg amount of time that each tuple spends in the topology. The
> key if here is if the input rate stays the same. If Complete Latency
> decreases, more tuples can be processed by the topology in a given amount
> time. But I see what you're saying the avg time spent on each tuple would
> be less if the input rate goes up because there's more data per second,
> more context switching amongst the executors, etc... Please confirm if I am
> thinking about this the wrong way, because this seems to be a pretty
> fundamental fact about Storm that I need to have right.
>
> Great point regarding waiting for topology to complete warm up.  I let my
> topo run for 20 minutes before measuring anything.
>
> Thanks
>
> --John
>
>
>
> On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
> nick.katsip@gmail.com> wrote:
>
>> Hello John,
>>
>> I have to say that a system's telemetry is not a mystery easily
>> understood. Then, let us try to deduce what might be the case in your
>> use-case that causes inconsistent performance metrics.
>>
>> At first, I would like to ask if your KafkaSpout's produce tuples with
>> the same rate. In other words, do you produce or read data in a
>> deterministic (replay-able) way; or do you attach your KafkaSpout to a
>> non-controllable source of data (like Twitter feed, news feed etc)? The
>> reason I am asking is because figuring out what happens in the source of
>> your data (in terms of input rate) is really important. If your use-case
>> involves varying input-rate for your sources, I would suggest picking a
>> particular snapshot of that source, and replay your experiments in order to
>> check if the variance in latency/throughput still exists.
>>
>> The second point I would like to make is that sometimes throughput (or
>> ack-rate as you correctly put it) might be related to the data you are
>> pushing. For instance, a computation-heavy task might take more time for a
>> particular value distribution than for another. Therefore, please make sure
>> that the data you send in the system always cause the same amount of
>> computation.
>>
>> And third, noticing dropping throughput and latency at the same time
>> immediately points to a dropped input rate. Think about it. If I send in
>> tuples with a lower input rate, I expect throughput to drop (since I am
>> sending tuples with a lower input rate), and at the same time the heavy
>> computation has to work with less data (thus end-to-end latency also
>> drops). Does the previous make sense to you? Can you verify that among the
>> different runs, you had consistent input rates?
>>
>> Finally, I would suggest to you that you let Storm warm-up and drop your
>> initial metrics. In my experience with Storm, latency and throughput, in
>> the beginning of a task (until all buffers get full), are highly variable,
>> and therefore, not reliable data points to include in your analysis. You
>> can verify my claim by doing an overtime plot of your data.
>>
>> Thanks,
>> Nikos
>>
>> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com> wrote:
>>
>>> Hi Everyone,
>>>
>>> I am a little puzzled by what I am seeing in some testing with a
>>> topology I have where the topo is reading from a KafkaSpout, doing some CPU
>>> intensive processing, and then writing out to Kafka via the standard
>>> KafkaBolt.
>>>
>>> I am doing testing in a multi-tenant environment and so test results can
>>> vary by 10-20% on average.  However, results are much more variable the
>>> last couple of days.
>>>
>>> The big thing I am noticing: whereas the throughput--as measured in
>>> tuples acked/minute--is half today of what it was yesterday for the same
>>> configuraton, the Complete Latency (total time a tuple is in the topology
>>> from the time it hits the KafkaSpout to the time it is acked in the
>>> KafkaBolt) today is a third of what it was yesterday.
>>>
>>> Any ideas as to how the throughput could go down dramatically at the
>>> same time the Complete Latency is improving?
>>>
>>> Thanks
>>>
>>> --John
>>>
>>
>>
>>
>> --
>> Nikos R. Katsipoulakis,
>> Department of Computer Science
>> University of Pittsburgh
>>
>
>


-- 
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by John Yost <ho...@gmail.com>.
Hey Nikos,

Thanks for responding so quickly, and I apologize for leaving out a
crucially important detail--the Kafka topic. My topo is reading from a
static topic.  I definitely agree that reading from a live topic could--and
would likely--lead to variable throughput rates, both in terms of raw input
rates as well as variability in the content. Again, great question and
points, I should have specified my topo is reading from a static Kafka
topic in my original post.

Regarding your third point, my thinking is that throughput would go up if
Complete Latency went down since its my understanding that Complete Latency
measures the avg amount of time that each tuple spends in the topology. The
key if here is if the input rate stays the same. If Complete Latency
decreases, more tuples can be processed by the topology in a given amount
time. But I see what you're saying the avg time spent on each tuple would
be less if the input rate goes up because there's more data per second,
more context switching amongst the executors, etc... Please confirm if I am
thinking about this the wrong way, because this seems to be a pretty
fundamental fact about Storm that I need to have right.

Great point regarding waiting for topology to complete warm up.  I let my
topo run for 20 minutes before measuring anything.

Thanks

--John



On Fri, Apr 1, 2016 at 9:54 AM, Nikos R. Katsipoulakis <
nick.katsip@gmail.com> wrote:

> Hello John,
>
> I have to say that a system's telemetry is not a mystery easily
> understood. Then, let us try to deduce what might be the case in your
> use-case that causes inconsistent performance metrics.
>
> At first, I would like to ask if your KafkaSpout's produce tuples with the
> same rate. In other words, do you produce or read data in a deterministic
> (replay-able) way; or do you attach your KafkaSpout to a non-controllable
> source of data (like Twitter feed, news feed etc)? The reason I am asking
> is because figuring out what happens in the source of your data (in terms
> of input rate) is really important. If your use-case involves varying
> input-rate for your sources, I would suggest picking a particular snapshot
> of that source, and replay your experiments in order to check if the
> variance in latency/throughput still exists.
>
> The second point I would like to make is that sometimes throughput (or
> ack-rate as you correctly put it) might be related to the data you are
> pushing. For instance, a computation-heavy task might take more time for a
> particular value distribution than for another. Therefore, please make sure
> that the data you send in the system always cause the same amount of
> computation.
>
> And third, noticing dropping throughput and latency at the same time
> immediately points to a dropped input rate. Think about it. If I send in
> tuples with a lower input rate, I expect throughput to drop (since I am
> sending tuples with a lower input rate), and at the same time the heavy
> computation has to work with less data (thus end-to-end latency also
> drops). Does the previous make sense to you? Can you verify that among the
> different runs, you had consistent input rates?
>
> Finally, I would suggest to you that you let Storm warm-up and drop your
> initial metrics. In my experience with Storm, latency and throughput, in
> the beginning of a task (until all buffers get full), are highly variable,
> and therefore, not reliable data points to include in your analysis. You
> can verify my claim by doing an overtime plot of your data.
>
> Thanks,
> Nikos
>
> On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I am a little puzzled by what I am seeing in some testing with a topology
>> I have where the topo is reading from a KafkaSpout, doing some CPU
>> intensive processing, and then writing out to Kafka via the standard
>> KafkaBolt.
>>
>> I am doing testing in a multi-tenant environment and so test results can
>> vary by 10-20% on average.  However, results are much more variable the
>> last couple of days.
>>
>> The big thing I am noticing: whereas the throughput--as measured in
>> tuples acked/minute--is half today of what it was yesterday for the same
>> configuraton, the Complete Latency (total time a tuple is in the topology
>> from the time it hits the KafkaSpout to the time it is acked in the
>> KafkaBolt) today is a third of what it was yesterday.
>>
>> Any ideas as to how the throughput could go down dramatically at the same
>> time the Complete Latency is improving?
>>
>> Thanks
>>
>> --John
>>
>
>
>
> --
> Nikos R. Katsipoulakis,
> Department of Computer Science
> University of Pittsburgh
>

Re: Complete Latency Vs. Throughput--when do they not change in same direction?

Posted by "Nikos R. Katsipoulakis" <ni...@gmail.com>.
Hello John,

I have to say that a system's telemetry is not a mystery easily understood.
Then, let us try to deduce what might be the case in your use-case that
causes inconsistent performance metrics.

At first, I would like to ask if your KafkaSpout's produce tuples with the
same rate. In other words, do you produce or read data in a deterministic
(replay-able) way; or do you attach your KafkaSpout to a non-controllable
source of data (like Twitter feed, news feed etc)? The reason I am asking
is because figuring out what happens in the source of your data (in terms
of input rate) is really important. If your use-case involves varying
input-rate for your sources, I would suggest picking a particular snapshot
of that source, and replay your experiments in order to check if the
variance in latency/throughput still exists.

The second point I would like to make is that sometimes throughput (or
ack-rate as you correctly put it) might be related to the data you are
pushing. For instance, a computation-heavy task might take more time for a
particular value distribution than for another. Therefore, please make sure
that the data you send in the system always cause the same amount of
computation.

And third, noticing dropping throughput and latency at the same time
immediately points to a dropped input rate. Think about it. If I send in
tuples with a lower input rate, I expect throughput to drop (since I am
sending tuples with a lower input rate), and at the same time the heavy
computation has to work with less data (thus end-to-end latency also
drops). Does the previous make sense to you? Can you verify that among the
different runs, you had consistent input rates?

Finally, I would suggest to you that you let Storm warm-up and drop your
initial metrics. In my experience with Storm, latency and throughput, in
the beginning of a task (until all buffers get full), are highly variable,
and therefore, not reliable data points to include in your analysis. You
can verify my claim by doing an overtime plot of your data.

Thanks,
Nikos

On Fri, Apr 1, 2016 at 9:16 AM, John Yost <ho...@gmail.com> wrote:

> Hi Everyone,
>
> I am a little puzzled by what I am seeing in some testing with a topology
> I have where the topo is reading from a KafkaSpout, doing some CPU
> intensive processing, and then writing out to Kafka via the standard
> KafkaBolt.
>
> I am doing testing in a multi-tenant environment and so test results can
> vary by 10-20% on average.  However, results are much more variable the
> last couple of days.
>
> The big thing I am noticing: whereas the throughput--as measured in tuples
> acked/minute--is half today of what it was yesterday for the same
> configuraton, the Complete Latency (total time a tuple is in the topology
> from the time it hits the KafkaSpout to the time it is acked in the
> KafkaBolt) today is a third of what it was yesterday.
>
> Any ideas as to how the throughput could go down dramatically at the same
> time the Complete Latency is improving?
>
> Thanks
>
> --John
>



-- 
Nikos R. Katsipoulakis,
Department of Computer Science
University of Pittsburgh