You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@samza.apache.org by "Chris Riccomini (JIRA)" <ji...@apache.org> on 2014/04/04 03:15:16 UTC

[jira] [Commented] (SAMZA-220) SystemConsumers is slow when consuming from a large number of partitions

    [ https://issues.apache.org/jira/browse/SAMZA-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13959508#comment-13959508 ] 

Chris Riccomini commented on SAMZA-220:
---------------------------------------

One of the nice characteristics observed about this approach is that containers that can't keep up will automatically start polling consumers less often since it takes longer for their buffer to drop below refreshThreshold, which will cause them to spend more time processing messages. This is a good thing, since they're behind.

Another nice thing observed about this approach is that containers that are caught up spend a lot of their CPU time running maybeCall. This is good because it means in the caught-up case, we are checking for new messages very frequently.

> SystemConsumers is slow when consuming from a large number of partitions
> ------------------------------------------------------------------------
>
>                 Key: SAMZA-220
>                 URL: https://issues.apache.org/jira/browse/SAMZA-220
>             Project: Samza
>          Issue Type: Bug
>          Components: container
>    Affects Versions: 0.6.0
>            Reporter: Chris Riccomini
>         Attachments: SAMZA-220.0.patch
>
>
> We have observed poor throughput when a SamzaContainer is consuming many partitions (100s). The more partitions, the worse the performance gets.
> When hooking up VisualVM, two operations take up more than 65% of the CPU in SystemConsumers:
> {code}
>     refresh.maybeCall()
>     updateMessageChooser
> {code}
> The problem is that we run each of these operations once before every process() call to a StreamTask. Both of these operations iterate over *all* SystemStreamPartitions that the SystemConsumers is consuming from. If you have hundreds of partitions, it means you do two loops of 100+ items for every message you process. This is true even if the SystemConsumers buffer has a lot of messages (10,000+), and also true even if most systemStreamPartitions have no messages available.
> I have two proposed solutions to this problem:
> 1. Only call refresh.maybeCall() when the total number of buffered messages in the SystemConsumers has dropped below some low watermark.
> 2. Only have updateMessageChooser call messageChooser.update for systemStreamPartitions that actually *have* a message.
> I have implemented this and deployed it on a few jobs, and I am seeing significant performance improvement. From 10k-20k msgs/sec to 50k+.
> The trade off, as I see it is really around (1), which will introduce a little latency for topics that are low volume. In such a case, the time from when a message arrives to when it gets refreshed in the buffer, and updated in the chooser increases.



--
This message was sent by Atlassian JIRA
(v6.2#6252)